On Friday 25 Feb 2005 01:03, John Peacock wrote: > Bob wrote: > > Is there an existing filter that could determine if a username@ > > is 60% or more mis-spelled as compared to real usernames? > > 60% is arbitrary and would be configurable. If so, that would > > serve to make a fuzzy honeypot filter for dictionary spam. > > There are a couple of modules on CPAN which do something like this, called > Text::Metaphone and newer and better Text::DoubleMetaphone. They both > convert a word into something like the Soundex algorithm that was invented > for the US Census. They produce a value for what a given word "sounds > like" so that similarly pronounced words have similar values.
Double Metaphone is very useful and much better than Soundex. You can generate metaphone keys for all your existing names up front, and then when you get an email you generate its key and look for matches. You can also run an Edit Distance check (how many edits are required to get from one string to another) but in this case you need to calculate the edit distance from the source to each target string in turn so it's more CPU intensive, but not too much so. There are perl modules for it, but here's a good page describing it http://www.merriampark.com/ld.htm Metaphone is particularly designed with names in mind, whereas edit distance is used for matching any words (but is very good at matching simple typos like "cahty" or "catthy" for "cathy"). I'm developing a tool for law enforcement markets (www.i2.co.uk) that, amongst otehr things, does this particularly for names, and we're using a mix of double metaphone, edit distance, regexps and some hand tuned tricks, and a manual list of synonyms (James = Jim = Jimmy = Jack?). I've got a (perl) webpage that illustrates results from various methods - if I get a chance I'll put it up somewhere public and let you know. Cheers -- Tim
