On Friday 25 Feb 2005 01:03, John Peacock wrote:
> Bob wrote:
> > Is there an existing filter that could determine if a username@
> > is 60% or more mis-spelled as compared to real usernames?
> > 60% is arbitrary and would be configurable. If so, that would
> > serve to make a fuzzy honeypot filter for dictionary spam.
>
> There are a couple of modules on CPAN which do something like this, called
> Text::Metaphone and newer and better Text::DoubleMetaphone.  They both
> convert a word into something like the Soundex algorithm that was invented
> for the US Census.  They produce a value for what a given word "sounds
> like" so that similarly pronounced words have similar values.

Double Metaphone is very useful and much better than Soundex.
You can generate metaphone keys for all your existing names up front, and then 
when you get an email you generate its key and look for matches.

You can also run an Edit Distance check (how many edits are required to get 
from one string to another) but in this case you need to calculate the edit 
distance from the source to each target string in turn so it's more CPU 
intensive, but not too much so. There are perl modules for it, but here's a 
good page describing it http://www.merriampark.com/ld.htm

Metaphone is particularly designed with names in mind, whereas edit distance 
is used for matching any words (but is very good at matching simple typos 
like "cahty" or "catthy" for "cathy").

I'm developing a tool for law enforcement markets (www.i2.co.uk) that, amongst 
otehr things, does this particularly for names, and we're using a mix of 
double metaphone, edit distance, regexps and some hand tuned tricks, and a 
manual list of synonyms (James = Jim = Jimmy = Jack?).

I've got a (perl) webpage that illustrates results from various methods - if I 
get a chance I'll put it up somewhere public and let you know.

Cheers

--
Tim

Reply via email to