Thanks, Tim. Both algorithms are in perl already.
With all usernames being known and a small set,
it might even run fast and require little or no hand
work on synonyms or regex.

"edit distance" is the perfect cross-check for soundex.
"Levenshtein distance (LD)". That would correct for
the downfall of phonetic blacklisting, which would give
false positives for typos consisting of 1) hitting the adjacent
key, and of 2) reversing two chars. Levenshtein distance
checking would catch those. It's a perfect marriage
(Text::DoubleMetaphone and Levenshtein for distinguishing
username@ as being not even close to any username, then
take your pick of options such as DENY_whatever or divert
to spam folder).

"edit distance" can non-blacklist the *mechanical* typos
and metaphone can non-blacklist the *mental* typos
("kathy" "cathy" "kathey[allow this favorite joe-job!]").
When they both do not weakly non-blacklist, blacklist the
wankers, or accept to spam folder instead of bouncing
no such user(as if somebody ought to know).

http://www.merriampark.com/ld.htm#OTHER

" Jorge Mas Trullenque points out that "the calculation needs
O(n) memory, so using a two-dimensional matrix in a practical
implementation is wasteful." He has written an implementation
in Perl that uses only one one-dimensional vector."

http://www.mgilleland.com/ld/ldperl2.htm

Levenshtein Distance Algorithm: Perl Implementation

by Jorge Mas Trullenque


sub levenshtein($$){ my @A=split //, lc shift; my @B=split //, lc shift; my @W=([EMAIL PROTECTED]); my ($i, $j, $cur, $next); for $i (0..$#A){ $cur=$i+1; for $j (0..$#B){ $next=min( $W[$j+1]+1, $cur+1, ($A[$i] ne $B[$j])+$W[$j] ); $W[$j]=$cur; $cur=$next; } [EMAIL PROTECTED]; } return $next; }

sub min($$$){
 if ($_[0] < $_[2]){ pop @_; } else { shift @_; }
 return $_[0] < $_[1]? $_[0]:$_[1];
}

print levenshtein("gambol","gumbo");
#print " ";
print levenshtein("gumbo", "gambol");
#print " ";
print levenshtein("gumbo", "bumble");


http://search.cpan.org/dist/Text-DoubleMetaphone/ http://search.cpan.org/src/MAURICE/Text-DoubleMetaphone-0.07/DoubleMetaphone.pm

Tim Meadowcroft wrote:

On Friday 25 Feb 2005 01:03, John Peacock wrote:


Bob wrote:


Is there an existing filter that could determine if a username@
is 60% or more mis-spelled as compared to real usernames?
60% is arbitrary and would be configurable. If so, that would
serve to make a fuzzy honeypot filter for dictionary spam.


There are a couple of modules on CPAN which do something like this, called
Text::Metaphone and newer and better Text::DoubleMetaphone. They both
convert a word into something like the Soundex algorithm that was invented
for the US Census. They produce a value for what a given word "sounds
like" so that similarly pronounced words have similar values.



Double Metaphone is very useful and much better than Soundex.
You can generate metaphone keys for all your existing names up front, and then when you get an email you generate its key and look for matches.


You can also run an Edit Distance check (how many edits are required to get from one string to another) but in this case you need to calculate the edit distance from the source to each target string in turn so it's more CPU intensive, but not too much so. There are perl modules for it, but here's a good page describing it http://www.merriampark.com/ld.htm

Metaphone is particularly designed with names in mind, whereas edit distance is used for matching any words (but is very good at matching simple typos like "cahty" or "catthy" for "cathy").

I'm developing a tool for......

I've got a (perl) webpage that illustrates results from various methods - if I 
get a chance I'll put it up somewhere public and let you know.

Cheers

--
Tim







Reply via email to