hi Chris -- Sorry to hear it didn't work out -- but thanks for the great analysis!
--j. Chris St. Pierre writes: > If anyone's curious, I did some followup research on the ideas below > and found them to be, generally, totally unfeasable. > > I downloaded the TREC corpus and generated a list of words that > commonly appeared in spam. I used the top 1000 most common words of > greater than four letters in the TREC spam that were NOT in the top > 1000 most common >4 letter words in the TREC ham. > > I then did two sets of tests on a few sample hams and spams, and the > results convinced me that it was not even necessary to run the tests > on the whole corpus. > > For each message, I compared each word of greater than four letters > with each word in my spam wordlist with the Wagner-Fischer distance, a > slightly modified Levenshtein distance. With W-F, I was able to give > greater weight to letter replacements, so "viagna" would be further > from "viagra" than, say, "viagrra." I also compared the Metaphone > representation of each word of >4 letters with the Metaphone hashes of > each word in my spam wordlist, again with Wagner-Fischer. I discarded > those distances that were too high and then computed a score for each > message with the following formula: > > <metaphone_length> ^ 2 / (<metaphone_distance> + 1) + > <word_length> ^ 2 / (<distance> + 1) > > I ran this on the first ten spams and hams in the corpus. The mean > score for spams was 365.7 and the median was 12.5; the mean score for > hams was 3715.565 and the median was 1103.6. More than anything, the > results seem to indicate the length of the message rather than the > spamminess. > > Processor time was also a problem; the largest message scanned took > over 23 minutes to process. The quickest was under 3 seconds, but the > average was around 45 seconds, with ham taking much longer to process > than spam. > > Running either test individually -- the plain text W-F distance or the > metaphone W-F distance -- did not show an appreciable improvement in > the accuracy of the algorithm, although the processing time improved. > > It's too bad this won't work, although if someone else wants to take a > crack at it, I'd be happy to share my code, word lists, etc. > > Chris St. Pierre > Unix Systems Administrator > Nebraska Wesleyan University > > On Thu, 5 Oct 2006, Chris St. Pierre wrote: > > >One thing I've wondered/thought about is using the Levenshtein > >difference between the words in an email and a list of spam words > >(ideally pulled from the bayes db). In this case, all of the > >misspelled words in that sample have a L-distance of 1 from the real > >word -- in other words, they're *very* close. > > > >I think the problem would be that this would consume tons of > >resources. Anything else, though, would be susceptible to other typo > >attacks. For instance, say you took each email, and replaced all > >doubled letters with single letters, it wouldn't be long before you > >were getting spam advertising "analr bictches" or the like. > > > >Chris St. Pierre > >Unix Systems Administrator > >Nebraska Wesleyan University > > > >On Wed, 4 Oct 2006, Eric A. Hall wrote: > > > >> > >>On 10/4/2006 5:57 PM, Richard Doyle wrote: > >>> I've been getting lots of porn site spam containing words with doubled > >>> letters, like this one: > >> > >>> Can anybody suggest a rule or ruleset to catch these double-letter > >>> obfuscations? I'm using Spamassassin 3.1.4. > >> > >>You'd probably need to write a plug-in that used some kind of > >>typo-matching logic to find porno words. > >> > >>Would be a good plug-in actually. Get busy :) > >> > >>-- > >>Eric A. Hall http://www.ehsco.com/ > >>Internet Core Protocols http://www.oreilly.com/catalog/coreprot/ > >> > >