Chris, thanks for your detailed analysis!

Please don't be discouraged, as you're generally on the right track,
you just need to do some fine tuning.

Since last spring, I've been running some word tests that include
something similar to the obfuscation approach you've described, and
have had good performance and excellent efficacy.

>I downloaded the TREC corpus and generated a list of words that 
>commonly appeared in spam. I used the top 1000 most common words of 
>greater than four letters in the TREC spam that were NOT in the top 
>1000 most common >4 letter words in the TREC ham.

That's a great approach for eliminating those found in Ham, however
it may be weak at picking spam tokens, mostly due to spammer
obfuscations. I would be VERY interested in seeing your word list.

For your next iteration, perhaps use your de-obfuscation algorithm to
find and merge matches in the initial spam list, then continue as
before.  That should somewhat improve the list quality.

The length of your list is a big part of your performance issues.
Do a careful manual review of the list, both reducing it and
classifying tokens by type of spam they're most likely to occur in,
for example: stock scams, fake degrees, sundry, and porn.

What I do is group, then sub-group the tokens, with each sub-group
having a different weighting, then score only if the total from any
ONE entire group is high enough.  Typically this means about 5 words
need to hit.  For example, my fake degrees group includes (among
others): nonaccredited, bachelor, classroom, degree, doctorate,
experience, graduation, mba, phd, prestigious, qualifications,
university.  Those are split into 4 different weighting sub-groups,
with "nonaccredited" being by itself and having the highest
weighting, and "university" having the lowest.

I also score differently depending on the type of matching:
exact, gappy, fuzzy.  "Exact" is self explanatory, "gappy" looks for
tokens divided only by whitespace and/or non-alphanumerics, and
"fuzzy" is pretty much the algorithm you described (favors duplicated
letters). There's an optional bonus score for matches that occur at
the beginning of lines (which I only use for my stock group).

The single most useful group uses "exact"+"gappy" tests on a set of
stock symbol and scammer phone numbers.  I typically check for new
symbols daily, and update my list.  This has all but eliminated text
stock spams.

I've implemented this all in a little filter (written in a compiled
language) that runs after SA.  Average run time JUST for word tests
is about 60 milliseconds, using about 150-200 tokens.  The code was
written for clarity, so I'm sure I could speed that up some, but
haven't had the incentive (yet).  FP rate has been zero for the
groups I've classified as reliable (stocks, degrees, porn), and very
low for the more aggressive groups.

Your system is much larger than mine, so not all of this would work
as well for you, but I had to give you some encouragement. :)

Thanks for the great algorithm description, including terminology.
I'll review some of that the next time I tweak my tests.
        - "Chip"


Reply via email to