Here's a 30 second overview:

The most popular spam filtering technology is called "Baysien Filtering" -
this method works, essentially, by assigning a score to every word of a
message.  The total score of all the words in a message are then added up
and an average produced - if the average is (let's say) above zero then the
message is not spam.

This is how most "learning" spam filters work: when you mark a message as
spam the score of each word in it is reduced a bit.  When you mark a message
as "good" the score of each word in it is increased a bit.  New words are
scored neutrally.

This usually means that common words like "a", "the" and "and" are
effectively neutral while common spam words like "penis", "Viagra", "sex",
etc are scored very low (since they appear mostly in spam).  Words very
specific to you like you're name, your children's names, your workplace, etc
are then scored very high (since they appear mostly only in good mail).

All the scores of all the words in a message are added up and averaged and
the resulting message score is used to determine whether the message is spam
or not.

>From that it should be clear that there are two ways (at least) to muck up
this system:

1) Misspelled words may not be in the database and are thus scored
neutrally.  This is why you see so many variations (the addition of symbols
and numbers for example) on common spam words.  Of course the more these
misspellings are used the worse their scores will get - and the more
creative the misspellings will have to get.

2) Lots of "good" words in a message can skew the average for a few "bad"
words.  This is exactly why you see blocks of non-spam text appear in spam
messages.  Throwing a few paragraphs of "Tom Sawyer" into a message can push
it's score up to the "good" level.

This is also the reason that so many spasm place all of their text in images
- there's no way for the filter to "read" the message.  This is losing
popularity now however as more and more people are just blocking all images
in mail.

Of course this is only one of the many spam filters around and any decent
tool will use several in conjunction.  But every filter has a weak-spot and
it's definitely a game of leap-frog - which is why we still get spam even
with the best filters.

Jim Davis
[Todays Threads] [This Message] [Subscription] [Fast Unsubscribe] [User Settings] [Donations and Support]

Reply via email to