On May 16, 2004, at 9:27 AM, Paul Berkowitz wrote:
Spammers can pretty quickly work around a lot of the "learning" of Bayesian filters. That's why you're seeing lost of plain text spam that includes long paragraphs of long Latinate words - they often get around filters which have learned to allow messages with some of these words.
It's clear that the random text included with many of today's spam messages is *meant* to throw off content-based filters. But I have yet to see *any* Bayesian filter fall into this trap. There are basically two reasons why it doesn't work. First, however much random stuff is in the message, there is still going to be some spammy content, and the filter will find it. Second, the random paragraphs just don't look like your good mail--to you, obviously, or to the filter.
It's possible that the spammer will get lucky, and one or two of the words will be ones that previously only appeared in your good mail. But this rarely happens, and it's generally not enough to tip the filter away from thinking that the message is spam. In fact, I've found that the random words generally make it *eaisier* for the filter to identify the message as spam, as it learns to recognize the patterns. Some of the random words become good spam indicators, because, although they sound innocuous, they also never appear in my good mail.
Instead, MS plan to update the filter from time to time. They're using the same Junk filter engines as Outlook, MSN and other MS programs - which gives them a HUGE sample to practice on �and learn from - much more than you could ever do with your own little Bayesian checker.
Personal Bayesian filters don't need a huge sample of messages. They achieve their high accuracy because they learn from the actual mail that *you* receive. Training my filter with other people's mail would probably reduce its effectiveness.
Anyway, my intent is not to get religious over filter technologies. In the end, what matters is how well the filter works. I just want to dispell the myth that spammers can "pretty quickly work around" Bayesian filters. They've certainly tried, but they haven't been successful.
--
Michael Tsai (SpamSieve developer) <http://www.c-command.com>
-- To unsubscribe: <mailto:[EMAIL PROTECTED]> archives: <http://www.mail-archive.com/entourage-talk%40lists.letterrip.com/> old-archive: <http://www.mail-archive.com/entourage-talk%40lists.boingo.com/>
