Re: The trouble with Bayes

Kevin Peuhkurinen 6 May 2005 12:57:48 -0000

Paul Boven wrote:

Hi everyone,
Here are some observations on using Bayes and autolearning I would like to share, and have your input on.


Okay!

Some suggestions on improving the performance of the Bayes system:
1.) Messages that have been manually submitted should have a higher 'weight' in the Bayes statistics than autolearned messages.


I agree with you there.  It seems to make good sense.

2.) There should be a framework within SpamAssassin that makes it easy for end-users to submit their spam for training. Currently, there are all kinds of scripts available outside the main SpamAssassin distribution (I've written my own, too) that attempt to get the message out of the mail-client or server and as close as possible to the original, to feed back to Bayes. Which is close to impossible with some of the mail-servers out there. SpamAssassin currently only includes half the Bayes interface: you can have auto-learning, but for manual learning or retraining you're on your own to some extent.

This I have to disagree with you on. SA is used on too many different types of systems in too many different environments for it to make any sort of sense to try to concoct a one-size-fits-all solution to learning. A better approach would be a one-stop source of information on how to implement learning in various environments, perhaps here: http://wiki.apache.org/spamassassin/BayesInSpamAssassin

3.) Message classification should not be on something as fragile as a mail-header or checksum thereof, but on the actual content. The goal of this classifier should be to be able to identify a message as being learned before, despite what has happened to it after having gone trough SpamAssassin

I agree that basing the classification on message IDs is "fragile", but I'm not sure that any other approach would be better. Perhaps an MD5 sum of the contents not including headers or attachments? It would require a fair bit of testing in various real-world environments of various methods before you could authoratatively say that one method is clearly superior than the one currently used.

4.) The Bayes subsystem should store this classification, and all the tokens it learned. This way we can be sure that we correctly unlearn a autolearned message. The entries in this database could be timestamped so they can be removed after some months, to prevent unlimited growth.

Sounds like a good idea. However, my Bayes database is already about 60MB. A significantly larger database may be a problem for some systems with limited storage space.

Bayes is a very powerfull system, especially for recognising site-specific ham. But at this moment, apx. 30% of the spam that slips trough my filter has 'autolearn=ham' set. And another 60% of the spam slipping trough has a negative Bayes score to help them along. For the moment, I've disabled the autolearning in my Bayes system.

I'm not sure that my experiences are similar. I don't think that many of my false negatives are doing better than BAYES_50, but I'll take a closer look.

Re: The trouble with Bayes

Reply via email to