Hi everyone,

Here are some observations on using Bayes and autolearning I would like to share, and have your input on.

Autolearning is turining out to be more trouble than it's worth. Although it helps the system to get to know the ham we send and get, and learn some of the spams on its own, it also tends to 'reward' the 'best' spammers out there. Spams that hit none of the rules (e.g. the current deluge of stock-spams) drive the score for all kinds of misspelled words towards the 'hammy' side of the curve, which makes it possible for more of that kind of junk to slip trough even if it hits SURBLSs or other rules.

The second weakness in the current Bayes setup concerns the 're-training' of the filter. The assumption in Bayes is that if a mail gets submitted for training, it will first be 'forgotten' and then correctly learned as spam (or ham). But in order to 'forget', SpamAssassin must be able to recognise that the submitted message is the same as a previously autolearned one. Currently this is done by checking the MsgID or some checksum of the headers. There are two potential pitfalls here: Firstly, the retraining message is never exactly the same as the original message. It's made another hop to the mailstore, or has been mangled by Exchange or some user agent. Secondly, especially if the original Msg-ID was not used by the autolearner, the SA-Generated Msg-ID would not be the same as the original. As soon as that happens, retraining becomes far less powerfull: when the original faulty autolearning doesn't get 'forgotten', the retraining will mostly cancel it out, but never get a chance to correct the Bayes scores for those tokens.

The end-users at my site are fairly good at submitting their spams to the filter (and fairly vocal if the filter misses too much). But there are also accounts that are not being read by humans. Accounts that gate onto mailing-lists. All these get spam too, and the spam gets autolearned, sometimes in the wrong direction. With retraining only partially effective as shown above, what happens in the end is that some spams, by virtue of sheer volume and sameness, manage to bias the filter in the wrong direction. Surely I'm not the only one who experiences this, because 'My Bayes has gone bad' is a frequent subject in this forum.

Some suggestions on improving the performance of the Bayes system:

1.) Messages that have been manually submitted should have a higher 'weight' in the Bayes statistics than autolearned messages.

2.) There should be a framework within SpamAssassin that makes it easy for end-users to submit their spam for training. Currently, there are all kinds of scripts available outside the main SpamAssassin distribution (I've written my own, too) that attempt to get the message out of the mail-client or server and as close as possible to the original, to feed back to Bayes. Which is close to impossible with some of the mail-servers out there. SpamAssassin currently only includes half the Bayes interface: you can have auto-learning, but for manual learning or retraining you're on your own to some extent.

3.) Message classification should not be on something as fragile as a mail-header or checksum thereof, but on the actual content. The goal of this classifier should be to be able to identify a message as being learned before, despite what has happened to it after having gone trough SpamAssassin

4.) The Bayes subsystem should store this classification, and all the tokens it learned. This way we can be sure that we correctly unlearn a autolearned message. The entries in this database could be timestamped so they can be removed after some months, to prevent unlimited growth.

Bayes is a very powerfull system, especially for recognising site-specific ham. But at this moment, apx. 30% of the spam that slips trough my filter has 'autolearn=ham' set. And another 60% of the spam slipping trough has a negative Bayes score to help them along. For the moment, I've disabled the autolearning in my Bayes system.

Regards, Paul Boven.



Reply via email to