Hey there, thanks for responding. That's an interesting point. Are you saying I should not use autolearning at all?
I don't have any way to review a large corpus of messages because we don't have access to them - after they run through our servers they are sent on, and the text of the message is not stored on our server. Man, I wish there was an easier way to feed Bayes an initial set of spam/ham to teach it properly .. I've been told that letting it autolearn for a few hours/days would make it learn well enough though. If only our mail server only got 100 messages a day - then I could just manually mark them! :) > -----Original Message----- > From: RW [mailto:rwmailli...@googlemail.com] > Sent: Wednesday, May 01, 2013 6:24 PM > To: users@spamassassin.apache.org > Subject: Re: Bayes Autolearning > > On Wed, 01 May 2013 22:02:43 +0100 > Steve Freegard wrote: > > > On 01/05/13 19:40, Andrew Talbot wrote: > > > Hi, Seve - > > > > > > Thanks for your response. Is that just for performance reasons? > > > > > > > Performance is one of the things that bayes_auto_learn_on_error 1 will > > give you. It means that if the message was already considered spam by > > Bayes, then the message won't be autolearnt again which means > > a bit less IO. It will also result in the Bayes databases being > > smaller as it is likely that with this option that less tokens will be > > present overall which will also save disk IO and space. > > > > But the key reason I like this option is that it doesn't allow bayes > > to overtrain in one direction (e.g. spam or ham). It only autolearns > > when Bayes either has the wrong result or isn't sure which IMO has to > > be better for accuracy in the long run. > > The evidence from trials with Bogofilter (which is similar to Bayes) showed > that initially train-on-everything significantly outperforms train-on-error. The > latter asymptotically catches up after thousands of errors. It seems that the > most important thing is to learn a few thousand hams and spams by any > means; and train-on-error can take a long time to get there. For this reason > DSPAM only allows train-on-error when 2500 hams have been learned. > > There *may* be advantages to train-on-error after this in preventing BAYES > becoming insensitive to learning. > > The chief problem with autolearning is learning ham. If you set a positive > threshold you end-up learning a lot of spam as ham, if you set a negative > threshold you effectively turn-over ham training to the DNS whitelists since > they are the only tests with significant negative scores that aren't excluded > from autolearning. Any problems with miss-learning are likely to be > exacerbated by train-on-error. > > If I had to use autolearning I'd mark the DNS whitelists as noautolearn and > write some negative-scoring, site-specific rules.