Hey there, thanks for responding. That's an interesting point.

Are you saying I should not use autolearning at all? 

I don't have any way to review a large corpus of messages because we don't
have access to them - after they run through our servers they are sent on,
and the text of the message is not stored on our server. 

Man, I wish there was an easier way to feed Bayes an initial set of spam/ham
to teach it properly .. I've been told that letting it autolearn for a few
hours/days would make it learn well enough though.

If only our mail server only got 100 messages a day - then I could just
manually mark them! :) 




> -----Original Message-----
> From: RW [mailto:rwmailli...@googlemail.com]
> Sent: Wednesday, May 01, 2013 6:24 PM
> To: users@spamassassin.apache.org
> Subject: Re: Bayes Autolearning
> 
> On Wed, 01 May 2013 22:02:43 +0100
> Steve Freegard wrote:
> 
> > On 01/05/13 19:40, Andrew Talbot wrote:
> > > Hi, Seve -
> > >
> > > Thanks for your response. Is that just for performance reasons?
> > >
> >
> > Performance is one of the things that bayes_auto_learn_on_error 1 will
> > give you.  It means that if the message was already considered spam by
> > Bayes, then the message won't be autolearnt again which means
> > a bit less IO.   It will also result in the Bayes databases being
> > smaller as it is likely that with this option that less tokens will be
> > present overall which will also save disk IO and space.
> >
> > But the key reason I like this option is that it doesn't allow bayes
> > to overtrain in one direction (e.g. spam or ham).  It only autolearns
> > when Bayes either has the wrong result or isn't sure which IMO has to
> > be better for accuracy in the long run.
> 
> The evidence from trials with Bogofilter (which is similar to Bayes)
showed
> that initially train-on-everything significantly outperforms
train-on-error. The
> latter asymptotically catches up after thousands of errors. It seems that
the
> most important thing  is to learn a few thousand hams and spams by any
> means; and train-on-error can take a long time to get there. For this
reason
> DSPAM only allows train-on-error when 2500 hams have been learned.
> 
> There *may* be advantages to train-on-error after this in preventing BAYES
> becoming insensitive to learning.
> 
> The chief problem with autolearning is learning ham. If you set a positive
> threshold you end-up learning a lot of spam as ham, if you set a negative
> threshold you effectively turn-over ham training to the DNS whitelists
since
> they are the only tests with  significant negative scores that aren't
excluded
> from autolearning. Any problems with miss-learning are likely to be
> exacerbated by train-on-error.
> 
> If I had to use autolearning I'd mark the DNS whitelists as noautolearn
and
> write some negative-scoring, site-specific rules.

Reply via email to