Re: [spambayes-dev] Training Question

Tony Meyer Mon, 16 May 2005 18:39:44 -0700

[Andrew]
>> At what point is SPAMBayes sufficiently trained?

[Alex]
> First, spambayes tends to work better when trained with 
> similar amounts of spam and ham; you've currently got about a 
> 4:1 ratio. I'd suggest retraining with closer to a 1:1 ratio, 
> and turning off training while filtering (which will tend to 
> drive you towards severely unbalanced training).


If you've got plenty of time to spend on this, you could figure out a way to
use Skip's tte.py script (in contrib/ in the source) with your setup
(sb_server, from memory).  This enforces a 1::1 ratio, and also reduces the
number of messages trained on.

You can get SpamBayes to keep the cached messages around by increasing the
cache expiry times.  You'd want to still use the review pages to correct any
misclassifications, so I guess you'd have to modify the source
(sb_server.py, ProxyUI.py or Corpus.py probably) to not actually train when
you do that (just move the message).  Then you'd have two directories of
classified messages that you could periodically give to tte.py* to build a
database.

* I don't recall if tte.py wants directories of individual messages or a
mbox of messages.  No doubt it could be modified to work either way.

=Tony.Meyer

_______________________________________________
spambayes-dev mailing list
spambayes-dev@python.org
http://mail.python.org/mailman/listinfo/spambayes-dev

Re: [spambayes-dev] Training Question

Reply via email to