[Andrew] >> At what point is SPAMBayes sufficiently trained? [Alex] > First, spambayes tends to work better when trained with > similar amounts of spam and ham; you've currently got about a > 4:1 ratio. I'd suggest retraining with closer to a 1:1 ratio, > and turning off training while filtering (which will tend to > drive you towards severely unbalanced training).
If you've got plenty of time to spend on this, you could figure out a way to use Skip's tte.py script (in contrib/ in the source) with your setup (sb_server, from memory). This enforces a 1::1 ratio, and also reduces the number of messages trained on. You can get SpamBayes to keep the cached messages around by increasing the cache expiry times. You'd want to still use the review pages to correct any misclassifications, so I guess you'd have to modify the source (sb_server.py, ProxyUI.py or Corpus.py probably) to not actually train when you do that (just move the message). Then you'd have two directories of classified messages that you could periodically give to tte.py* to build a database. * I don't recall if tte.py wants directories of individual messages or a mbox of messages. No doubt it could be modified to work either way. =Tony.Meyer _______________________________________________ spambayes-dev mailing list spambayes-dev@python.org http://mail.python.org/mailman/listinfo/spambayes-dev