I haven't done as much testing on this as I'd like, but I've gotten away from it, and wanted to get my thoughts in here before I forget them.
I have a strong suspicion that SA's bayes implementation sucks. The two major problems, as I see them: 1) Lack of learn-on-fail. 2) Lack of multi-word tokens. In the process I discovered that 9 years ago I did some testing that showed multi-word tokens work better than single-word tokens: http://www.chaosreigns.com/adventures/entry.php?date=2003-10-06&num=01 It really blows my mind that we don't have these two features. Learn-on-fail means, when you train an email as spam or ham, it first checks the email to see if it would already have been classified correctly, and then only does any training if it would've gotten it wrong. So it doesn't modify the database unless there's actually evidence that it would be beneficial (reducing non-beneficial modifications). It was implemented for auto-learning in 2010, but not for manual training: https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6447 Multi-word tokens is probably obvious? Currently, SA's bayes tokens are single words. And there are better results when you also have two word tokens. In 2003-2004, I wrote a bayesian filter from scratch, and thought it was pretty neat how much you can get out of some tweaks to tokenization (do you only split on white space? What about non-alpha-numeric characters?) The two word token thing was mentioned on http://wiki.apache.org/spamassassin/WeLoveVolunteers since 2004-02-24. One of my questions is, does it make sense to continue to maintain bayesian stuff within SA at all? Or should we drop it, and encourage people to run a pure bayesian classifier before SA (like spamprobe), then have rules that read the headers from those classifiers? Are there options better than spamprobe? On one hand, spamprobe has been around forever, and almost certainly does bayes at least as well as SA is ever likely to, it's pretty easy to run it before SA, and creating the rules to read those scores would be easy. On the other hand, keeping the bayes functionality within SA provides a tidier package, a little easier to set up, one fewer process spawned per email, and adding these two features really shouldn't be hard. Without even breaking any backward compatibility (with the existing database format). I hope. The reason I'm playing with bayes is my interest in the possible usefulness of shared bayes data. I want to do more testing of using other people's bayes data on my corpora. My assumption is that most end users don't do their own training. So I haven't been using bayes, for some time, in an attempt to better see what typical end users see. But I suspect that taking multiple other people's bayes databases, merging them, and using them on my corpora, could be very useful. And if I can prove that, then we / I could distribute it to more people. I tried it with patdk-wk's (from IRC) data, and 1.18% of my ham hit BAYES_99, which I call terrible. But my hope is to see better results with data merged from multiple people. So, please send me your bayes data. Mailing me the output of "sa-learn --backup | gzip >> sa-learn.backup.yourname.gz" off list should do. Please let me know how much it's hand verified vs. auto-trained. And let me know if you're comfortable with it being distributed to others. Mine is: http://www.chaosreigns.com/tmp/sa-learn.backup.darxus.gz No auto-training. There was strong concern expressed about the idea of merging bayes DBs eight years go: http://mail-archives.apache.org/mod_mbox/spamassassin-users/200412.mbox/%3c20041204160616.ga2...@mail.herk.net%3E I don't share that concern, but I also plan to find evidence that it's useful before suggesting anybody else try it. To test bayes, I grepped BAYES from the default rule set into ~/sa/bayesonly/local.cf, then copied /etc/spamassassin/*.pre to ~/sa/bayesonly/, simlinked ~/.spamassassin/bayes* into masses/spamassassin, and ran: ./mass-check --bayes -c ~/sa/bayesonly/ --progress ham:dir:/home/darxus/masscheckwork/ham/ spam:dir:/home/darxus/masscheckwork/spam/ It's annoying that it doesn't seem possible to just run spamassassin with only the bayes rules, instead of masscheck. It gives an error about not having any rules defined. -- "I offer the modest proposal that our Universe is simply one of those things which happen from time to time." - Is the Universe a Vacuum Fluctuation? http://www.ChaosReigns.com