Future of SA's bayes implementation

darxus Fri, 09 Nov 2012 09:48:38 -0800

I haven't done as much testing on this as I'd like, but I've gotten away
from it, and wanted to get my thoughts in here before I forget them.

I have a strong suspicion that SA's bayes implementation sucks.

The two major problems, as I see them:
1) Lack of learn-on-fail.
2) Lack of multi-word tokens.

In the process I discovered that 9 years ago I did some testing that showed
multi-word tokens work better than single-word tokens:
http://www.chaosreigns.com/adventures/entry.php?date=2003-10-06&num=01

It really blows my mind that we don't have these two features.

Learn-on-fail means, when you train an email as spam or ham, it
first checks the email to see if it would already have been classified
correctly, and then only does any training if it would've gotten it wrong.
So it doesn't modify the database unless there's actually evidence that
it would be beneficial (reducing non-beneficial modifications). It was
implemented for auto-learning in 2010, but not for manual training:
https://issues.apache.org/SpamAssassin/show_bug.cgi?id=6447

Multi-word tokens is probably obvious? Currently, SA's bayes tokens
are single words. And there are better results when you also have two
word tokens. In 2003-2004, I wrote a bayesian filter from scratch,
and thought it was pretty neat how much you can get out of some
tweaks to tokenization (do you only split on white space? What about
non-alpha-numeric characters?)

The two word token thing was mentioned on
http://wiki.apache.org/spamassassin/WeLoveVolunteers since 2004-02-24.

One of my questions is, does it make sense to continue to maintain bayesian
stuff within SA at all? Or should we drop it, and encourage people to run
a pure bayesian classifier before SA (like spamprobe), then have rules that
read the headers from those classifiers? Are there options better than
spamprobe?

On one hand, spamprobe has been around forever, and almost certainly does
bayes at least as well as SA is ever likely to, it's pretty easy to run it
before SA, and creating the rules to read those scores would be easy.

On the other hand, keeping the bayes functionality within SA provides a
tidier package, a little easier to set up, one fewer process spawned per
email, and adding these two features really shouldn't be hard. Without
even breaking any backward compatibility (with the existing database
format). I hope.

The reason I'm playing with bayes is my interest in the possible usefulness
of shared bayes data.

I want to do more testing of using other people's bayes data on
my corpora. My assumption is that most end users don't do their own
training. So I haven't been using bayes, for some time, in an attempt
to better see what typical end users see. But I suspect that taking
multiple other people's bayes databases, merging them, and using them
on my corpora, could be very useful. And if I can prove that, then we /
I could distribute it to more people.

I tried it with patdk-wk's (from IRC) data, and 1.18% of my ham hit
BAYES_99, which I call terrible. But my hope is to see better results with
data merged from multiple people.

So, please send me your bayes data. Mailing me the output of
"sa-learn --backup | gzip >> sa-learn.backup.yourname.gz" off list should do.
Please let me know how much it's hand verified vs. auto-trained. And let
me know if you're comfortable with it being distributed to others.

Mine is: http://www.chaosreigns.com/tmp/sa-learn.backup.darxus.gz
No auto-training.

There was strong concern expressed about the idea of merging bayes DBs
eight years go:
http://mail-archives.apache.org/mod_mbox/spamassassin-users/200412.mbox/%3c20041204160616.ga2...@mail.herk.net%3E
I don't share that concern, but I also plan to find evidence that it's
useful before suggesting anybody else try it.

To test bayes, I grepped BAYES from the default rule set into
~/sa/bayesonly/local.cf, then copied /etc/spamassassin/*.pre to
~/sa/bayesonly/, simlinked ~/.spamassassin/bayes* into masses/spamassassin,
and ran:
./mass-check --bayes -c ~/sa/bayesonly/ --progress
ham:dir:/home/darxus/masscheckwork/ham/
spam:dir:/home/darxus/masscheckwork/spam/

It's annoying that it doesn't seem possible to just run spamassassin with
only the bayes rules, instead of masscheck. It gives an error about not
having any rules defined.

--
"I offer the modest proposal that our Universe is simply one of those
things which happen from time to time."
- Is the Universe a Vacuum Fluctuation?
http://www.ChaosReigns.com

Future of SA's bayes implementation

Reply via email to