On Fri, 9 Nov 2012 12:48:11 -0500 dar...@chaosreigns.com wrote: > I haven't done as much testing on this as I'd like, but I've gotten > away from it, and wanted to get my thoughts in here before I forget > them. > > I have a strong suspicion that SA's bayes implementation sucks. > > The two major problems, as I see them: > 1) Lack of learn-on-fail. > 2) Lack of multi-word tokens. > > In the process I discovered that 9 years ago I did some testing that > showed multi-word tokens work better than single-word tokens: > http://www.chaosreigns.com/adventures/entry.php?date=2003-10-06&num=01 > > It really blows my mind that we don't have these two features. > > Learn-on-fail means, when you train an email as spam or ham, it > first checks the email to see if it would already have been classified > correctly, and then only does any training if it would've gotten it > wrong.
It wouldn't hurt to have the option, but I think a lot of people are already doing this simply by being selective about what they learn. One problem with it is that you get a lot of unnecessary failures before the accuracy levels-out. DSPAM's TOE mode only switches on when there are 2500 ham messages in the database. I think this is sensible - particularly for per user databases. > So it doesn't modify the database unless there's actually > evidence that it would be beneficial (reducing non-beneficial > modifications). I've never really found that argument particularly compelling. Correctly identified mails are often rich in useful tokens whereas errors often occur because there's not much to go on. > The two word token thing was mentioned on > http://wiki.apache.org/spamassassin/WeLoveVolunteers since 2004-02-24. > > One of my questions is, does it make sense to continue to maintain > bayesian stuff within SA at all? Or should we drop it, and encourage > people to run a pure bayesian classifier before SA (like spamprobe), > then have rules that read the headers from those classifiers? One advantage is access to metadata and an interface that allows plugins to contribute. I think there is probably scope for a lot more to be done with Bayes in this area. Maybe it would also be useful if plugins could get back the ham/spam counts for tokens they contribute. > The reason I'm playing with bayes is my interest in the possible > usefulness of shared bayes data. > > I want to do more testing of using other people's bayes data on > my corpora. My assumption is that most end users don't do their own > training. So I haven't been using bayes, for some time, in an attempt > to better see what typical end users see. But I suspect that taking > multiple other people's bayes databases, merging them, and using them > on my corpora, could be very useful. And if I can prove that, then > we / I could distribute it to more people. I think merging needs to be done per token so the global database contributes most strongly on local low-count tokens.