On Fri, 9 Nov 2012 12:48:11 -0500
dar...@chaosreigns.com wrote:

> I haven't done as much testing on this as I'd like, but I've gotten
> away from it, and wanted to get my thoughts in here before I forget
> them.
> 
> I have a strong suspicion that SA's bayes implementation sucks.
> 
> The two major problems, as I see them:
> 1) Lack of learn-on-fail.
> 2) Lack of multi-word tokens.
> 
> In the process I discovered that 9 years ago I did some testing that
> showed multi-word tokens work better than single-word tokens:
> http://www.chaosreigns.com/adventures/entry.php?date=2003-10-06&num=01
> 
> It really blows my mind that we don't have these two features.
> 
> Learn-on-fail means, when you train an email as spam or ham, it
> first checks the email to see if it would already have been classified
> correctly, and then only does any training if it would've gotten it
> wrong.

It wouldn't hurt to have the option, but I think a lot of people are
already doing this simply by being selective about what they learn.

One problem with it is that you get a lot of unnecessary failures
before the accuracy levels-out. DSPAM's TOE mode only switches on when
there are 2500  ham messages in the database. I think this is sensible
- particularly for per user databases.

> So it doesn't modify the database unless there's actually
> evidence that it would be beneficial (reducing non-beneficial
> modifications).

I've never really found that argument particularly compelling.
Correctly identified mails are often rich in useful tokens whereas
errors often occur because there's not much to go on. 


> The two word token thing was mentioned on
> http://wiki.apache.org/spamassassin/WeLoveVolunteers since 2004-02-24.
> 
> One of my questions is, does it make sense to continue to maintain
> bayesian stuff within SA at all?  Or should we drop it, and encourage
> people to run a pure bayesian classifier before SA (like spamprobe),
> then have rules that read the headers from those classifiers?  

One advantage is access to metadata and an interface that allows
plugins to contribute. I think there is probably scope for a lot more
to be done with Bayes in this area.

Maybe it would also be useful if plugins could get back the ham/spam
counts for tokens they contribute. 


> The reason I'm playing with bayes is my interest in the possible
> usefulness of shared bayes data.
> 
> I want to do more testing of using other people's bayes data on
> my corpora.  My assumption is that most end users don't do their own
> training.  So I haven't been using bayes, for some time, in an attempt
> to better see what typical end users see.  But I suspect that taking
> multiple other people's bayes databases, merging them, and using them
> on my corpora, could be very useful.  And if I can prove that, then
> we / I could distribute it to more people.

I think merging needs to be done per token so the global database
contributes most strongly on local low-count tokens. 

Reply via email to