Another Bayes tweak

Sidney Markowitz 15 Apr 2004 22:31:43 -0000

I want to see what people think of this before I put something in Bugzilla.

There are two Bugzilla entries having to do with rule order and short-circuiting, http://bugzilla.spamassassin.org/show_bug.cgi?id=2912 which is closed and http://bugzilla.spamassassin.org/show_bug.cgi?id=3109 which is still open.

They led me to think about Bayes processing as a special case because it is so expensive. I base that statement on sonic.net's experience of having difficulty deploying the latest SpamAssassin because of the I/O requirements of Bayes processing. The recent optimizations help, but I'm not sure if they are enough.

If Bayes were done last, as per bug #2912, or we had a short-circuit mechanism as in 3109, Bayes calculations could be skipped whenever the score exceeded some positive or negative threshold.

A very conservative approach would be to make the threshold limits be (required_score - BAYES_99) to (required_score - BAYES_00) which means skip Bayes processing whenever it cannot possibly make a difference. If there is enough high scoring spam and low scoring ham in the mail stream, then this would save a lot of processing load.

Since the Bayes score is not used in deciding when something should be autolearned, the problems of short-circuiting and autolearning are not a factor. Does that mean that we should use a special mechanism for Bayes that is simpler than whatever we eventually do for short-circuiting?

Does this make sense to people, or should we just dedicate ourselves to making sure that Bayes processing is so efficient that there will be no need to treat it as a special case?

 --sidney

Another Bayes tweak

Reply via email to