At 01:04 PM 12/29/2004, Matt Linzbach wrote:
Can someone point me to a thread that would discuss the scoring of the Bayesian rules in 3.0. Specifically why BAYES_99 would score less than BAYES_95 for bayes+net tests?

Why would you expect it to be higher? It's a common human perception that everything is simple and linear. Unfortunately, nothing about SA scoring is simple, nor linear.


The big thing to keep in mind that rules are NOT independent entities. They are NOT scored based on their performance.

The score that BAYES_99 gets is not a function of it's performance, but the performance of every other rule in the ruleset. Furthermore they are most heavily biased by the rules that match the same messages in the corpus test. They are still affected by other rules which don't match any of the same messages, just because those rules affect other rules, which eventually get around to affecting BAYES_99.. (this relationship works a bit like the Kevin Bacon number.)

As an example, consider the case another rule that also performs well is added to the syste,, and has similar false positive problems and similar spam matches. In this situation the perceptron (or GA in older versions of SA) is going to have to trade off the score between the two rules. Generally, it's going to heavily bias towards the rule with the least FPs, but only because it's trying to tune the scores to get the lowest number for (FP+(FN/100)) it can.

In this case, I suspect BAYES was heavily biased by the URIBL rules. Some of those have VERY high hit rates, and VERY low FP problems, much less than the theoretical 0.5% that BAYES_99 has.


Reply via email to