Re: I don't understand the Bayes scoring logic
At 07:08 PM 3/5/2005, Nigel Wilkinson wrote: Why does a 99-100% probability score less than an 80-95% probability??? This is more-or-less a FAQ in SA now. Rule scores in SA are not in any way linear. The scores are not assigned based on performance, they're based on tuning the scores of ALL of the rules together in such a way to minimize the total of FP's and FN's with a 1:100 ratio (i.e. find the lowest FP +100*FN). Because of this, rule scores are not assigned based on the performance of one individual rule, but it's interactions with every other rule in the ruleset. In the case of BAYES_99, it would appear that most spam messages that hit it also hit a lot of other rules, thus SA's score optimize could sacrifice the score slightly to reduce the FPs without introducing a significant number of FN's. However, the story may be different in BAYES_80.. here the spams are likely to be more evasive, and might need a higher score from this rule to avoid large numbers of FNs. The other off-chance possibility is there may be some mis-placed spams in the corpus the dev's used. Actualy, there's almost certainly one or two in the lot, but if there's a decent number of them they can really screw up the scores.
Re: I don't understand the Bayes scoring logic
Nigel Wilkinson wrote: > Why does a 99-100% probability score less than an 80-95% probability??? Because the Bayes engine is not the only factor in classifying a message as spam. Along with that all of the other rules are factored into it too. A message which is 99-100% probability is going to trigger many of the other SA rules. The total is enough to push the message over the 5 point threshold. The scoring program therefore did not need to make the BAYES_99 score any higher than it did. And I also believe there is a value in the SA development team that no single rule should be too large. It can lead to false positives. It is better to be conservative and avoid false positives for the masses. However, *I* don't like seeing the same spam again and again. With the default values I would see a spam, train for it, and still see the same spam again and again because it would only score BAYES_99 and be below the threshold. Often this is before it is reported and before network tests such as RBLs and SURBL can tag the sender. So I increase the BAYES_95 and BAYES_99 points to 4.0 and 5.0 for my own personal use. That way if the same spam comes through again, as I know it will, it will get tagged. But I can't say with any authority that this won't generate false positives. I can only say that I have only myself to blame in that case and also that since I know what it is doing I won't be surprised by it. Bob signature.asc Description: Digital signature