Bowie Bailey wrote:
>
> The Bayes rules are not individual unrelated rules. Bayes is a
> series of rules indicating a range of probability that a message is
> spam or ham. You can argue over the exact scoring, but I can't see
> any reason to score BAYES_99 lower than BAYES_95. Since a BAYES_99
> message is even more likely to be spam than a BAYES_95 message, it
> should have at least a slightly higher score.
No, it should not. I've given a conclusive reason why it may not
always be higher. My reason has a solid statistical reason behind it.
This reasoning is supported by real-world testing and real-world data.
You've given your opinion to the contrary, but no facts to support it
other than declaring the rules to be related, and therefore the
score should correlate with the bayes-calculated probability of spam.
While I don't disagree with you that BAYES_99 scoring lower than
BAYES_95 is counter-intuitive. I do not believe intuition alone is a
reason to defy reality.
If there are other rules with better performance (ie: fewer FPs) that
consistently coincide with the hits of BAYES_99, those rules should
soak up the lions share of the score. However, if there are a lot of
spam messages with no other rules hit, BAYES_99 should get a strong
boost from those.
The perceptron results show that the former is largely true. BAYES_99
is mostly redundant. To back it up, I'm going to verify it with my
own maillog data.
Looking at my own current real-world maillogs, BAYES_99 matched 6,643
messages last week. Of those, only 24 had total scores under 9.0.
(with BAYES_99 scoring
3.5, it would take a message with a total score of less than 8.5 to
drop below the threshold of 5.0 if BAYES_99 were omitted entirely).
So less than 0.37% of BAYES_99's hits actually mattered on my system
last week.
BAYES_95 on the other hand hit 468 messages, 20 of which scored less
than 9.0. That's 4.2% of messages with BAYES_95 hits. A considerably
larger percentage. Bringing it down to 8.0 to compensate for the
score difference and you still get 17 messages, which is still a much
larger 3.6% of it's hits.
On my system, BAYES_95 is significant in pushing mail over the spam
threshold 10 times more often than BAYES_99 is.
What are your results?
These are the greps I used, based on MailScanner log formats. Should
work for spamd users, perhaps with slight modifications.
zgrep BAYES_99 maillog.1.gz |wc -l
zgrep BAYES_99 maillog.1.gz |grep -v "score=[1-9][0-9]\." | grep -v
"score=9\." | wc -l