From: "Bowie Bailey" <[EMAIL PROTECTED]>

Matt Kettler wrote:
Bowie Bailey wrote:
> > The Bayes rules are not individual unrelated rules. Bayes is a
> series of rules indicating a range of probability that a message is
> spam or ham.  You can argue over the exact scoring, but I can't see
> any reason to score BAYES_99 lower than BAYES_95.  Since a BAYES_99
> message is even more likely to be spam than a BAYES_95 message, it
> should have at least a slightly higher score.

No, it should not. I've given a conclusive reason why it may not
always be higher. My reason has a solid statistical reason behind it.
This reasoning is supported by real-world testing and real-world data.

You've given your opinion to the contrary, but no facts to support it
 other than declaring the rules to be related, and therefore the
score should correlate with  the bayes-calculated probability of spam.

While I don't disagree with you that BAYES_99 scoring lower than
BAYES_95 is counter-intuitive. I do not believe intuition alone is a
reason to defy reality.
If there are other rules with better performance (ie: fewer FPs) that
consistently coincide with the hits of BAYES_99, those rules should
soak up the lions share of the score. However, if there are a lot of
spam messages with no other rules hit, BAYES_99 should get a strong
boost from those.
The perceptron results show that the former is largely true. BAYES_99
is mostly redundant. To back it up, I'm going to verify it with my
own maillog data.
Looking at my own current real-world maillogs, BAYES_99 matched 6,643
messages last week. Of those, only 24 had total scores under 9.0.
(with BAYES_99 scoring 3.5, it would take a message with a total score of less than 8.5 to
drop below the threshold of 5.0 if BAYES_99 were omitted entirely).

So less than 0.37% of BAYES_99's hits actually mattered on my system
last week.
BAYES_95 on the other hand hit 468 messages, 20 of which scored less
than 9.0. That's 4.2% of messages with BAYES_95 hits. A considerably
larger percentage. Bringing it down to 8.0 to compensate for the
score difference and you still get 17 messages, which is still a much
larger 3.6% of it's hits.
On my system, BAYES_95 is significant in pushing mail over the spam
threshold 10 times more often than BAYES_99 is.

What are your results?

These are the greps I used, based on MailScanner log formats. Should
work for spamd users, perhaps with slight modifications.

zgrep BAYES_99 maillog.1.gz |wc -l
zgrep BAYES_99 maillog.1.gz |grep -v "score=[1-9][0-9]\." | grep -v
"score=9\." | wc -l

I think we are arguing from slightly different viewpoints.

You are saying that higher scores are not needed since the lower score
is made up for by other rules.  I have 13,935 hits for BAYES_99.  412
of them are lower than 9.0.  This seems to be caused by either AWL hits
lowering the score or very few other rules hitting.  BAYES_95 hit 469
times with 18 hits lower than 9.0.  This means that, for me, BAYES_95
is significant slightly more often, percentage-wise, than BAYES_99.
But considering volume, I would say that BAYES_99 is the more useful
rule.

However, that's not what I was arguing about to begin with.  Because
of the way the Bayes algorhytm works, I should be able to have more
confidence in a BAYES_99 hit than a BAYES_95 hit.  Therefore, it
should have a higher score.  Otherwise, you get the very strange
occurance that if you train Bayes too well and the spams go from
BAYES_95 to BAYES_99, the SA score actually goes down.

The better you train your Bayes database, the more confidence it
should have in picking out the spams.  As the scoring moves from
BAYES_50 up to BAYES_99, the SA score should increase to reflect the
higher confidence level of the Bayes engine.

Bingo - the trick that's been tickling my brain and the name not making
it through the fog of old age is the Kalman Filter. You grade inputs
per their confidence factor rather than punish them for being too good.

This might be a better way to put together the rules scores and the
Bayes scores.

{^_^}

Reply via email to