Kelson wrote:
> Dhaval Patel wrote:
>>  1.2 BAYES_50               BODY: Bayesian spam probability is 40 to 60%
>>                             [score: 0.4999]
>
> Possibly silly and slightly off-topic question, but why are you giving
> BAYES_50 a positive score?  BAYES_50 means Bayes gives it a 50/50
> chance of being either spam or not.  Essentially, you're giving all
> messages a starting point of 1.2 instead of a starting point of 0.
>
That's the default score for BAYES_50 with set2 (bayes, no network) in
SA 3.0.x.

score BAYES_50 0 0 1.567 0.001



In SA 3.0.x, the perceptron was allowed to generate a real-world based
score for BAYES_50. When you start looking at perceptron output, you
need to understand that rule scores aren't a function of the rule alone.
They're a function of the rule AND how it interacts with every other
rule in the ruleset.

Also, even though the "theoretical" performance of BAYES_50 should be
50/50, the real world performance doesn't match that. In SA 3.0's
mass-checks, the set2 data came up with a S/O of 0.936, or 93.6%
spam/6.4% nonspam. That accounts for a lot of the positive score.
However set3 came up with 0.306, which accounts for the near-zero set3
score.

SA 3.1.x's mass checks came up with much more consistent numbers,0.756
and  0.746 for set2 and 3, respectively.

It's possible there were some errors in test in the 3.0 data, but the
3.1 numbers feel about right to me. Even though bayes 50 is
theoretically is 50/50, I find that a message matching BAYES_50 is more
likely to be spam than not. This could be an artifiact of good training
on my part. Although BAYES_50 sounds like it's "equal chance" it really
means "not fitting of any existing training" OR "what training it does
match is roughly divided equally between the two". If you're nonspam and
spam training rates are pretty good, most of your BAYES_50 is going to
be "new variants" of spam, because spam changes much more dramatically
than nonspam from a bayes perspective.




Reply via email to