Kelson wrote: > Dhaval Patel wrote: >> 1.2 BAYES_50 BODY: Bayesian spam probability is 40 to 60% >> [score: 0.4999] > > Possibly silly and slightly off-topic question, but why are you giving > BAYES_50 a positive score? BAYES_50 means Bayes gives it a 50/50 > chance of being either spam or not. Essentially, you're giving all > messages a starting point of 1.2 instead of a starting point of 0. > That's the default score for BAYES_50 with set2 (bayes, no network) in SA 3.0.x.
score BAYES_50 0 0 1.567 0.001 In SA 3.0.x, the perceptron was allowed to generate a real-world based score for BAYES_50. When you start looking at perceptron output, you need to understand that rule scores aren't a function of the rule alone. They're a function of the rule AND how it interacts with every other rule in the ruleset. Also, even though the "theoretical" performance of BAYES_50 should be 50/50, the real world performance doesn't match that. In SA 3.0's mass-checks, the set2 data came up with a S/O of 0.936, or 93.6% spam/6.4% nonspam. That accounts for a lot of the positive score. However set3 came up with 0.306, which accounts for the near-zero set3 score. SA 3.1.x's mass checks came up with much more consistent numbers,0.756 and 0.746 for set2 and 3, respectively. It's possible there were some errors in test in the 3.0 data, but the 3.1 numbers feel about right to me. Even though bayes 50 is theoretically is 50/50, I find that a message matching BAYES_50 is more likely to be spam than not. This could be an artifiact of good training on my part. Although BAYES_50 sounds like it's "equal chance" it really means "not fitting of any existing training" OR "what training it does match is roughly divided equally between the two". If you're nonspam and spam training rates are pretty good, most of your BAYES_50 is going to be "new variants" of spam, because spam changes much more dramatically than nonspam from a bayes perspective.