What are the odds for a mail to hit 50.00% ? Normal statictics would say 1 out
of 10.000, and I for sure did not go anywhere close to 50.000 emails today.
"normal" statistics would apply statistically even distributions of data.
This is most definitely not a system with any kind of even distributions. SA's bayes output is going to have heavy weightings towards 0 and 99, with a small spike at exactly 50. (note "BAYES_50" in SA 2.6x does not get the exact value of 50)
Bayes results are a function of your bayes training. If the tokens in that email aren't in your database it gets a 50. If there are very few tokens present, the bayes engine decides it doesn't have enough of a sample to make a confident decision, so it gets a 50.
You can reduce the number of 50's through training.
Looking at just my spam-hits (I don't log nonspam hits) I've got a bit under 5% of them having BAYES_ absent:
[EMAIL PROTECTED] log]# grep "score\=" maillog |grep BAYES_ |wc -l
4977
[EMAIL PROTECTED] log]# grep "score\=" maillog |grep -v BAYES_ |wc -l
253253 / (253+4977) = 0.04837 = 4.83%
Also look at STATISTICS-set3.txt from SA 2.63. Note the heavy weights towards 00 and 99. I also went and totaled up the percentages..
44.741 0.0861 95.8440 0.001 1.00 -5.20 BAYES_00 48.689 91.2161 0.0200 1.000 1.00 3.01 BAYES_99 1.075 1.9973 0.0193 0.990 0.85 3.00 BAYES_90 0.397 0.7286 0.0171 0.977 0.81 2.86 BAYES_80 0.300 0.5502 0.0143 0.975 0.81 2.31 BAYES_70 0.546 0.0574 1.1057 0.049 0.75 -5.40 BAYES_01 0.314 0.5539 0.0400 0.933 0.71 1.10 BAYES_60 0.197 0.0362 0.3812 0.087 0.66 -4.70 BAYES_10 0.176 0.0374 0.3348 0.101 0.63 -2.60 BAYES_20 0.178 0.0586 0.3141 0.157 0.52 -0.93 BAYES_30 0.000 0.0000 0.0000 0.500 0.09 0.00 BAYES_56 0.000 0.0000 0.0000 0.500 0.09 0.00 BAYES_50 0.000 0.0000 0.0000 0.500 0.09 -0.00 BAYES_44 0.000 0.0000 0.0000 0.500 0.09 -0.00 BAYES_40 ------ 96.613% of all email accounted for.
3.387% didn't get any BAYES_ rule matches, making this classification the third most popular category.
(note: there may be some small deviance here due to cumulative rounding errors, but it's going to be quite small. Even assuming an error of 0.0009 per rule you only get 0.0126 as a total deviance. It's still somewhere over 3%)
