Re: Bye Bye Bayes

Kai Schaetzl Wed, 04 Mar 2009 08:31:54 -0800

Karsten Bräckelmann wrote on Wed, 04 Mar 2009 02:25:51 +0100:

> That's bayes_auto_learn_threshold_spam and nonspam respectively, I
> guess? Keep in mind that threshold is not the actual score, so you
> aren't learning all spam with a score of 8+ then.


Right, I know. That's where the spam quarantine comes into play. All spam 
in it (= everything with score 5 or higher) gets learned in the night.
That's absolutely necessary as we don't get much spam. 96% of the mail 
that is accepted is ham (or spam that comes in because the user opted out, 
there's no distinction because there's no detection). The remainder is 
either a virus or other bad content or High Scoring spam. Low scoring spam 
is almost non-existent.

> 
> Kai, given a nonspam threshold of -2, how exactly do you (manually)
> learn ham? That would be interesting. And what's the ham/spam ratio?

I just checked and have to admit we must have removed the
bayes_auto_learn_threshold_ham -2
some time ago as 0.01 seems to be reliable enough. Only the 
bayes_auto_learn_threshold_spam 8
is in effect now.
But I believe -2 would also deliver enough ham for autolearning. Score 
distribution of the last 40.000 or so messages on the same server.

-15 6 
-4 3,364 
-3 4,249 
-2 9,982 
-1 4,760 
0 13,995 
1 1,267 
2 789 
3 387 

Bayes from that machine:

0.000          0      66285          0  non-token data: nspam
0.000          0      85888          0  non-token data: nham
0.000          0    1864402          0  non-token data: ntokens

As you see, because of the structure of the incoming mail, the ham exceeds 
the spam and the gap is probably steadily growing. This is also reflected 
in the rule hits. The no. 1 rule that hits is Bayes_00 (it hits 99.7% of 
all ham). Bayes_99 is only at around position 25, but with a 100% accuracy 
and the no. 1 rule hitting spam (hitting about 50% of spam).

On servers where I get in some spam trap email and let part of it flow 
thru the MTA rejection the picture is very different. For instance the 
server for my own domains has only 25% ham. Bayes_99 is the no. 1 hitting 
rule with an accuracy of 95.8% (again, not checked if the remainder really 
was ham). With all the URIBL rules and BAYES_00 (accuracy 99.9%) as 
runners up.

So, all in all Bayes works very much for me. Especially in those cases 
where no other rule hits (typically some spamvertized site not yet on a 
URIBL) it's most often the only rule that hits. That's why I moved it to 
5.0 a while ago. Works very well. I think if you use DCC or Razor you may 
get similar results for these rules and may not need to rely so much on 
Bayes. I do not use *any* network rules except for the URIBL stuff which 
isn't shut off by "skip_rbl_checks 1".

(Figures are taken from mailwatch rule hits tables.)

Kai

-- 
Kai Schätzl, Berlin, Germany
Get your web at Conactive Internet Services: http://www.conactive.com

Re: Bye Bye Bayes

Reply via email to