Karsten Bräckelmann wrote on Wed, 04 Mar 2009 02:25:51 +0100: > That's bayes_auto_learn_threshold_spam and nonspam respectively, I > guess? Keep in mind that threshold is not the actual score, so you > aren't learning all spam with a score of 8+ then.
Right, I know. That's where the spam quarantine comes into play. All spam in it (= everything with score 5 or higher) gets learned in the night. That's absolutely necessary as we don't get much spam. 96% of the mail that is accepted is ham (or spam that comes in because the user opted out, there's no distinction because there's no detection). The remainder is either a virus or other bad content or High Scoring spam. Low scoring spam is almost non-existent. > > Kai, given a nonspam threshold of -2, how exactly do you (manually) > learn ham? That would be interesting. And what's the ham/spam ratio? I just checked and have to admit we must have removed the bayes_auto_learn_threshold_ham -2 some time ago as 0.01 seems to be reliable enough. Only the bayes_auto_learn_threshold_spam 8 is in effect now. But I believe -2 would also deliver enough ham for autolearning. Score distribution of the last 40.000 or so messages on the same server. -15 6 -4 3,364 -3 4,249 -2 9,982 -1 4,760 0 13,995 1 1,267 2 789 3 387 Bayes from that machine: 0.000 0 66285 0 non-token data: nspam 0.000 0 85888 0 non-token data: nham 0.000 0 1864402 0 non-token data: ntokens As you see, because of the structure of the incoming mail, the ham exceeds the spam and the gap is probably steadily growing. This is also reflected in the rule hits. The no. 1 rule that hits is Bayes_00 (it hits 99.7% of all ham). Bayes_99 is only at around position 25, but with a 100% accuracy and the no. 1 rule hitting spam (hitting about 50% of spam). On servers where I get in some spam trap email and let part of it flow thru the MTA rejection the picture is very different. For instance the server for my own domains has only 25% ham. Bayes_99 is the no. 1 hitting rule with an accuracy of 95.8% (again, not checked if the remainder really was ham). With all the URIBL rules and BAYES_00 (accuracy 99.9%) as runners up. So, all in all Bayes works very much for me. Especially in those cases where no other rule hits (typically some spamvertized site not yet on a URIBL) it's most often the only rule that hits. That's why I moved it to 5.0 a while ago. Works very well. I think if you use DCC or Razor you may get similar results for these rules and may not need to rely so much on Bayes. I do not use *any* network rules except for the URIBL stuff which isn't shut off by "skip_rbl_checks 1". (Figures are taken from mailwatch rule hits tables.) Kai -- Kai Schätzl, Berlin, Germany Get your web at Conactive Internet Services: http://www.conactive.com