maillist wrote:
Kim Christensen wrote:
Hey list,

I've recently started training our bayesian filter with spam/ham from my
personal mailbox, to prepare for live usage on our customer accounts.

% sa-learn --dump magic
...
0.000          0        340          0  non-token data: nspam
0.000          0        475          0  non-token data: nham
0.000          0      53404          0  non-token data: ntokens
...

So far so good, and spamd is actually using the bayesian db when
examining incoming mails. However, I find that a few of the legit ham (not a majority) mails get unusually high bayesian points, while some
of the real spam (which gets scored as spam by sa) often get bayesian
points < 1.
Now, I'm sure I haven't trained the database with wrong messages. Is it
a good idea to continue feeding sa-learn with example spam and ham until
it reaches a few thousands messages, before relying on the results?

I would think my current amount is sufficient, but I guess something's
wrong with this picture :-)


Best regards
Run spamassassin --test-mode on the messages that are scoring high and low. See if they are actually running through any BAYES_* tests. I'm not 100% sure but I think that by default, the bayes do not even begin until you have 500 trained messages of each spam and ham.

You can of course get around this by setting bayes_min_ham_num and bayes_min_spam_num in your local.cf file.

-=Aubrey=-

The default for 3.* is 200 messages for each.  Sorry dude.

-=Aubrey=-

Reply via email to