This is what automatic training attempts to solve.

If you are reliably nailing spam with your current setup you can experiment
with the automatic learning. But I'd widen the score ranges a little, as
far as is practical for your mail mix.

{^_^}
----- Original Message ----- From: "Clay Davis" <[EMAIL PROTECTED]>


Does anyone have any good techniques for capturing a sample of ham that can be used as the ham corpus. I'm in a corporate environment and am not keen on the idea of intercepting non-spam messages. I will if I have to, but was hoping someone had a better idea.

Regards,
Clay


On 2/7/2006 at 3:16 pm, in message <[EMAIL PROTECTED]>, Matt Kettler
<[EMAIL PROTECTED]> wrote:
[EMAIL PROTECTED] wrote:
Can you just feed spamassassin spam or do you need to give it ham also?

I read the docs and it didn't say you had to feed it ham.

I then read another doc and it said you should feed it equal amounts of
spam and ham.

Yes, you really should feed it both. You also should strive for a 1:1 ratio
of
spam and nonspam, but don't kill yourself to get there.

SA's use of chi-squared combining makes it very tolerant of wild imbalances
in
training. However, the closer you are to a 1:1 ratio the better SA will be
able
to distinguish tokens that are present in both kinds of mail and ignore
them. So
this is a worthwhile goal to strive for as long as it doesn't become a
burden.

My current training ratio is about 7:1 spam:nonspam, but in the past it's
been
as bad as 20:1. Both of those are very far off from equal amounts, but the
imbalance has never caused me any problems.

From my sa-learn --dump magic output as of today:
0.000          0     995764          0  non-token data: nspam
0.000          0     145377          0  non-token data: nham

That works out to a ratio of 6.85:1

Reply via email to