Does anyone have any good techniques for capturing a sample of ham that can be used as the ham corpus. I'm in a corporate environment and am not keen on the idea of intercepting non-spam messages. I will if I have to, but was hoping someone had a better idea.
Regards, Clay >>> On 2/7/2006 at 3:16 pm, in message <[EMAIL PROTECTED]>, Matt Kettler <[EMAIL PROTECTED]> wrote: > [EMAIL PROTECTED] wrote: >> Can you just feed spamassassin spam or do you need to give it ham also? >> >> I read the docs and it didn't say you had to feed it ham. >> >> I then read another doc and it said you should feed it equal amounts of >> spam and ham. > > Yes, you really should feed it both. You also should strive for a 1:1 ratio > of > spam and nonspam, but don't kill yourself to get there. > > SA's use of chi-squared combining makes it very tolerant of wild imbalances > in > training. However, the closer you are to a 1:1 ratio the better SA will be > able > to distinguish tokens that are present in both kinds of mail and ignore > them. So > this is a worthwhile goal to strive for as long as it doesn't become a > burden. > > My current training ratio is about 7:1 spam:nonspam, but in the past it's > been > as bad as 20:1. Both of those are very far off from equal amounts, but the > imbalance has never caused me any problems. > > From my sa-learn --dump magic output as of today: > 0.000 0 995764 0 non-token data: nspam > 0.000 0 145377 0 non-token data: nham > > That works out to a ratio of 6.85:1