Does anyone have any good techniques for capturing a sample of ham that can be 
used as the ham corpus.  I'm in a corporate environment and am not keen on the 
idea of intercepting non-spam messages.  I will if I have to, but was hoping 
someone had a better idea.

Regards,
Clay


>>> On 2/7/2006 at 3:16 pm, in message <[EMAIL PROTECTED]>, Matt Kettler
<[EMAIL PROTECTED]> wrote:
> [EMAIL PROTECTED] wrote:
>> Can you just feed spamassassin spam or do you need to give it ham also?
>> 
>> I read the docs and it didn't say you had to feed it ham.
>> 
>> I then read another doc and it said you should feed it equal amounts of
>> spam and ham.
> 
> Yes, you really should feed it both. You also should strive for a 1:1 ratio 
> of
> spam and nonspam, but don't kill yourself to get there.
> 
> SA's use of chi-squared combining makes it very tolerant of wild imbalances 
> in
> training. However, the closer you are to a 1:1 ratio the better SA will be 
> able
> to distinguish tokens that are present in both kinds of mail and ignore 
> them. So
> this is a worthwhile goal to strive for as long as it doesn't become a 
> burden.
> 
> My current training ratio is about 7:1 spam:nonspam, but in the past it's 
> been
> as bad as 20:1. Both of those are very far off from equal amounts, but the
> imbalance has never caused me any problems.
> 
> From my sa-learn --dump magic output as of today:
> 0.000          0     995764          0  non-token data: nspam
> 0.000          0     145377          0  non-token data: nham
> 
> That works out to a ratio of 6.85:1

Reply via email to