Re: Forcing autolearn

Magnus Holmgren Sun, 31 Jul 2005 10:23:48 -0700

Matt Kettler wrote:
>Magnus Holmgren wrote:
>>Kai Schaetzl wrote:
>>>Magnus Holmgren wrote on Thu, 28 Jul 2005 09:06:20 +0200:
>>>
>>>>In other words, is there a way to bypass the 3 points minimum for header 
>>>>and body? (Why isn't that limit configurable, by the way?)
>>>
>>>It's trying to prevent you from accidently poisoning your Bayes db.
>>
>>That explains the limit but not its non-configurability, IMHO. Hey, why
>>can't I shoot myself in my foot if I really want to! There is always a
>>possibility to re-learn (provided you save the learnt-from messages).
>>
> It is reconfigurable.. It's just harder than most SA options as you have to 
> hack
> the source code to change it. :)
> 
> DISCLAIMER: I *really* think it's a bad idea to adjust this. But if you 
> insist,
> it is possible.
> 
> I want there to still be some difficulty to intimidate you from changing this
> without some consideration. (it shouldn't be hard to find the setting knowing
> what file it's in, so this isn't much of a hurdle)


You can always hack the source, and yes, it was easy to find. :-)

Now for the consideration part:

First, we don't want to learn anything as spam that isn't. With a
default lower limit of 12 points that's very unlikely and as already
mentioned I haven't yet noticed a single false positive in my case.
Second, we don't want bayes poisoning, i.e. "hammy" words recorded as
"spammy". I guess the reasoning is that if the header scores lots of
points while the body scores low or even zero, then the body isn't
spammy enough and shouldn't be learnt from. Conversely, if the header is
clean then any (at least 9!) body points are probably just coincidence.
Right?

Now, whether bayes poisoning is really is an issue is debated. Someone
pointed out that the random words hidden by spammers in the message in
various ways aren't likely to resemble typical legit correspondence;
indeed they are just random noise that doesn't contribute in any
direction. In my case most real messages are in Swedish, meaning less
problem with those (but slightly more with English ones). Also, many
body points doesn't mean there is no bayes poison. Finally, when spam
slips through, the user would want to feed it to sa-learn regardless of
any bayes poison.

In conclusion, I feel confident in letting SA learn from every message
that I am certain that it can be certain is spam.

-- 
Magnus Holmgren
[EMAIL PROTECTED]

Re: Forcing autolearn

Reply via email to