Re: Forcing autolearn

Matt Kettler Thu, 04 Aug 2005 09:18:03 -0700

Magnus Holmgren wrote:
> Matt Kettler wrote:
> 
>>Yes, bayes poison should be trained without worry. However, bayes poison is 
>>not
>>the topic of discussion here. We are talking about mis-learning, something
>>COMPLETELY different.
> 
> 
> Kai Schaetzl talked about "prevent[ing] you from accidently poisoning
> your Bayes db", so I assumed we were talking about bayes poisoning.


This gets into a subtlety of usage of the words.

"bayes poison"  is a noun, and unless otherwise stated means text inserted by
spammers in an attempt to make the message look like nonspam.

"bayes poisoning" is a verb, and refers to the act of successfully imbalancing a
bayes database. Most bayes poison, despite its name, is very ineffective at
causing this, although it does try.

There are two sources of bayes poisoning, but only one is commonly called "bayes
poison", and it's mostly harmless. Mislearing is just called mislearning,
although it's a much more potent cause of bayes poisoning.

So this thread is about bayes poisoning, but it's about poisoning as a result of
mislearning, not poisoning as a result of bayes poison.

(Isn't the clarity of human language wonderful?)

>>Are you sure your conclusions are based on accurate perceptions of the 
>>consequences?
>>
> 
> I am sure that there will be no mislearning, even if I lower the body
> and/or header limits a bit, and that any mislearning that nevertheless
> may occur can be rectified by relearning. The mail volumes are low

In that case, it doesn't matter a whole lot if the message gets autolearned or
not. You'll manually train it correctly one way or the other.

> 
> What I still would like to know is the theory behind the hardcoded 3
> point limits. Can someone give as an example a message that would be
> mislearnt if it weren't for those limits?
> 


A whole lot of messages posted to this list will score very high in body points
due to spam quotations, but 0 or near 0 header points. Many messages sent by
persons on shady ISPs will score high in the header points but low in the body.

Ideally SA wants to take the approach of autolearning as spam when it's quite
sure of itself. Anything that doesn't get autolearned can always be manually
trained to compensate.


Basically you have two paths:

 1) aggressively autolearn and try to fix any errors with manual training. Risk
of FPs is slightly increased in the interim.

 2) autolearn normally and manually train anything that didn't autolearn. Risk
of FNs is slightly increased in the interim.

So, it boils down to which is worse for you, FPs or FNs.

SA in general takes the  standpoint that FPs are much worse than FNs. Thus, it
is natural for SA to be very conservative about spam learning, and liberal about
ham learning. Such a learning pattern fits SA's general design.

Re: Forcing autolearn

Reply via email to