Re: Understanding SpamAssassin

Bowie Bailey Fri, 25 Sep 2009 07:09:50 -0700

poifgh wrote:
> Bowie Bailey wrote:
>   
>> For auto-learning, the high and low scoring messages are fed to Bayes. 
>> However, for an optimal setup, you should manually train Bayes on as
>> much of your (verified) ham and spam as possible.  The more of your mail
>> stream Bayes sees, the better the results will be.
>>
>> Your description of Bayes is pretty close.  It breaks down the message
>> into "tokens" (words and character sequences) and then keeps track of
>> how likely each of those tokens is to appear in either a ham or spam
>> message.  When a new message comes in, Bayes breaks it into tokens and
>> then scores it depending on which tokens were found in the message.
>>
>>     
>
> Suppose we do not have manual Bayesian training. We only do online training
> in which high and low scoring mails are fed to the learner [is the a usual
> thing to do? How many people manually train their bayesian filter?]
> A high scoring spam is then fed to the learner. The spam is high scoring
> since a few rules [regex] matched. Now the bayesian leaner would learn all
> the tokens from this mail. Next time a mail [say M] with similar tokens is
> seen, it would be flagged as spam [using bayes rule]. why would bayesian
> learning be needed for us to say M is spam. Since it contains very much
> similar words like earlier high scoring mails, shouldnt we expect the regex
> rules to work for M as well? - since M is very much similar to those mails
> from which we learnt from ?
>


Look at it this way -- Bayes is learning what your spam looks like and
what your ham looks like.  Most of your spam will be caught by other
rules, but there are times when an email will come in that the main
rules do not catch.  Bayes is frequently able to catch these because it
is looking at the message as a whole rather than looking for particular
words or phrases as the main regex rules do.

Manual training is not strictly required for Bayes, but the more manual
training you do, the higher the accuracy and the more useful it
becomes.  At the least, you should manually train Bayes on all of your
false positives and false negatives.  This can be scripted to happen
automatically based on folders which are expected to contain hand-sorted
spam and ham.

> Here is how I think bayesian is helpful [which could be be entirely my
> misunderstanding]. Suppose a set of spam mails look like
>
> "Please buy M3d1C1NE X at store Y for cheap". 
>
> Now spammers have obfuscated word "medicine" in the mail. Spammers send, say
> a thousand spam each having a different way in which "medicine" is spelt
> out, but all the other words around it remain nearly the same. Only some of
> the first 100 of these mails would hit [say if there exists] a MEDICINE rule
> [regex]. Those particular mails would have high spam scores and hence the
> bayesian filter would learn that mails containing words "Please", "buy",
> "at", "store", "for", "cheap" corresponds to have a high spam probability.  
>
> For 101st mail, if the regex MEDICINE is unable to match the obfuscated
> text, then the mail would have a low score, but bayesian learner would say,
> seeing the words surrounding obfuscated text, that this mail is spam.
>
> Does it work this way? Does it work only this way [if not manually trained]? 
>   

That is a pretty fair description of how it works regardless of how you
train it.  The advantage of manual training is that you allow it to
learn from the lower scoring spam (and higher scoring ham), which are
the kinds of messages that can most use the extra points from the Bayes
rules.

-- 
Bowie

Re: Understanding SpamAssassin

Reply via email to