RE: Silly Question

Brent Kennedy Wed, 03 May 2006 12:58:27 -0700

Wow.. You definitely went the opposite direction.  Although, I did
appreciate a well written explanation of the bayes system.

I could be evil and forward this to them(thoughts?)... Maybe they wont ask
again. >:) 

-Brent

-----Original Message-----
From: Matt Kettler [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, May 03, 2006 1:47 PM
To: Brent Kennedy
Cc: users@spamassassin.apache.org
Subject: Re: Silly Question

Brent Kennedy wrote:

> Here is my explanation for how spamassassin learns email as 
> spam(laymans
> terms):
> 
> 1. Users receive the junk email
> 2. The users who received the junk email drags and drops the email to 
> the spammail public folder in outlook 3. Spamassassin connects to the 
> internal email server and downloads the email from the spammail public 
> folder.
> 4. After the email is downloaded, the mail is scanned, the different 
> aspects of the email are noted and point rules inside the spamassassin 
> engine are
> updated(raised) based on the number of instances a rule is found in 
> emails from the spammail public folder.

Statement 4 is untrue, but I can understand the reason to over-simplify
here.
The real story is a bit complex and long-winded. However, it is important
that you at least know that learning spam does not adjust the score of any
of rule in SA. The whole sa-learn process does not ever even consider any of
the rules in SA at all, it is completely unaware that they exist.

The learning only affects what BAYES category the message will be placed in.
It does not adjust scores, but it does indirectly affect the score by
shifting which BAYES_* rule will later match.

Let me take a moment to explain bayes. Bayes is one of the many subsystems
contained in SpamAssassin. While it is quite complex if you get down to the
fine-grained details, it's not that hard at a high-level view. As a simple
explanation, it's a statistical "probability of spam" tracker for all the
words found in the body of messages. It establishes spam probabilities for
words through training, and later applies these to compute overall chance of
spam for a message when you scan.

In greater detail:

Fundamentally, bayes works by breaking a message into tokens.  These tokens
are mostly just the words in the body of the message, but in SA's bayes
implementations there are also tokens extracted from some parts of the
message headers and URLs.

When you are training messages with sa-learn, it records how many times each
token is seen in a spam message and how many times in nonspam. These
statistics are stored in a database for future reference. So when you train
a message as spam, sa-learn breaks the message into tokens, finds all the
ones present in the  database and increments their spam counter. All the
other tokens get added as new tokens with the spam counter set to 1 and the
nonspam counter set to 0.

Based on the frequency of spam vs nonspam, SA can compute that a token has a
particular probability of being present in a spam message vs a nonspam
message.
When SA uses bayes to scan a message, it looks up all the tokens present in
the database, computes all the probabilities of each token, then combines
them to get an overall probability that the whole message is spam or not.
This is expressed as a percentage chance of the message being spam (from 0%
to 100%)

So through your training, SA's bayes effectively learns to associate certain
words with spam, some with nonspam, and some as being "in between". When a
message is scanned let's say for example it has 5 words that are associated
strongly with spam, 10 that are "in between" and 1 that's strongly
associated with nonspam. Since there are so many more strong spam tokens
than non-spam tokens, bayes is going to declare this message to have a
fairly high probability of being spam.

These bayes probabilities show up as fixed number rules that SA assigns
scores to. These scores are pre-generated, like every other score in SA, and
do not change unless you perform a sa-update or otherwise upgrade your SA to
have new rules.

In SA 3.1.x the rules that bayes shows up as are BAYES_00 (0%-1%), BAYES_05
(1%-5%), BAYES_20 (5%-20%), BAYES_40 (20%-40%), BAYES_50 (40%-60%), BAYES_60
(60%-80%), BAYES_80 (80%-95%) BAYES_95 (95%-99%) and BAYES_99 (99%-100%).

By shifting which of these rules a message matches, sa-learn winds up
affecting the overall score of future messages with similar content.

RE: Silly Question

Reply via email to