Brent Kennedy wrote:

> Here is my explanation for how spamassassin learns email as spam(laymans
> terms):
> 
> 1. Users receive the junk email
> 2. The users who received the junk email drags and drops the email to the
> spammail public folder in outlook
> 3. Spamassassin connects to the internal email server and downloads the
> email from the spammail public folder.
> 4. After the email is downloaded, the mail is scanned, the different aspects
> of the email are noted and point rules inside the spamassassin engine are
> updated(raised) based on the number of instances a rule is found in emails
> from the spammail public folder.

Statement 4 is untrue, but I can understand the reason to over-simplify here.
The real story is a bit complex and long-winded. However, it is important that
you at least know that learning spam does not adjust the score of any of rule in
SA. The whole sa-learn process does not ever even consider any of the rules in
SA at all, it is completely unaware that they exist.

The learning only affects what BAYES category the message will be placed in. It
does not adjust scores, but it does indirectly affect the score by shifting
which BAYES_* rule will later match.

Let me take a moment to explain bayes. Bayes is one of the many subsystems
contained in SpamAssassin. While it is quite complex if you get down to the
fine-grained details, it's not that hard at a high-level view. As a simple
explanation, it's a statistical "probability of spam" tracker for all the words
found in the body of messages. It establishes spam probabilities for words
through training, and later applies these to compute overall chance of spam for
a message when you scan.


In greater detail:

Fundamentally, bayes works by breaking a message into tokens.  These tokens are
mostly just the words in the body of the message, but in SA's bayes
implementations there are also tokens extracted from some parts of the message
headers and URLs.

When you are training messages with sa-learn, it records how many times each
token is seen in a spam message and how many times in nonspam. These statistics
are stored in a database for future reference. So when you train a message as
spam, sa-learn breaks the message into tokens, finds all the ones present in the
 database and increments their spam counter. All the other tokens get added as
new tokens with the spam counter set to 1 and the nonspam counter set to 0.


Based on the frequency of spam vs nonspam, SA can compute that a token has a
particular probability of being present in a spam message vs a nonspam message.
When SA uses bayes to scan a message, it looks up all the tokens present in the
database, computes all the probabilities of each token, then combines them to
get an overall probability that the whole message is spam or not. This is
expressed as a percentage chance of the message being spam (from 0% to 100%)

So through your training, SA's bayes effectively learns to associate certain
words with spam, some with nonspam, and some as being "in between". When a
message is scanned let's say for example it has 5 words that are associated
strongly with spam, 10 that are "in between" and 1 that's strongly associated
with nonspam. Since there are so many more strong spam tokens than non-spam
tokens, bayes is going to declare this message to have a fairly high probability
of being spam.

These bayes probabilities show up as fixed number rules that SA assigns scores
to. These scores are pre-generated, like every other score in SA, and do not
change unless you perform a sa-update or otherwise upgrade your SA to have new
rules.

In SA 3.1.x the rules that bayes shows up as are BAYES_00 (0%-1%), BAYES_05
(1%-5%), BAYES_20 (5%-20%), BAYES_40 (20%-40%), BAYES_50 (40%-60%), BAYES_60
(60%-80%), BAYES_80 (80%-95%) BAYES_95 (95%-99%) and BAYES_99 (99%-100%).

By shifting which of these rules a message matches, sa-learn winds up affecting
the overall score of future messages with similar content.


Reply via email to