Wow.. You definitely went the opposite direction. Although, I did appreciate a well written explanation of the bayes system.
I could be evil and forward this to them(thoughts?)... Maybe they wont ask again. >:) -Brent -----Original Message----- From: Matt Kettler [mailto:[EMAIL PROTECTED] Sent: Wednesday, May 03, 2006 1:47 PM To: Brent Kennedy Cc: users@spamassassin.apache.org Subject: Re: Silly Question Brent Kennedy wrote: > Here is my explanation for how spamassassin learns email as > spam(laymans > terms): > > 1. Users receive the junk email > 2. The users who received the junk email drags and drops the email to > the spammail public folder in outlook 3. Spamassassin connects to the > internal email server and downloads the email from the spammail public > folder. > 4. After the email is downloaded, the mail is scanned, the different > aspects of the email are noted and point rules inside the spamassassin > engine are > updated(raised) based on the number of instances a rule is found in > emails from the spammail public folder. Statement 4 is untrue, but I can understand the reason to over-simplify here. The real story is a bit complex and long-winded. However, it is important that you at least know that learning spam does not adjust the score of any of rule in SA. The whole sa-learn process does not ever even consider any of the rules in SA at all, it is completely unaware that they exist. The learning only affects what BAYES category the message will be placed in. It does not adjust scores, but it does indirectly affect the score by shifting which BAYES_* rule will later match. Let me take a moment to explain bayes. Bayes is one of the many subsystems contained in SpamAssassin. While it is quite complex if you get down to the fine-grained details, it's not that hard at a high-level view. As a simple explanation, it's a statistical "probability of spam" tracker for all the words found in the body of messages. It establishes spam probabilities for words through training, and later applies these to compute overall chance of spam for a message when you scan. In greater detail: Fundamentally, bayes works by breaking a message into tokens. These tokens are mostly just the words in the body of the message, but in SA's bayes implementations there are also tokens extracted from some parts of the message headers and URLs. When you are training messages with sa-learn, it records how many times each token is seen in a spam message and how many times in nonspam. These statistics are stored in a database for future reference. So when you train a message as spam, sa-learn breaks the message into tokens, finds all the ones present in the database and increments their spam counter. All the other tokens get added as new tokens with the spam counter set to 1 and the nonspam counter set to 0. Based on the frequency of spam vs nonspam, SA can compute that a token has a particular probability of being present in a spam message vs a nonspam message. When SA uses bayes to scan a message, it looks up all the tokens present in the database, computes all the probabilities of each token, then combines them to get an overall probability that the whole message is spam or not. This is expressed as a percentage chance of the message being spam (from 0% to 100%) So through your training, SA's bayes effectively learns to associate certain words with spam, some with nonspam, and some as being "in between". When a message is scanned let's say for example it has 5 words that are associated strongly with spam, 10 that are "in between" and 1 that's strongly associated with nonspam. Since there are so many more strong spam tokens than non-spam tokens, bayes is going to declare this message to have a fairly high probability of being spam. These bayes probabilities show up as fixed number rules that SA assigns scores to. These scores are pre-generated, like every other score in SA, and do not change unless you perform a sa-update or otherwise upgrade your SA to have new rules. In SA 3.1.x the rules that bayes shows up as are BAYES_00 (0%-1%), BAYES_05 (1%-5%), BAYES_20 (5%-20%), BAYES_40 (20%-40%), BAYES_50 (40%-60%), BAYES_60 (60%-80%), BAYES_80 (80%-95%) BAYES_95 (95%-99%) and BAYES_99 (99%-100%). By shifting which of these rules a message matches, sa-learn winds up affecting the overall score of future messages with similar content.