Re: autolearn=ham when it shouldnt

Matt Kettler 10 Aug 2004 18:42:30 -0000

At 11:27 AM 8/10/2004, Jim Maul wrote:

X-Spam-Status: No, hits=1.1 required=5.0 tests=CLICK_BELOW,DEAR_SOMETHING,
         HTML_LINK_CLICK_HERE,HTML_MESSAGE,HTML_WEB_BUGS,INVALID_MSGID,
         RCVD_IN_BSP_TRUSTED autolearn=ham version=2.63


What i was concerned with was things like click below, dear something, html
messages, web bugs, etc getting into bayes as ham.  Im not saying the message
isnt ham, i just dont want bayes getting confused.

Since you're still apprehensive, let me fill in some extra details here. Please don't take my wording as an intent to insult, my intent is to bring across a point in a clear, if a bit blunt, manner.

1) none of those are very good spamsigns, despite what you might think from their names. 2) bayes doesn't learn SA rules, so what rules hit won't affect bayes per-se 3) bayes won't even see the HTML code of a message. That's all stripped before bayes examines the message. 4) again, it's all about realistic. Not what you perceive as being "possibly spam like nonspam".

To justify my statement in 1) let's look at the STATISTICS.txt data for those rules, focusing on the simple metric of S/O.

S/O is the ratio of spam hits to overall hits for a given rule. It literally represents what percentage of a rule's hits are spam in the corpus tests. A rule with a S/O of 0.90 has 90% of it's hits being spam, and 10% being nonspam.

STATISTICS.txt is a table of results for the mass-check corpus test that was used to generate the scores for a particular version of SA. 2.63's corpus run consisted of 543,473 messages, a fairly decent sized statistical sampling of email. It's not a perfect statistical sample, but it's certainly not grossly undersized.

Here's the results from 2.63's STATISTICS.txt (I've trimmed off the first few columns so we can look at S/O easily) S/O RANK SCORE NAME 0.902 0.75 0.10 CLICK_BELOW 0.880 0.65 1.61 DEAR_SOMETHING 0.953 0.87 0.10 HTML_LINK_CLICK_HERE 0.896 0.81 0.16 HTML_MESSAGE 0.964 0.86 1.12 HTML_WEB_BUGS 0.957 0.83 1.17 INVALID_MSGID

Not bad, but none are altogether impressive. Two of these have more than 10% of their hits being nonspam, and the best of the lot has 3.6%. The worst has 12%.

Now here's the results from the recent SA 3.0-pre4's STATISTICS.txt:

 S/O    RANK   SCORE  NAME
 0.687   0.29    0.01  CLICK_BELOW
 0.857   0.40    1.23  DEAR_SOMETHING
 0.832   0.30    0.01  HTML_LINK_CLICK_HERE
 0.908   0.34    0.01  HTML_MESSAGE
 0.896   0.39    0.37  HTML_WEB_BUGS
 0.890   0.43    1.08  INVALID_MSGID

Ouch. It would appear that given current trends in email, all of these rules are very poor performers indeed! 5 of the 6 are over 10%, and the other is not far behind. The best of the lot has 9.2% of its hits being nonspam messages. The worst of them has 32% of its hits being nonspam!

Clearly your message is not statistically that "out of line" for a nonspam message. Clearly all of these rules have high enough false positive rates that it's not unexpected for nonspam messages to hit them.

In general, I'd still emphasize that you need to focus less on being worried about poisoning bayes by feeding it all of your mail, and more worried about poisoning it with your preconceptions of what it needs to see. Bayes really is designed for the real world, you don't need to isolate it from the facts of reality.

Bayes is a very broad statistics-based tool that tokenizes nearly every word in an email it learns. Little disturbances like this might bother you, but they'll hardly influence bayes at all. Bayes naturally accommodates "neutral" tokens which appear in both spam and nonspam. A word like "kitten" could appear in a child's email, or a porn spam. Bayes as a result will learn this token is neutral. On the other hand, "teenxxx" is not likely to appear in anything but spam, and bayes will learn to recognize that as important. Bayes treats each and every token it finds as a statistic with a percentage chance of spam, then looks at the overall collection of them when making it's decisions. It doesn't look at just one or two things in an email, it usually looks at more like 15-30.

Sometime run a message through spamassassin -D, you'll get a better understanding of just how much it looks at. Look for some lines like these, there should be several:


debug: bayes token 'FEATURED' => 0.00273649317207415
debug: bayes token 'pdf' => 0.00511688439191974
debug: bayes token 'UD:pdf' => 0.00539832400516067
debug: bayes token 'ranges' => 0.00659281885375479
debug: bayes token 'RED' => 0.00664197530864198

(the message I took this from had 152 tokens it matched. It's 36kbytes, so it's a bit long, but you should get the idea that bayes examines lots of things, not just a few. Another message, 861 bytes, including headers, with only 1 line of body text, hit 8 tokens)

Re: autolearn=ham when it shouldnt

Reply via email to