Re: SpamAssassins bayes mechanism and message headers

Jeff Mincy Wed, 18 Mar 2009 12:24:26 -0700

   From: Matt Kettler <mkettler...@verizon.net>
   Date: Tue, 17 Mar 2009 21:30:02 -0400
   
   fl...@pbartels.info wrote:
   > Hello,
   >
   > instead of disabling a lot possibly set message headers using
   > "bayes_ignore_header" and ending up in strange configs like:
   >
   > bayes_ignore_header Return-Path
   ...
   > (found on the net)
   Where?
   >
   > shouldn't SpamAssassins bayes mechanism just ignore the complete
   > message header and just look at the body?
   > This seems useful in my opinion.
   It seems like a very misguided idea to me.
   
   Is there any reason to think headers make bad tokens?
   Do you have any test data showing this improves your bayes accuracy?


Yes - I think some headers make extremely bad tokens for bayes, for
example the X-Mailer/User-Agent headers.   40% of the spam I get
claims to  have Microsoft Outlook as a x-Mailer.   So bayes rapidly
determines that *UAMicrosoft (etc) is an extremely strong token.
These *UA tokens were enough to push a short ham message to BAYES_99.
When I added an bayes_ignore_header the score dropped to ~BAYES_40
Obfuscated words like 'st0ck' are 100% indications of spam (or of
messages that discuss spam), so these words work great for bayes.
A 'X-Mailer: Microsoft Office Outlook' header doesn't really tell you
anything about the message, at least not to the extent that bayes
treats these tokens.

The Message-ID tokens are also low quality tokens.  Most of these
tokens are hapaxes that are never used by other messages.  These just
fill up the bayes database.  Maybe if the Message-ID tokens were even
more processed then maybe these could be more useful for bayes - eg -
replace 1234.56789 with a format %4d.%5d, or throw out all of the
timestamp numbers and keep the just the stuff after the @.
-jeff

Re: SpamAssassins bayes mechanism and message headers

Reply via email to