> I know there are theoretical reasons why this might make sense, but I don't
> see any benefit in the real world for scores like these. The high scores
> increase the chance of a random false positive - regardless of the size of
> the existing corpus - and if the negative ones indicate that the rules are
> useless, they should just be removed.

To me, -ve scores on tests can also be used to "offset" spammy messages in 
clean email.  I have several of these of my own creation:

body CORRECT_FOR_EXCHANGE       /This message is in MIME format/
describe CORRECT_FOR_EXCHANGE   Correct for MIME 'null block'

body GROUPS_YAHOO               /http:\/\/groups\.yahoo\.com\/group\//
describe GROUPS_YAHOO           Yahoo Groups message list

header FWD_MSG                  Subject =~ /\[?Fwd?:?\s*/
describe FWD_MSG                Forwarded email

header GROUPS_MSN               Message-Id =~ /.*\@groups\.msn\.com/
describe GROUPS_MSN             MSN Groups Message List

body MAILBITS_EMAIL             /This is a free service provided by 

body HOTMAIL_FOOTER1             /Send and receive Hotmail on your mobile 
device: /
body HOTMAIL_FOOTER2            /Get your FREE download of MSN Explorer at /
body HOTMAIL_FOOTER3            /Get Your Private, Free E-mail from MSN 
Hotmail at http:\/\/www\.hotmail\.com\./
body HOTMAIL_FOOTER4            /Join the world.s largest e-mail service with 
MSN Hotmail\./
body HOTMAIL_FOOTER5            /Chat with friends online, try MSN Messenger:/
body MSN_FOOTER1                /MSN Photos is the easiest way to share and 
print your photos: /
body MSN_FOOTER2                /Remove my e-mail address from Gaming Zone /

These are all assigned mid-size (-1 to -2.4) negative scores to try and 
counteract some of the +ve scored tests that these emails receive.  IMO -ve 
scored tests don't show the test is bad, but rather that it is a test for 
NON-spam email.

> Anyway, I still have a sneaking suspicion that there are a few thousand
> messages from the spamassassin-talk mailing list (talking about spam, and
> sometimes quoting it) in the non-spam corpus.

Very likely.  I am maintaining a folder of mis-detected email (non-spam 
detected as spam) so I can run these into the GA and help out with the 
"hairy-assed edge" of spam and nonspam.  :-)


Spamassassin-talk mailing list

Reply via email to