I've been running Spambayes on my Linux server for my family and a couple friends.
I trained it for each user using the current spam-free inbox as ham and a large folder of spam as the spam seed for each user.
Since we all get a LOT of false positives. It's mostly mailing list traffic. It's getting really annoying to have to go through my Junk folder several times a day pouring through spam to find a few non-spam messages.
What's most annoying is that I've been trying to train Spambayes that the messages I'm finding in the Junk folder are NOT spam and I don't seem to be getting anywhere. In most cases similar messages continue to show up as SPAM.
Here is a prime example. I get several Google news alerts and about 99% of these end up in my Junk folder. Rarely do they ever get past Spambayes.
Here are the headers of a recent message:
Return-Path: <[EMAIL PROTECTED]>
Received: from sproxy.google.com (sproxy.google.com [64.233.170.130])
by server.gagme.com (8.12.8/8.12.8) with ESMTP id j36EnIY3000455
for <[EMAIL PROTECTED]>; Wed, 6 Apr 2005 09:49:18 -0500
Received: by sproxy.google.com with SMTP id i51so652127rne
for <[EMAIL PROTECTED]>; Wed, 06 Apr 2005 07:49:21 -0700 (PDT)
Received: by 10.38.153.44 with SMTP id a44mr1934530rne;
Wed, 06 Apr 2005 07:49:21 -0700 (PDT)
Message-ID: <[EMAIL PROTECTED]>
Date: Wed, 06 Apr 2005 07:49:21 -0700 (PDT)
From: Google Alerts <[EMAIL PROTECTED]>
To: [EMAIL PROTECTED]
Subject: Google Alert - formula 1
MIME-Version: 1.0
Content-Type: text/html; charset="UTF-8"
X-Spambayes-Classification: spam; 0.98
X-Spambayes-Evidence: '*H*': 0.01; '*S*': 0.98; 'manage': 0.07; 'team': 0.09;
'to:addr:greg': 0.15; 'alert': 0.16; 'winner': 0.16; 'create': 0.16;
'working': 0.17; 'year': 0.18; 'management': 0.21; 'see': 0.23; 'race': 0.25;
'past': 0.28; 'header:Received:3': 0.28; 'for': 0.34; 'the': 0.34;
'make': 0.37; 'old': 0.38; 'url:0': 0.38; 'this': 0.39; 'what': 0.39;
'been': 0.39; 'subject: ': 0.60; 'proto:http': 0.61; 'url:info': 0.62;
'to:no real name:2**0': 0.62; 'url:1': 0.62; 'url:asp': 0.63;
'url:com': 0.63; 'url:www': 0.64; 'san': 0.65; 'skip:n 10': 0.71;
'subject: - ': 0.73; 'content-type:text/html': 0.75; '...': 0.84;
'500': 0.84; 'car': 0.84; 'director': 0.84; 'happens': 0.84; 'marino': 0.84;
'old,': 0.84; 'topic': 0.84; 'url:ie': 0.84; 'url:oe': 0.84; 'ceo': 0.91;
'created': 0.91; 'formula': 0.91; 'friday': 0.91; 'previous': 0.91;
'url:news': 0.91; 'url:s': 0.91; 'url:remove': 0.95; 'alert.': 0.96;
'url:src': 0.96; 'brought': 0.97; 'url:http': 0.97
Here is the procedure I use to re-train the Spambayes database. I forward the message as an attachment to a special mailbox. There a script pulls the original message out (I've verified that it is indeed identical to what I received) and runs it through the following command line to train it as a non-spam:
/usr/bin/sb_filter.py -d /home/greg/.hammiedb -g < message-file.txt
Am I doing something wrong??
This is getting very annoying and I sometimes feel I'd be better off not running it.
BTW, I know several other people who run Spambayes and they're all complaining about excessive false positives but none of them appear to be as bad as mine.
-- Greg Gulik http://www.gulik.org/greg/ greg @ gulik.org
_______________________________________________ [email protected] http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html
