RE: [Spambayes] Still getting tons of false positives...

Tony Meyer Wed, 06 Apr 2005 16:56:46 -0700

> I trained it for each user using the current spam-free inbox 
> as ham and a large folder of spam as the spam seed for each user.


Do the databases have roughly the same amount of ham and spam?  SpamBayes
works best if the ratio is close to 1::1.

[...]
> Here is a prime example.  I get several Google news alerts 
> and about 99% of these end up in my Junk folder.  Rarely do they
> ever get past Spambayes.
[...]
> 'proto:http': 0.61; 
> 'url:info': 0.62;
> 'url:com': 0.63;
> 'url:www': 0.64;

These indicate that it's a weak spam clue if a message has just about any
URL in it (any URL with an http protocol, that has "com" or "www" in it).
Would you say that that is the case?  I would have thought that these would
be much closer to 0.5 (and therefore ignored).

> 'content-type:text/html': 0.75; 

This says that being in HTML is a moderately strong spam clue - would you
say that this is the case?

> 'car': 0.84;
> 'marino': 0.84;
> 'old,': 0.84;
> 'topic': 0.84;
> 'formula': 0.91;
> 'friday': 0.91; 
> 'previous': 0.91;
> 'brought': 0.97;

These are very odd tokens to be such strong spam clues (I gather from the
subject of the example message that ham about racing cars is not uncommon).
Do you really get a lot of spam that would have "car"/"formula"/"friday"/etc
in it?

This looks like a training problem to me.  How many ham/spam have you
trained on?  Are there definitely no mistakes?

(The contrib/showclues.py script (in CVS or the forthcoming 1.1a1) displays
a table that has the spam/ham count for each clue as well as the score.  It
would be interesting to know what the counts for some of those clues were,
if you're able to run that).

> BTW, I know several other people who run Spambayes and they're all 
> complaining about excessive false positives but none of them 
> appear to be as bad as mine.

Personally, I wouldn't continue to use a spam filter that continually gave
false positives <0.1 wink>.  You should get almost none, with a well trained
classifier.

=Tony.Meyer

-- 
Please always include the list ([email protected]) in your replies
(reply-all), and please don't send me personal mail about SpamBayes.
http://www.massey.ac.nz/~tameyer/writing/reply_all.html explains this.

_______________________________________________
[email protected]
http://mail.python.org/mailman/listinfo/spambayes
Check the FAQ before asking: http://spambayes.sf.net/faq.html

RE: [Spambayes] Still getting tons of false positives...

Reply via email to