> I trained it for each user using the current spam-free inbox > as ham and a large folder of spam as the spam seed for each user.
Do the databases have roughly the same amount of ham and spam? SpamBayes works best if the ratio is close to 1::1. [...] > Here is a prime example. I get several Google news alerts > and about 99% of these end up in my Junk folder. Rarely do they > ever get past Spambayes. [...] > 'proto:http': 0.61; > 'url:info': 0.62; > 'url:com': 0.63; > 'url:www': 0.64; These indicate that it's a weak spam clue if a message has just about any URL in it (any URL with an http protocol, that has "com" or "www" in it). Would you say that that is the case? I would have thought that these would be much closer to 0.5 (and therefore ignored). > 'content-type:text/html': 0.75; This says that being in HTML is a moderately strong spam clue - would you say that this is the case? > 'car': 0.84; > 'marino': 0.84; > 'old,': 0.84; > 'topic': 0.84; > 'formula': 0.91; > 'friday': 0.91; > 'previous': 0.91; > 'brought': 0.97; These are very odd tokens to be such strong spam clues (I gather from the subject of the example message that ham about racing cars is not uncommon). Do you really get a lot of spam that would have "car"/"formula"/"friday"/etc in it? This looks like a training problem to me. How many ham/spam have you trained on? Are there definitely no mistakes? (The contrib/showclues.py script (in CVS or the forthcoming 1.1a1) displays a table that has the spam/ham count for each clue as well as the score. It would be interesting to know what the counts for some of those clues were, if you're able to run that). > BTW, I know several other people who run Spambayes and they're all > complaining about excessive false positives but none of them > appear to be as bad as mine. Personally, I wouldn't continue to use a spam filter that continually gave false positives <0.1 wink>. You should get almost none, with a well trained classifier. =Tony.Meyer -- Please always include the list ([email protected]) in your replies (reply-all), and please don't send me personal mail about SpamBayes. http://www.massey.ac.nz/~tameyer/writing/reply_all.html explains this. _______________________________________________ [email protected] http://mail.python.org/mailman/listinfo/spambayes Check the FAQ before asking: http://spambayes.sf.net/faq.html
