Tony Meyer wrote:
Do the databases have roughly the same amount of ham and spam?  SpamBayes
works best if the ratio is close to 1::1.

Hmm, maybe not. The SPAM folder likely had more spam than the ham folder had non-spam.

These indicate that it's a weak spam clue if a message has just about any
URL in it (any URL with an http protocol, that has "com" or "www" in it).
Would you say that that is the case?  I would have thought that these would
be much closer to 0.5 (and therefore ignored).

Probably should be closer to .5 since a lot of the mailing lists I'm on include links. All my Slashdot E-mails end up in my SPAM folder as well.


This says that being in HTML is a moderately strong spam clue - would you
say that this is the case?

Probably accurate since I rarely get HTML E-mail from my friends, only some of my automatically generated is in HTML format such as Google news alerts.


These are very odd tokens to be such strong spam clues (I gather from the
subject of the example message that ham about racing cars is not uncommon).
Do you really get a lot of spam that would have "car"/"formula"/"friday"/etc
in it?

This looks like a training problem to me.  How many ham/spam have you
trained on?  Are there definitely no mistakes?

(The contrib/showclues.py script (in CVS or the forthcoming 1.1a1) displays
a table that has the spam/ham count for each clue as well as the score.  It
would be interesting to know what the counts for some of those clues were,
if you're able to run that).

Personally, I wouldn't continue to use a spam filter that continually gave
false positives <0.1 wink>.  You should get almost none, with a well trained
classifier.

So I'm guessing that my best bet is to re-train the spam filter with a more accurate representation of the E-mail???


I guess in hindsight the mistake I made is my HAM training folder contained none of these types of messages because those are the types I would read and delete and wouldn't remain in my inbox for very long.


-- Greg Gulik http://www.gulik.org/greg/ greg @ gulik.org

_______________________________________________
[email protected]
http://mail.python.org/mailman/listinfo/spambayes
Check the FAQ before asking: http://spambayes.sf.net/faq.html

Reply via email to