On Sun, Feb 22, 2004, James Gregory wrote: > Anyway, I've gone back to almost 100% accuracy by deleting all but the > last year's worth of spam from my junkmail folder and rebuilding > bogofilter's database. So it seems to me that because both my > legitimate email-senders and junkmail changes over time (and people's > writing style I guess), the database should reflect that and decay > input based on age.
The alternative is that you distorted the probability estimates by making bogofilter think that the prior probability of things being spam is very high, much higher than the actual ratio of spam:non-spam you receive, by training it on all spam but only a fraction of the non-spam you receive. I haven't returned to the sources to check this, but I believe that the Bayes estimates do take into account the prior probability of any message X (or "bag of words X" since it assumes words are independent) falling into category Y. For what it's worth, I train SpamAssassin's bayesian filter on *all* known spam and non-spam messages I receive (I only let it train on read messages and I move any false negatives or positives into the right folders) and I find that it currently has a false negative (spam in my inbox) about once every two or three days (out of approx 150 spams received in that period) and a false positive (legitimate mail marked as spam) about once a month. False postives are, without exception, solicited commerical email. Telling solicited and unsolicited commerical email apart is, as far as I can tell, simply a hard problem, and the best solution I have is whitelists. One weakness of current implementations of Bayes's theorem for mail filtering is that they almost always have the training categories hard-coded as "spam" and "non-spam" or "good" and "bad". The theorem is capable of handling arbitary numbers of categories (although you have to increase the size of the training data) so there's no reason why it couldn't have "spam", "viruses" and "good" or "spam", "viruses", "impersonal" and "personal". Most people really care about "spam" vs "non-spam" but it sounds from your mail like a "spam"/"virus"/"non-spam" categorisation might work. Of course, at present (and perhaps inevitably) pattern matching for viruses is much much more reliable than pattern matching for spam. -Mary -- SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/ Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html