On Sun, Feb 22, 2004, James Gregory wrote:
> Anyway, I've gone back to almost 100% accuracy by deleting all but the
> last year's worth of spam from my junkmail folder and rebuilding
> bogofilter's database. So it seems to me that because both my
> legitimate email-senders and junkmail changes over time (and people's
> writing style I guess), the database should reflect that and decay
> input based on age.

The alternative is that you distorted the probability estimates by
making bogofilter think that the prior probability of things being spam
is very high, much higher than the actual ratio of spam:non-spam you
receive, by training it on all spam but only a fraction of the non-spam
you receive.

I haven't returned to the sources to check this, but I believe that the
Bayes estimates do take into account the prior probability of any message X
(or "bag of words X" since it assumes words are independent) falling
into category Y.

For what it's worth, I train SpamAssassin's bayesian filter on *all*
known spam and non-spam messages I receive (I only let it train on read
messages and I move any false negatives or positives into the right
folders) and I find that it currently has a false negative (spam in my
inbox) about once every two or three days (out of approx 150 spams
received in that period) and a false positive (legitimate mail marked as
spam) about once a month. False postives are, without exception,
solicited commerical email.

Telling solicited and unsolicited commerical email apart is, as far as I
can tell, simply a hard problem, and the best solution I have is
whitelists.

One weakness of current implementations of Bayes's theorem for mail
filtering is that they almost always have the training categories
hard-coded as "spam" and "non-spam" or "good" and "bad". The theorem is
capable of handling arbitary numbers of categories (although you have to
increase the size of the training data) so there's no reason why it
couldn't have "spam", "viruses" and "good" or "spam", "viruses",
"impersonal" and "personal". Most people really care about "spam" vs
"non-spam" but it sounds from your mail like a "spam"/"virus"/"non-spam"
categorisation might work.

Of course, at present (and perhaps inevitably) pattern matching for
viruses is much much more reliable than pattern matching for spam.

-Mary
-- 
SLUG - Sydney Linux User's Group Mailing List - http://slug.org.au/
Subscription info and FAQs: http://slug.org.au/faq/mailinglists.html

Reply via email to