On Thu, 31 May 2007, Jérôme Charaoui wrote:
I'm wondering whether it's worthwhile to use that kind of data to feed
sa-learn, since a) a lot more spam than spam gets reported and b) most
of the ham reported is mail that just moves within different Exchange
mailboxes and never passes through the gateway.
If indeed it's mostly useless (or maybe even harmful for the Bayes
filter) then I was wondering if it would be more logical to have only
the technical team feed the SPAM and HAM folders with proper messages
(ie good mail that comes from an external source in the case as HAM).
In that case, I'm wondering if the fact that only specific users report
SPAM and HAM could trigger the Bayes filter to think that a message
would be more hammy or spammy depending on the recipient.
Use per-user filtering. Seriously. As you're aware, your users are
better at poisoning your Bayesian filter than any spammer could ever
be. There are three approaches:
1. Hold their hands, carefully combing over the reported false
positives/negatives and writing polite emails saying Tut tut!
That's not actually spam!;
2. Only let your tech team tweak your filtering, which excludes a lot
of people (and a lot of input); or
3. Let people train their filter to their hearts' content, but only
their filter. If they want to report mail as spam, let them! If
it's not spam, so what? They're only harming themselves.
We have users who report all sorts of absurd stuff as spam, but I
don't care. If they think it's spam, then we'll do whatever's
reasonable to filter it. (In our case, we blacklist the sender for
that recipient and run the message through sa-learn.)
At that volume, there's really no reason to be concerned about the
difference in the amount of spam and ham getting reported.
Chris St. Pierre
Unix Systems Administrator
Nebraska Wesleyan University
LOPSA Sysadmin Days: Professional Training for Professional SysAdmins
August 6-7, Cherry Hill, NJ
http://lopsa.org/SysadminDays