Using sa-learn on an anti-spam gateway

2007-05-31 Thread Jérôme Charaoui
Hi,

I'm setting up a new anti-spam gateway for a fairly busy site (about 20k
messages a day) using Postfix/Amavis/SpamAssassin/ClamAV on a Debian
etch system that delivers incoming (ham) mail to an Exchange 2003
server.

Since the old gateway was using a similar setup, there are already SPAM
and HAM public mail folders which our users contribute to. The SPAM
folder usually gets a lot of (untagged) spam, about 500 every day, while
the HAM gets very little, and most of it is internal (within Exchange)
mail that never passes through the gateway.

I'm wondering whether it's worthwhile to use that kind of data to feed
sa-learn, since a) a lot more spam than spam gets reported and b) most
of the ham reported is mail that just moves within different Exchange
mailboxes and never passes through the gateway.

If indeed it's mostly useless (or maybe even harmful for the Bayes
filter) then I was wondering if it would be more logical to have only
the technical team feed the SPAM and HAM folders with proper messages
(ie good mail that comes from an external source in the case as HAM).

In that case, I'm wondering if the fact that only specific users report
SPAM and HAM could trigger the Bayes filter to think that a message
would be more hammy or spammy depending on the recipient.

In short, I'm looking for a way to feed sa-learn that's at least
minimally effective in a situation where only a little useful HAM is
being reported by our users at large.


-- 
Jérôme Charaoui [EMAIL PROTECTED]
Service informatique - Collège de Maisonneuve


Re: Using sa-learn on an anti-spam gateway

2007-05-31 Thread Chris St. Pierre

On Thu, 31 May 2007, Jérôme Charaoui wrote:


I'm wondering whether it's worthwhile to use that kind of data to feed
sa-learn, since a) a lot more spam than spam gets reported and b) most
of the ham reported is mail that just moves within different Exchange
mailboxes and never passes through the gateway.

If indeed it's mostly useless (or maybe even harmful for the Bayes
filter) then I was wondering if it would be more logical to have only
the technical team feed the SPAM and HAM folders with proper messages
(ie good mail that comes from an external source in the case as HAM).

In that case, I'm wondering if the fact that only specific users report
SPAM and HAM could trigger the Bayes filter to think that a message
would be more hammy or spammy depending on the recipient.


Use per-user filtering.  Seriously.  As you're aware, your users are
better at poisoning your Bayesian filter than any spammer could ever
be.  There are three approaches:

1.  Hold their hands, carefully combing over the reported false
positives/negatives and writing polite emails saying Tut tut!
That's not actually spam!;

2.  Only let your tech team tweak your filtering, which excludes a lot
of people (and a lot of input); or

3.  Let people train their filter to their hearts' content, but only
their filter.  If they want to report mail as spam, let them!  If
it's not spam, so what?  They're only harming themselves.

We have users who report all sorts of absurd stuff as spam, but I
don't care.  If they think it's spam, then we'll do whatever's
reasonable to filter it.  (In our case, we blacklist the sender for
that recipient and run the message through sa-learn.)

At that volume, there's really no reason to be concerned about the
difference in the amount of spam and ham getting reported.

Chris St. Pierre
Unix Systems Administrator
Nebraska Wesleyan University

LOPSA Sysadmin Days: Professional Training for Professional SysAdmins
August 6-7, Cherry Hill, NJ
http://lopsa.org/SysadminDays