Re: recommended setup to catch spam for thousands of domains?

Ryan Thompson 28 Jul 2004 00:24:12 -0000

Miles Keaton wrote to [EMAIL PROTECTED]:

I work at a webhosting company that hosts a few thousand domains for people.
Anyone done this kind of setup successfully for thousands of users?


Yes.

We use Qmail + Vpopmail (+ Squirrelmail, if they want to read their
mail on the web) - on FreeBSD computers.


Sounds reasonable.

We'd like to help block all possible spam from our client's boxes, of
course, dumping it all into a [EMAIL PROTECTED] box - if they
want to go look at it.


That should be relatively easy. We do something similar using
MIMEDefang.

Because many of our clients like to use the webmail, we can't rely on
their home-PC client-side email app to catch the spam for them.


Right. We made some local additions to our SquirrelMail install to
better cope with spam filtering, so, by default, it works better than
a regular email client without any local filtering set up.

What's the best plan for this kind of setup?


Be conservative. You're filtering for thousands of users, who, if you're
doing general web hosting, will come from very different walks of life,
and have conflicting opinions on what constitutes spam.

I'm assuming:
- clamd first kills all emails with viruses


Yes.

- a system-wide spam-filter kills all positively-spam emails with a
very high spam-probability count (or coming from known spam servers,
etc)


Check with your users first; some more conservative users might
reasonably object to this. Many others will beg you to do it. Your best
bet is to configure this on a per-user basis, and ensure that *all* mail
is recoverable for a limited period of time (7-10 days minimum). The
last thing you want to do is lose a morning's worth of everyone's mail
when your filter goes nuts. :-)

- a user-controlled, user-taught Bayesian filter catches spams they've
taught their preferences to catch.

Personally, I'd recommend against user Bayes. *Theoretically*, you'll get more accurate Bayes results from individually trained filters... in practice, however, that's often not the case, because:

1) Your customers will be lazy. They'll accidentally classify ham as
   spam and vice-versa. Ironically, as diverse and incompatible as your
   clients might be, you might well have *higher* accuracy by training
   yourself, based on ham and spam that you recover, and tweak your
   rules so that you do lots of site-wide autolearning.

2) Individual email accounts and even domain-wide accounts for semi-busy
   domains see very low volumes of email from a Bayes training
   perspective. Learning will be sub-optimal.

   a) Partly because of #2, your users will always be struggling to keep
   up to the latest spam. With a site-wide configuration, your users
   benefit from the (auto)learning that has already been done. With
   site-wide autolearning, we've actually watched the Bayes and AWL
   scores of the same spam rise steadily as it hits one account after
   another.

3) Time. By asking your users to spend time hand-classifying and
   training their filter, how does this save them from having to look at
   spam? Contrarily, if you spend a few hours a week training a
   site-wide filter, you can easily save *hundreds* of person hours for
   your users. As your userbase increases in size, this becomes a *huge*
   win. That's what I'd sell as one awesome value-added benefit.

4) As discussed, there is no guarantee that user Bayes will necessarily
   improve Bayes accuracy in practical applications. Further, even if
   there *is* a difference in accuracy between site-wide and user Bayes,

   a) It probably isn't going to be a huge difference, even for
      relatively "diverse" user bases.

   b) The Bayes classifier is one (admittedly very helpful) test in a
      wide array of checks that SpamAssassin uses to classify email.
      Asking users to spend time doing training won't be worth it, time
      wise.

So, given the above, I endorse site-wide Bayes for almost all large
installations. The practical advantage of per-user bayes is highly
debatable, and, even if there *is* a practical advantage, it is probably
small, and not worth the time required.

You *can* encourage manual mistake-based training, though it might not
be necessary. Configure SquirrelMail with a button for every message. If
it was auto-classified as ham, put in a "No, this is spam!" button, and
vice-versa for spam. If the score is borderline, learn the message on
the spot. If the score is highly contradictory (i.e., the message scored
50 points and they click "No, this is ham!"), have it go to an admin for
review, to curb accidental or malicious Bayes misclassifications.

If your FP/FN rates are already low (as they should be), the buttons
should seldom need to be clicked, and your users get a warm-fuzzy
feeling *once in a while*.

Trained conservatively site-wide, you'll gain the benefit (and
responsibility!) of your users trusting you to do the right thing, and
your users will enjoy not having to look at spam or worry about false
positives. Isn't that the whole point?

... or am I approaching this wrong?

Any benefit to having two different kinds of spam filters, so that one
(SpamAssassin?) goes system-wide, and another (Dspam?) is the
user-trained one?


Interesting idea; it might allow users who really *want* to do their own
training to do so... where, if a user doesn't want to invest that time,
he/she will still benefit from your well-trained site-wide installation.
Personally, I'd just stay site-wide to ease administration.

Anyone done this kind of setup successfully for thousands of users?
Any suggestions or pointers to articles appreciated.


- Ryan

--
  Ryan Thompson <[EMAIL PROTECTED]>

  SaskNow Technologies - http://www.sasknow.com
  901-1st Avenue North - Saskatoon, SK - S7K 1Y4

        Tel: 306-664-3600   Fax: 306-244-7037   Saskatoon
  Toll-Free: 877-727-5669     (877-SASKNOW)     North America

Re: recommended setup to catch spam for thousands of domains?

Reply via email to