Miles Keaton wrote to [EMAIL PROTECTED]:
I work at a webhosting company that hosts a few thousand domains for people.
Anyone done this kind of setup successfully for thousands of users?
Yes.
We use Qmail + Vpopmail (+ Squirrelmail, if they want to read their mail on the web) - on FreeBSD computers.
Sounds reasonable.
We'd like to help block all possible spam from our client's boxes, of course, dumping it all into a [EMAIL PROTECTED] box - if they want to go look at it.
That should be relatively easy. We do something similar using MIMEDefang.
Because many of our clients like to use the webmail, we can't rely on their home-PC client-side email app to catch the spam for them.
Right. We made some local additions to our SquirrelMail install to better cope with spam filtering, so, by default, it works better than a regular email client without any local filtering set up.
What's the best plan for this kind of setup?
Be conservative. You're filtering for thousands of users, who, if you're doing general web hosting, will come from very different walks of life, and have conflicting opinions on what constitutes spam.
I'm assuming:
- clamd first kills all emails with viruses
Yes.
- a system-wide spam-filter kills all positively-spam emails with a very high spam-probability count (or coming from known spam servers, etc)
Check with your users first; some more conservative users might reasonably object to this. Many others will beg you to do it. Your best bet is to configure this on a per-user basis, and ensure that *all* mail is recoverable for a limited period of time (7-10 days minimum). The last thing you want to do is lose a morning's worth of everyone's mail when your filter goes nuts. :-)
- a user-controlled, user-taught Bayesian filter catches spams they've taught their preferences to catch.
Personally, I'd recommend against user Bayes. *Theoretically*, you'll
get more accurate Bayes results from individually trained filters... in practice, however, that's often not the case, because:
1) Your customers will be lazy. They'll accidentally classify ham as spam and vice-versa. Ironically, as diverse and incompatible as your clients might be, you might well have *higher* accuracy by training yourself, based on ham and spam that you recover, and tweak your rules so that you do lots of site-wide autolearning.
2) Individual email accounts and even domain-wide accounts for semi-busy domains see very low volumes of email from a Bayes training perspective. Learning will be sub-optimal.
a) Partly because of #2, your users will always be struggling to keep up to the latest spam. With a site-wide configuration, your users benefit from the (auto)learning that has already been done. With site-wide autolearning, we've actually watched the Bayes and AWL scores of the same spam rise steadily as it hits one account after another.
3) Time. By asking your users to spend time hand-classifying and training their filter, how does this save them from having to look at spam? Contrarily, if you spend a few hours a week training a site-wide filter, you can easily save *hundreds* of person hours for your users. As your userbase increases in size, this becomes a *huge* win. That's what I'd sell as one awesome value-added benefit.
4) As discussed, there is no guarantee that user Bayes will necessarily improve Bayes accuracy in practical applications. Further, even if there *is* a difference in accuracy between site-wide and user Bayes,
a) It probably isn't going to be a huge difference, even for
relatively "diverse" user bases. b) The Bayes classifier is one (admittedly very helpful) test in a
wide array of checks that SpamAssassin uses to classify email.
Asking users to spend time doing training won't be worth it, time
wise.So, given the above, I endorse site-wide Bayes for almost all large installations. The practical advantage of per-user bayes is highly debatable, and, even if there *is* a practical advantage, it is probably small, and not worth the time required.
You *can* encourage manual mistake-based training, though it might not be necessary. Configure SquirrelMail with a button for every message. If it was auto-classified as ham, put in a "No, this is spam!" button, and vice-versa for spam. If the score is borderline, learn the message on the spot. If the score is highly contradictory (i.e., the message scored 50 points and they click "No, this is ham!"), have it go to an admin for review, to curb accidental or malicious Bayes misclassifications.
If your FP/FN rates are already low (as they should be), the buttons should seldom need to be clicked, and your users get a warm-fuzzy feeling *once in a while*.
Trained conservatively site-wide, you'll gain the benefit (and responsibility!) of your users trusting you to do the right thing, and your users will enjoy not having to look at spam or worry about false positives. Isn't that the whole point?
... or am I approaching this wrong?
Any benefit to having two different kinds of spam filters, so that one (SpamAssassin?) goes system-wide, and another (Dspam?) is the user-trained one?
Interesting idea; it might allow users who really *want* to do their own training to do so... where, if a user doesn't want to invest that time, he/she will still benefit from your well-trained site-wide installation. Personally, I'd just stay site-wide to ease administration.
Anyone done this kind of setup successfully for thousands of users?
Any suggestions or pointers to articles appreciated.
- Ryan
-- Ryan Thompson <[EMAIL PROTECTED]>
SaskNow Technologies - http://www.sasknow.com 901-1st Avenue North - Saskatoon, SK - S7K 1Y4
Tel: 306-664-3600 Fax: 306-244-7037 Saskatoon Toll-Free: 877-727-5669 (877-SASKNOW) North America
