> Forget about this. Most of you users will only report spams, > not ham, they're going to screw the bayes database. As a > consequence, you'll have more spam, or more fp. > > You should find another solution or educate your users (but > it takes too much time) so they feed correctly the bayesian filters. >
I've heard this many times, but my experience thus far hasn't borne it out. We've got SA w/Bayes running site-wide on our 400-user system and Bayes_99 is consistently our highest-scoring test systemwide, even outscoring the various SBL and URIBL tests. That said, the Ham corpus is entirely my own, I don't bother to have my users submit anything but Spam. This works surprisingly well, so I guess I have good Ham. :) My method is simple and fairly manual. I have my users put Spam in an Exchange Public Folder (substitute shared IMAP folder if you're using a more standard e-mail server) and copy them down into a local MBOX. Thunderbird is handy for this. I upload the MBOX file to the SA server, run sa-learn, and it's done. Initially I had to do this fairly often, but once I had it well trained and enough SARE rules in place it became less of an issue. I now run it only every other month or so. Bayes covers a number of corner-cases that aren't covered by rules, so it's an important part of my overall strategy. It's also handy to train in new spam that hasn't hit the URIBLs or other rules yet, much easier than writing custom rules. Bayes hasn't given any false positives that I'm aware of in the last year, despite the theoretical skew that ought to be introduced by using everyone's Spam and only my Ham. I cannot tell you why, but it works and it works well. Aaron Grewell Network Administrator University of Washington Bothell