I am a user, tester, and occasional developer of SpamAssassin. There's a lot of semi-accurate information about SA floating around the Net.

First of all, SA does not use a specific or "arbitrary" method of identifying spam. Instead, it is an open platform that has a number of different techniques plugged into it (and is extensible if you want to write/add your own). SA techniques include:

* Internal pattern-matching on header & body parts
* Second-order (syntactic) analysis of patterns
* Analysis of embedded code (JavaScript, HTML, etc.)
* Automatic (feedback-based) whitelist and blacklist processing
* Use of external blocking lists like MAPS RBL, DUL, Osirus, Ordb.org, SpamCop, RFCI, et al.
* Use of Vipul's Razor (known spam database)

Future plans include hooks for Bayesian Filtering.

The rulesets and scoring are repeatedly applied to a set of spam, nonspam, and mixed message corpuses, using a genetic algorithm, to determine scores. They are absolutely not "arbitrary", i.e., having a person decide a particular word or phrase is "spam" or not.

The reason that SA works so well -- and I believe that it's the best at what it does -- is that there is no one "best" way to identify spam. There are multiple techniques with varying degrees of success, and if you combine them all, and allow a self-correcting feedback technique determine the score (likelihood of a message being spam) you get a very high degree of success. Plus the ability of any user to override various rules and scores to meet his/her individual needs.

--
Michael C. Berch
[EMAIL PROTECTED]



Reply via email to