Hi Alexander, Does your implementation respect the additional constraints required by SpamAssassin? The constraints are as follows:
1. Only "nice" rules may have scores less than 0. 2. No rule may have a score above 5. Constraint 1 is required because it must be impossible for a spammer to add content to their messages that will reduce its score. This is a hard-learned lesson. In the past, we had some rules that detected headers added by common e-mail clients. Spammers added these headers deliberately to cause SpamAssassin to misclassify their messages. Constraint 2 is required by SpamAssassin's (unofficial?) development policy to reduce the risk of false positives. If no one rule can cause a message to be marked as spam then at least two rules must fire on a legitimate message in order for it to be incorrectly marked. This also reduces the degradation of the true positive rate over time as developers are forced to look for multiple features to identify spam instead of relying on one brittle feature. Is it possible for you to modify the SVM learning algorithm to satisfy these constraints? Lastly, something that most people do not take into consideration is that the accuracy numbers measured in the lab do not at all resemble those that will be observed in the real world. SpamAssassin's real-world accuracy depends more on the efficacy of the rules in use than on the scores assigned to them: almost every false negative that you will see in the real world has a score close to 0 because the spammer is going to great efforts to avoid triggering any rules. In the case of textual rules, the problem spammers test their messages against SpamAssassin before they are sent out. As an example, Send Safe, a popular spam tool, has an embedded copy of SpamAssassin that the spammer can use to test their message before it's ever sent out. In the case of network rules (RBLs and URIBLs), they chose attack patterns that attempt to out-pace the update frequency of the blacklists. There are two notable spam gangs that rapidly rotate the links to their landing pages. Two weeks ago, one of those gangs was abusing tripod.com's free hosting service and was using each link for less than 5 minutes each! If you are interested in improving the state of the art of machine learning and spam filtering, the area that I think needs the most work is how to evaluate a model. Traditional methods such as bootstrap or cross-validation are somewhat meaningless because they do not take into account the fact that spam evolves along a timeline and that one generally receives multiple copies of the same spam. A better evaluation method would model how responsive a learning algorithm is to spam as it changes over time. In my opinion, because of the high degree of duplication in your average corpus, current evaluation methods test to see how well a classifier can detect duplicate (with some fuzz) entries. Cheers, Henry Justin Mason wrote: > > forwarding on behalf of Alexander... > > ------- Forwarded Message > > Date: Wed, 09 Nov 2005 13:08:47 +0100 > From: "Alexander K. Seewald" <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED] > Subject: SA-Train > > Hi Justin, > > I've implemented a training procedure for SpamAssassin which learn > the rule scores as well as the bayes model. It can do a > cross-validation, and uses a linear kernel SVM for score learning, > which should perform better than the perceptron (a perceptron is > essentially a randomized version of a linear SVM that does not > guarantee the maximum margin hyperplane, but just one hyperplane in > case of linear separability and nothing at all if the data is not > linearly separable) > > Papers describing the work, plus the scripts are available at > http://alex.seewald.at/spam. Please tell me if you find these useful, > and possibly set a link from spamassassin.org where appropriate. > I am also willing to contribute the code for SpamAssassin - > everything is written in Perl, so it should be easy to integrate. > > On a less positive note, I have found - based on about one year of > experiments with similar systems, during which I built up a SA-based > filtering system at ÖFAI - that SA does not offer better performance > than pure bayes systems such as SpamBayes. It is still competitive, > and the resulting models (bayes+ruleset) are smaller and therefore > more efficient. These experiments have been undertaken on a local > corpus of around 100,000 from eight different users. We have a > spam/ham ratio of 20:1 (i.e. 95% of incoming mails is spam) > > Best, > Alex
signature.asc
Description: OpenPGP digital signature
