Hi Alexander,

Does your implementation respect the additional constraints required by
SpamAssassin?  The constraints are as follows:

1.  Only "nice" rules may have scores less than 0.
2.  No rule may have a score above 5.

Constraint 1 is required because it must be impossible for a spammer to
add content to their messages that will reduce its score.  This is a
hard-learned lesson.  In the past, we had some rules that detected
headers added by common e-mail clients.  Spammers added these headers
deliberately to cause SpamAssassin to misclassify their messages.

Constraint 2 is required by SpamAssassin's (unofficial?) development
policy to reduce the risk of false positives.  If no one rule can cause
a message to be marked as spam then at least two rules must fire on a
legitimate message in order for it to be incorrectly marked.  This also
reduces the degradation of the true positive rate over time as
developers are forced to look for multiple features to identify spam
instead of relying on one brittle feature.

Is it possible for you to modify the SVM learning algorithm to satisfy
these constraints?

Lastly, something that most people do not take into consideration is
that the accuracy numbers measured in the lab do not at all resemble
those that will be observed in the real world.  SpamAssassin's
real-world accuracy depends more on the efficacy of the rules in use
than on the scores assigned to them:  almost every false negative that
you will see in the real world has a score close to 0 because the
spammer is going to great efforts to avoid triggering any rules.

In the case of textual rules, the problem spammers test their messages
against SpamAssassin before they are sent out.  As an example, Send
Safe, a popular spam tool, has an embedded copy of SpamAssassin that the
spammer can use to test their message before it's ever sent out.

In the case of network rules (RBLs and URIBLs), they chose attack
patterns that attempt to out-pace the update frequency of the
blacklists.  There are two notable spam gangs that rapidly rotate the
links to their landing pages.  Two weeks ago, one of those gangs was
abusing tripod.com's free hosting service and was using each link for
less than 5 minutes each!

If you are interested in improving the state of the art of machine
learning and spam filtering, the area that I think needs the most work
is how to evaluate a model.  Traditional methods such as bootstrap or
cross-validation are somewhat meaningless because they do not take into
account the fact that spam evolves along a timeline and that one
generally receives multiple copies of the same spam.  A better
evaluation method would model how responsive a learning algorithm is to
spam as it changes over time.  In my opinion, because of the high degree
of duplication in your average corpus, current evaluation methods test
to see how well a classifier can detect duplicate (with some fuzz) entries.

Cheers,
Henry

Justin Mason wrote:
> 
> forwarding on behalf of Alexander...
> 
> ------- Forwarded Message
> 
> Date:    Wed, 09 Nov 2005 13:08:47 +0100
> From:    "Alexander K. Seewald" <[EMAIL PROTECTED]>
> To:      [EMAIL PROTECTED]
> Subject: SA-Train
> 
> Hi Justin,
> 
> I've implemented a training procedure for SpamAssassin which learn
> the rule scores as well as the bayes model. It can do a
> cross-validation, and uses a linear kernel SVM for score learning,
> which should perform better than the perceptron (a perceptron is
> essentially a randomized version of a linear SVM that does not
> guarantee the maximum margin hyperplane, but just one hyperplane in
> case of linear separability and nothing at all if the data is not
> linearly separable)
> 
> Papers describing the work, plus the scripts are available at
> http://alex.seewald.at/spam. Please tell me if you find these useful,
> and possibly set a link from spamassassin.org where appropriate.
> I am also willing to contribute the code for SpamAssassin -
> everything is written in Perl, so it should be easy to integrate.
> 
> On a less positive note, I have found - based on about one year of
> experiments with similar systems, during which I built up a SA-based
> filtering system at ÖFAI - that SA does not offer better performance
> than pure bayes systems such as SpamBayes. It is still competitive,
> and the resulting models (bayes+ruleset) are smaller and therefore
> more efficient. These experiments have been undertaken on a local
> corpus of around 100,000 from eight different users. We have a
> spam/ham ratio of 20:1 (i.e. 95% of incoming mails is spam)
> 
> Best,
>   Alex

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to