Re: 2 + 2 != 4 - Spamassassin needs a new paradigm

Marc Perkel Wed, 04 Mar 2009 17:32:41 -0800


decoder wrote:

decoder wrote:
Justin Mason wrote:
So you're volunteering to code it up, then? ;)
I was planning to do at least some brainstorming+experiements as towhat learning methods would seem suitable and how well the methodperforms, whenever I have time again. Unless someone else did thatalready?
Ok, I did some short experiments: I've built an SVM classifier from alarge mail corpus (8226 mails (5414 ham, 2812 spam)) and did a 5-foldcross validation. The resulting classifier has an accuracy of over99%, so performs as good as the regular system. Now I applied this toa set of 202 False Negatives that I collected, and 69 of these arerecognized as spam by the SVM. As a second test, I pulled 2707 mailsfrom one of my other inboxes and applied the classifier, the accuracywas again over 99% (and this is only ham).
From my point of view, the results show that this approach haspotential. It is highly accurate with respect to the current system,but additionally outperformed it on several false negatives.
There are other advantages that this system has over the commonsystem: It allows everybody to train the whole spamfilter (not onlyBayes) to the kind of spam that one receives, i.e. it is more adaptivethan the common system.
Any opinions on this are greatly welcome. Maybe we should try to comeup with a proof of concept plugin for SA?

Good work so far but sounds like you need to throw more data at it. Alsoeven though you indicate "over 99% accuracy" can you break that downbetter? 99.9% is 10 times as accurate as 99%.

Also - when it identifies messages do the numbers on the spam scores goup and ham goes down? If so that makes it more solid and starves themiddle. I'm encouraged that the initial results are good.

My feeling is that if this works that it will work better if we havemore informational tokens. For example - is the from address a freemailaddress. Does the message contain a freemail address. By themselvesthese wouldn't score points. But spam coming from yahoo, hotmail, gmail,etc. is a different kind of spam than spam coming from spambots. Maybecountry tokens from the received lines would be useful. Maybe names ofbanks in the message would be useful. For example Bank of America +Nigeria = spam.


I'm really glad you're junking on this. I think it will be a breakthrough.


Some of these tokens

Re: 2 + 2 != 4 - Spamassassin needs a new paradigm

Reply via email to