decoder wrote:
decoder wrote:
Justin Mason wrote:

So you're volunteering to code it up, then? ;)

I was planning to do at least some brainstorming+experiements as to what learning methods would seem suitable and how well the method performs, whenever I have time again. Unless someone else did that already?


Ok, I did some short experiments: I've built an SVM classifier from a large mail corpus (8226 mails (5414 ham, 2812 spam)) and did a 5-fold cross validation. The resulting classifier has an accuracy of over 99%, so performs as good as the regular system. Now I applied this to a set of 202 False Negatives that I collected, and 69 of these are recognized as spam by the SVM. As a second test, I pulled 2707 mails from one of my other inboxes and applied the classifier, the accuracy was again over 99% (and this is only ham).

From my point of view, the results show that this approach has potential. It is highly accurate with respect to the current system, but additionally outperformed it on several false negatives.


There are other advantages that this system has over the common system: It allows everybody to train the whole spamfilter (not only Bayes) to the kind of spam that one receives, i.e. it is more adaptive than the common system.


Any opinions on this are greatly welcome. Maybe we should try to come up with a proof of concept plugin for SA?


Good work so far but sounds like you need to throw more data at it. Also even though you indicate "over 99% accuracy" can you break that down better? 99.9% is 10 times as accurate as 99%.

Also - when it identifies messages do the numbers on the spam scores go up and ham goes down? If so that makes it more solid and starves the middle. I'm encouraged that the initial results are good.

My feeling is that if this works that it will work better if we have more informational tokens. For example - is the from address a freemail address. Does the message contain a freemail address. By themselves these wouldn't score points. But spam coming from yahoo, hotmail, gmail, etc. is a different kind of spam than spam coming from spambots. Maybe country tokens from the received lines would be useful. Maybe names of banks in the message would be useful. For example Bank of America + Nigeria = spam.

I'm really glad you're junking on this. I think it will be a breakthrough.


Some of these tokens

Reply via email to