Re: 2 + 2 != 4 - Spamassassin needs a new paradigm

decoder Thu, 05 Mar 2009 05:00:51 -0800

Marc Perkel wrote:

Good work so far but sounds like you need to throw more data at it. Also even though you indicate "over 99% accuracy" can you break that down better? 99.9% is 10 times as accurate as 99%.

What do you mean by more data? Of course, some additional data might help. One should consider that _most_ of the SA rules are designed to score on spam. For an SVM, you can use more general data like "Mail has property XYZ" although you don't know what this property means (ham or spam) or if it is even suitable to classify anything. This is of course an advantage.



With respect to the numbers:

I repeated the experiments today with slight modifications to provide a more solid setup:

The input is again the dataset I used yesterday. In one run, I permutate the dataset, then split it (2/3 training vs. 1/3 testing, not stratified). Then the training set is used to train an SVM, and it is applied to the 1/3 testing set and additionally to my false negatives set.

The SVM outputs an accuracy value, but I wrote a tool that calculates precision and recall by hand because these values are more interesting as


1 - Precision = False Positive Rate (which is an important factor in SA)
1 - Recall = False Negative Rate (or, consider recall as the detection rate)

I ran this 5 times, the output is attached as text file, there you will see the exact numbers :)


Taking the mean over the 5 runs:


False positive rate: 0.37908199952036 %
Detection Rate: 99.18104855859372 %

Detection Rate on False Negatives (my SA has 0% on this set): 31.7821782178218 %

One should consider that my dataset might not be 100% accurate. It is combined from my inbox and my spam folder. Of course my spam folder is unlikely to contain ham, but it is surely possible that I forgot to delete one or another false negative from my inbox. I'm looking forward to get Justin's set :)

Also - when it identifies messages do the numbers on the spam scores go up and ham goes down? If so that makes it more solid and starves the middle. I'm encouraged that the initial results are good.

What do you mean by that question, I don't really understand it :)

My feeling is that if this works that it will work better if we have more informational tokens. For example - is the from address a freemail address. Does the message contain a freemail address. By themselves these wouldn't score points. But spam coming from yahoo, hotmail, gmail, etc. is a different kind of spam than spam coming from spambots. Maybe country tokens from the received lines would be useful. Maybe names of banks in the message would be useful. For example Bank of America + Nigeria = spam.

Yes, this is exactly what I meant above. These tokens are of limited use for SA currently, but an SVM might be able to use them :)



Cheers,


Chris

Reading dataset...
Permutating...
Splitting and outputting...
Training...
*
optimization finished, #iter = 449
nu = 0.144606
obj = -529.640159, rho = -2.227729
nSV = 802, nBSV = 785
Total nSV = 802
Predicting test set...
Accuracy = 99.2706% (2722/2742) (classification)
Predicting false negative set...
Accuracy = 31.6832% (64/202) (classification)
Evaluating results...
Results on test set:
Precision: 99.8896856039713 %
Recall: 99.01585565883 %

Results on false negative set:
Precision: 100 %
Recall: 31.6831683168317 %

=================================================================

Reading dataset...
Permutating...
Splitting and outputting...
Training...
*
optimization finished, #iter = 466
nu = 0.147031
obj = -539.132218, rho = -2.297470
nSV = 817, nBSV = 791
Total nSV = 817
Predicting test set...
Accuracy = 99.2706% (2722/2742) (classification)
Predicting false negative set...
Accuracy = 32.1782% (65/202) (classification)
Evaluating results...
Results on test set:
Precision: 99.6613995485327 %
Recall: 99.2134831460674 %

Results on false negative set:
Precision: 100 %
Recall: 32.1782178217822 %

=================================================================

Reading dataset...
Permutating...
Splitting and outputting...
Training...
*
optimization finished, #iter = 454
nu = 0.146568
obj = -535.034660, rho = -2.187959
nSV = 814, nBSV = 793
Total nSV = 814
Predicting test set...
Accuracy = 99.2341% (2721/2742) (classification)
Predicting false negative set...
Accuracy = 31.6832% (64/202) (classification)
Evaluating results...
Results on test set:
Precision: 99.3834080717489 %
Recall: 99.4391475042064 %

Results on false negative set:
Precision: 100 %
Recall: 31.6831683168317 %

=================================================================

Reading dataset...
Permutating...
Splitting and outputting...
Training...
*
optimization finished, #iter = 447
nu = 0.144391
obj = -530.359839, rho = -2.219816
nSV = 802, nBSV = 781
Total nSV = 802
Predicting test set...
Accuracy = 99.2341% (2721/2742) (classification)
Predicting false negative set...
Accuracy = 31.6832% (64/202) (classification)
Evaluating results...
Results on test set:
Precision: 99.5589856670342 %
Recall: 99.2853216052776 %

Results on false negative set:
Precision: 100 %
Recall: 31.6831683168317 %

=================================================================

Reading dataset...
Permutating...
Splitting and outputting...
Training...
*
optimization finished, #iter = 441
nu = 0.143899
obj = -520.886081, rho = -2.283996
nSV = 795, nBSV = 785
Total nSV = 795
Predicting test set...
Accuracy = 99.0518% (2716/2742) (classification)
Predicting false negative set...
Accuracy = 31.6832% (64/202) (classification)
Evaluating results...
Results on test set:
Precision: 99.6111111111111 %
Recall: 98.9514348785872 %

Results on false negative set:
Precision: 100 %
Recall: 31.6831683168317 %

smime.p7s
Description: S/MIME Cryptographic Signature

Re: 2 + 2 != 4 - Spamassassin needs a new paradigm

Reply via email to