Marc Perkel wrote:
What do you mean by more data? Of course, some additional data might help. One should consider that _most_ of the SA rules are designed to score on spam. For an SVM, you can use more general data like "Mail has property XYZ" although you don't know what this property means (ham or spam) or if it is even suitable to classify anything. This is of course an advantage.Good work so far but sounds like you need to throw more data at it. Also even though you indicate "over 99% accuracy" can you break that down better? 99.9% is 10 times as accurate as 99%.
With respect to the numbers:I repeated the experiments today with slight modifications to provide a more solid setup:
The input is again the dataset I used yesterday. In one run, I permutate the dataset, then split it (2/3 training vs. 1/3 testing, not stratified). Then the training set is used to train an SVM, and it is applied to the 1/3 testing set and additionally to my false negatives set.
The SVM outputs an accuracy value, but I wrote a tool that calculates precision and recall by hand because these values are more interesting as
1 - Precision = False Positive Rate (which is an important factor in SA) 1 - Recall = False Negative Rate (or, consider recall as the detection rate)I ran this 5 times, the output is attached as text file, there you will see the exact numbers :)
Taking the mean over the 5 runs: False positive rate: 0.37908199952036 % Detection Rate: 99.18104855859372 %Detection Rate on False Negatives (my SA has 0% on this set): 31.7821782178218 %
One should consider that my dataset might not be 100% accurate. It is combined from my inbox and my spam folder. Of course my spam folder is unlikely to contain ham, but it is surely possible that I forgot to delete one or another false negative from my inbox. I'm looking forward to get Justin's set :)
Also - when it identifies messages do the numbers on the spam scores go up and ham goes down? If so that makes it more solid and starves the middle. I'm encouraged that the initial results are good.
What do you mean by that question, I don't really understand it :)
Yes, this is exactly what I meant above. These tokens are of limited use for SA currently, but an SVM might be able to use them :)My feeling is that if this works that it will work better if we have more informational tokens. For example - is the from address a freemail address. Does the message contain a freemail address. By themselves these wouldn't score points. But spam coming from yahoo, hotmail, gmail, etc. is a different kind of spam than spam coming from spambots. Maybe country tokens from the received lines would be useful. Maybe names of banks in the message would be useful. For example Bank of America + Nigeria = spam.
Cheers, Chris
Reading dataset... Permutating... Splitting and outputting... Training... * optimization finished, #iter = 449 nu = 0.144606 obj = -529.640159, rho = -2.227729 nSV = 802, nBSV = 785 Total nSV = 802 Predicting test set... Accuracy = 99.2706% (2722/2742) (classification) Predicting false negative set... Accuracy = 31.6832% (64/202) (classification) Evaluating results... Results on test set: Precision: 99.8896856039713 % Recall: 99.01585565883 % Results on false negative set: Precision: 100 % Recall: 31.6831683168317 % ================================================================= Reading dataset... Permutating... Splitting and outputting... Training... * optimization finished, #iter = 466 nu = 0.147031 obj = -539.132218, rho = -2.297470 nSV = 817, nBSV = 791 Total nSV = 817 Predicting test set... Accuracy = 99.2706% (2722/2742) (classification) Predicting false negative set... Accuracy = 32.1782% (65/202) (classification) Evaluating results... Results on test set: Precision: 99.6613995485327 % Recall: 99.2134831460674 % Results on false negative set: Precision: 100 % Recall: 32.1782178217822 % ================================================================= Reading dataset... Permutating... Splitting and outputting... Training... * optimization finished, #iter = 454 nu = 0.146568 obj = -535.034660, rho = -2.187959 nSV = 814, nBSV = 793 Total nSV = 814 Predicting test set... Accuracy = 99.2341% (2721/2742) (classification) Predicting false negative set... Accuracy = 31.6832% (64/202) (classification) Evaluating results... Results on test set: Precision: 99.3834080717489 % Recall: 99.4391475042064 % Results on false negative set: Precision: 100 % Recall: 31.6831683168317 % ================================================================= Reading dataset... Permutating... Splitting and outputting... Training... * optimization finished, #iter = 447 nu = 0.144391 obj = -530.359839, rho = -2.219816 nSV = 802, nBSV = 781 Total nSV = 802 Predicting test set... Accuracy = 99.2341% (2721/2742) (classification) Predicting false negative set... Accuracy = 31.6832% (64/202) (classification) Evaluating results... Results on test set: Precision: 99.5589856670342 % Recall: 99.2853216052776 % Results on false negative set: Precision: 100 % Recall: 31.6831683168317 % ================================================================= Reading dataset... Permutating... Splitting and outputting... Training... * optimization finished, #iter = 441 nu = 0.143899 obj = -520.886081, rho = -2.283996 nSV = 795, nBSV = 785 Total nSV = 795 Predicting test set... Accuracy = 99.0518% (2716/2742) (classification) Predicting false negative set... Accuracy = 31.6832% (64/202) (classification) Evaluating results... Results on test set: Precision: 99.6111111111111 % Recall: 98.9514348785872 % Results on false negative set: Precision: 100 % Recall: 31.6831683168317 %
smime.p7s
Description: S/MIME Cryptographic Signature