On Sat, Nov 19, 2005 at 10:35:05PM +1300, Sidney Markowitz wrote: > That's interesting. The microarray data for cancer cells I was looking > at was just the opposite: We have on the order of ten thousands genes > and on the order of only a hundred training samples. The data is > _always_ linearly separable. In that case the main advantage of the SVM This is of course because the number of examples is much smaller than the number of attributes, and the attributes are highly correlated, effectively reducing their number. But for SA, you have about 1000 attributes (one for each rule) and if you use 60,000 mails it is clear that linear separability cannot be guaranteed in all cases. This can easily be checked: reapply the perceptron model on the training data. If you get even a single training set error, the data is not linearily separable.
> Looking at the code I don't see anything to check for convergence. If, > as you say, the data are never linearly separable, I would think that > would make the results tend to be erratic. Yes, but repeating several runs and averaging the weights would be expected to counter that effect. Of course the maximum margin hyperplane cannot be guaranteed for even the most elaborate perceptron approach, but I have initially used linear regression (which similarily has no such guarantees) with good results. > For it to be adopted by the SpamAssassin developers I am sure that they > will have to see some hard data comparing results of doing it each way. > The perceptron as it is used now is initialized with random weights. It > is run multiple times to use ten-fold cross-validation. I have to look > over that again to remind myself exactly how that is used to generate > the final rule scores. That was not my original intention. The default score set of both SA 2.6.4 and SA 3.0.1 performs badly on our mails and rapidly gets worse in time, and during the last 18 months I have trained similar systems to improve this, which have been tested at the Austrian Research Institute for Artificial Intelligence by seven colleagues of mine. Obviously a test by myself would not have been sufficient. It should be noted that I was notified of only a single false positive during this timespan. I am challenging the prevalent view that a single score set is sufficient for _all_ users. My intention was to provide this tool as a way for others which are similarily disappointed with the default score set to train their own score set easily. Whether or not the SA developers adopt SVMs instead of a repeated-run perceptron is of no concern for me. I still think that this warrants a link from the SA page. If you are of a different opinion, I will just rely on Google. > Anyway, the results are different each time the perceptron is run, but > results using the final scores that are calculated will have to be > compared to results using scores calculated with the SVM. This is not interesting for me, as SA - even when trained with SVMs - performs very similar to a pure NaiveBayes learner (SpamBayes), and a single learner is preferrable for reasons of performance and the possibility for incremental updates, which would be hard for SA-Train. It should however be very easy to adapt Algorithm::SVM for this purpose, but I would not expect it to solve the main problems: that one single score set is not sufficient for the whole world, and that SA does not work significantly better than a NaiveBayes learner on its own, at least if it has enough data to work with. 60,000 mails (half ham, half spam) were sufficient to train an institute-wide model at the Austrian research institute, and the most recent model has been tested for half a year now. > The problem is that SA is licensed under the Apache Software Foundation > (ASF) License which has fewer restrictions than the GPL. Anything that > is licensed under GPL cannot be distributed without source code or made > part of software that is distributed without source code. SpamAssassin's > license does allow it to be made part of a commercial closed-source > product, and there are companies that have done so. That prevents us > from incorporating any GPL'd code into SpamAssassin. Making the script part of a commercial tool does not appeal to me, since people would be expected to pay for something I intend to remain free. Also, without a C-library and additional interfaces the script is more light-weight. On second thought, the GPL suits me just fine. > For you to contribute code to SpamAssassin, you would have to sign an > agreement http://www.apache.org/licenses/icla.txt making it available > under the ASF license http://www.apache.org/licenses/LICENSE-2.0 I'm sorry, but I cannot do this. However, as the approach used is very simple, you should be able to reproduce it in your scripts quite easily: * train all mails from the example set via sa-learn * run Algorithm::SVM (lambda=1) on the set of rules, similar to the perceptron. This prevents the licensing issues. * extract weights and optionally apply model to test set. Best, Alex -- Dr.techn. Alexander K. Seewald Solutions for the 21st century +43(664)1106886 ------------------------------------------------ Information wants to be free; Information also wants to be expensive (S.Brant) --------------- alex.seewald.at ----------------
