Alexander K. Seewald wrote: > On Sat, Nov 19, 2005 at 10:04:19AM +1300, Sidney Markowitz wrote: >> Alexander, did you try using the SpamAssassin perceptron and compare the >> results of using its rule scores with using the scores from the SVM? > No, as above: the final hyperplane is defined by initial conditions > (i.e. the initial weight vector), and the perceptron does not > guarantee anything for non-linear-separable data. In my experience > most reasonably large mailboxes are _not_ linearily separable,
That's interesting. The microarray data for cancer cells I was looking at was just the opposite: We have on the order of ten thousands genes and on the order of only a hundred training samples. The data is _always_ linearly separable. In that case the main advantage of the SVM is that it maximizes the margin (roughly speaking the distance between the hyperplane and the boundaries of the two classes) which gives the most leeway when classifying future data samples that are just outside the boundaries of the training data classes. That's how I always thought it would prove beneficial when used for spam, since spam can be expected to evolve. I haven't thought much about the situation when the classes are not linearly separable. SpamAssassin's perceptron uses stochastic gradient descent with random initial weights for a fixed number of epochs. Looking at the code I don't see anything to check for convergence. If, as you say, the data are never linearly separable, I would think that would make the results tend to be erratic. > there is IMHO no point to compare it to the perceptron as it can be > expected to perform worse. How worse depends on the initial weight > vector, and setting it to random practically ensures a different > final hyperplane for each run of the algorithm... which one to > compare it to? For it to be adopted by the SpamAssassin developers I am sure that they will have to see some hard data comparing results of doing it each way. The perceptron as it is used now is initialized with random weights. It is run multiple times to use ten-fold cross-validation. I have to look over that again to remind myself exactly how that is used to generate the final rule scores. Anyway, the results are different each time the perceptron is run, but results using the final scores that are calculated will have to be compared to results using scores calculated with the SVM. > I don't know, I haven't checked. IMHO the time should be practical - > I've used the original WEKA code for up to 60,000 mails, and it took > several hours to run so there should be more leeway. The comments in README.perceptron says that it takes around 15 seconds compared to hours for the old GA. We really should see how Algorithm::SVM compares. That should be the fastest SVM available. > I would also be interested in that. However, for me there would not > be any problem licensing it under GPL as well. I gather there is a > problem with licensing SA under GPL? Or licensing any perl program > under GPL? The problem is that SA is licensed under the Apache Software Foundation (ASF) License which has fewer restrictions than the GPL. Anything that is licensed under GPL cannot be distributed without source code or made part of software that is distributed without source code. SpamAssassin's license does allow it to be made part of a commercial closed-source product, and there are companies that have done so. That prevents us from incorporating any GPL'd code into SpamAssassin. For you to contribute code to SpamAssassin, you would have to sign an agreement http://www.apache.org/licenses/icla.txt making it available under the ASF license http://www.apache.org/licenses/LICENSE-2.0 > I will either offer a second version using Algorithm::SVM, and - > pending the outcome of the license debate - possibly remove the old > ported version, should it be impossible to license under GPL. > ASAIK, everything which contains even a part of a GPL program also > needs to be licensed under GPL, so perhaps that is the licensing > problem you are referring to? Exactly. Using Algorithm::SVM instead of WEKA code avoids that problem, as WEKA is licensed under GPL while CPAN modules and libsvm have licenses that are effectively the same as the ASF license. I suspect that it will run much faster too. In my first look at Algorithm::SVM I don't see a way to get the weights from the trained SVM. I know that they are there in libsvm, but the author of Algorithm::SVM may not have written a perl function to get that information. If that is the case and you are not familiar with writing functions to access C from perl, I can help with adding in that functionality. -- sidney
