Alexander K. Seewald wrote:
> On Sat, Nov 19, 2005 at 10:04:19AM +1300, Sidney Markowitz wrote:
>> Alexander, did you try using the SpamAssassin perceptron and compare the
>> results of using its rule scores with using the scores from the SVM?
> No, as above: the final hyperplane is defined by initial conditions
> (i.e. the initial weight vector), and the perceptron does not
> guarantee anything for non-linear-separable data. In my experience
> most reasonably large mailboxes are _not_ linearily separable,

That's interesting. The microarray data for cancer cells I was looking
at was just the opposite: We have on the order of ten thousands genes
and on the order of only a hundred training samples. The data is
_always_ linearly separable. In that case the main advantage of the SVM
is that it maximizes the margin (roughly speaking the distance between
the hyperplane and the boundaries of the two classes) which gives the
most leeway when classifying future data samples that are just outside
the boundaries of the training data classes. That's how I always thought
it would prove beneficial when used for spam, since spam can be expected
to evolve.

I haven't thought much about the situation when the classes are not
linearly separable. SpamAssassin's perceptron uses stochastic gradient
descent with random initial weights for a fixed number of epochs.
Looking at the code I don't see anything to check for convergence. If,
as you say, the data are never linearly separable, I would think that
would make the results tend to be erratic.

> there is IMHO no point to compare it to the perceptron as it can be
> expected to perform worse. How worse depends on the initial weight
> vector, and setting it to random practically ensures a different
> final hyperplane for each run of the algorithm... which one to
> compare it to?

For it to be adopted by the SpamAssassin developers I am sure that they
will have to see some hard data comparing results of doing it each way.
The perceptron as it is used now is initialized with random weights. It
is run multiple times to use ten-fold cross-validation. I have to look
over that again to remind myself exactly how that is used to generate
the final rule scores.

Anyway, the results are different each time the perceptron is run, but
results using the final scores that are calculated will have to be
compared to results using scores calculated with the SVM.

> I don't know, I haven't checked. IMHO the time should be practical -
> I've used the original WEKA code for up to 60,000 mails, and it took
> several hours to run so there should be more leeway.

The comments in README.perceptron says that it takes around 15 seconds
compared to hours for the old GA. We really should see how
Algorithm::SVM compares. That should be the fastest SVM available.

> I would also be interested in that. However, for me there would not
> be any problem licensing it under GPL as well. I gather there is a
> problem with licensing SA under GPL? Or licensing any perl program
> under GPL?

The problem is that SA is licensed under the Apache Software Foundation
(ASF) License which has fewer restrictions than the GPL. Anything that
is licensed under GPL cannot be distributed without source code or made
part of software that is distributed without source code. SpamAssassin's
license does allow it to be made part of a commercial closed-source
product, and there are companies that have done so. That prevents us
from incorporating any GPL'd code into SpamAssassin.

For you to contribute code to SpamAssassin, you would have to sign an
agreement http://www.apache.org/licenses/icla.txt making it available
under the ASF license http://www.apache.org/licenses/LICENSE-2.0

> I will either offer a second version using Algorithm::SVM, and -
> pending the outcome of the license debate - possibly remove the old
> ported version, should it be impossible to license under GPL.
> ASAIK, everything which contains even a part of a GPL program also
> needs to be licensed under GPL, so perhaps that is the licensing
> problem you are referring to?

Exactly. Using Algorithm::SVM instead of WEKA code avoids that problem,
as WEKA is licensed under GPL while CPAN modules and libsvm have
licenses that are effectively the same as the ASF license. I suspect
that it will run much faster too.

In my first look at Algorithm::SVM I don't see a way to get the weights
from the trained SVM. I know that they are there in libsvm, but the
author of Algorithm::SVM may not have written a perl function to get
that information. If that is the case and you are not familiar with
writing functions to access C from perl, I can help with adding in that
functionality.

 -- sidney

Reply via email to