Re: SA-Train (fwd)

Alexander K. Seewald Mon, 21 Nov 2005 00:33:46 -0800

On Sat, Nov 19, 2005 at 10:35:05PM +1300, Sidney Markowitz wrote:
> That's interesting. The microarray data for cancer cells I was looking
> at was just the opposite: We have on the order of ten thousands genes
> and on the order of only a hundred training samples. The data is
> _always_ linearly separable. In that case the main advantage of the SVM
This is of course because the number of examples is much smaller
than the number of attributes, and the attributes are highly
correlated, effectively reducing their number. But for SA, you have
about 1000 attributes (one for each rule) and if you use 60,000 mails
it is clear that linear separability cannot be guaranteed in all cases.
This can easily be checked: reapply the perceptron model on the training
data. If you get even a single training set error, the data is not linearily
separable.



> Looking at the code I don't see anything to check for convergence. If,
> as you say, the data are never linearly separable, I would think that
> would make the results tend to be erratic.
Yes, but repeating several runs and averaging the weights would be
expected to counter that effect. Of course the maximum margin
hyperplane cannot be guaranteed for even the most elaborate
perceptron approach, but I have initially used linear regression
(which similarily has no such guarantees) with good results.


> For it to be adopted by the SpamAssassin developers I am sure that they
> will have to see some hard data comparing results of doing it each way.
> The perceptron as it is used now is initialized with random weights. It
> is run multiple times to use ten-fold cross-validation. I have to look
> over that again to remind myself exactly how that is used to generate
> the final rule scores.
That was not my original intention. The default score set of both
SA 2.6.4 and SA 3.0.1 performs badly on our mails and rapidly gets
worse in time, and during the last 18 months I have trained similar
systems to improve this, which have been tested at the Austrian Research
Institute for Artificial Intelligence by seven colleagues of mine.
Obviously a test by myself would not have been sufficient.
It should be noted that I was notified of only a single
false positive during this timespan. I am challenging the prevalent
view that a single score set is sufficient for _all_ users.

My intention was to provide this tool as a way for others which are
similarily disappointed with the default score set to train their
own score set easily. Whether or not the SA developers adopt SVMs instead
of a repeated-run perceptron is of no concern for me. I still think
that this warrants a link from the SA page. If you are of a
different opinion, I will just rely on Google.


> Anyway, the results are different each time the perceptron is run, but
> results using the final scores that are calculated will have to be
> compared to results using scores calculated with the SVM.
This is not interesting for me, as SA - even when trained with SVMs
- performs very similar to a pure NaiveBayes learner (SpamBayes),
and a single learner is preferrable for reasons of performance and
the possibility for incremental updates, which would be hard for
SA-Train. It should however be very easy to adapt Algorithm::SVM
for this purpose, but I would not expect it to solve the main
problems: that one single score set is not sufficient for the whole
world, and that SA does not work significantly better than a
NaiveBayes learner on its own, at least if it has enough data to
work with. 60,000 mails (half ham, half spam) were sufficient to
train an institute-wide model at the Austrian research institute,
and the most recent model has been tested for half a year now.


> The problem is that SA is licensed under the Apache Software Foundation
> (ASF) License which has fewer restrictions than the GPL. Anything that
> is licensed under GPL cannot be distributed without source code or made
> part of software that is distributed without source code. SpamAssassin's
> license does allow it to be made part of a commercial closed-source
> product, and there are companies that have done so. That prevents us
> from incorporating any GPL'd code into SpamAssassin.
Making the script part of a commercial tool does not appeal to me, since
people would be expected to pay for something I intend to remain free.
Also, without a C-library and additional interfaces the script is more
light-weight. On second thought, the GPL suits me just fine.


> For you to contribute code to SpamAssassin, you would have to sign an
> agreement http://www.apache.org/licenses/icla.txt making it available
> under the ASF license http://www.apache.org/licenses/LICENSE-2.0
I'm sorry, but I cannot do this. However, as the approach used is very
simple, you should be able to reproduce it in your scripts quite easily:
* train all mails from the example set via sa-learn
* run Algorithm::SVM (lambda=1) on the set of rules, similar to the
  perceptron. This prevents the licensing issues.
* extract weights and optionally apply model to test set.


Best,
  Alex
-- 
Dr.techn. Alexander K. Seewald

Solutions for the 21st century   +43(664)1106886
------------------------------------------------
         Information wants to be free;
Information also wants to be expensive (S.Brant)
--------------- alex.seewald.at ----------------

Re: SA-Train (fwd)

Reply via email to