[Bug 2910] New: Fast SpamAssassin score learning tool.

bugzilla-daemon 8 Jan 2004 17:55:18 -0000

http://bugzilla.spamassassin.org/show_bug.cgi?id=2910


           Summary: Fast SpamAssassin score learning tool.
           Product: Spamassassin
           Version: current-CVS
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P5
         Component: GA
        AssignedTo: [EMAIL PROTECTED]
        ReportedBy: [EMAIL PROTECTED]


Copyright (c)2003 Henry Stern

Fast SpamAssassin Score Learning Tool

Henry Stern
Faculty of Computer Science
Dalhousie University
6050 University Avenue
Halifax, NS  Canada
B3H 1W5
[EMAIL PROTECTED]

January 8, 2004

1.  WHAT IS IT?

This program is used to compute scores for SpamAssassin rules.  It makes
use of data files generated by the suite of scripts in
spamassassin/masses.  The program outputs the generated scores in a file 
titled 'perceptron.scores'.

The advantage of this program over that of the genetic algorithm (GA)
implementation in spamassassin/masses/craig_evolve.c is that while the GA 
requires several hours to run on high-end machines, the perceptron 
requires only about 15 seconds of CPU time on an Athlon XP 1700+ system.

This makes incremental updates and score personalization practical for the
end-user and gives developers a better idea just how useful a new rule is.

2.  OPTIONS

There are four options that can be passed to the perceptron program.

  -p ham_preference

        This increases the number of non-spam messages in the training 
        set.  It does this by adding 1 + (number of tests hit) * 
        ham_preference instances of non-spam messages to the training set.
        This is intended to reduce false positives by encouraging the
        training program to look at the harder-to-clasify ham messages 
        more often.  By default, this parameter is 2.0.

  -e num_epochs

        This parameter sets how many passes the perceptron will make 
        through the training set before terminating.  On each pass, the 
        training set is shuffled and then iterated through.  By default, 
        it will make 15 passes.

  -l learning_rate

        This parameter modifies the learning rate of the perceptron.  The 
        error gradient is computed for each instance.  This program uses a 
        logsig activation function y = (1/(1+exp(-x))), so the error 
        gradient is computed as E(x) = y*(1-y)*(is_spam - y).  For 
        each instance and score hit in the training set, the scores are 
        modified by adding E(x) / (number of tests hit + 1) * 
        learning_rate.  The default value for this is 2.0, but it can be 
        whatever you want.

  -w weight_decay

        To prevent the scores from getting too high (or even forcing them 
        to if you want), before each epoch, the scores and network bias
        are multiplied by the weight decay.  This is off by default (1.0), 
        but it can be useful.

3.  HOW DOES IT WORK?

This program implements the "Stochastic Gradient Descent" method of 
training a neural network.  It uses a single perceptron with a logsig 
activation function and maps the weights to SpamAssassin score space.

The perceptron is the simplest form of neural network.  It consists of a 
transfer function and an activation function.  Together, they simulate the 
average firing rate of a biological neuron over time.

The transfer function is the sum of the product of weights and inputs.  
It simulates the membrane potential of a neuron.  When the accumulated
charge on the membrane exceeds a certain threshold, the neuron fires,
sending an electrical impulse down the axon. This implementation uses a
linear transfer function:

[1]     f(x) = bias + sum_{i=1}^{n} w_i * x_i

This is quite similar to how the SpamAssasin score system works.  If you 
set the bias to be 0, w_i to be the score for rule i and x_i to be whether 
or not rule i is activated by a given message, the transfer function will 
return the score of the message.  

The activation function simulates the electical spike travelling down the 
axon.  It takes the output of the transfer function as input and applies 
some sort of transformation to it.  This implementation uses a logsig 
activation function:

[2]     y(x) = 1 / (1 + exp(-f(x)))

This non-linear function constrains the output of the transfer function 
between 0 and 1.  When plotted, it looks somewhat S-shaped with vertical 
asymptotes at 0 as x approaches -infinity and 1 as x approaches infinity.  
The slope of the function is greatest at x=0 and tapers off as it 
approaches the asymptotes.

Lastly, the performance of the perceptron needs to be measured using an 
error function.  Two error functions are commonly used:  mean squared 
error and entropic error.  By default, this implementation uses mean 
squared error but entropic error may be substituted by adding a compiler 
directive.

The most common method of training neural networks is called gradient
descent.  It involves iteratively tuning the parameters of the network so
that the mean error rate always decreases.  This is done by finding the
direction of steepest descent down the "error gradient," reducing the
value of the error function for the next iteration of the algorithm.

If the transfer function and activation function are both differentiable,
the error gradient of a neural network can be calculated with respect to
the weights and bias.  Without getting into calculus, the error gradient 
for a perceptron with a linear transfer function, logsig activation 
function and mean squared error function is:

[3]     E(x) = y(x) * (1-y(x)) * (y_expected - y(x))

The weights are updated using the function:

[4]     w_i = w_i + E(x) * x_i * learning_rate

Since the SpamAssassin rule hits are sparse, the basic gradient descent 
algorithm is impractical.  This implementation uses a variation called 
"Stochastic gradient descent."  Instead of doing one batch update per 
epoch, the training set is randomly walked through, doing incremental 
updates.  In addition, in this implementation, the learning rate is 
modified by the number of rule hits for a given training instance.  
Together, these allow good weights to be computed for 
infrequently-occurring rules.

Once the perceptron has finished running, the weights are converted to 
scores and exported using the familiar file format.  Weights are converted 
to scores using this function:

[5]     score(weight) = -threshold * weight / bias

4.  ACKNOWLEDGEMENTS

I would like to thank my PhD supervisor, Michael Shepherd, for not getting 
mad at me while I worked on this.  I'd also like to thank Thomas 
Trappenberg for his invaluable assistance while I was tweaking the 
performance of the learning algorithm.  I would like to thank Daniel 
Quinlan, Justin Mason and Theo Van Dinter for their valuable input and 
constructive criticism.

--
hs
8/1/2004



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 2910] New: Fast SpamAssassin score learning tool.

Reply via email to