[Bug 2910] Fast SpamAssassin score learning tool.

bugzilla-daemon 11 Jan 2004 20:10:39 -0000

http://bugzilla.spamassassin.org/show_bug.cgi?id=2910






------- Additional Comments From [EMAIL PROTECTED]  2004-01-11 12:10 -------
Subject: RE:  Fast SpamAssassin score learning tool.



Thought it might be useful to archive the related discussion on the SA Dev.
List, So am repeating a couple of the related e-mails below.

> From: Henry Stern [mailto:[EMAIL PROTECTED]
> Sent: Saturday, January 10, 2004 11:50 AM
> To: 'Gary Funck'; [email protected]
> Cc: 'Spam Assassin Dev'; [EMAIL PROTECTED]
> Subject: RE: Neural Net scoring
>
>
> > -----Original Message-----
> > From: Gary Funck [mailto:[EMAIL PROTECTED]
> > Sent: January 10, 2004 3:29 PM
> > To: [email protected]
> > Cc: Spam Assassin Dev; [EMAIL PROTECTED] (Henry Stern)
> > Subject: RE: Neural Net scoring
> >
> > Thanks. Here's the link:
> > http://bugzilla.spamassassin.org/show_bug.cgi?id=2910
> >
> > This looks interesting. I echo Sidney's follow-up:
> >
> > "That's impressive. How close are the results to those of the GA? That's
> > actually two questions: 1) How close is the scoring that the perceptron
> > comes up with to the scoring that the GA comes up with? and 2) How much
> > difference in spam categorization results is there between using the
> > scores generated by the perceptron and those generated by the GA?"
>
> Some of the scores are the same, others are different.  The GA has some
> added constraints that are required because it works on a global level (it
> looks at mean performance of solutions over the training set) where
> stochastic gradient descent looks at performance on individuals.
>
> > This approach looks like it does a good job of mixing some of
> the benefits
> > of a the current additive scoring approach and a Neural Net. The final
> > neural
> > net that is derived is much simpler than a full-fledged net, but it has
> > the
> > advnatage of being simple to understand, and maps well onto the existing
> > framework.
>
> The current additive scoring approach is precisely equivalent to a
> perceptron with a linear transfer function and a threshold activation
> function.  What I do is use a different activation function for training
> (threshold activation functions are discontinuous and therefore not
> differentiable) and then map the results to a threshold perceptron.
>
> > It would've been interesting to see what sorts of scores this approach
> > produced,
> > and how well they worked in practice. (There's also a question of
> > copyright
> > that
> > would need to be resolved for this approach to gain wider use.)
>
> Once the preprocessing stuff is worked out, I'll write a white paper that
> discusses the results in detail.  As for copyright, I've signed an Apache
> CLA.
>
> Henry
>

----------------------------------------------------------------------


> From: Phillip Evans [mailto:[EMAIL PROTECTED]
> Sent: Saturday, January 10, 2004 5:40 PM
> To: [EMAIL PROTECTED]
> Subject: Re: New rule type suggestion
>
>
> G'day.  I think you're thinking too deeply about this <g>.  To clarify:
>
> MLP's basically do two things to determine a result:
>       1.  Identify features; and
>       2.  Correlate between those features.
>
> One problem with ANNs, particularly in the area of text processing, is
> getting something meaningful into it.  This is why things like
> Hidden Markov
> Models (ie: statistical models) are more commonly used (NB: This is a
> completely unsubstantiated statement based on work I did years ago).
>
> SA is already identifying features so IMO we don't need an ANN to do that.
> What we need is something that can correlate features to classifications.
> But wait!  We have one of those already - the Bayes engine.
>
> NB: I don't think that the correlation between the presence of
> certain rules
> in a message and that message being classified as SPAM is all
> that complex -
> certainly doesn't need a hidden layer in an MLP.  SPAM messages are being
> generated by people (drongos, granted, but people none-the-less) and are
> heavily constrained by the protocol they have to use.  There's no
> scope here
> for an obtuse n-dimensional inverse bicubic relationship that
> only ANNs (and
> rocket scientists with too much time on their hands) can identify.  Having
> said that, there's certainly an argument for automatically
> determining rule
> weightings.
>
>
> Now I haven't been working with SA for long and I don't know Perl
> (you might
> get sick of me saying that over the next few weeks) so I don't know the
> internals of the SA Bayes engine.  I am going to assume that the Bayes
> engine works as others out there (eg: POPFile) work: by tokenising the
> message text and then weighting features.
>
> The SA rules are identifying features that the Bayes engine doesn't
> currently identify.  The idea would be to identify the features
> using the SA
> rules and feed those into the Bayes engine for consideration.
> Now the Bayes
> engine can (automagically) weight the rule-based features and Hey
> Presto! -
> we have the meta-rule rule.  Not only that, you have visibility of the
> weighting assigned to each rule so humans can easily tweak them without
> getting inexplicable results.  Alternatively you could just feed
> additional
> tokens based upon the rules into the current Bayes processing.
>
>
> As a final comment, the existing SA rule weightings are manually set and
> this would seem to be causing problems that people are now trying to solve
> (using, for example, the Fast SA Score Learning Tool).  If you wanted to I
> reckon you could change the existing SA rules engine to be
> completely Bayes
> driven (ie: take away the manually set weightings altogether).  This might
> require initially writing some rules for identifying valid e-mail
> so it can
> identify what messages *should* look like but this set shouldn't need to
> change much over time.
>
> Phil.
>
> PS: I don't want to imply that the Fast SA Score Learning Tool isn't the
> best thing since sliced bread.  It looks like pretty cool stuff
> to me - keep
> up the good work Henry!
>






------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

[Bug 2910] Fast SpamAssassin score learning tool.

Reply via email to