Hmmm.  I understand.

I still think there's room to move.  Maybe I should experiment more before
commenting further.  Ah, what the hell...

There are two thoughts running around in my head (yes, only two, maybe a
background thought about lunch):

First, that you could have a separate rule which was Bayes driven.  ie: the
same Bayesian code but running from a different corpus.  This rule, or a
derivation of it, could possibly replace the existing rule weightings and
summing mechanics of SA.

Second, you could just add the rules as tokens to the set of tokens
currently evaluated by the existing Bayes rule.


In either case there would be the issue of providing "good" tokens if they
were present.  ie: rules that identified good e-mail.  As when training the
Bayesian filter today you need a balanced training set of ham and spam, you
would need this with the rule tokens also.  There's no point in having a
Bayes database trained on only negative tokens - everything will be SPAM!  I
guess there could be an argument that "no tokens == ham" but Bayes won't see
it that way (even though it's probably true).

For example, there is a rule that identifies messages claiming to be
composed using Outlook Express and yet not providing the full set of headers
that Outlook Express writes.  We would need a corresponding rule that says
"yes, it claims to come from Outlook and all Outlook headers are present".
At this point someone will probably tell me that rule already exists with a
negative weighting (if it doesn't, should it?).  Another example would be
the delivery time-based rules.  ie: Has the message has been delivered in a
timely fashion?

Henry raises an excellent point though - that some rules are interdependent
in that multiple rules may fire for the same feature in a given e-mail.  I'm
going to have to think about that (I'll have to work out which of my two
thoughts to throw out first).  Without thinking too deeply, it probably
means that we would need to track the co-occurrence of rules (which is
effectively what the ANN is doing).  There's probably some relatively simple
maths that can be done to decrease the Bayes score for a given token if a
token with high co-occurrence is also present.  The main difference between
this and the ANN approach (and believe me, I love ANN's) is that the
resulting information will be empirical and easily understood by mere
mortals such as myself.  The other possible benefit is that an approach like
this can adapt and re-train online (as the existing Bayes engine does).

Phil.



-----Original Message-----
From: Henry Stern [mailto:[EMAIL PROTECTED] 
Sent: Tuesday, 13 January 2004 7:14 AM
To: [EMAIL PROTECTED];
[EMAIL PROTECTED]
Subject: RE: [Bug 2910] Fast SpamAssassin score learning tool.


Na�ve Bayes will not work very well with the rules because there is far too
much mutual information in the attributes.  The reason why the neural
network performs so well (with either training algorithm) is that it is able
to learn lower weights for groups of rules that frequently co-occur.

If you'd like to learn more about machine learning, I would suggest taking a
look at Data Mining by Witten and Frank or Machine Learning by Mitchell (the
latter is much more technical).  Both cover all of the common learning
algorithms (neural networks, rule-based learning, decision trees, Bayesian
networks, support vector machines, etc.). Witten and Frank's book focuses on
running experiments using the "Weka" tool, an open source machine learning
toolkit.  Like most of the tools that I've come across, it has a hard time
dealing with large datasets, so your mileage may vary.

Henry

------- Additional Comments From [EMAIL PROTECTED]  2004-01-12 11:55
------- Let me see if I understand what Phil is suggesting: The idea to add
to the Bayes db a magic token for each rule that matches, along with the
other Bayes db information, and then use only Bayes for the scoring. This
should be easy to test with a 10-fold cross validation combining the output
of mass check with construction of the Bayes db.

Is anybody up to running the test?


Reply via email to