Hmmm. I understand. I still think there's room to move. Maybe I should experiment more before commenting further. Ah, what the hell...
There are two thoughts running around in my head (yes, only two, maybe a background thought about lunch): First, that you could have a separate rule which was Bayes driven. ie: the same Bayesian code but running from a different corpus. This rule, or a derivation of it, could possibly replace the existing rule weightings and summing mechanics of SA. Second, you could just add the rules as tokens to the set of tokens currently evaluated by the existing Bayes rule. In either case there would be the issue of providing "good" tokens if they were present. ie: rules that identified good e-mail. As when training the Bayesian filter today you need a balanced training set of ham and spam, you would need this with the rule tokens also. There's no point in having a Bayes database trained on only negative tokens - everything will be SPAM! I guess there could be an argument that "no tokens == ham" but Bayes won't see it that way (even though it's probably true). For example, there is a rule that identifies messages claiming to be composed using Outlook Express and yet not providing the full set of headers that Outlook Express writes. We would need a corresponding rule that says "yes, it claims to come from Outlook and all Outlook headers are present". At this point someone will probably tell me that rule already exists with a negative weighting (if it doesn't, should it?). Another example would be the delivery time-based rules. ie: Has the message has been delivered in a timely fashion? Henry raises an excellent point though - that some rules are interdependent in that multiple rules may fire for the same feature in a given e-mail. I'm going to have to think about that (I'll have to work out which of my two thoughts to throw out first). Without thinking too deeply, it probably means that we would need to track the co-occurrence of rules (which is effectively what the ANN is doing). There's probably some relatively simple maths that can be done to decrease the Bayes score for a given token if a token with high co-occurrence is also present. The main difference between this and the ANN approach (and believe me, I love ANN's) is that the resulting information will be empirical and easily understood by mere mortals such as myself. The other possible benefit is that an approach like this can adapt and re-train online (as the existing Bayes engine does). Phil. -----Original Message----- From: Henry Stern [mailto:[EMAIL PROTECTED] Sent: Tuesday, 13 January 2004 7:14 AM To: [EMAIL PROTECTED]; [EMAIL PROTECTED] Subject: RE: [Bug 2910] Fast SpamAssassin score learning tool. Na�ve Bayes will not work very well with the rules because there is far too much mutual information in the attributes. The reason why the neural network performs so well (with either training algorithm) is that it is able to learn lower weights for groups of rules that frequently co-occur. If you'd like to learn more about machine learning, I would suggest taking a look at Data Mining by Witten and Frank or Machine Learning by Mitchell (the latter is much more technical). Both cover all of the common learning algorithms (neural networks, rule-based learning, decision trees, Bayesian networks, support vector machines, etc.). Witten and Frank's book focuses on running experiments using the "Weka" tool, an open source machine learning toolkit. Like most of the tools that I've come across, it has a hard time dealing with large datasets, so your mileage may vary. Henry ------- Additional Comments From [EMAIL PROTECTED] 2004-01-12 11:55 ------- Let me see if I understand what Phil is suggesting: The idea to add to the Bayes db a magic token for each rule that matches, along with the other Bayes db information, and then use only Bayes for the scoring. This should be easy to test with a 10-fold cross validation combining the output of mass check with construction of the Bayes db. Is anybody up to running the test?
