Hey all -- I've been considering possible improvements to how we figure out what rules are effective.
Currently we use the S/O ratio and hit-rate of each individual rule, in other words, if a rule hits a lot of spam, and little nonspam, we detect that and consider it "good". However, that doesn't take in account the situation where multiple rules are hitting mostly the same mail; for example, like this: S1 S2 S3 S4 S5 H1 H2 H3 H4 H5 RULE1: x x x x RULE2: x x x x RULE3: x x x RULE4: x (S1-S5 = 5 spam mails; H1-H5 = 5 ham/nonspam mails. "x" means a "hit" by a rule, " " means no hit -- our rules are boolean.) obviously, RULE1 and RULE2 overlap entirely, and therefore either (a) one should be removed, or (b) both should share half the score as equal contributors. (b) is what the perceptron currently does. RULE3, by contrast, would be considered a lousy rule under our current scheme, because it hits ham 33% of the time; however in this case, it's actually quite informational to a certain extent, because it's hitting spam that the others cannot hit. RULE4 is even better than RULE3, because it's hitting the mail that RULE1 and RULE2 miss, yet it doesn't appear that good because: - it has a hit-rate half that of RULE3 - it has a hit-rate 4 times lower than RULE1 and RULE2 This is the kind of effect we do see now -- a lot of our rules are actually firing in combination, and some rules that hit e.g. 0.5% of spam are in effect more useful than some rules that hit 20%, because they're hitting the 0.5% of spam that *gets past* the other rules. So, what I'm looking for is a statistical method to measure this effect, and report - (a) that RULE1 and RULE2 overlap almost entirely - (b) that RULE3 is worthwhile, because it can hit that 20% of the messages the other rules cannot - (c) that RULE4 is better than RULE3 because it has a lower false-positive rate The perceptron rescoring system *does* do this already, but for rule QA and rule selection, being able to do this at a "human" level -- and quickly -- would be essential. We also have an overlap-measurement tool, but that's only useful for measuring (a) and is extremely RAM-hungry. So -- statisticians? any tips? ;) (if anyone can fwd this on to their resident stats guy, that would be appreciated, too.) (Henry, you may be too busy to respond if you're writing up of course ;) --j.