statistics help needed

Justin Mason 8 Oct 2004 22:49:59 -0000

Hey all --

I've been considering possible improvements to how we figure out what
rules are effective.


Currently we use the S/O ratio and hit-rate of each individual rule, in
other words, if a rule hits a lot of spam, and little nonspam, we detect
that and consider it "good".

However, that doesn't take in account the situation where multiple rules
are hitting mostly the same mail; for example, like this:

             S1  S2  S3  S4  S5  H1  H2  H3  H4  H5
    RULE1:   x   x   x   x                       
    RULE2:   x   x   x   x                       
    RULE3:               x   x                   x
    RULE4:                   x                    

(S1-S5 = 5 spam mails; H1-H5 = 5 ham/nonspam mails.  "x" means a "hit"
by a rule, " " means no hit -- our rules are boolean.)

obviously, RULE1 and RULE2 overlap entirely, and therefore either (a) one
should be removed, or (b) both should share half the score as equal
contributors.  (b) is what the perceptron currently does.

RULE3, by contrast, would be considered a lousy rule under our current
scheme, because it hits ham 33% of the time; however in this case, it's
actually quite informational to a certain extent, because it's hitting
spam that the others cannot hit.

RULE4 is even better than RULE3, because it's hitting the mail that
RULE1 and RULE2 miss, yet it doesn't appear that good because:

    - it has a hit-rate half that of RULE3
    - it has a hit-rate 4 times lower than RULE1 and RULE2

This is the kind of effect we do see now -- a lot of our rules are
actually firing in combination, and some rules that hit e.g. 0.5% of
spam are in effect more useful than some rules that hit 20%, because
they're hitting the 0.5% of spam that *gets past* the other rules.

So, what I'm looking for is a statistical method to measure this effect,
and report

    - (a) that RULE1 and RULE2 overlap almost entirely
    - (b) that RULE3 is worthwhile, because it can hit that 20% of the
      messages the other rules cannot
    - (c) that RULE4 is better than RULE3 because it has a lower
      false-positive rate

The perceptron rescoring system *does* do this already, but for
rule QA and rule selection, being able to do this at a "human"
level -- and quickly -- would be essential.    We also have an
overlap-measurement tool, but that's only useful for measuring (a)
and is extremely RAM-hungry.

So -- statisticians?  any tips? ;)   (if anyone can fwd this on
to their resident stats guy, that would be appreciated, too.)

(Henry, you may be too busy to respond if you're writing up of course ;)

--j.

statistics help needed

Reply via email to