Hello Daniel, Tuesday, August 24, 2004, 5:26:55 PM, you wrote:
DQ> Rule must pass the following parameters: DQ> - hits > 1% of recent spam or > 5% of missed spam DQ> - hits > 1% of missed spam How do you measure missed spam? I have maybe 1 or 2 missed spam (false negatives) each week. If my corpus contains three months' spam, that's a generous 26 spam. Any rule that hits a single FN hits 4% missed spam in this environment. Are you maybe suggesting a 3-way corpus to be tested: a ham corpus (1-2 year coverage), a spam corpus (3-4 month coverage), and a false negative corpus (6 mo to 1 year coverage) or something like that? DQ> - S/O ratio must be >= 0.999 (until we use perceptron) DQ> - scores >= 0.5 and <= 2.5 (until we use perceptron) These seem very reasonable and workable. Any normal SARE rule which scores a 1% spam hit rate in SARE mass-checks, with zero ham hits, gets a 1.666 score using our current methods (1/3 of 5), and those which are obfu-like rules (testing specifically for things that should not ever happen in ham, obfu, %RANDOM tags, and the like) get 2.500. We can probably find a way to lower these to a more relativistic measure based on actual spam counts... DQ> I'm willing to tweak these, I just want something to get us off the DQ> ground. I'm also hoping we use the perceptron very soon after we get DQ> started. I think these are good starting measures. And yes, the perceptron will be very useful once part of this system. Bob Menschel