Hello Loren, Monday, July 25, 2005, 9:55:36 PM, you wrote:
LW> We could invent a class of rules that were 'test rules'. LW> They would have nil score and wouldn't report on the mail summary LW> if they hit. But they would show up in the report-home summary is LW> to whether they hit, and whether it was ham or spam. More thought ... what if SA systems were to accumulate daily statistics, along the lines of one record for each rule, containing: a) rule name b) initial ham hits (neg score, before any human looks at it) c) accumulated ham score for these hits (or some other useful stat) d) initial spam hits e) accumulated spam score f) initial middle hits (pos score, below spam threshold) g) initial middle score accum h) sa-learn ham hits (some human called this ham) i) sa-learn ham score accum j) sa-learn spam hits k) sa-learn spam score accum plus one similar record for all of the day's emails, and the system would package up the stats at the end of the day and email it to a central collection system. The system would not collect any information about the hits themselves (no matching text), and no header information from the emails, so there should be no confidentiality concerns -- the only information being fed back to central is statistical (how many emails, which rules hit, how effective are the rules (or would they be). Then if we were to issue regular sa-update runs, feeding these test rules out to participating systems, we could take the statistical results and feed them into some scoring algorithm to determine the next month's scores (perhaps adjusting almost all scores within SA monthly, to take changing spam patterns into account), and also to determine which rules to add to the scoring mix, and which rules to remove. The ideas we've had concerning "this rule works well in language 'xx'" (or doesn't) could also apply here. It'd need a lot of thought and care to make something like this practical and profitable and feasible, but I think the idea has merit... Bob Menschel
