Hello Loren,

Monday, July 25, 2005, 9:55:36 PM, you wrote:

LW> We could invent a class of rules that were 'test rules'.
LW> They would have nil score and wouldn't report on the mail summary
LW> if they hit.  But they would show up in the report-home summary is
LW> to whether they hit, and whether it was ham or spam.

More thought ... what if SA systems were to accumulate daily
statistics, along the lines of one record for each rule, containing:
a) rule name
b) initial ham hits (neg score, before any human looks at it)
c) accumulated ham score for these hits (or some other useful stat)
d) initial spam hits
e) accumulated spam score
f) initial middle hits (pos score, below spam threshold)
g) initial middle score accum
h) sa-learn ham hits (some human called this ham)
i) sa-learn ham score accum
j) sa-learn spam hits
k) sa-learn spam score accum
plus one similar record for all of the day's emails, and the system
would package up the stats at the end of the day and email it to a
central collection system.

The system would not collect any information about the hits themselves
(no matching text), and no header information from the emails, so
there should be no confidentiality concerns -- the only information
being fed back to central is statistical (how many emails, which rules
hit, how effective are the rules (or would they be).

Then if we were to issue regular sa-update runs, feeding these test
rules out to participating systems, we could take the statistical
results and feed them into some scoring algorithm to determine the
next month's scores (perhaps adjusting almost all scores within SA
monthly, to take changing spam patterns into account), and also to
determine which rules to add to the scoring mix, and which rules to
remove.

The ideas we've had concerning "this rule works well in language 'xx'"
(or doesn't) could also apply here.

It'd need a lot of thought and care to make something like this
practical and profitable and feasible, but I think the idea has
merit...

Bob Menschel



Reply via email to