> What I miss most is a transparent dataset about every rule.
> I'd like to know
> - percentage of false positives
> - percentage of flase negatives
> - percentage of true positives
> - percentage of true negatives
> - number of mails checked for the results above
> - standard deviation of the percentages obove
Some of this shows up in SARE rules, at least in Bob's rulesets. He has
fairly standard comment forms that note the spam/ham hits and corpus sizes,
as well as the date and corpus owner.
A big problem though is that most of our test corpuses are english/american,
so we never really know what will happen in Germany, Sweden, or Nigeria.
(Not sure we care about the last. :-) We REALLY need to somehow get to the
point that we have test corpori from various other parts of the world.
> Detection of redundancy or linear independency.
> Is my new rule covered or disabled by another rule
> or does it affect existing rules?
> This could be detected which a MassCheck.
Masscheck has an interdependency option, although it increases the checking
time. We use it on rules once they seem useful, but not usually in early
one-off checking.
> My idea about this is to send a FN to a reference
> server, see which (even very new und little tested)
> rules matches, look at the statistics and decide
> to include it or not - or - if no rule matches, to
> provide one.
> For each rule a set of matching spam-mails should
> be stored by the author to cross check other rules
> for linear dependencies.
This is a very interesting idea that I think needs more exploring in the
future. Any SA server that has a Bayes database potentially has most of the
knowledge to be able to participate in Seti-like background processing for
determining rule hit ratios. For that matter, any SA server should be able
to collect logs of the rules that are hitting there, and send out that rule
hit information to some central server once a day. This won't necessarily
give fp/fn hit counts, but it can give total hits per rule, and that is
moderately valuable information in itself, while still being pretty
annonomous.
> Sadly the actual used model of scoring is not helpful
> for this approach :( It would be much better to have
> a real statistical scoring where I just could multiply
> the probabilities of each used rule to get a result.
Sare has a rule scoring method that Bob developed that assigns a probable
score to the rule based on the masscheck results. Sometimes we modify this
manually based on other factors, but most of the time it goes into the rules
files directly. We know it isn't as good as a full SA scoring run. But on
the other hand, it doesn't require a full SA scoring run, and generally
produces pretty usable results. I would envision smething like this being
used for initial rule introductions, and periodically the rules would be
rescored using a full scoring run.
Loren