> How would we determine ham/spam? At this point all we have is SA's
> first estimation, and no way of knowing whether this is accurate, FN,
> or FP.
All we could reasonably do is take SA's assment of the message and assume that
statistically it will be correct to one or two sigma or so. If the reporting
site really does have a huge percentage of FPs or FNs, it will screw the
reporting up some. But most sites should be running fairly cleanly, so we
should be able to assume with about 95% accuracy that the assessment of what
kind of mail the rule hit was correct.
One of the main goals here would be to look for rules that we think are
supposed to hit spam, but the reporting site claims 25% of the time hit ham.
This would be a clear indication that a) the site is terribly screwed up, b)
the rule is terribly screwed up if the site reports that it speaks English, c)
the rule doesn't work well in whatever language the site reports it uses.
Clearly the obverse situation also applies.
This is why I also want the reports to contain an indication of the language
and/or geographical location of the site, so we can spot the foreign language
problems. If the report also contained some indication of the percentage of
mails that were submitted for learning and showed signs of being mis-classified
(if that information is obtainable) it would give an indication of the
reliability of the classification data from the site.
The goal here isn't the accuracy of a masscheck with a hand-classified corpus.
The goal is much greater insight into how rules are really seeming to work out
in the real world, especially the non-english parts of it. While not as good
as a hand-done corpus, it should be a heck of a lot better than NO corpus,
which is the current alternative.
Loren