Re: Re[4]: Hackathon summary

Loren Wilton Tue, 26 Jul 2005 01:29:30 -0700

> How would we determine ham/spam?  At this point all we have is SA's
> first estimation, and no way of knowing whether this is accurate, FN,
> or FP.


All we could reasonably do is take SA's assment of the message and assume that 
statistically it will be correct to one or two sigma or so.  If the reporting 
site really does have a huge percentage of FPs or FNs, it will screw the 
reporting up some.  But most sites should be running fairly cleanly, so we 
should be able to assume with about 95% accuracy that the assessment of what 
kind of mail the rule hit was correct.

One of the main goals here would be to look for rules that we think are 
supposed to hit spam, but the reporting site claims 25% of the time hit ham.  
This would be a clear indication that a) the site is terribly screwed up, b) 
the rule is terribly screwed up if the site reports that it speaks English, c) 
the rule doesn't work well in whatever language the site reports it uses.

Clearly the obverse situation also applies.

This is why I also want the reports to contain an indication of the language 
and/or geographical location of the site, so we can spot the foreign language 
problems.  If the report also contained some indication of the percentage of 
mails that were submitted for learning and showed signs of being mis-classified 
(if that information is obtainable) it would give an indication of the 
reliability of the classification data from the site.

The goal here isn't the accuracy of a masscheck with a hand-classified corpus.  
The goal is much greater insight into how rules are really seeming to work out 
in the real world, especially the non-english parts of it.  While not as good 
as a hand-done corpus, it should be a heck of a lot better than NO corpus, 
which is the current alternative.

        Loren

Re: Re[4]: Hackathon summary

Reply via email to