I'm thinking of a new way to collect mail for the mass-checks. We already collect plenty of spam; I don't think that's going to be a problem. However, we can collect more ham to identify FPs.
One useful factor of ham is that it's not time-sensitive; a mail that was ham in 2003 would still be ham today. So we can collect old ham mail archives, or submissions of relatively old mail, if necessary. I plan to ask (on users@, on my blog etc.) for submissions of archives of ham. Submissions of _just_ false positives is OK, as long as they're labelled as such, because they'll have differing profiles and too many FPs in the corpus will cause trouble for the score generation step. I'll then have a quick go at hand-classifying the submitted corpora, spotting obvious FNs that slipped in, etc., and will then leave them on the zone for nightly mass-checks to use as well. So the corpora won't be private submissions. Thoughts? -- --j.
