On Thu, 2009-07-16 at 13:58 -0700, John Hardin wrote:
> On Thu, 16 Jul 2009, Justin Mason wrote:
> > I'll then have a quick go at hand-classifying the submitted corpora,
> > spotting obvious FNs that slipped in, etc., and will then leave them on
> > the zone for nightly mass-checks to use as well. So the corpora won't
> > be private submissions.
> >
> > Thoughts?
I guess that's bound to heavily be biased towards newsletters and stuff,
rather than real ham. Why? Privacy. I for one am definitely not going to
publish my entire ham archive, let alone private conversations.
What will be left is non-private, non-confidential stuff. Most likely to
heavily be biased. But then, the current (distributed) corpora are
probably slightly(?) biased anyway, towards geeky...
Also, I fear some guys from the targeted audience will dump too many
list posts there, including discussions about spam and virus filtering.
These should not be filtered by SA in the first place, so they must not
appear in the corpus.
> Liability? Someone who provides you with a corpus voluntarily is implying
> they don't care if it becomes public; you might want to require a
> liability release.
Good point! The recipient might not have a problem publishing it. But
what about the sender?
--
char *t="\10pse\0r\0dtu...@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}