-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Henry Stern writes:
> I'd expect that the 700k message corpus will be more prone to errors
> than the 2M message corpus.  It still might be good enough.
> 
> I'm not convinced that rescoring (as opposed to putting in new rules)
> will do much for 3.0.5's accuracy.  If people really want to go to the
> trouble of running the mass-checks, I won't say no to generating the
> scores.  However, I can't promise that they will be any good.

at this stage, I think we're back to discussing the rescore mass-checks in
general, rather than 3.0.5 in particular.  Theo noted the relative corpus
sizes in 3.1.0 as a potential issue (nearly 2/3rds is made up of my and
Theo's mail alone).

- --j.

> Cheers,
> Henry
> 
> Justin Mason wrote:
> > Actually, the problem that Theo is highlighting is not that we don't have
> > any contributors for rescoring mass-checks using smaller corpora; we do
> > (and more are definitely welcome!)
> >
> > The problem is that these small corpora become "background noise" compared
> > to the big, 700k-message corpora -- myself (34%), and Theo (31%).
> > What we need to do to fix this problem, is come up with ways to avoid
> > letting big corpora "drown out" the little ones.
> >
> > I think if we limit each corpora to a certain max percentage of the total,
> > we could do this -- e.g. if a corpus makes up more than (100 /
> > num_contributors)%, then any excess above that percentage is dropped,
> > favouring recent mails over older ones.  (This post-processing step
> > is doable with mass-check logs btw, we can write a script to do this.)
> >
> > The downside would be that we would then have "only" a 700,000-message
> > corpus (or so) instead of a 2,000,000-message one.  Henry, is that OK?
> >
> > --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFDj4NSMJF5cimLx9ARAkBoAJ4/zU5LknHq6IFrLaQD2/adIZTSawCgopeQ
zs0k/mzjbMZ/dZc3IfX5zf8=
=0BKQ
-----END PGP SIGNATURE-----

Reply via email to