-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Henry Stern writes: > I'd expect that the 700k message corpus will be more prone to errors > than the 2M message corpus. It still might be good enough. > > I'm not convinced that rescoring (as opposed to putting in new rules) > will do much for 3.0.5's accuracy. If people really want to go to the > trouble of running the mass-checks, I won't say no to generating the > scores. However, I can't promise that they will be any good. at this stage, I think we're back to discussing the rescore mass-checks in general, rather than 3.0.5 in particular. Theo noted the relative corpus sizes in 3.1.0 as a potential issue (nearly 2/3rds is made up of my and Theo's mail alone). - --j. > Cheers, > Henry > > Justin Mason wrote: > > Actually, the problem that Theo is highlighting is not that we don't have > > any contributors for rescoring mass-checks using smaller corpora; we do > > (and more are definitely welcome!) > > > > The problem is that these small corpora become "background noise" compared > > to the big, 700k-message corpora -- myself (34%), and Theo (31%). > > What we need to do to fix this problem, is come up with ways to avoid > > letting big corpora "drown out" the little ones. > > > > I think if we limit each corpora to a certain max percentage of the total, > > we could do this -- e.g. if a corpus makes up more than (100 / > > num_contributors)%, then any excess above that percentage is dropped, > > favouring recent mails over older ones. (This post-processing step > > is doable with mass-check logs btw, we can write a script to do this.) > > > > The downside would be that we would then have "only" a 700,000-message > > corpus (or so) instead of a 2,000,000-message one. Henry, is that OK? > > > > --j. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) Comment: Exmh CVS iD8DBQFDj4NSMJF5cimLx9ARAkBoAJ4/zU5LknHq6IFrLaQD2/adIZTSawCgopeQ zs0k/mzjbMZ/dZc3IfX5zf8= =0BKQ -----END PGP SIGNATURE-----
