-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
Michael Monnerie writes: > On Dienstag, 22. November 2005 06:44 Theo Van Dinter wrote: > > So basically Justin is 34%, I'm 31%, and everyone else combined is > > 35%. > > I could send you my hand sorted SPAM, if you like. It's only ~3000 > SPAMs, but maybe worth it - more and more german language SPAM coming > into my honeypots. > > For a more differentiated SPAM score, more people would have to commit > their SPAM. If that process would be well documented, and people > encouraged to do so, and the process is as easy as calling a script, I > believe you could have a lot of reporters. Actually, the problem that Theo is highlighting is not that we don't have any contributors for rescoring mass-checks using smaller corpora; we do (and more are definitely welcome!) The problem is that these small corpora become "background noise" compared to the big, 700k-message corpora -- myself (34%), and Theo (31%). What we need to do to fix this problem, is come up with ways to avoid letting big corpora "drown out" the little ones. I think if we limit each corpora to a certain max percentage of the total, we could do this -- e.g. if a corpus makes up more than (100 / num_contributors)%, then any excess above that percentage is dropped, favouring recent mails over older ones. (This post-processing step is doable with mass-check logs btw, we can write a script to do this.) The downside would be that we would then have "only" a 700,000-message corpus (or so) instead of a 2,000,000-message one. Henry, is that OK? - --j. -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.4.1 (GNU/Linux) Comment: Exmh CVS iD8DBQFDjz8aMJF5cimLx9ARAnzkAKCn6114fMkEqYby6QuDyd0V2x46gACdH8FN p2axrU0h3iTd9evP8aUhFS4= =iKC1 -----END PGP SIGNATURE-----
