-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Michael Monnerie writes:
> On Dienstag, 22. November 2005 06:44 Theo Van Dinter wrote:
> > So basically Justin is 34%, I'm 31%, and everyone else combined is
> > 35%.
> 
> I could send you my hand sorted SPAM, if you like. It's only ~3000 
> SPAMs, but maybe worth it - more and more german language SPAM coming 
> into my honeypots.
> 
> For a more differentiated SPAM score, more people would have to commit 
> their SPAM. If that process would be well documented, and people 
> encouraged to do so, and the process is as easy as calling a script, I 
> believe you could have a lot of reporters.

Actually, the problem that Theo is highlighting is not that we don't have
any contributors for rescoring mass-checks using smaller corpora; we do
(and more are definitely welcome!)

The problem is that these small corpora become "background noise" compared
to the big, 700k-message corpora -- myself (34%), and Theo (31%).
What we need to do to fix this problem, is come up with ways to avoid
letting big corpora "drown out" the little ones.

I think if we limit each corpora to a certain max percentage of the total,
we could do this -- e.g. if a corpus makes up more than (100 /
num_contributors)%, then any excess above that percentage is dropped,
favouring recent mails over older ones.  (This post-processing step
is doable with mass-check logs btw, we can write a script to do this.)

The downside would be that we would then have "only" a 700,000-message
corpus (or so) instead of a 2,000,000-message one.  Henry, is that OK?

- --j.
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.1 (GNU/Linux)
Comment: Exmh CVS

iD8DBQFDjz8aMJF5cimLx9ARAnzkAKCn6114fMkEqYby6QuDyd0V2x46gACdH8FN
p2axrU0h3iTd9evP8aUhFS4=
=iKC1
-----END PGP SIGNATURE-----

Reply via email to