On Donnerstag, 1. Dezember 2005 19:21 Justin Mason wrote:
> I think if we limit each corpora to a certain max percentage of the
> total, we could do this -- e.g. if a corpus makes up more than (100 /
> num_contributors)%, then any excess above that percentage is dropped,

You don't tell me that your 700k messages are hand sorted? How old are 
you ;-)

Anyway, more contributors would help to the problem. Imagine you get 100 
contributors, each just 2000 messages. And I believe there are a lot of 
people out there having a bigger corpora already. Making it more easy 
to contribute (and encourage people to report) could help.

If your two corpora is so big, I guess setting a time limit to only take 
the last 180 days or so of all SPAM could reduce your over-power in the 
percentages.

mfg zmi
-- 
// Michael Monnerie, Ing.BSc  ---   it-management Michael Monnerie
// http://zmi.at           Tel: 0660/4156531          Linux 2.6.11
// PGP Key:   "lynx -source http://zmi.at/zmi2.asc | gpg --import"
// Fingerprint: EB93 ED8A 1DCD BB6C F952  F7F4 3911 B933 7054 5879
// Keyserver: www.keyserver.net                 Key-ID: 0x70545879

Attachment: pgpKS81DOgVU9.pgp
Description: PGP signature

Reply via email to