On Donnerstag, 1. Dezember 2005 19:21 Justin Mason wrote: > I think if we limit each corpora to a certain max percentage of the > total, we could do this -- e.g. if a corpus makes up more than (100 / > num_contributors)%, then any excess above that percentage is dropped,
You don't tell me that your 700k messages are hand sorted? How old are you ;-) Anyway, more contributors would help to the problem. Imagine you get 100 contributors, each just 2000 messages. And I believe there are a lot of people out there having a bigger corpora already. Making it more easy to contribute (and encourage people to report) could help. If your two corpora is so big, I guess setting a time limit to only take the last 180 days or so of all SPAM could reduce your over-power in the percentages. mfg zmi -- // Michael Monnerie, Ing.BSc --- it-management Michael Monnerie // http://zmi.at Tel: 0660/4156531 Linux 2.6.11 // PGP Key: "lynx -source http://zmi.at/zmi2.asc | gpg --import" // Fingerprint: EB93 ED8A 1DCD BB6C F952 F7F4 3911 B933 7054 5879 // Keyserver: www.keyserver.net Key-ID: 0x70545879
pgpKS81DOgVU9.pgp
Description: PGP signature
