Justin Mason wrote:

"Daryl C. W. O'Shea" writes:

>>Please let me know what you think!

>Sounds good, but I think the limited (and relatively static?) corpus may
>be an issue for rule development aimed at catching new spam signs.


Good point.

A static-ish ham corpus isn't a big problem, but we may need to supplement
the spam corpus with fresh feeds of new spam.  It should be possible to do
this either from trap feeds, or via submissions from the nightly corpus
submitters (rsync up bits of your corpus as you see fit).  Traps is
probably easier.

A couple of thoughts regarding corpus stuff with the current SARE masscheck method in mind:

- Ham is private to the individual masschecker. If there were a global corpus, this would necessarily not be the case. I would think twice about sending my corpus to some (even access controlled) global corpus.

- Individual corpus results vary dramatically. Sometimes it's useful to see how rules hit different corpora. In your proposed model, the masscheck could iterate over each corpus and masscheck on each individually, then consolidate the results (one weakness of our current method is that there is no consolidated view).

- Staleness of corpora. Sometimes a rule is developed for a brand new spam. Chris S sometimes cranks out a new version of a rule multiple times in a week as the spam mutates. Often the users' corpora that aren't up to date (usually mine ;) ) will show no hits, but if the user refreshes the corpus the hits show up. This would be an issue for either type of system; for me it currently means checking my Maildirs for misclassified ham, running an IMAP purge, and running an exportcorpus script. In your proposed system it would simply mean adding an rsync as another step.

- Masscheck speed: a minor point, but valid I think. The proposed buildbot solution as a centralized solution doesn't scale as well when additional corpora are added. In the current SARE system each corpus is checked in parallel with the rest.

- Barrier to entry: the SARE system requires each user to set up a script to do the masscheck, integrate with the local MTA and ensure serialization of requests, etc. Your proposed solution (uploading of corpora) is easier to get set up.

That's all for now, I may think of more stuff later


Chris Thielen

Attachment: signature.asc
Description: OpenPGP digital signature

Reply via email to