https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5845
--- Comment #6 from Justin Mason <[EMAIL PROTECTED]> 2008-04-21 02:18:41 PST --- I've been thinking about this a little more. I think the hardest part is distributing the semi-public ham/spam corpora to the mass-checking machines. Currently we use mass-check's client/server mode, and it dists out the messages, but this is complex, seems a little brittle, and of course isn't Hadoop-compatible. Here's another alternative: - on the zone, we collect the uploaded mails to scan (we currently do this) - a zone-hosted process periodically takes the uploaded corpora, and extracts them into "bundles" of mail from each submitter, organised by date. The easiest way to deal with these "bundles" is just to collect them as gzipped mboxes, one per day, "/home/corpus/jm/20080421.mbox.gz" for example. The advantage of organising them by submitter: we can tell whose collection the mail came from, of course. The advantage of organising by date: with sufficient recording of metadata about where the original mails came from, we can cache rebuilding of these files so that the incremental build process runs quickly. Corpora of mails from 2007 aren't likely to need to change, in other words, apart from "expiring" off the mass-check list due to age. - we can then rsync out the .mbox.gz files, or even offer them up on the zone for HTTP download at semi-public urls in some way (possibly similar to how I'm distributing the spamtrap data at the moment). This should scale better. one advantage is that we can achieve this with the client-server stuff right now, I think, without moving full-scale to Hadoop. (right Daryl?) -- Configure bugmail: https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email ------- You are receiving this mail because: ------- You are the assignee for the bug.
