[Bug 5845] Hadoopify large-scale mass-check infrastructure

bugzilla-daemon Mon, 21 Apr 2008 02:22:04 -0700

https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5845






--- Comment #6 from Justin Mason <[EMAIL PROTECTED]>  2008-04-21 02:18:41 PST 
---
I've been thinking about this a little more.

I think the hardest part is distributing the semi-public ham/spam corpora to
the mass-checking machines.  Currently we use mass-check's client/server mode,
and it dists out the messages, but this is complex, seems a little brittle, and
of course isn't Hadoop-compatible.

Here's another alternative:

- on the zone, we collect the uploaded mails to scan (we currently do this)

- a zone-hosted process periodically takes the uploaded corpora, and extracts
  them into "bundles" of mail from each submitter, organised by date.  The
  easiest way to deal with these "bundles" is just to collect them
  as gzipped mboxes, one per day, "/home/corpus/jm/20080421.mbox.gz" for
  example.

  The advantage of organising them by submitter: we can tell whose collection
  the mail came from, of course.

  The advantage of organising by date: with sufficient recording of metadata
  about where the original mails came from, we can cache rebuilding of these
  files so that the incremental build process runs quickly.  Corpora of mails
  from 2007 aren't likely to need to change, in other words, apart from
  "expiring" off the mass-check list due to age.

- we can then rsync out the .mbox.gz files, or even offer them up on the zone
  for HTTP download at semi-public urls in some way (possibly similar to
  how I'm distributing the spamtrap data at the moment).

This should scale better.  one advantage is that we can achieve this with
the client-server stuff right now, I think, without moving full-scale to
Hadoop.  (right Daryl?)


-- 
Configure bugmail: 
https://issues.apache.org/SpamAssassin/userprefs.cgi?tab=email
------- You are receiving this mail because: -------
You are the assignee for the bug.

[Bug 5845] Hadoopify large-scale mass-check infrastructure

Reply via email to