Re: some guidance needed

Robert Burrell Donkin Thu, 19 May 2011 11:53:44 -0700

On Thu, May 19, 2011 at 12:04 PM, Ioan Eugen Stan <stan.ieu...@gmail.com> wrote:
> I have forwarded this discussion to my mentors so they are informed


(I've hopped onto this list so no need to remember to copy me into the
thread ;-)

<snip>

> Eric, one of my mentors, suggested I use Gora for
> this and after a quick look at Gora I saw that it is an ORM for HBase
> and Cassandra which will allow me switch between them. The downside
> with this is that Gora is still incubating so a piece of advice about
> using it or not is welcomed. I will also ask on the Gora mailing list
> to see how things are there.

(I suspect there will be a measure of experimentation required in this
project, so don't be afraid to try a spike or two)

>>> I would encourage you to look at a system like HBase for your mail
>>> backend. HDFS doesn't work well with lots of little files, and also
>>> doesn't support random update, so existing formats like Maildir
>>> wouldn't be a good fit.

(Apache James closer to the Microsoft Exchange space than traditional
*nix mail user agents)

> I don't think I understand correctly what you mean by random updates.
> E-mails are immutable so once written they are not going to be
> updated. But if you are referring to the fact that lots of (small)
> files will be written in a directory and that this can be a problem
> then I get it. This will also mean that mailbox format (all emails in
> one file) will be more inappropriate than Maildir. But since e-mails
> are immutable and adding a mail to the mailbox means appending a small
> piece of data to the file this should not be a problem if Hadoop has
> append.

Essentially, there are two classes of data that mail storage requires

1. read only MIME documents (mail messages) embedding meta-data (headers)
2. read-write meta-data sets about each document including flags for
each (virtual) mail directory containing the document

The documents are searched rarely. The meta-data sets are read often
but written rarely.

I suspect that emails are relatively small in Hadoop terms, and are
often numerous. Might be interesting to see how a tuned HDFS instance
performs when storing large numbers of small MIME documents. Should be
easy enough to set up an experiment to benchmark. (I wonder whether a
RESTful distributed storage solution might end up working better.)

I suspect that the read-write meta-data sets will need HBase (or
Cassandra). Would need to think carefully about design, I think.

> The presentation on Vimeo it stated that HDFS 0.19 did not had append,
> I don't know yet what is the status on that, but things are a little
> brighter. You could have a mailbox file that could grow to a very
> large size. This will lead to all the users emails into one big file
> that is easy to manage, the only thing that it's missing is the
> fetching the emails. Since emails are appended to the file (inbox) as
> they come, and you usually are interested in the latest emails
> received you could just read the tail of the file and do some indexing
> based on that.

I'm not hopeful about adopting an append based approach. (Might be
made to work but I suspect that the locking required for IMAP or POP3
is likely to kill performance.)

Robert

Re: some guidance needed

Reply via email to