CC to gora-dev... Begin forwarded message:
> From: Robert Burrell Donkin <[email protected]> > Date: May 19, 2011 11:53:16 AM PDT > To: "[email protected]" <[email protected]> > Subject: Re: some guidance needed > Reply-To: "[email protected]" <[email protected]> > > On Thu, May 19, 2011 at 12:04 PM, Ioan Eugen Stan <[email protected]> > wrote: >> I have forwarded this discussion to my mentors so they are informed > > (I've hopped onto this list so no need to remember to copy me into the > thread ;-) > > <snip> > >> Eric, one of my mentors, suggested I use Gora for >> this and after a quick look at Gora I saw that it is an ORM for HBase >> and Cassandra which will allow me switch between them. The downside >> with this is that Gora is still incubating so a piece of advice about >> using it or not is welcomed. I will also ask on the Gora mailing list >> to see how things are there. > > (I suspect there will be a measure of experimentation required in this > project, so don't be afraid to try a spike or two) > >>>> I would encourage you to look at a system like HBase for your mail >>>> backend. HDFS doesn't work well with lots of little files, and also >>>> doesn't support random update, so existing formats like Maildir >>>> wouldn't be a good fit. > > (Apache James closer to the Microsoft Exchange space than traditional > *nix mail user agents) > >> I don't think I understand correctly what you mean by random updates. >> E-mails are immutable so once written they are not going to be >> updated. But if you are referring to the fact that lots of (small) >> files will be written in a directory and that this can be a problem >> then I get it. This will also mean that mailbox format (all emails in >> one file) will be more inappropriate than Maildir. But since e-mails >> are immutable and adding a mail to the mailbox means appending a small >> piece of data to the file this should not be a problem if Hadoop has >> append. > > Essentially, there are two classes of data that mail storage requires > > 1. read only MIME documents (mail messages) embedding meta-data (headers) > 2. read-write meta-data sets about each document including flags for > each (virtual) mail directory containing the document > > The documents are searched rarely. The meta-data sets are read often > but written rarely. > > I suspect that emails are relatively small in Hadoop terms, and are > often numerous. Might be interesting to see how a tuned HDFS instance > performs when storing large numbers of small MIME documents. Should be > easy enough to set up an experiment to benchmark. (I wonder whether a > RESTful distributed storage solution might end up working better.) > > I suspect that the read-write meta-data sets will need HBase (or > Cassandra). Would need to think carefully about design, I think. > >> The presentation on Vimeo it stated that HDFS 0.19 did not had append, >> I don't know yet what is the status on that, but things are a little >> brighter. You could have a mailbox file that could grow to a very >> large size. This will lead to all the users emails into one big file >> that is easy to manage, the only thing that it's missing is the >> fetching the emails. Since emails are appended to the file (inbox) as >> they come, and you usually are interested in the latest emails >> received you could just read the tail of the file and do some indexing >> based on that. > > I'm not hopeful about adopting an append based approach. (Might be > made to work but I suspect that the locking required for IMAP or POP3 > is likely to kill performance.) > > Robert ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: [email protected] WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
