Re: mailbox over HDFS/HBase

Norman Mon, 23 May 2011 22:51:14 -0700

Hi Eric,

comments inside...


Am 24.05.2011 06:08, schrieb Eric Charles:

Hi,

For the immutable mails:
1. if we store each mail in a file, we don't have the alter it but weface the performance issue cause reading a small file in Hadoop seemsexpensive (not performant).

Seems like this, yeah..

2. If we store each folder in a file, we may have less performanceissue on read (larger file), but we face the issue that we can notalter the content (only append!!). So does not sound like an option.

Well we could just have some kind of info which mails are deleted andskip then while read from the file. This would still need to cleanup"deleted" messages later somehow. Not sure if it makes sense

given by the complexibilty it will introduce..

For associated metadata, maildir offers this functionality by usingthe file name as metadata container. On change, the file is renamedaddign some flags,... which is possible with Hadoop ([4] for exampleoperations on hdfs). Once again, at the price of performance for smallsize.
As Robert suggested in [1], a benchmark could be setup, but we wouldneed a realistic cluster (numerous hardware machines with replicationfactor >= 3) and large dataset (millions of mails) to get somerepresentative numbers.
On the possible file format, we have a limited options (hadoop callsthese some Writable): Text or BytesWritable. There's also file-baseddata structures: SequenceFile or MapFile.
I also answered on [1] asking what hadoop can offer in regards to Avroformat (see also [5] on the protocol buffer, avro kind-of, usage attwitter). I don't know if Avro file format changes anything to theexposed considerations...
In this Hadoop approach, we also need to ask how we get/query theinformation. Directly read the Hadoop Writable/File via the io API, oruse a map/reduce job ? The map/reduce job result will be stored in aOutputFile which must in its turn be read again, sounds a bit too muchto me...
Now, if we find all these too challenging and we are not sure we willget a performant solution, HBase for example is a proven solution andoffers a structured storage on top of Hadoop.
There's some ORM around (like the datanucleus jdo,...) but the HBasenative API is rich enough and should do the job for us withoutadditional layer.


+1, for no ORM ;)

I am following the Apache Gora incubating mailing list as it seems tohave much to offer (persistence towards hbase, cassandra,..indexing...) but the last time the project seemed to be quiet. Thisdoesn't mean the today functionality is not usable for us.
Another question is about the potential usage of the existing luceneindex to help us on the queries (for IMAP, currently in mailbox-storeproject). This would be a nice solution to use, but today the index islocal (not distributed). It's a work in progress, and can evolvetowards distribution. I don't think we need to decide on this now, butthe question will come one day.

Unfortunally the Lucene Index is not complete yet, its still on my todolist ;)


Tks,

- Eric

[4]http://myjavanotebook.blogspot.com/2008/05/hadoop-file-system-tutorial.html[5]http://www.slideshare.net/kevinweil/protocol-buffers-and-hadoop-at-twitter


On 24/05/2011 00:01, Ioan Eugen Stan wrote:

Hello,


I had some discussions with Eric about what will be the best way to
implement the mailbox over HDFS and we agreed that it's better to
inform the list about the situation.

The project idea that I applied for is to implement James mailbox
storage over Hadoop HDFS and one of the first steps was to find the
best way to interact with Hadoop. So I just did that. I have spent the
last week or so trying to figure out the best way to implement the
mailbox over Hadoop. I found the training videos from Cloudera to be
very helpful [2].
I also wrote on the Hadoop mailing list to ask them for an opinion
(before watching the videos) . You can read the discussion here [1].

I have come to the conclusion that there is no easy way to implement
the mailbox directly over HDFS, and my opinion is to use HBase, either
directly or over Gora. I will support my statement with some of the
things I found out.

First, about email :
- emails are essentially immutable. Once created they do not modify.
- meta information is read/write (like the status - read/unread).
maybe other stuff, I still have to get up to date.
- you can delete an email, but other than that you can't modify it.
- users usually access the last 50-100 emails (my observation)

About HDFS:

- is designed to work well with large data with the order of magnitude
of GB and beyond. It has a block size>  64 MB. This enables less disk
seeks when reading a file, because the file is less fragmented. It
uses bulk reads and writes enables to HDFS to perform better: all the
data is in one place, and you have a small number of open file
handlers, which means less over-heed.
- does not provide random file alteration. HDFS only supports APPEND
information at the end of an existing file. If you need to modify a
file, the only way to do it is to create a new file with the
modifications.


HBase:

- is a NoSQL implementation over Hadoop.
- provides the user a way to store information and access it very
easily based on some keys.
- provides a way to modify the files by keeping a log, similar to the
way journal file systems work: it appends all the modifications to a
log file. When certain conditions are met the log file is merged back
into the „database”.


My conclusions:

Because emails are small and require that a part of them needs to be
changed, storing them in a filesystem that was designed for large
files, which does not provide a way to modify these files is not a
sensible thing to do.

I see a couple of choices:

1. we use HBase
2. we keep the meta information in a separate database, outside
Hadoop, but things will not scale very well.
3. we design it on top of HDFS, but essentially we (I) will end up
solving the same problems that HBase solved

The most easy and straight forward solution is to use HBase, There is
a paper [3] that shows some results with an email store based on
Cassandra, so it is proven to work.
I am thinking of using Gora and avoiding to use HBase API directly.
This will ensure that James could use any NoSQL storage that Gora can
access. What keeps me back is that Gora does not seem to be very
active and it's also incubating so I may run into things not easy to
get out of.


What do you think?

[1]http://comments.gmane.org/gmane.comp.jakarta.lucene.hadoop.user/26022

[2] http://vimeo.com/search/videos/search:cloudera/st/48b36a32
[3] http://ewh.ieee.org/r6/scv/computer/nfic/2009/IBM-Jun-Rao.pdf


Bye,
Norman

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org
For additional commands, e-mail: server-dev-h...@james.apache.org

Re: mailbox over HDFS/HBase

Reply via email to