Hello,
I had some discussions with Eric about what will be the best way to implement the mailbox over HDFS and we agreed that it's better to inform the list about the situation. The project idea that I applied for is to implement James mailbox storage over Hadoop HDFS and one of the first steps was to find the best way to interact with Hadoop. So I just did that. I have spent the last week or so trying to figure out the best way to implement the mailbox over Hadoop. I found the training videos from Cloudera to be very helpful [2]. I also wrote on the Hadoop mailing list to ask them for an opinion (before watching the videos) . You can read the discussion here [1]. I have come to the conclusion that there is no easy way to implement the mailbox directly over HDFS, and my opinion is to use HBase, either directly or over Gora. I will support my statement with some of the things I found out. First, about email : - emails are essentially immutable. Once created they do not modify. - meta information is read/write (like the status - read/unread). maybe other stuff, I still have to get up to date. - you can delete an email, but other than that you can't modify it. - users usually access the last 50-100 emails (my observation) About HDFS: - is designed to work well with large data with the order of magnitude of GB and beyond. It has a block size > 64 MB. This enables less disk seeks when reading a file, because the file is less fragmented. It uses bulk reads and writes enables to HDFS to perform better: all the data is in one place, and you have a small number of open file handlers, which means less over-heed. - does not provide random file alteration. HDFS only supports APPEND information at the end of an existing file. If you need to modify a file, the only way to do it is to create a new file with the modifications. HBase: - is a NoSQL implementation over Hadoop. - provides the user a way to store information and access it very easily based on some keys. - provides a way to modify the files by keeping a log, similar to the way journal file systems work: it appends all the modifications to a log file. When certain conditions are met the log file is merged back into the „database”. My conclusions: Because emails are small and require that a part of them needs to be changed, storing them in a filesystem that was designed for large files, which does not provide a way to modify these files is not a sensible thing to do. I see a couple of choices: 1. we use HBase 2. we keep the meta information in a separate database, outside Hadoop, but things will not scale very well. 3. we design it on top of HDFS, but essentially we (I) will end up solving the same problems that HBase solved The most easy and straight forward solution is to use HBase, There is a paper [3] that shows some results with an email store based on Cassandra, so it is proven to work. I am thinking of using Gora and avoiding to use HBase API directly. This will ensure that James could use any NoSQL storage that Gora can access. What keeps me back is that Gora does not seem to be very active and it's also incubating so I may run into things not easy to get out of. What do you think? [1] http://comments.gmane.org/gmane.comp.jakarta.lucene.hadoop.user/26022 [2] http://vimeo.com/search/videos/search:cloudera/st/48b36a32 [3] http://ewh.ieee.org/r6/scv/computer/nfic/2009/IBM-Jun-Rao.pdf -- Ioan-Eugen Stan --------------------------------------------------------------------- To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org For additional commands, e-mail: server-dev-h...@james.apache.org