[jira] [Commented] (MAILBOX-44) [gsoc2011] Design and implement a distributed mailbox using Hadoop

Norman Maurer (JIRA) Tue, 14 Jun 2011 22:53:00 -0700

    [ 
https://issues.apache.org/jira/browse/MAILBOX-44?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13049618#comment-13049618
 ]


Norman Maurer commented on MAILBOX-44:
--------------------------------------

@stack:

First of welcome :)

I wrote a few of the other mailbox implementations in JAMES. So maybe I can 
answer your questions (concerns) ;) I also wrote a prototype for a mailbox on 
top of cassandra which is not to different in terms of "limitations".

So here we go:

I think putting all the mail in one row for a mailbox will not work. As really 
big mailboxes are quite common these days. This will just limit the 
distribution a lot (as you already pointed out). So let me try to explain how I 
did the schema for cassandra maybe it also fits for hbase (I had not the time 
to dig deeper into it).

* one row for the mailbox meta data (mailboxId, uidvalidity, namespace, 
username ...). 
* one row for the message metadata ( mailboxId, uid, size, headers, flags, 
messagecontentId...). 
* one row per message content where I splitted the messagecontent in 1mb parts 
and put each "raw" byte[] in a new column. This makes sure we don't get to big 
column (not sure if this is also needed for hbase, in cassandra big columns are 
a problem)

For queries there a the following:
* retrieve all messages which have the recent flag set
* retrieve all messages which have the sent flag set
* retrieve all messages with uid <=> X
* retrieve all messages with the deleted flag set
* retrieve all mailboxes with name like '%X%'

Then IMAP also allows to build your own search query. Which is really 
problematic with nosql stores or even if sql stores. As it mainly allow the 
user todo any kind of filtering, which in fact just suck when you don't have 
the indexes set. So we have a lucene index for that atm. I plan to write one in 
SOLR too.

Threading is not supported atm but is on my todo list.

Hope this helps, just ask if you need more infos

> [gsoc2011] Design and implement a distributed mailbox using Hadoop
> ------------------------------------------------------------------
>
>                 Key: MAILBOX-44
>                 URL: https://issues.apache.org/jira/browse/MAILBOX-44
>             Project: James Mailbox
>          Issue Type: New Feature
>            Reporter: Eric Charles
>            Assignee: Norman Maurer
>              Labels: gsoc2011
>             Fix For: 0.3
>
>
> Context: The mailbox subproject (http://james.apache.org/mailbox/) supports 
> maildir, SQL database (via JPA) and Java Content Repository (JCR) as 
> technology for mail storage. This flexibility is achieved thanks to a API 
> design that abstracts mail storage from the mail protocols.
> Task: We need to implement mailbox storage as a distributed system on top of 
> Hadoop HDFS. The James mailbox API will be used. A first step is to design 
> how to interact with Hadoop (native api, gora incubator at apache,...) and 
> deal with specific performance questions related to mail loading/parsing in a 
> distributed system (use map/reduce or not, use existing local lucene indexes 
> for search,...). The second step is to implement the HDFS mailbox (maildir 
> mailbox is similar because is stores mails as a file and can be an 
> inspiration). A single James server will still be deployed because we don't 
> have any distributed UID generation.
> Mentor: eric at apache dot org
> Complexity: medium 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: server-dev-unsubscr...@james.apache.org
For additional commands, e-mail: server-dev-h...@james.apache.org

[jira] [Commented] (MAILBOX-44) [gsoc2011] Design and implement a distributed mailbox using Hadoop

Reply via email to