See my comments inline.
Tks,
- Eric

On 24/05/2011 07:44, Norman wrote:

<snip>

First, about email :
- emails are essentially immutable. Once created they do not modify.
- meta information is read/write (like the status - read/unread).
maybe other stuff, I still have to get up to date.
The only read-write you need to care about are the "FLAGS". Nothing else
is allowed to get changed once the mail is stored.
So you have:
- Append message + metadata
- Delete message + metadata
- Change FLAGS which is stored as metadata


Very good summary :)
I would also add the "mailbox" to the message metadata.
Maybe implicit when you say "message", but depending on the choices, the way we'll implement may vary completely. The mailbox of the message is r/w because user can move a message from a mailbox to another.

<snip>


- you can delete an email, but other than that you can't modify it.
- users usually access the last 50-100 emails (my observation)

Kind of.. you often see an IMAP client todo some "big" FETCH on the
first connect to see if there are changes in the mailbox. Like

a FETCH 1:* (FLAGS)


Yes, I regulary see that when I debug with wireshark some imap traffic. The full fetch can take some time for large mailboxes...

This will hopefully get improved when Apache James IMAP supports the
CONDSTORE[a] and QRESYNC[b] extensions. But thats on my todo list ;)
Unfortunally this will need to change the API of the current mailbox
release (0.2), but thats not something you should care about atm. Just use
the 0.2 release for your development


yes, let's stick to 0.2 release to not be impacted by upcoming changes in trunk.


About HDFS:

- is designed to work well with large data with the order of magnitude
of GB and beyond. It has a block size> 64 MB. This enables less disk
seeks when reading a file, because the file is less fragmented. It
uses bulk reads and writes enables to HDFS to perform better: all the
data is in one place, and you have a small number of open file
handlers, which means less over-heed.
- does not provide random file alteration. HDFS only supports APPEND
information at the end of an existing file. If you need to modify a
file, the only way to do it is to create a new file with the
modifications.

I thought we could do something similar to maildir which use the
filename as "meta-data" container.
See [c] and [d]. Not sure about the "small" file problem here ;)


Yes, no experience either with many small files in hadoop, but let's trust what the hadoop community says and writes :)

HBase:

- is a NoSQL implementation over Hadoop.
- provides the user a way to store information and access it very
easily based on some keys.
- provides a way to modify the files by keeping a log, similar to the
way journal file systems work: it appends all the modifications to a
log file. When certain conditions are met the log file is merged back
into the „database”.


HBase sounds like a good fit ...


+1

HBase is not difficult to install, well documented and the client API is very well done. Facebook's mailing system is built upon it.

My conclusions:

Because emails are small and require that a part of them needs to be
changed, storing them in a filesystem that was designed for large
files, which does not provide a way to modify these files is not a
sensible thing to do.

I see a couple of choices:

1. we use HBase
2. we keep the meta information in a separate database, outside
Hadoop, but things will not scale very well.
3. we design it on top of HDFS, but essentially we (I) will end up
solving the same problems that HBase solved

Using a seperate database for meta-information will only work if we can
store it in a distributed fashion. Otherwise it
just "kills" all the benefits of hadoop. Maybe storing the "meta-data"
in a distributed SOLR index could do the trick, not sure.

The most easy and straight forward solution is to use HBase, There is
a paper [3] that shows some results with an email store based on
Cassandra, so it is proven to work.
I wrote a prototype which use cassandra for Apache James Mailbox, which
is not Open-Source (yet?). It works quite well but suffer from any
locking, so you need some distributed locking
service like hazelcast [e]. So using NoSQL should work without probs,
you just need to keep in mind how the data is accessed.

I am thinking of using Gora and avoiding to use HBase API directly.
This will ensure that James could use any NoSQL storage that Gora can
access. What keeps me back is that Gora does not seem to be very
active and it's also incubating so I may run into things not easy to
get out of.

Maybe its just me but I still think a ORM mapper can just not work well
in the NoSQL world. As you need to design your "storage" in the way you
access the data. I would prolly just use the HBase API.

What do you think?


[1] http://comments.gmane.org/gmane.comp.jakarta.lucene.hadoop.user/26022
[2] http://vimeo.com/search/videos/search:cloudera/st/48b36a32
[3] http://ewh.ieee.org/r6/scv/computer/nfic/2009/IBM-Jun-Rao.pdf
Hope it helps,
Norman

[a] http://tools.ietf.org/html/rfc4551
[b] http://tools.ietf.org/search/rfc5162
[c] http://cr.yp.to/proto/maildir.html
[d] http://www.courier-mta.org/imap/README.maildirquota.html
[e] http://www.hazelcast.com/

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to