Re: some guidance needed
Hi, Yes, we need to store immutable mails and their associated r/w metadata. I was wondering in which way a solution like the one presented on [1] can help. Twitter seems to use Protocol Buffers to store tweets. Would a solution based on Avro be a better fit for our needs (mail storage)? In this Avro option, would each mail be a avro file, or should be consider to have the folder an avro file and run some map/reduce jobs? Tks, - Eric [1] http://www.slideshare.net/kevinweil/protocol-buffers-and-hadoop-at-twitter On 19/05/2011 20:53, Robert Burrell Donkin wrote: On Thu, May 19, 2011 at 12:04 PM, Ioan Eugen Stanstan.ieu...@gmail.com wrote: I have forwarded this discussion to my mentors so they are informed (I've hopped onto this list so no need to remember to copy me into the thread ;-) snip Eric, one of my mentors, suggested I use Gora for this and after a quick look at Gora I saw that it is an ORM for HBase and Cassandra which will allow me switch between them. The downside with this is that Gora is still incubating so a piece of advice about using it or not is welcomed. I will also ask on the Gora mailing list to see how things are there. (I suspect there will be a measure of experimentation required in this project, so don't be afraid to try a spike or two) I would encourage you to look at a system like HBase for your mail backend. HDFS doesn't work well with lots of little files, and also doesn't support random update, so existing formats like Maildir wouldn't be a good fit. (Apache James closer to the Microsoft Exchange space than traditional *nix mail user agents) I don't think I understand correctly what you mean by random updates. E-mails are immutable so once written they are not going to be updated. But if you are referring to the fact that lots of (small) files will be written in a directory and that this can be a problem then I get it. This will also mean that mailbox format (all emails in one file) will be more inappropriate than Maildir. But since e-mails are immutable and adding a mail to the mailbox means appending a small piece of data to the file this should not be a problem if Hadoop has append. Essentially, there are two classes of data that mail storage requires 1. read only MIME documents (mail messages) embedding meta-data (headers) 2. read-write meta-data sets about each document including flags for each (virtual) mail directory containing the document The documents are searched rarely. The meta-data sets are read often but written rarely. I suspect that emails are relatively small in Hadoop terms, and are often numerous. Might be interesting to see how a tuned HDFS instance performs when storing large numbers of small MIME documents. Should be easy enough to set up an experiment to benchmark. (I wonder whether a RESTful distributed storage solution might end up working better.) I suspect that the read-write meta-data sets will need HBase (or Cassandra). Would need to think carefully about design, I think. The presentation on Vimeo it stated that HDFS 0.19 did not had append, I don't know yet what is the status on that, but things are a little brighter. You could have a mailbox file that could grow to a very large size. This will lead to all the users emails into one big file that is easy to manage, the only thing that it's missing is the fetching the emails. Since emails are appended to the file (inbox) as they come, and you usually are interested in the latest emails received you could just read the tail of the file and do some indexing based on that. I'm not hopeful about adopting an append based approach. (Might be made to work but I suspect that the locking required for IMAP or POP3 is likely to kill performance.) Robert
Re: some guidance needed
I have forwarded this discussion to my mentors so they are informed and I hope they will provide better input regarding email storage. I second what Todd said, even with FuseHDFS, mounting HDFS as a regular file system, it won't give you the immediate response about the file status that you need. I believe Google implemented Gmail with HBase. Here is an example of implementing a mail store with Cassandra: http://ewh.ieee.org/r6/scv/computer/nfic/2009/IBM-Jun-Rao.pdf http://ewh.ieee.org/r6/scv/computer/nfic/2009/IBM-Jun-Rao.pdfMark Thanks Mark, I will look into that. I am currently watching. Claudera Hadoop Training [1] to get a better view of how things work. I have one question: what is the defining difference between Cassandra and HBase? Also, Eric, one of my mentors, suggested I use Gora for this and after a quick look at Gora I saw that it is an ORM for HBase and Cassandra which will allow me switch between them. The downside with this is that Gora is still incubating so a piece of advice about using it or not is welcomed. I will also ask on the Gora mailing list to see how things are there. I would encourage you to look at a system like HBase for your mail backend. HDFS doesn't work well with lots of little files, and also doesn't support random update, so existing formats like Maildir wouldn't be a good fit. I don't think I understand correctly what you mean by random updates. E-mails are immutable so once written they are not going to be updated. But if you are referring to the fact that lots of (small) files will be written in a directory and that this can be a problem then I get it. This will also mean that mailbox format (all emails in one file) will be more inappropriate than Maildir. But since e-mails are immutable and adding a mail to the mailbox means appending a small piece of data to the file this should not be a problem if Hadoop has append. The presentation on Vimeo it stated that HDFS 0.19 did not had append, I don't know yet what is the status on that, but things are a little brighter. You could have a mailbox file that could grow to a very large size. This will lead to all the users emails into one big file that is easy to manage, the only thing that it's missing is the fetching the emails. Since emails are appended to the file (inbox) as they come, and you usually are interested in the latest emails received you could just read the tail of the file and do some indexing based on that. Should I post this on the HDFS mailing-list also? I'm talking without real experience with Hadoop so shut me up if I'm wrong. -- Todd Lipcon Software Engineer, Cloudera You are form Cloudera, nice. Answers straight from the source :). [1] http://vimeo.com/3591321 Thanks, -- Ioan-Eugen Stan
Re: some guidance needed
On Thu, May 19, 2011 at 12:04 PM, Ioan Eugen Stan stan.ieu...@gmail.com wrote: I have forwarded this discussion to my mentors so they are informed (I've hopped onto this list so no need to remember to copy me into the thread ;-) snip Eric, one of my mentors, suggested I use Gora for this and after a quick look at Gora I saw that it is an ORM for HBase and Cassandra which will allow me switch between them. The downside with this is that Gora is still incubating so a piece of advice about using it or not is welcomed. I will also ask on the Gora mailing list to see how things are there. (I suspect there will be a measure of experimentation required in this project, so don't be afraid to try a spike or two) I would encourage you to look at a system like HBase for your mail backend. HDFS doesn't work well with lots of little files, and also doesn't support random update, so existing formats like Maildir wouldn't be a good fit. (Apache James closer to the Microsoft Exchange space than traditional *nix mail user agents) I don't think I understand correctly what you mean by random updates. E-mails are immutable so once written they are not going to be updated. But if you are referring to the fact that lots of (small) files will be written in a directory and that this can be a problem then I get it. This will also mean that mailbox format (all emails in one file) will be more inappropriate than Maildir. But since e-mails are immutable and adding a mail to the mailbox means appending a small piece of data to the file this should not be a problem if Hadoop has append. Essentially, there are two classes of data that mail storage requires 1. read only MIME documents (mail messages) embedding meta-data (headers) 2. read-write meta-data sets about each document including flags for each (virtual) mail directory containing the document The documents are searched rarely. The meta-data sets are read often but written rarely. I suspect that emails are relatively small in Hadoop terms, and are often numerous. Might be interesting to see how a tuned HDFS instance performs when storing large numbers of small MIME documents. Should be easy enough to set up an experiment to benchmark. (I wonder whether a RESTful distributed storage solution might end up working better.) I suspect that the read-write meta-data sets will need HBase (or Cassandra). Would need to think carefully about design, I think. The presentation on Vimeo it stated that HDFS 0.19 did not had append, I don't know yet what is the status on that, but things are a little brighter. You could have a mailbox file that could grow to a very large size. This will lead to all the users emails into one big file that is easy to manage, the only thing that it's missing is the fetching the emails. Since emails are appended to the file (inbox) as they come, and you usually are interested in the latest emails received you could just read the tail of the file and do some indexing based on that. I'm not hopeful about adopting an append based approach. (Might be made to work but I suspect that the locking required for IMAP or POP3 is likely to kill performance.) Robert
Re: some guidance needed
Hi Ioan, I would encourage you to look at a system like HBase for your mail backend. HDFS doesn't work well with lots of little files, and also doesn't support random update, so existing formats like Maildir wouldn't be a good fit. -Todd On Wed, May 18, 2011 at 4:02 PM, Ioan Eugen Stan stan.ieu...@gmail.com wrote: Hello everybody, I'm a GSoC student for this year and I will be working on James [1]. My project is to implement email storage over HDFS. I am quite new to Hadoop and associates and I am looking for some hints as to get started on the right track. I have installed a single node Hadoop instance on my machine and played around with it (ran some examples) but I am interested into what you (more experienced people) think it's the best way to approach my problem. I am a little puzzled about the fact that that I read hadoop is best used for large files and email aren't that large from what I know. Another thing that crossed my mind is that since HDFS is a file system, wouldn't it be possible to set it as a back-end for the (existing) maildir and mailbox storage formats? (I think this question is more suited on the James mailing list, but if you have some ideas please speak your mind). Also, any development resources to get me started are welcomed. [1] http://james.apache.org/mailbox/ [2] https://issues.apache.org/jira/browse/MAILBOX-44 Regards, -- Ioan Eugen Stan -- Todd Lipcon Software Engineer, Cloudera
Re: some guidance needed
Ioan, I second what Todd said, even with FuseHDFS, mounting HDFS as a regular file system, it won't give you the immediate response about the file status that you need. I believe Google implemented Gmail with HBase. Here is an example of implementing a mail store with Cassandra: http://ewh.ieee.org/r6/scv/computer/nfic/2009/IBM-Jun-Rao.pdf http://ewh.ieee.org/r6/scv/computer/nfic/2009/IBM-Jun-Rao.pdfMark On Wed, May 18, 2011 at 5:05 PM, Todd Lipcon t...@cloudera.com wrote: Hi Ioan, I would encourage you to look at a system like HBase for your mail backend. HDFS doesn't work well with lots of little files, and also doesn't support random update, so existing formats like Maildir wouldn't be a good fit. -Todd On Wed, May 18, 2011 at 4:02 PM, Ioan Eugen Stan stan.ieu...@gmail.com wrote: Hello everybody, I'm a GSoC student for this year and I will be working on James [1]. My project is to implement email storage over HDFS. I am quite new to Hadoop and associates and I am looking for some hints as to get started on the right track. I have installed a single node Hadoop instance on my machine and played around with it (ran some examples) but I am interested into what you (more experienced people) think it's the best way to approach my problem. I am a little puzzled about the fact that that I read hadoop is best used for large files and email aren't that large from what I know. Another thing that crossed my mind is that since HDFS is a file system, wouldn't it be possible to set it as a back-end for the (existing) maildir and mailbox storage formats? (I think this question is more suited on the James mailing list, but if you have some ideas please speak your mind). Also, any development resources to get me started are welcomed. [1] http://james.apache.org/mailbox/ [2] https://issues.apache.org/jira/browse/MAILBOX-44 Regards, -- Ioan Eugen Stan -- Todd Lipcon Software Engineer, Cloudera