Re: some guidance needed

2011-05-23 Thread Eric Charles

Hi,

Yes, we need to store immutable mails and their associated r/w metadata.

I was wondering in which way a solution like the one presented on [1] 
can help. Twitter seems to use Protocol Buffers to store tweets.


Would a solution based on Avro be a better fit for our needs (mail storage)?

In this Avro option, would each mail be a avro file, or should be 
consider to have the folder an avro file and run some map/reduce jobs?


Tks,

- Eric

[1] 
http://www.slideshare.net/kevinweil/protocol-buffers-and-hadoop-at-twitter



On 19/05/2011 20:53, Robert Burrell Donkin wrote:

On Thu, May 19, 2011 at 12:04 PM, Ioan Eugen Stanstan.ieu...@gmail.com  wrote:

I have forwarded this discussion to my mentors so they are informed


(I've hopped onto this list so no need to remember to copy me into the
thread ;-)

snip


Eric, one of my mentors, suggested I use Gora for
this and after a quick look at Gora I saw that it is an ORM for HBase
and Cassandra which will allow me switch between them. The downside
with this is that Gora is still incubating so a piece of advice about
using it or not is welcomed. I will also ask on the Gora mailing list
to see how things are there.


(I suspect there will be a measure of experimentation required in this
project, so don't be afraid to try a spike or two)


I would encourage you to look at a system like HBase for your mail
backend. HDFS doesn't work well with lots of little files, and also
doesn't support random update, so existing formats like Maildir
wouldn't be a good fit.


(Apache James closer to the Microsoft Exchange space than traditional
*nix mail user agents)


I don't think I understand correctly what you mean by random updates.
E-mails are immutable so once written they are not going to be
updated. But if you are referring to the fact that lots of (small)
files will be written in a directory and that this can be a problem
then I get it. This will also mean that mailbox format (all emails in
one file) will be more inappropriate than Maildir. But since e-mails
are immutable and adding a mail to the mailbox means appending a small
piece of data to the file this should not be a problem if Hadoop has
append.


Essentially, there are two classes of data that mail storage requires

1. read only MIME documents (mail messages) embedding meta-data (headers)
2. read-write meta-data sets about each document including flags for
each (virtual) mail directory containing the document

The documents are searched rarely. The meta-data sets are read often
but written rarely.

I suspect that emails are relatively small in Hadoop terms, and are
often numerous. Might be interesting to see how a tuned HDFS instance
performs when storing large numbers of small MIME documents. Should be
easy enough to set up an experiment to benchmark. (I wonder whether a
RESTful distributed storage solution might end up working better.)

I suspect that the read-write meta-data sets will need HBase (or
Cassandra). Would need to think carefully about design, I think.


The presentation on Vimeo it stated that HDFS 0.19 did not had append,
I don't know yet what is the status on that, but things are a little
brighter. You could have a mailbox file that could grow to a very
large size. This will lead to all the users emails into one big file
that is easy to manage, the only thing that it's missing is the
fetching the emails. Since emails are appended to the file (inbox) as
they come, and you usually are interested in the latest emails
received you could just read the tail of the file and do some indexing
based on that.


I'm not hopeful about adopting an append based approach. (Might be
made to work but I suspect that the locking required for IMAP or POP3
is likely to kill performance.)

Robert




Re: some guidance needed

2011-05-19 Thread Ioan Eugen Stan
I have forwarded this discussion to my mentors so they are informed
and I hope they will provide better input regarding email storage.

 I second what Todd said, even with FuseHDFS, mounting HDFS as a regular file
 system, it won't give you the immediate response about the file status that
 you need. I believe Google implemented Gmail with HBase. Here is an example
 of implementing a mail store with Cassandra:
 http://ewh.ieee.org/r6/scv/computer/nfic/2009/IBM-Jun-Rao.pdf

 http://ewh.ieee.org/r6/scv/computer/nfic/2009/IBM-Jun-Rao.pdfMark

Thanks Mark, I will look into that. I am currently watching. Claudera
Hadoop Training [1] to get a better view of how things work.

I have one question: what is the defining difference between Cassandra
and HBase? Also, Eric, one of my mentors, suggested I use Gora for
this and after a quick look at Gora I saw that it is an ORM for HBase
and Cassandra which will allow me switch between them. The downside
with this is that Gora is still incubating so a piece of advice about
using it or not is welcomed. I will also ask on the Gora mailing list
to see how things are there.

 I would encourage you to look at a system like HBase for your mail
 backend. HDFS doesn't work well with lots of little files, and also
 doesn't support random update, so existing formats like Maildir
 wouldn't be a good fit.

I don't think I understand correctly what you mean by random updates.
E-mails are immutable so once written they are not going to be
updated. But if you are referring to the fact that lots of (small)
files will be written in a directory and that this can be a problem
then I get it. This will also mean that mailbox format (all emails in
one file) will be more inappropriate than Maildir. But since e-mails
are immutable and adding a mail to the mailbox means appending a small
piece of data to the file this should not be a problem if Hadoop has
append.

The presentation on Vimeo it stated that HDFS 0.19 did not had append,
I don't know yet what is the status on that, but things are a little
brighter. You could have a mailbox file that could grow to a very
large size. This will lead to all the users emails into one big file
that is easy to manage, the only thing that it's missing is the
fetching the emails. Since emails are appended to the file (inbox) as
they come, and you usually are interested in the latest emails
received you could just read the tail of the file and do some indexing
based on that. Should I post this on the HDFS mailing-list also?

I'm talking without real experience with Hadoop so shut me up if I'm wrong.

 --
 Todd Lipcon
 Software Engineer, Cloudera

You are form Cloudera, nice. Answers straight from the source :).

[1] http://vimeo.com/3591321

Thanks,

-- 
Ioan-Eugen Stan


Re: some guidance needed

2011-05-19 Thread Robert Burrell Donkin
On Thu, May 19, 2011 at 12:04 PM, Ioan Eugen Stan stan.ieu...@gmail.com wrote:
 I have forwarded this discussion to my mentors so they are informed

(I've hopped onto this list so no need to remember to copy me into the
thread ;-)

snip

 Eric, one of my mentors, suggested I use Gora for
 this and after a quick look at Gora I saw that it is an ORM for HBase
 and Cassandra which will allow me switch between them. The downside
 with this is that Gora is still incubating so a piece of advice about
 using it or not is welcomed. I will also ask on the Gora mailing list
 to see how things are there.

(I suspect there will be a measure of experimentation required in this
project, so don't be afraid to try a spike or two)

 I would encourage you to look at a system like HBase for your mail
 backend. HDFS doesn't work well with lots of little files, and also
 doesn't support random update, so existing formats like Maildir
 wouldn't be a good fit.

(Apache James closer to the Microsoft Exchange space than traditional
*nix mail user agents)

 I don't think I understand correctly what you mean by random updates.
 E-mails are immutable so once written they are not going to be
 updated. But if you are referring to the fact that lots of (small)
 files will be written in a directory and that this can be a problem
 then I get it. This will also mean that mailbox format (all emails in
 one file) will be more inappropriate than Maildir. But since e-mails
 are immutable and adding a mail to the mailbox means appending a small
 piece of data to the file this should not be a problem if Hadoop has
 append.

Essentially, there are two classes of data that mail storage requires

1. read only MIME documents (mail messages) embedding meta-data (headers)
2. read-write meta-data sets about each document including flags for
each (virtual) mail directory containing the document

The documents are searched rarely. The meta-data sets are read often
but written rarely.

I suspect that emails are relatively small in Hadoop terms, and are
often numerous. Might be interesting to see how a tuned HDFS instance
performs when storing large numbers of small MIME documents. Should be
easy enough to set up an experiment to benchmark. (I wonder whether a
RESTful distributed storage solution might end up working better.)

I suspect that the read-write meta-data sets will need HBase (or
Cassandra). Would need to think carefully about design, I think.

 The presentation on Vimeo it stated that HDFS 0.19 did not had append,
 I don't know yet what is the status on that, but things are a little
 brighter. You could have a mailbox file that could grow to a very
 large size. This will lead to all the users emails into one big file
 that is easy to manage, the only thing that it's missing is the
 fetching the emails. Since emails are appended to the file (inbox) as
 they come, and you usually are interested in the latest emails
 received you could just read the tail of the file and do some indexing
 based on that.

I'm not hopeful about adopting an append based approach. (Might be
made to work but I suspect that the locking required for IMAP or POP3
is likely to kill performance.)

Robert


some guidance needed

2011-05-18 Thread Ioan Eugen Stan
Hello everybody,

I'm a GSoC student for this year and I will be working on James [1].
My project is to implement email storage over HDFS. I am quite new to
Hadoop and associates and I am looking for some hints as to get
started on the right track.

I have installed a single node Hadoop instance on my machine and
played around with it (ran some examples) but I am interested into
what you (more experienced people) think it's the best way to approach
my problem.

I am a little puzzled about the fact that that I read hadoop is best
used for large files and email aren't that large from what I know.
Another thing that crossed my mind is that since HDFS is a file
system, wouldn't it be possible to set it as a back-end for the
(existing) maildir and mailbox storage formats? (I think this question
is more suited on the James mailing list, but if you have some ideas
please speak your mind).

Also, any development resources to get me started are welcomed.


[1] http://james.apache.org/mailbox/
[2] https://issues.apache.org/jira/browse/MAILBOX-44

Regards,
-- 
Ioan Eugen Stan


Re: some guidance needed

2011-05-18 Thread Todd Lipcon
Hi Ioan,

I would encourage you to look at a system like HBase for your mail
backend. HDFS doesn't work well with lots of little files, and also
doesn't support random update, so existing formats like Maildir
wouldn't be a good fit.

-Todd

On Wed, May 18, 2011 at 4:02 PM, Ioan Eugen Stan stan.ieu...@gmail.com wrote:
 Hello everybody,

 I'm a GSoC student for this year and I will be working on James [1].
 My project is to implement email storage over HDFS. I am quite new to
 Hadoop and associates and I am looking for some hints as to get
 started on the right track.

 I have installed a single node Hadoop instance on my machine and
 played around with it (ran some examples) but I am interested into
 what you (more experienced people) think it's the best way to approach
 my problem.

 I am a little puzzled about the fact that that I read hadoop is best
 used for large files and email aren't that large from what I know.
 Another thing that crossed my mind is that since HDFS is a file
 system, wouldn't it be possible to set it as a back-end for the
 (existing) maildir and mailbox storage formats? (I think this question
 is more suited on the James mailing list, but if you have some ideas
 please speak your mind).

 Also, any development resources to get me started are welcomed.


 [1] http://james.apache.org/mailbox/
 [2] https://issues.apache.org/jira/browse/MAILBOX-44

 Regards,
 --
 Ioan Eugen Stan




-- 
Todd Lipcon
Software Engineer, Cloudera


Re: some guidance needed

2011-05-18 Thread Mark Kerzner
Ioan,

I second what Todd said, even with FuseHDFS, mounting HDFS as a regular file
system, it won't give you the immediate response about the file status that
you need. I believe Google implemented Gmail with HBase. Here is an example
of implementing a mail store with Cassandra:
http://ewh.ieee.org/r6/scv/computer/nfic/2009/IBM-Jun-Rao.pdf

http://ewh.ieee.org/r6/scv/computer/nfic/2009/IBM-Jun-Rao.pdfMark

On Wed, May 18, 2011 at 5:05 PM, Todd Lipcon t...@cloudera.com wrote:

 Hi Ioan,

 I would encourage you to look at a system like HBase for your mail
 backend. HDFS doesn't work well with lots of little files, and also
 doesn't support random update, so existing formats like Maildir
 wouldn't be a good fit.

 -Todd

 On Wed, May 18, 2011 at 4:02 PM, Ioan Eugen Stan stan.ieu...@gmail.com
 wrote:
  Hello everybody,
 
  I'm a GSoC student for this year and I will be working on James [1].
  My project is to implement email storage over HDFS. I am quite new to
  Hadoop and associates and I am looking for some hints as to get
  started on the right track.
 
  I have installed a single node Hadoop instance on my machine and
  played around with it (ran some examples) but I am interested into
  what you (more experienced people) think it's the best way to approach
  my problem.
 
  I am a little puzzled about the fact that that I read hadoop is best
  used for large files and email aren't that large from what I know.
  Another thing that crossed my mind is that since HDFS is a file
  system, wouldn't it be possible to set it as a back-end for the
  (existing) maildir and mailbox storage formats? (I think this question
  is more suited on the James mailing list, but if you have some ideas
  please speak your mind).
 
  Also, any development resources to get me started are welcomed.
 
 
  [1] http://james.apache.org/mailbox/
  [2] https://issues.apache.org/jira/browse/MAILBOX-44
 
  Regards,
  --
  Ioan Eugen Stan
 



 --
 Todd Lipcon
 Software Engineer, Cloudera