Re: some guidance needed

2011-05-23 Thread Eric Charles

Hi,

Yes, we need to store immutable mails and their associated r/w metadata.

I was wondering in which way a solution like the one presented on [1] 
can help. Twitter seems to use Protocol Buffers to store tweets.


Would a solution based on Avro be a better fit for our needs (mail storage)?

In this Avro option, would each "mail" be a avro file, or should be 
consider to have the "folder" an avro file and run some map/reduce jobs?


Tks,

- Eric

[1] 
http://www.slideshare.net/kevinweil/protocol-buffers-and-hadoop-at-twitter



On 19/05/2011 20:53, Robert Burrell Donkin wrote:

On Thu, May 19, 2011 at 12:04 PM, Ioan Eugen Stan  wrote:

I have forwarded this discussion to my mentors so they are informed


(I've hopped onto this list so no need to remember to copy me into the
thread ;-)




Eric, one of my mentors, suggested I use Gora for
this and after a quick look at Gora I saw that it is an ORM for HBase
and Cassandra which will allow me switch between them. The downside
with this is that Gora is still incubating so a piece of advice about
using it or not is welcomed. I will also ask on the Gora mailing list
to see how things are there.


(I suspect there will be a measure of experimentation required in this
project, so don't be afraid to try a spike or two)


I would encourage you to look at a system like HBase for your mail
backend. HDFS doesn't work well with lots of little files, and also
doesn't support random update, so existing formats like Maildir
wouldn't be a good fit.


(Apache James closer to the Microsoft Exchange space than traditional
*nix mail user agents)


I don't think I understand correctly what you mean by random updates.
E-mails are immutable so once written they are not going to be
updated. But if you are referring to the fact that lots of (small)
files will be written in a directory and that this can be a problem
then I get it. This will also mean that mailbox format (all emails in
one file) will be more inappropriate than Maildir. But since e-mails
are immutable and adding a mail to the mailbox means appending a small
piece of data to the file this should not be a problem if Hadoop has
append.


Essentially, there are two classes of data that mail storage requires

1. read only MIME documents (mail messages) embedding meta-data (headers)
2. read-write meta-data sets about each document including flags for
each (virtual) mail directory containing the document

The documents are searched rarely. The meta-data sets are read often
but written rarely.

I suspect that emails are relatively small in Hadoop terms, and are
often numerous. Might be interesting to see how a tuned HDFS instance
performs when storing large numbers of small MIME documents. Should be
easy enough to set up an experiment to benchmark. (I wonder whether a
RESTful distributed storage solution might end up working better.)

I suspect that the read-write meta-data sets will need HBase (or
Cassandra). Would need to think carefully about design, I think.


The presentation on Vimeo it stated that HDFS 0.19 did not had append,
I don't know yet what is the status on that, but things are a little
brighter. You could have a mailbox file that could grow to a very
large size. This will lead to all the users emails into one big file
that is easy to manage, the only thing that it's missing is the
fetching the emails. Since emails are appended to the file (inbox) as
they come, and you usually are interested in the latest emails
received you could just read the tail of the file and do some indexing
based on that.


I'm not hopeful about adopting an append based approach. (Might be
made to work but I suspect that the locking required for IMAP or POP3
is likely to kill performance.)

Robert




Release or Snapshot from Maven Repositories?

2011-06-09 Thread Eric Charles

Hi,

I'm trying to define the needed Hadoop artifacts in my maven project./
I could go to 0.20.203 [1] or use some snapshots

I'm still in learning curve and I build hadoop project (common, hdfs, 
mr) in eclipse (asking ant/ivy to generate the eclipse .classpath and 
.project, waiting on "HADOOP-6671 To use maven for hadoop common 
builds"), so depending on current SVN trunk is good to me, especially 
getting the published src/javadoc jars in my IDE.


My questions are:

1. Which of 21.0, 21.1, 22.0 or 23.0" corresponds to trunk?" (I didn't 
find any branch/tag for these numbers.


2. I like the split made on the snapshot maven artifacts (examples are 
separated,...), but this distinction is not present in src projects. Is 
it a goal to also split in modules? (this last topic was discussed on 
HADOOP-6671, just tell me if you want me to comment on existing or open 
new JIRA)


Tks,
- Eric


[1] release 0.20.203 http://search.maven.org/#browse|306912910
[2] snapshot 0.21.0 0.21.1 0.22.0 0.23.0 
https://repository.apache.org/content/groups/snapshots/org/apache/hadoop/hadoop-common/


Re: Release or Snapshot from Maven Repositories?

2011-06-12 Thread Eric Charles

On 09/06/11 12:11, Eric Charles wrote:

1. Which of 21.0, 21.1, 22.0 or 23.0" corresponds to trunk?" (I didn't
find any branch/tag for these numbers.


From pom.xml in hdfs project, it seems 0.23-SNAPSHOT is the last one.
So I take the 0.23.0-SNAPSHOT (still confused with the additional .0...).


2. I like the split made on the snapshot maven artifacts (examples are
separated,...), but this distinction is not present in src projects. Is
it a goal to also split in modules? (this last topic was discussed on
HADOOP-6671, just tell me if you want me to comment on existing or open
new JIRA)



I missed the fact that the HADOOP-6671 is for common, not for the rest.
I'm following this now on JIRA.

Tks,
- Eric



Tks,
- Eric


[1] release 0.20.203 http://search.maven.org/#browse|306912910
[2] snapshot 0.21.0 0.21.1 0.22.0 0.23.0
https://repository.apache.org/content/groups/snapshots/org/apache/hadoop/hadoop-common/





Common project tests in eclipe

2011-06-13 Thread Eric Charles

Hi,

I've imported the hadoop common trunk in eclipse (.classpath and 
.project created via ant eclipse).


ant test build fine (0 failures).

When I run the junit tests from eclipse (right-click on test folder, 
"Run as test"), there are many failures...


Is there some env (-D...) to give when running from eclipse.
I could hack the ant scripts, but someone here may know the answer :)

Many tks for your help,
- Eric


Re: Common project tests in eclipe

2011-06-20 Thread Eric Charles

Hi Harsh,

Tks for your quick reply, and sorry to be so late.

http://search-hadoop.com/m/gLWelrO8Mc helped me to find the Launch 
configuration for hadoop-hdfs and hadoop-mapreduce, but there are no 
such Launch config for hadoop-common.


Never mind, I just replaced :hdfs with :common and got [1].

My test success % has made an impressive jump, but I still have a few 
failures for example:
junit.framework.AssertionFailedError: -expunge failed expected:<0> but 
was:<1>

at org.apache.hadoop.fs.TestTrash.trashShell(TestTrash.java:328)
at org.apache.hadoop.fs.TestTrash.trashShell(TestTrash.java:102)
at org.apache.hadoop.fs.TestTrash.testTrash(TestTrash.java:447)


About the needed environment variables to run the tests:

1.- Should the Launch configs be documented on 
http://wiki.apache.org/hadoop/EclipseEnvironment ?


2.- should Launch config be created for hadoop-common project ?

2.- The tests depend heavily on env variables. I wonder how to handle 
this with the upcoming maven structure 
(https://issues.apache.org/jira/browse/HADOOP-6671). Should we forsee 
during the test setUp() some default values, a bit like if we would 
generalize https://issues.apache.org/jira/browse/HADOOP-5916 
(Standardize fall-back value of test.build.data for testing directories) 
to all needed env.


Maybe this is already a know topic?
Tks.

[1] -Xms256m -Xmx512m 
-Dtest.build.data=${workspace_loc:common}/build/test 
-Dtest.cache.data=${workspace_loc:common}/build/test/cache 
-Dtest.debug.data=${workspace_loc:common}/build/test/debug 
-Dhadoop.log.dir=${workspace_loc:common}/build/test/log 
-Dtest.src.dir=${workspace_loc:common}/build/test/src 
-Dtest.build.extraconf=${workspace_loc:common}/build/test/extraconf 
-Dhadoop.policy.file=hadoop-policy.xml


On 13/06/11 11:57, Harsh J wrote:

Eric,

Is the problem reported as something like "webapps not found on
CLASSPATH"? I see this on mapreduce and hdfs projects at times, but
common tests usually run fine out of the box for me.

If it is indeed that, this conversation may help solve/provide an
answer: http://search-hadoop.com/m/gLWelrO8Mc

On Mon, Jun 13, 2011 at 12:49 PM, Eric Charles
  wrote:

Hi,

I've imported the hadoop common trunk in eclipse (.classpath and .project
created via ant eclipse).

ant test build fine (0 failures).

When I run the junit tests from eclipse (right-click on test folder, "Run as
test"), there are many failures...

Is there some env (-D...) to give when running from eclipse.
I could hack the ant scripts, but someone here may know the answer :)

Many tks for your help,
- Eric







--
Eric


Re: Append to Existing File

2011-06-21 Thread Eric Charles
When you say "bugs pending", are your refering to HDFS-265 (which links 
to HDFS-1060, HADOOP-6239 and HDFS-744?


Are there other issues related to append than the one above?

Tks, Eric

https://issues.apache.org/jira/browse/HDFS-265


On 21/06/11 12:36, madhu phatak wrote:

Its not stable . There are some bugs pending . According one of the
disccusion till date the append is not ready for production.

On Tue, Jun 14, 2011 at 12:19 AM, jagaran daswrote:


I am using hadoop-0.20.203.0 version.
I have set

dfs.support.append to true and then using append method

It is working but need to know how stable it is to deploy and use in
production
clusters ?

Regards,
Jagaran




From: jagaran das
To: common-user@hadoop.apache.org
Sent: Mon, 13 June, 2011 11:07:57 AM
Subject: Append to Existing File

Hi All,

Is append to an existing file is now supported in Hadoop for production
clusters?
If yes, please let me know which version and how

Thanks
Jagaran





--
Eric


Re: Append to Existing File

2011-06-21 Thread Eric Charles

Hi Madhu,

Tks for the pointer. Even after reading the section on 0.21/22/23 
written by Tsz-Wo, I still remain in the fog...


Will HDFS-265 (and its mentioned Jiras) provide a solution for append 
(whatever the release it will be)?


Another way of asking is: "Are there today other Jiras than the ones 
mentioned on HDFS-265 to take into consideration to have working hadoop 
append?".


Tks, Eric


On 21/06/11 12:58, madhu phatak wrote:

Please refer to this discussion
http://search-hadoop.com/m/rnG0h1zCZcL1/Re%253A+HDFS+File+Appending+URGENT&subj=Fw+HDFS+File+Appending+URGENT

On Tue, Jun 21, 2011 at 4:23 PM, Eric Charleswrote:


When you say "bugs pending", are your refering to HDFS-265 (which links to
HDFS-1060, HADOOP-6239 and HDFS-744?

Are there other issues related to append than the one above?

Tks, Eric

https://issues.apache.org/**jira/browse/HDFS-265



On 21/06/11 12:36, madhu phatak wrote:


Its not stable . There are some bugs pending . According one of the
disccusion till date the append is not ready for production.

On Tue, Jun 14, 2011 at 12:19 AM, jagaran das**
wrote:

  I am using hadoop-0.20.203.0 version.

I have set

dfs.support.append to true and then using append method

It is working but need to know how stable it is to deploy and use in
production
clusters ?

Regards,
Jagaran



__**__
From: jagaran das
To: common-user@hadoop.apache.org
Sent: Mon, 13 June, 2011 11:07:57 AM
Subject: Append to Existing File

Hi All,

Is append to an existing file is now supported in Hadoop for production
clusters?
If yes, please let me know which version and how

Thanks
Jagaran





--
Eric





--
Eric


Re: conferences

2011-06-29 Thread Eric Charles

On 29/06/11 13:33, Keren Ouaknine wrote:

Hello,

I would like to find the list of prestigious conferences related to Hadoop.
Where can I find the list of these? Thanks!

Keren



Hi,

You can try http://wiki.apache.org/hadoop/Conferences

I was just surfing this morning on:
http://developer.yahoo.com/events/hadoopsummit2011/
http://www.cloudera.com/company/events/hadoop-world-2011/

Thx
--
Eric