Hi Milinda, > > 1. There are multiple Hadoop versions out there including 0.20.x, > > 0.22.x, 0.23.x, 1.0.x, 2.0.0-alpha. 0.23.x(now renamed to 2.0.0) is a > > complete overhaul of previous MapReduce implementation and most > > probably stable release will be available end of this year and > > 0.22.x(1.0.x) is the widely(I'm guessing) used version. We need to > > decide which version we are going to support. > Hadoop introduced a new MapReduce API (mapreduce) in 0.20.x and deprecated the old API (mapred). but unfortunately the new API lacked certain features and they un-deprecated the old API somewhere in 0.21.x.. But the chances are high that the old API will get dropped off at some point. Supporting the new API (mapreduce) and the 1.0.x branch would be the best bet at the moment, unless we have to support any legacy apps implemented using the old mapred API.
0.23.x (2.0) is a rewrite of Hadoop. AFAIK it's not totally stable yet and AFAIK the other projects in the Hadoop community (HBase, Mahout) are yet to support it. IMHO it'll take sometime for the 2.0.x to get stable and the people to start using it. By that time the (hopefully :) ) Airavata Hadoop support would have gotten very popular and chances will be high that we would have to support 1.0.x as well as 2.0.x... thanks, Thilina > > > Sorry but I am not familiar with hadoop versions, I tried to understand > [1] but could not figure out much. The only other criteria I would have > looked for is XBaya's current integration with Amazon Elastic Map Reduce, > but EMR seems to support most of the versions you mentioned [2]. You seem > to have better understanding of hadoop versions, so unless others on the > list have recommendation, I will defer the decision to your judgement. > > > 2. There are two ways of submitting jobs to Hadoop. First is to use > > jar command of the 'hadoop' command line tool and this assumes that > > you have Hadoop configured(local or remote cluster) in your system. > > Other method is to use Hadoop Job API. This way we can do the job > > submission programmatically, but we need to specify Hadoop > > configuration files and other related things like jar file, Mapper and > > Reducer class names. > > The API options seems to be better. Maintaining properties file might be > an issue, but that seems to be an easier to support instead of assuming > hadoop clients are properly installed on local systems. > > > 3. We can use Whirr to setup Hadoop on a local cluster as well. But > > local cluster configuration doesn't support certificate based > > authentication, we need to specify user name/password pairs and root > > passwords of the machines in the local cluster. I think this is enough > > for the initial implementation. WDYT? > > I agree, initial implementation can probably work with this limitation > which outweighs the advantages of having a local cluster. > > > In addition to above I have several questions specific to Hadoop > > provider configuration. We may need to add additional configuration > > parameters to current GFac configuration. But these will change based > > on the above decisions we made. So I'll send a separate mail on that > > later. > > > > Please feel free to comment of above and let me know if you need more > details. > > Please feel free to suggest any changes to GFac configurations and schemas > incorporating hadoop extensions. And as always patches welcome. > > Cheers, > Suresh > > [1] - > http://www.cloudera.com/blog/2012/04/apache-hadoop-versions-looking-ahead-3/ > [2] - http://aws.amazon.com/elasticmapreduce/faqs/#dev-12 > > > > > Thanks > > Milinda > > > > -- > > Milinda Pathirage > > PhD Student Indiana University, Bloomington; > > E-mail: [email protected] > > Web: http://mpathirage.com > > Blog: http://blog.mpathirage.com > > -- https://www.cs.indiana.edu/~tgunarat/ http://www.linkedin.com/in/thilina http://thilina.gunarathne.org
