Hi Devs,

I did the initial implementation of Hadoop provider based on OGCE
implementation and also tried the Apache Whirr to deploy Hadoop on
EC2. But there are certain issues to solve and decisions to take
before finishing the first phase of implementation.

1. There are multiple Hadoop versions out there including 0.20.x,
0.22.x, 0.23.x, 1.0.x, 2.0.0-alpha. 0.23.x(now renamed to 2.0.0) is a
complete overhaul of previous MapReduce implementation and most
probably stable release will be available end of this year and
0.22.x(1.0.x) is the widely(I'm guessing) used version. We need to
decide which version we are going to support.

2. There are two ways of submitting jobs to Hadoop. First is to use
jar command of the 'hadoop' command line tool and this assumes that
you have Hadoop configured(local or remote cluster) in your system.
Other method is to use Hadoop Job API. This way we can do the job
submission programmatically, but we need to specify Hadoop
configuration files and other related things like jar file, Mapper and
Reducer class names.

3. We can use Whirr to setup Hadoop on a local cluster as well. But
local cluster configuration doesn't support certificate based
authentication, we need to specify user name/password pairs and root
passwords of the machines in the local cluster. I think this is enough
for the initial implementation. WDYT?

In addition to above I have several questions specific to Hadoop
provider configuration. We may need to add additional configuration
parameters to current GFac configuration. But these will change based
on the above decisions we made. So I'll send a separate mail on that
later.

Please feel free to comment of above and let me know if you need more details.

Thanks
Milinda

-- 
Milinda Pathirage
PhD Student Indiana University, Bloomington;
E-mail: [email protected]
Web: http://mpathirage.com
Blog: http://blog.mpathirage.com

Reply via email to