Re: Questions about Hadoop

Enis Soztutar Wed, 24 Sep 2008 04:23:17 -0700


Arijit Mukherjee wrote:

Thanx Enis.

By workflow, I was trying to mean something like a chain of MapReduce
jobs - the first one will extract a certain amount of data from the
original set and do some computation resulting in a smaller summary,
which will then be the input to a further MR job, and so on...somewhat
similar to a workflow as in the SOA world.

Yes, you can always chain job together to form a final summary.o.a.h.mapred.jobcontrol.JobControl might be interesting for you.

Is it possible to use statistical analysis tools such as R (or say PL/R)
within MapReduce on Hadoop? As far as I've heard, Greenplum is working
on a custom MapReduce engine over their Greenplum database which will
also support PL/R procedures.

Using R on Hadoop might include some level of custom coding. If you arelooking for an ad-hoc tool for data mining, then check Pig and Hive.


Enis

Arijit

Dr. Arijit Mukherjee
Principal Member of Technical Staff, Level-II
Connectiva Systems (I) Pvt. Ltd.
J-2, Block GP, Sector V, Salt Lake
Kolkata 700 091, India
Phone: +91 (0)33 23577531/32 x 107
http://www.connectivasystems.com


-----Original Message-----
From: Enis Soztutar [mailto:[EMAIL PROTECTED]Sent: Wednesday, September 24, 2008 2:57 PM
To: core-user@hadoop.apache.org
Subject: Re: Questions about Hadoop


Hi,

Arijit Mukherjee wrote:
Hi

We've been thinking of using Hadoop for a decision making system which
will analyze telecom-related data from various sources to take certain
decisions. The data can be huge, of the order of terabytes, and can be
stored as CSV files, which I understand will fit into Hadoop as TomWhite mentions in the Rough Cut Guide that Hadoop is well suited forrecords. The question I want to ask is whether it is possible toperform statistical analysis on the data using Hadoop and MapReduce.If anyone has done such a thing, we'd be very interested to know about
it. Is it also possible to create a workflow like functionality withMapReduce?
Hadoop can handle TB data sizes, and statistical data analysis is one of

the
perfect things that fit into the mapreduce computation model. You can
check what people are doing with Hadoop athttp://wiki.apache.org/hadoop/PoweredBy.I think the best way to see if your requirements can be met byHadoop/mapreduce isto read the Mapreduce paper by Dean et.al. Also you might be interestedin checking outMahout, which is a subproject of Lucene. They are doing ML on top ofHadoop.
Hadoop is mostly suitable for batch jobs, however these jobs can bechained together to
form a workflow.  I will try to be more helpful if you could extend what

you mean by workflow.

Enis Soztutar
Regards
Arijit

Dr. Arijit Mukherjee
Principal Member of Technical Staff, Level-II
Connectiva Systems (I) Pvt. Ltd.
J-2, Block GP, Sector V, Salt Lake
Kolkata 700 091, India
Phone: +91 (0)33 23577531/32 x 107 http://www.connectivasystems.com
No virus found in this incoming message.
Checked by AVG - http://www.avg.comVersion: 8.0.169 / Virus Database: 270.7.1/1687 - Release Date:
9/23/2008 6:32 PM

Re: Questions about Hadoop

Reply via email to