Thanx Enis. By workflow, I was trying to mean something like a chain of MapReduce jobs - the first one will extract a certain amount of data from the original set and do some computation resulting in a smaller summary, which will then be the input to a further MR job, and so on...somewhat similar to a workflow as in the SOA world.
Is it possible to use statistical analysis tools such as R (or say PL/R) within MapReduce on Hadoop? As far as I've heard, Greenplum is working on a custom MapReduce engine over their Greenplum database which will also support PL/R procedures. Arijit Dr. Arijit Mukherjee Principal Member of Technical Staff, Level-II Connectiva Systems (I) Pvt. Ltd. J-2, Block GP, Sector V, Salt Lake Kolkata 700 091, India Phone: +91 (0)33 23577531/32 x 107 http://www.connectivasystems.com -----Original Message----- From: Enis Soztutar [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 24, 2008 2:57 PM To: core-user@hadoop.apache.org Subject: Re: Questions about Hadoop Hi, Arijit Mukherjee wrote: > Hi > > We've been thinking of using Hadoop for a decision making system which > will analyze telecom-related data from various sources to take certain > decisions. The data can be huge, of the order of terabytes, and can be > stored as CSV files, which I understand will fit into Hadoop as Tom > White mentions in the Rough Cut Guide that Hadoop is well suited for > records. The question I want to ask is whether it is possible to > perform statistical analysis on the data using Hadoop and MapReduce. > If anyone has done such a thing, we'd be very interested to know about > it. Is it also possible to create a workflow like functionality with > MapReduce? > Hadoop can handle TB data sizes, and statistical data analysis is one of the perfect things that fit into the mapreduce computation model. You can check what people are doing with Hadoop at http://wiki.apache.org/hadoop/PoweredBy. I think the best way to see if your requirements can be met by Hadoop/mapreduce is to read the Mapreduce paper by Dean et.al. Also you might be interested in checking out Mahout, which is a subproject of Lucene. They are doing ML on top of Hadoop. Hadoop is mostly suitable for batch jobs, however these jobs can be chained together to form a workflow. I will try to be more helpful if you could extend what you mean by workflow. Enis Soztutar > Regards > Arijit > > Dr. Arijit Mukherjee > Principal Member of Technical Staff, Level-II > Connectiva Systems (I) Pvt. Ltd. > J-2, Block GP, Sector V, Salt Lake > Kolkata 700 091, India > Phone: +91 (0)33 23577531/32 x 107 http://www.connectivasystems.com > > > No virus found in this incoming message. Checked by AVG - http://www.avg.com Version: 8.0.169 / Virus Database: 270.7.1/1687 - Release Date: 9/23/2008 6:32 PM