Re: Questions about Hadoop

2008-09-26 Thread Paco NATHAN
Edward,

Can you describe more about Hama, with respect to Hadoop?
I've read through the Incubator proposal and your blog -- it's a great approach.

Are there any benchmarks available?  E.g., size of data sets used,
kinds of operations performed, etc.

Will this project be able to make use of existing libraries?

Best,
Paco


On Thu, Sep 25, 2008 at 9:31 PM, Edward J. Yoon [EMAIL PROTECTED] wrote:
 The decision making system seems interesting to me. :)

 The question I want to ask is whether it is possible to perform statistical 
 analysis on the data using Hadoop and MapReduce.

 I'm sure Hadoop could do it. FYI, The Hama project is an easy-to-use
 to matrix algebra and its uses in statistical analysis on Hadoop and
 Hbase. (It is still in its early stage)

 /Edward


Re: Questions about Hadoop

2008-09-25 Thread Edward J. Yoon
The decision making system seems interesting to me. :)

 The question I want to ask is whether it is possible to perform statistical 
 analysis on the data using Hadoop and MapReduce.

I'm sure Hadoop could do it. FYI, The Hama project is an easy-to-use
to matrix algebra and its uses in statistical analysis on Hadoop and
Hbase. (It is still in its early stage)

/Edward

On Wed, Sep 24, 2008 at 5:33 PM, Arijit Mukherjee
[EMAIL PROTECTED] wrote:
 Hi

 We've been thinking of using Hadoop for a decision making system which
 will analyze telecom-related data from various sources to take certain
 decisions. The data can be huge, of the order of terabytes, and can be
 stored as CSV files, which I understand will fit into Hadoop as Tom
 White mentions in the Rough Cut Guide that Hadoop is well suited for
 records. The question I want to ask is whether it is possible to perform
 statistical analysis on the data using Hadoop and MapReduce. If anyone
 has done such a thing, we'd be very interested to know about it. Is it
 also possible to create a workflow like functionality with MapReduce?

 Regards
 Arijit

 Dr. Arijit Mukherjee
 Principal Member of Technical Staff, Level-II
 Connectiva Systems (I) Pvt. Ltd.
 J-2, Block GP, Sector V, Salt Lake
 Kolkata 700 091, India
 Phone: +91 (0)33 23577531/32 x 107
 http://www.connectivasystems.com





-- 
Best regards, Edward J. Yoon
[EMAIL PROTECTED]
http://blog.udanax.org


Re: Questions about Hadoop

2008-09-24 Thread Enis Soztutar

Hi,

Arijit Mukherjee wrote:

Hi

We've been thinking of using Hadoop for a decision making system which
will analyze telecom-related data from various sources to take certain
decisions. The data can be huge, of the order of terabytes, and can be
stored as CSV files, which I understand will fit into Hadoop as Tom
White mentions in the Rough Cut Guide that Hadoop is well suited for
records. The question I want to ask is whether it is possible to perform
statistical analysis on the data using Hadoop and MapReduce. If anyone
has done such a thing, we'd be very interested to know about it. Is it
also possible to create a workflow like functionality with MapReduce?
  
Hadoop can handle TB data sizes, and statistical data analysis is one of 
the

perfect things that fit into the mapreduce computation model. You can check
what people are doing with Hadoop at 
http://wiki.apache.org/hadoop/PoweredBy.
I think the best way to see if your requirements can be met by 
Hadoop/mapreduce is
to read the Mapreduce paper by Dean et.al. Also you might be interested 
in checking out
Mahout, which is a subproject of Lucene. They are doing ML on top of 
Hadoop.


Hadoop is mostly suitable for batch jobs, however these jobs can be 
chained together to
form a workflow.  I will try to be more helpful if you could extend what 
you mean by workflow.


Enis Soztutar


Regards
Arijit

Dr. Arijit Mukherjee
Principal Member of Technical Staff, Level-II
Connectiva Systems (I) Pvt. Ltd.
J-2, Block GP, Sector V, Salt Lake
Kolkata 700 091, India
Phone: +91 (0)33 23577531/32 x 107
http://www.connectivasystems.com


  




RE: Questions about Hadoop

2008-09-24 Thread Arijit Mukherjee
Thanx Enis.

By workflow, I was trying to mean something like a chain of MapReduce
jobs - the first one will extract a certain amount of data from the
original set and do some computation resulting in a smaller summary,
which will then be the input to a further MR job, and so on...somewhat
similar to a workflow as in the SOA world.

Is it possible to use statistical analysis tools such as R (or say PL/R)
within MapReduce on Hadoop? As far as I've heard, Greenplum is working
on a custom MapReduce engine over their Greenplum database which will
also support PL/R procedures.

Arijit

Dr. Arijit Mukherjee
Principal Member of Technical Staff, Level-II
Connectiva Systems (I) Pvt. Ltd.
J-2, Block GP, Sector V, Salt Lake
Kolkata 700 091, India
Phone: +91 (0)33 23577531/32 x 107
http://www.connectivasystems.com


-Original Message-
From: Enis Soztutar [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, September 24, 2008 2:57 PM
To: core-user@hadoop.apache.org
Subject: Re: Questions about Hadoop


Hi,

Arijit Mukherjee wrote:
 Hi

 We've been thinking of using Hadoop for a decision making system which

 will analyze telecom-related data from various sources to take certain

 decisions. The data can be huge, of the order of terabytes, and can be

 stored as CSV files, which I understand will fit into Hadoop as Tom 
 White mentions in the Rough Cut Guide that Hadoop is well suited for 
 records. The question I want to ask is whether it is possible to 
 perform statistical analysis on the data using Hadoop and MapReduce. 
 If anyone has done such a thing, we'd be very interested to know about

 it. Is it also possible to create a workflow like functionality with 
 MapReduce?
   
Hadoop can handle TB data sizes, and statistical data analysis is one of

the
perfect things that fit into the mapreduce computation model. You can
check what people are doing with Hadoop at 
http://wiki.apache.org/hadoop/PoweredBy.
I think the best way to see if your requirements can be met by 
Hadoop/mapreduce is
to read the Mapreduce paper by Dean et.al. Also you might be interested 
in checking out
Mahout, which is a subproject of Lucene. They are doing ML on top of 
Hadoop.

Hadoop is mostly suitable for batch jobs, however these jobs can be 
chained together to
form a workflow.  I will try to be more helpful if you could extend what

you mean by workflow.

Enis Soztutar

 Regards
 Arijit

 Dr. Arijit Mukherjee
 Principal Member of Technical Staff, Level-II
 Connectiva Systems (I) Pvt. Ltd.
 J-2, Block GP, Sector V, Salt Lake
 Kolkata 700 091, India
 Phone: +91 (0)33 23577531/32 x 107 http://www.connectivasystems.com


   
No virus found in this incoming message.
Checked by AVG - http://www.avg.com 
Version: 8.0.169 / Virus Database: 270.7.1/1687 - Release Date:
9/23/2008 6:32 PM




RE: Questions about Hadoop

2008-09-24 Thread Arijit Mukherjee
Thanx again Enis. I'll have a look at Pig and Hive.

Regards
Arijit

Dr. Arijit Mukherjee
Principal Member of Technical Staff, Level-II
Connectiva Systems (I) Pvt. Ltd.
J-2, Block GP, Sector V, Salt Lake
Kolkata 700 091, India
Phone: +91 (0)33 23577531/32 x 107
http://www.connectivasystems.com


-Original Message-
From: Enis Soztutar [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, September 24, 2008 4:53 PM
To: core-user@hadoop.apache.org
Subject: Re: Questions about Hadoop




Arijit Mukherjee wrote:
 Thanx Enis.

 By workflow, I was trying to mean something like a chain of MapReduce 
 jobs - the first one will extract a certain amount of data from the 
 original set and do some computation resulting in a smaller summary, 
 which will then be the input to a further MR job, and so on...somewhat

 similar to a workflow as in the SOA world.

   
Yes, you can always chain job together to form a final summary. 
o.a.h.mapred.jobcontrol.JobControl might be interesting for you.
 Is it possible to use statistical analysis tools such as R (or say 
 PL/R) within MapReduce on Hadoop? As far as I've heard, Greenplum is 
 working on a custom MapReduce engine over their Greenplum database 
 which will also support PL/R procedures.
   
Using R on Hadoop might include some level of custom coding. If you are 
looking for an ad-hoc tool for data mining, then check Pig and Hive.

Enis
 Arijit

 Dr. Arijit Mukherjee
 Principal Member of Technical Staff, Level-II
 Connectiva Systems (I) Pvt. Ltd.
 J-2, Block GP, Sector V, Salt Lake
 Kolkata 700 091, India
 Phone: +91 (0)33 23577531/32 x 107 http://www.connectivasystems.com


 -Original Message-
 From: Enis Soztutar [mailto:[EMAIL PROTECTED] 
 Sent: Wednesday, September 24, 2008 2:57 PM
 To: core-user@hadoop.apache.org
 Subject: Re: Questions about Hadoop


 Hi,

 Arijit Mukherjee wrote:
   
 Hi

 We've been thinking of using Hadoop for a decision making system
which
 

   
 will analyze telecom-related data from various sources to take
certain
 

   
 decisions. The data can be huge, of the order of terabytes, and can
be
 

   
 stored as CSV files, which I understand will fit into Hadoop as Tom 
 White mentions in the Rough Cut Guide that Hadoop is well suited for 
 records. The question I want to ask is whether it is possible to 
 perform statistical analysis on the data using Hadoop and MapReduce. 
 If anyone has done such a thing, we'd be very interested to know
about
 

   
 it. Is it also possible to create a workflow like functionality with 
 MapReduce?
   
 
 Hadoop can handle TB data sizes, and statistical data analysis is one
of

 the
 perfect things that fit into the mapreduce computation model. You can
 check what people are doing with Hadoop at 
 http://wiki.apache.org/hadoop/PoweredBy.
 I think the best way to see if your requirements can be met by 
 Hadoop/mapreduce is
 to read the Mapreduce paper by Dean et.al. Also you might be
interested 
 in checking out
 Mahout, which is a subproject of Lucene. They are doing ML on top of 
 Hadoop.

 Hadoop is mostly suitable for batch jobs, however these jobs can be 
 chained together to
 form a workflow.  I will try to be more helpful if you could extend
what

 you mean by workflow.

 Enis Soztutar

   
 Regards
 Arijit

 Dr. Arijit Mukherjee
 Principal Member of Technical Staff, Level-II
 Connectiva Systems (I) Pvt. Ltd.
 J-2, Block GP, Sector V, Salt Lake
 Kolkata 700 091, India
 Phone: +91 (0)33 23577531/32 x 107 http://www.connectivasystems.com


   
 
 No virus found in this incoming message.
 Checked by AVG - http://www.avg.com 
 Version: 8.0.169 / Virus Database: 270.7.1/1687 - Release Date:
 9/23/2008 6:32 PM



   
No virus found in this incoming message.
Checked by AVG - http://www.avg.com 
Version: 8.0.169 / Virus Database: 270.7.1/1687 - Release Date:
9/23/2008 6:32 PM




Re: Questions about Hadoop

2008-09-24 Thread Paco NATHAN
Arijit,

For workflow, check out http://cascading.org  -- that works quite well
and fits what you described.

Greenplum and Aster Data have announced support for running MR within
the context of their relational databases, e.g.,
http://www.greenplum.com/resources/mapreduce/

In terms of PIG, Hive, these RDBMS vendors, etc., they seem to be
quite good for situations where there are lots of ad hoc queries,
business intelligence needs short-term, less-technical staff involved.
However, if there are large, repeated batch jobs which require
significant analytics work, then I'm not so convinced that SQL is the
right mind-set for representing the math required for algorithms or
for maintaining complex code throughout the software lifecycle.


I run an analytics group where our statisticians use R, while our
developers use Hadoop, Cascading, etc., at scale on terabytes.  One
approach is simply to sample data, analyze it in R, then use the
analysis to articulate requirements for developers to use at scale.

In terms of running R on large data, one issue is that -- in contrast
to SAS, where data is handled line-by-line -- R is limited by how much
data can be loaded into memory.

Another issue is that while some areas of statistical data analysis
are suitable for MapReduce, others clearly are not. Mahout or similar
projects may go far, but do not expect them to be capable of
displacing R, SAS, etc.  For example, you can accomplish much by
scanning a data set to determine N, sum X, sum X^X, etc., to produce
descriptive stats, quantiles, C.I., plots for p.d.f., c.d.f., etc.
Quite useful. However, MapReduce requires data independence, so it
will not serve well for tasks such as inverting a matrix.

You might want to look into Parallel R, and talk with
http://www.revolution-computing.com/

Our team has a project which runs Hadoop workflows underneath R.  It
is at an early stage, and there's no plan yet about a public release.
It's not a simple thing to implement by any stretch of the
imagination!

Best,
Paco



On Wed, Sep 24, 2008 at 4:39 AM, Arijit Mukherjee
[EMAIL PROTECTED] wrote:
 Thanx Enis.

 By workflow, I was trying to mean something like a chain of MapReduce
 jobs - the first one will extract a certain amount of data from the
 original set and do some computation resulting in a smaller summary,
 which will then be the input to a further MR job, and so on...somewhat
 similar to a workflow as in the SOA world.

 Is it possible to use statistical analysis tools such as R (or say PL/R)
 within MapReduce on Hadoop? As far as I've heard, Greenplum is working
 on a custom MapReduce engine over their Greenplum database which will
 also support PL/R procedures.

 Arijit

 Dr. Arijit Mukherjee
 Principal Member of Technical Staff, Level-II
 Connectiva Systems (I) Pvt. Ltd.
 J-2, Block GP, Sector V, Salt Lake
 Kolkata 700 091, India
 Phone: +91 (0)33 23577531/32 x 107
 http://www.connectivasystems.com



RE: Questions about Hadoop

2008-09-24 Thread Arijit Mukherjee
That's a very good overview Paco - thanx for that. I might get back to
you with more queries about cascade etc. at some time - hope you
wouldn't mind.

Regards
Arijit

Dr. Arijit Mukherjee
Principal Member of Technical Staff, Level-II
Connectiva Systems (I) Pvt. Ltd.
J-2, Block GP, Sector V, Salt Lake
Kolkata 700 091, India
Phone: +91 (0)33 23577531/32 x 107
http://www.connectivasystems.com


-Original Message-
From: Paco NATHAN [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, September 24, 2008 6:10 PM
To: core-user@hadoop.apache.org; [EMAIL PROTECTED]
Subject: Re: Questions about Hadoop


Arijit,

For workflow, check out http://cascading.org  -- that works quite well
and fits what you described.

Greenplum and Aster Data have announced support for running MR within
the context of their relational databases, e.g.,
http://www.greenplum.com/resources/mapreduce/

In terms of PIG, Hive, these RDBMS vendors, etc., they seem to be quite
good for situations where there are lots of ad hoc queries, business
intelligence needs short-term, less-technical staff involved. However,
if there are large, repeated batch jobs which require significant
analytics work, then I'm not so convinced that SQL is the right mind-set
for representing the math required for algorithms or for maintaining
complex code throughout the software lifecycle.


I run an analytics group where our statisticians use R, while our
developers use Hadoop, Cascading, etc., at scale on terabytes.  One
approach is simply to sample data, analyze it in R, then use the
analysis to articulate requirements for developers to use at scale.

In terms of running R on large data, one issue is that -- in contrast to
SAS, where data is handled line-by-line -- R is limited by how much data
can be loaded into memory.

Another issue is that while some areas of statistical data analysis are
suitable for MapReduce, others clearly are not. Mahout or similar
projects may go far, but do not expect them to be capable of displacing
R, SAS, etc.  For example, you can accomplish much by scanning a data
set to determine N, sum X, sum X^X, etc., to produce descriptive stats,
quantiles, C.I., plots for p.d.f., c.d.f., etc. Quite useful. However,
MapReduce requires data independence, so it will not serve well for
tasks such as inverting a matrix.

You might want to look into Parallel R, and talk with
http://www.revolution-computing.com/

Our team has a project which runs Hadoop workflows underneath R.  It is
at an early stage, and there's no plan yet about a public release. It's
not a simple thing to implement by any stretch of the imagination!

Best,
Paco



On Wed, Sep 24, 2008 at 4:39 AM, Arijit Mukherjee
[EMAIL PROTECTED] wrote:
 Thanx Enis.

 By workflow, I was trying to mean something like a chain of MapReduce 
 jobs - the first one will extract a certain amount of data from the 
 original set and do some computation resulting in a smaller summary, 
 which will then be the input to a further MR job, and so on...somewhat

 similar to a workflow as in the SOA world.

 Is it possible to use statistical analysis tools such as R (or say 
 PL/R) within MapReduce on Hadoop? As far as I've heard, Greenplum is 
 working on a custom MapReduce engine over their Greenplum database 
 which will also support PL/R procedures.

 Arijit

 Dr. Arijit Mukherjee
 Principal Member of Technical Staff, Level-II
 Connectiva Systems (I) Pvt. Ltd.
 J-2, Block GP, Sector V, Salt Lake
 Kolkata 700 091, India
 Phone: +91 (0)33 23577531/32 x 107 http://www.connectivasystems.com

No virus found in this incoming message.
Checked by AVG - http://www.avg.com 
Version: 8.0.169 / Virus Database: 270.7.1/1687 - Release Date:
9/23/2008 6:32 PM




Re: Questions about Hadoop

2008-09-24 Thread Paco NATHAN
Certainly. It'd be great to talk with others working in analytics and
statistical computing, who have been evaluating MapReduce as well.

Paco


On Wed, Sep 24, 2008 at 7:45 AM, Arijit Mukherjee
[EMAIL PROTECTED] wrote:
 That's a very good overview Paco - thanx for that. I might get back to
 you with more queries about cascade etc. at some time - hope you
 wouldn't mind.

 Regards
 Arijit

 Dr. Arijit Mukherjee
 Principal Member of Technical Staff, Level-II
 Connectiva Systems (I) Pvt. Ltd.
 J-2, Block GP, Sector V, Salt Lake
 Kolkata 700 091, India
 Phone: +91 (0)33 23577531/32 x 107
 http://www.connectivasystems.com


 -Original Message-
 From: Paco NATHAN [mailto:[EMAIL PROTECTED]
 Sent: Wednesday, September 24, 2008 6:10 PM
 To: core-user@hadoop.apache.org; [EMAIL PROTECTED]
 Subject: Re: Questions about Hadoop


 Arijit,

 For workflow, check out http://cascading.org  -- that works quite well
 and fits what you described.

 Greenplum and Aster Data have announced support for running MR within
 the context of their relational databases, e.g.,
 http://www.greenplum.com/resources/mapreduce/

 In terms of PIG, Hive, these RDBMS vendors, etc., they seem to be quite
 good for situations where there are lots of ad hoc queries, business
 intelligence needs short-term, less-technical staff involved. However,
 if there are large, repeated batch jobs which require significant
 analytics work, then I'm not so convinced that SQL is the right mind-set
 for representing the math required for algorithms or for maintaining
 complex code throughout the software lifecycle.


 I run an analytics group where our statisticians use R, while our
 developers use Hadoop, Cascading, etc., at scale on terabytes.  One
 approach is simply to sample data, analyze it in R, then use the
 analysis to articulate requirements for developers to use at scale.

 In terms of running R on large data, one issue is that -- in contrast to
 SAS, where data is handled line-by-line -- R is limited by how much data
 can be loaded into memory.

 Another issue is that while some areas of statistical data analysis are
 suitable for MapReduce, others clearly are not. Mahout or similar
 projects may go far, but do not expect them to be capable of displacing
 R, SAS, etc.  For example, you can accomplish much by scanning a data
 set to determine N, sum X, sum X^X, etc., to produce descriptive stats,
 quantiles, C.I., plots for p.d.f., c.d.f., etc. Quite useful. However,
 MapReduce requires data independence, so it will not serve well for
 tasks such as inverting a matrix.

 You might want to look into Parallel R, and talk with
 http://www.revolution-computing.com/

 Our team has a project which runs Hadoop workflows underneath R.  It is
 at an early stage, and there's no plan yet about a public release. It's
 not a simple thing to implement by any stretch of the imagination!

 Best,
 Paco



 On Wed, Sep 24, 2008 at 4:39 AM, Arijit Mukherjee
 [EMAIL PROTECTED] wrote:
 Thanx Enis.

 By workflow, I was trying to mean something like a chain of MapReduce
 jobs - the first one will extract a certain amount of data from the
 original set and do some computation resulting in a smaller summary,
 which will then be the input to a further MR job, and so on...somewhat

 similar to a workflow as in the SOA world.

 Is it possible to use statistical analysis tools such as R (or say
 PL/R) within MapReduce on Hadoop? As far as I've heard, Greenplum is
 working on a custom MapReduce engine over their Greenplum database
 which will also support PL/R procedures.

 Arijit

 Dr. Arijit Mukherjee
 Principal Member of Technical Staff, Level-II
 Connectiva Systems (I) Pvt. Ltd.
 J-2, Block GP, Sector V, Salt Lake
 Kolkata 700 091, India
 Phone: +91 (0)33 23577531/32 x 107 http://www.connectivasystems.com

 No virus found in this incoming message.
 Checked by AVG - http://www.avg.com
 Version: 8.0.169 / Virus Database: 270.7.1/1687 - Release Date:
 9/23/2008 6:32 PM