Re: Questions about Hadoop
> Are there any benchmarks available? E.g., size of data sets used, > kinds of operations performed, etc. > > Will this project be able to make use of existing libraries? Sure, the source code is now available in the SVN repository. (https://svn.apache.org/repos/asf/incubator/hama/trunk hama-trunk) And, I'm trying to research/benchmark about its performance and parallelism. But, It will take some time to release it. /Edward On Fri, Sep 26, 2008 at 10:05 PM, Paco NATHAN <[EMAIL PROTECTED]> wrote: > Edward, > > Can you describe more about Hama, with respect to Hadoop? > I've read through the Incubator proposal and your blog -- it's a great > approach. > > Are there any benchmarks available? E.g., size of data sets used, > kinds of operations performed, etc. > > Will this project be able to make use of existing libraries? > > Best, > Paco > > > On Thu, Sep 25, 2008 at 9:31 PM, Edward J. Yoon <[EMAIL PROTECTED]> wrote: >> The decision making system seems interesting to me. :) >> >>> The question I want to ask is whether it is possible to perform statistical >>> analysis on the data using Hadoop and MapReduce. >> >> I'm sure Hadoop could do it. FYI, The Hama project is an easy-to-use >> to matrix algebra and its uses in statistical analysis on Hadoop and >> Hbase. (It is still in its early stage) >> >> /Edward > -- Best regards, Edward J. Yoon [EMAIL PROTECTED] http://blog.udanax.org
Re: Questions about Hadoop
Edward, Can you describe more about Hama, with respect to Hadoop? I've read through the Incubator proposal and your blog -- it's a great approach. Are there any benchmarks available? E.g., size of data sets used, kinds of operations performed, etc. Will this project be able to make use of existing libraries? Best, Paco On Thu, Sep 25, 2008 at 9:31 PM, Edward J. Yoon <[EMAIL PROTECTED]> wrote: > The decision making system seems interesting to me. :) > >> The question I want to ask is whether it is possible to perform statistical >> analysis on the data using Hadoop and MapReduce. > > I'm sure Hadoop could do it. FYI, The Hama project is an easy-to-use > to matrix algebra and its uses in statistical analysis on Hadoop and > Hbase. (It is still in its early stage) > > /Edward
Re: Questions about Hadoop
The decision making system seems interesting to me. :) > The question I want to ask is whether it is possible to perform statistical > analysis on the data using Hadoop and MapReduce. I'm sure Hadoop could do it. FYI, The Hama project is an easy-to-use to matrix algebra and its uses in statistical analysis on Hadoop and Hbase. (It is still in its early stage) /Edward On Wed, Sep 24, 2008 at 5:33 PM, Arijit Mukherjee <[EMAIL PROTECTED]> wrote: > Hi > > We've been thinking of using Hadoop for a decision making system which > will analyze telecom-related data from various sources to take certain > decisions. The data can be huge, of the order of terabytes, and can be > stored as CSV files, which I understand will fit into Hadoop as Tom > White mentions in the Rough Cut Guide that Hadoop is well suited for > records. The question I want to ask is whether it is possible to perform > statistical analysis on the data using Hadoop and MapReduce. If anyone > has done such a thing, we'd be very interested to know about it. Is it > also possible to create a workflow like functionality with MapReduce? > > Regards > Arijit > > Dr. Arijit Mukherjee > Principal Member of Technical Staff, Level-II > Connectiva Systems (I) Pvt. Ltd. > J-2, Block GP, Sector V, Salt Lake > Kolkata 700 091, India > Phone: +91 (0)33 23577531/32 x 107 > http://www.connectivasystems.com > > -- Best regards, Edward J. Yoon [EMAIL PROTECTED] http://blog.udanax.org
Re: Questions about Hadoop
Certainly. It'd be great to talk with others working in analytics and statistical computing, who have been evaluating MapReduce as well. Paco On Wed, Sep 24, 2008 at 7:45 AM, Arijit Mukherjee <[EMAIL PROTECTED]> wrote: > That's a very good overview Paco - thanx for that. I might get back to > you with more queries about cascade etc. at some time - hope you > wouldn't mind. > > Regards > Arijit > > Dr. Arijit Mukherjee > Principal Member of Technical Staff, Level-II > Connectiva Systems (I) Pvt. Ltd. > J-2, Block GP, Sector V, Salt Lake > Kolkata 700 091, India > Phone: +91 (0)33 23577531/32 x 107 > http://www.connectivasystems.com > > > -Original Message- > From: Paco NATHAN [mailto:[EMAIL PROTECTED] > Sent: Wednesday, September 24, 2008 6:10 PM > To: core-user@hadoop.apache.org; [EMAIL PROTECTED] > Subject: Re: Questions about Hadoop > > > Arijit, > > For workflow, check out http://cascading.org -- that works quite well > and fits what you described. > > Greenplum and Aster Data have announced support for running MR within > the context of their relational databases, e.g., > http://www.greenplum.com/resources/mapreduce/ > > In terms of PIG, Hive, these RDBMS vendors, etc., they seem to be quite > good for situations where there are lots of ad hoc queries, business > intelligence needs short-term, less-technical staff involved. However, > if there are large, repeated batch jobs which require significant > analytics work, then I'm not so convinced that SQL is the right mind-set > for representing the math required for algorithms or for maintaining > complex code throughout the software lifecycle. > > > I run an analytics group where our statisticians use R, while our > developers use Hadoop, Cascading, etc., at scale on terabytes. One > approach is simply to sample data, analyze it in R, then use the > analysis to articulate requirements for developers to use at scale. > > In terms of running R on large data, one issue is that -- in contrast to > SAS, where data is handled line-by-line -- R is limited by how much data > can be loaded into memory. > > Another issue is that while some areas of statistical data analysis are > suitable for MapReduce, others clearly are not. Mahout or similar > projects may go far, but do not expect them to be capable of displacing > R, SAS, etc. For example, you can accomplish much by scanning a data > set to determine N, sum X, sum X^X, etc., to produce descriptive stats, > quantiles, C.I., plots for p.d.f., c.d.f., etc. Quite useful. However, > MapReduce requires data independence, so it will not serve well for > tasks such as inverting a matrix. > > You might want to look into Parallel R, and talk with > http://www.revolution-computing.com/ > > Our team has a project which runs Hadoop workflows underneath R. It is > at an early stage, and there's no plan yet about a public release. It's > not a simple thing to implement by any stretch of the imagination! > > Best, > Paco > > > > On Wed, Sep 24, 2008 at 4:39 AM, Arijit Mukherjee > <[EMAIL PROTECTED]> wrote: >> Thanx Enis. >> >> By workflow, I was trying to mean something like a chain of MapReduce >> jobs - the first one will extract a certain amount of data from the >> original set and do some computation resulting in a smaller summary, >> which will then be the input to a further MR job, and so on...somewhat > >> similar to a workflow as in the SOA world. >> >> Is it possible to use statistical analysis tools such as R (or say >> PL/R) within MapReduce on Hadoop? As far as I've heard, Greenplum is >> working on a custom MapReduce engine over their Greenplum database >> which will also support PL/R procedures. >> >> Arijit >> >> Dr. Arijit Mukherjee >> Principal Member of Technical Staff, Level-II >> Connectiva Systems (I) Pvt. Ltd. >> J-2, Block GP, Sector V, Salt Lake >> Kolkata 700 091, India >> Phone: +91 (0)33 23577531/32 x 107 http://www.connectivasystems.com >> > No virus found in this incoming message. > Checked by AVG - http://www.avg.com > Version: 8.0.169 / Virus Database: 270.7.1/1687 - Release Date: > 9/23/2008 6:32 PM > > >
RE: Questions about Hadoop
That's a very good overview Paco - thanx for that. I might get back to you with more queries about cascade etc. at some time - hope you wouldn't mind. Regards Arijit Dr. Arijit Mukherjee Principal Member of Technical Staff, Level-II Connectiva Systems (I) Pvt. Ltd. J-2, Block GP, Sector V, Salt Lake Kolkata 700 091, India Phone: +91 (0)33 23577531/32 x 107 http://www.connectivasystems.com -Original Message- From: Paco NATHAN [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 24, 2008 6:10 PM To: core-user@hadoop.apache.org; [EMAIL PROTECTED] Subject: Re: Questions about Hadoop Arijit, For workflow, check out http://cascading.org -- that works quite well and fits what you described. Greenplum and Aster Data have announced support for running MR within the context of their relational databases, e.g., http://www.greenplum.com/resources/mapreduce/ In terms of PIG, Hive, these RDBMS vendors, etc., they seem to be quite good for situations where there are lots of ad hoc queries, business intelligence needs short-term, less-technical staff involved. However, if there are large, repeated batch jobs which require significant analytics work, then I'm not so convinced that SQL is the right mind-set for representing the math required for algorithms or for maintaining complex code throughout the software lifecycle. I run an analytics group where our statisticians use R, while our developers use Hadoop, Cascading, etc., at scale on terabytes. One approach is simply to sample data, analyze it in R, then use the analysis to articulate requirements for developers to use at scale. In terms of running R on large data, one issue is that -- in contrast to SAS, where data is handled line-by-line -- R is limited by how much data can be loaded into memory. Another issue is that while some areas of statistical data analysis are suitable for MapReduce, others clearly are not. Mahout or similar projects may go far, but do not expect them to be capable of displacing R, SAS, etc. For example, you can accomplish much by scanning a data set to determine N, sum X, sum X^X, etc., to produce descriptive stats, quantiles, C.I., plots for p.d.f., c.d.f., etc. Quite useful. However, MapReduce requires data independence, so it will not serve well for tasks such as inverting a matrix. You might want to look into Parallel R, and talk with http://www.revolution-computing.com/ Our team has a project which runs Hadoop workflows underneath R. It is at an early stage, and there's no plan yet about a public release. It's not a simple thing to implement by any stretch of the imagination! Best, Paco On Wed, Sep 24, 2008 at 4:39 AM, Arijit Mukherjee <[EMAIL PROTECTED]> wrote: > Thanx Enis. > > By workflow, I was trying to mean something like a chain of MapReduce > jobs - the first one will extract a certain amount of data from the > original set and do some computation resulting in a smaller summary, > which will then be the input to a further MR job, and so on...somewhat > similar to a workflow as in the SOA world. > > Is it possible to use statistical analysis tools such as R (or say > PL/R) within MapReduce on Hadoop? As far as I've heard, Greenplum is > working on a custom MapReduce engine over their Greenplum database > which will also support PL/R procedures. > > Arijit > > Dr. Arijit Mukherjee > Principal Member of Technical Staff, Level-II > Connectiva Systems (I) Pvt. Ltd. > J-2, Block GP, Sector V, Salt Lake > Kolkata 700 091, India > Phone: +91 (0)33 23577531/32 x 107 http://www.connectivasystems.com > No virus found in this incoming message. Checked by AVG - http://www.avg.com Version: 8.0.169 / Virus Database: 270.7.1/1687 - Release Date: 9/23/2008 6:32 PM
Re: Questions about Hadoop
Arijit, For workflow, check out http://cascading.org -- that works quite well and fits what you described. Greenplum and Aster Data have announced support for running MR within the context of their relational databases, e.g., http://www.greenplum.com/resources/mapreduce/ In terms of PIG, Hive, these RDBMS vendors, etc., they seem to be quite good for situations where there are lots of ad hoc queries, business intelligence needs short-term, less-technical staff involved. However, if there are large, repeated batch jobs which require significant analytics work, then I'm not so convinced that SQL is the right mind-set for representing the math required for algorithms or for maintaining complex code throughout the software lifecycle. I run an analytics group where our statisticians use R, while our developers use Hadoop, Cascading, etc., at scale on terabytes. One approach is simply to sample data, analyze it in R, then use the analysis to articulate requirements for developers to use at scale. In terms of running R on large data, one issue is that -- in contrast to SAS, where data is handled line-by-line -- R is limited by how much data can be loaded into memory. Another issue is that while some areas of statistical data analysis are suitable for MapReduce, others clearly are not. Mahout or similar projects may go far, but do not expect them to be capable of displacing R, SAS, etc. For example, you can accomplish much by scanning a data set to determine N, sum X, sum X^X, etc., to produce descriptive stats, quantiles, C.I., plots for p.d.f., c.d.f., etc. Quite useful. However, MapReduce requires data independence, so it will not serve well for tasks such as inverting a matrix. You might want to look into Parallel R, and talk with http://www.revolution-computing.com/ Our team has a project which runs Hadoop workflows underneath R. It is at an early stage, and there's no plan yet about a public release. It's not a simple thing to implement by any stretch of the imagination! Best, Paco On Wed, Sep 24, 2008 at 4:39 AM, Arijit Mukherjee <[EMAIL PROTECTED]> wrote: > Thanx Enis. > > By workflow, I was trying to mean something like a chain of MapReduce > jobs - the first one will extract a certain amount of data from the > original set and do some computation resulting in a smaller summary, > which will then be the input to a further MR job, and so on...somewhat > similar to a workflow as in the SOA world. > > Is it possible to use statistical analysis tools such as R (or say PL/R) > within MapReduce on Hadoop? As far as I've heard, Greenplum is working > on a custom MapReduce engine over their Greenplum database which will > also support PL/R procedures. > > Arijit > > Dr. Arijit Mukherjee > Principal Member of Technical Staff, Level-II > Connectiva Systems (I) Pvt. Ltd. > J-2, Block GP, Sector V, Salt Lake > Kolkata 700 091, India > Phone: +91 (0)33 23577531/32 x 107 > http://www.connectivasystems.com >
RE: Questions about Hadoop
Thanx again Enis. I'll have a look at Pig and Hive. Regards Arijit Dr. Arijit Mukherjee Principal Member of Technical Staff, Level-II Connectiva Systems (I) Pvt. Ltd. J-2, Block GP, Sector V, Salt Lake Kolkata 700 091, India Phone: +91 (0)33 23577531/32 x 107 http://www.connectivasystems.com -Original Message- From: Enis Soztutar [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 24, 2008 4:53 PM To: core-user@hadoop.apache.org Subject: Re: Questions about Hadoop Arijit Mukherjee wrote: > Thanx Enis. > > By workflow, I was trying to mean something like a chain of MapReduce > jobs - the first one will extract a certain amount of data from the > original set and do some computation resulting in a smaller summary, > which will then be the input to a further MR job, and so on...somewhat > similar to a workflow as in the SOA world. > > Yes, you can always chain job together to form a final summary. o.a.h.mapred.jobcontrol.JobControl might be interesting for you. > Is it possible to use statistical analysis tools such as R (or say > PL/R) within MapReduce on Hadoop? As far as I've heard, Greenplum is > working on a custom MapReduce engine over their Greenplum database > which will also support PL/R procedures. > Using R on Hadoop might include some level of custom coding. If you are looking for an ad-hoc tool for data mining, then check Pig and Hive. Enis > Arijit > > Dr. Arijit Mukherjee > Principal Member of Technical Staff, Level-II > Connectiva Systems (I) Pvt. Ltd. > J-2, Block GP, Sector V, Salt Lake > Kolkata 700 091, India > Phone: +91 (0)33 23577531/32 x 107 http://www.connectivasystems.com > > > -Original Message- > From: Enis Soztutar [mailto:[EMAIL PROTECTED] > Sent: Wednesday, September 24, 2008 2:57 PM > To: core-user@hadoop.apache.org > Subject: Re: Questions about Hadoop > > > Hi, > > Arijit Mukherjee wrote: > >> Hi >> >> We've been thinking of using Hadoop for a decision making system which >> > > >> will analyze telecom-related data from various sources to take certain >> > > >> decisions. The data can be huge, of the order of terabytes, and can be >> > > >> stored as CSV files, which I understand will fit into Hadoop as Tom >> White mentions in the Rough Cut Guide that Hadoop is well suited for >> records. The question I want to ask is whether it is possible to >> perform statistical analysis on the data using Hadoop and MapReduce. >> If anyone has done such a thing, we'd be very interested to know about >> > > >> it. Is it also possible to create a workflow like functionality with >> MapReduce? >> >> > Hadoop can handle TB data sizes, and statistical data analysis is one of > > the > perfect things that fit into the mapreduce computation model. You can > check what people are doing with Hadoop at > http://wiki.apache.org/hadoop/PoweredBy. > I think the best way to see if your requirements can be met by > Hadoop/mapreduce is > to read the Mapreduce paper by Dean et.al. Also you might be interested > in checking out > Mahout, which is a subproject of Lucene. They are doing ML on top of > Hadoop. > > Hadoop is mostly suitable for batch jobs, however these jobs can be > chained together to > form a workflow. I will try to be more helpful if you could extend what > > you mean by workflow. > > Enis Soztutar > > >> Regards >> Arijit >> >> Dr. Arijit Mukherjee >> Principal Member of Technical Staff, Level-II >> Connectiva Systems (I) Pvt. Ltd. >> J-2, Block GP, Sector V, Salt Lake >> Kolkata 700 091, India >> Phone: +91 (0)33 23577531/32 x 107 http://www.connectivasystems.com >> >> >> >> > No virus found in this incoming message. > Checked by AVG - http://www.avg.com > Version: 8.0.169 / Virus Database: 270.7.1/1687 - Release Date: > 9/23/2008 6:32 PM > > > > No virus found in this incoming message. Checked by AVG - http://www.avg.com Version: 8.0.169 / Virus Database: 270.7.1/1687 - Release Date: 9/23/2008 6:32 PM
Re: Questions about Hadoop
Arijit Mukherjee wrote: Thanx Enis. By workflow, I was trying to mean something like a chain of MapReduce jobs - the first one will extract a certain amount of data from the original set and do some computation resulting in a smaller summary, which will then be the input to a further MR job, and so on...somewhat similar to a workflow as in the SOA world. Yes, you can always chain job together to form a final summary. o.a.h.mapred.jobcontrol.JobControl might be interesting for you. Is it possible to use statistical analysis tools such as R (or say PL/R) within MapReduce on Hadoop? As far as I've heard, Greenplum is working on a custom MapReduce engine over their Greenplum database which will also support PL/R procedures. Using R on Hadoop might include some level of custom coding. If you are looking for an ad-hoc tool for data mining, then check Pig and Hive. Enis Arijit Dr. Arijit Mukherjee Principal Member of Technical Staff, Level-II Connectiva Systems (I) Pvt. Ltd. J-2, Block GP, Sector V, Salt Lake Kolkata 700 091, India Phone: +91 (0)33 23577531/32 x 107 http://www.connectivasystems.com -Original Message- From: Enis Soztutar [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 24, 2008 2:57 PM To: core-user@hadoop.apache.org Subject: Re: Questions about Hadoop Hi, Arijit Mukherjee wrote: Hi We've been thinking of using Hadoop for a decision making system which will analyze telecom-related data from various sources to take certain decisions. The data can be huge, of the order of terabytes, and can be stored as CSV files, which I understand will fit into Hadoop as Tom White mentions in the Rough Cut Guide that Hadoop is well suited for records. The question I want to ask is whether it is possible to perform statistical analysis on the data using Hadoop and MapReduce. If anyone has done such a thing, we'd be very interested to know about it. Is it also possible to create a workflow like functionality with MapReduce? Hadoop can handle TB data sizes, and statistical data analysis is one of the perfect things that fit into the mapreduce computation model. You can check what people are doing with Hadoop at http://wiki.apache.org/hadoop/PoweredBy. I think the best way to see if your requirements can be met by Hadoop/mapreduce is to read the Mapreduce paper by Dean et.al. Also you might be interested in checking out Mahout, which is a subproject of Lucene. They are doing ML on top of Hadoop. Hadoop is mostly suitable for batch jobs, however these jobs can be chained together to form a workflow. I will try to be more helpful if you could extend what you mean by workflow. Enis Soztutar Regards Arijit Dr. Arijit Mukherjee Principal Member of Technical Staff, Level-II Connectiva Systems (I) Pvt. Ltd. J-2, Block GP, Sector V, Salt Lake Kolkata 700 091, India Phone: +91 (0)33 23577531/32 x 107 http://www.connectivasystems.com No virus found in this incoming message. Checked by AVG - http://www.avg.com Version: 8.0.169 / Virus Database: 270.7.1/1687 - Release Date: 9/23/2008 6:32 PM
RE: Questions about Hadoop
Thanx Enis. By workflow, I was trying to mean something like a chain of MapReduce jobs - the first one will extract a certain amount of data from the original set and do some computation resulting in a smaller summary, which will then be the input to a further MR job, and so on...somewhat similar to a workflow as in the SOA world. Is it possible to use statistical analysis tools such as R (or say PL/R) within MapReduce on Hadoop? As far as I've heard, Greenplum is working on a custom MapReduce engine over their Greenplum database which will also support PL/R procedures. Arijit Dr. Arijit Mukherjee Principal Member of Technical Staff, Level-II Connectiva Systems (I) Pvt. Ltd. J-2, Block GP, Sector V, Salt Lake Kolkata 700 091, India Phone: +91 (0)33 23577531/32 x 107 http://www.connectivasystems.com -Original Message- From: Enis Soztutar [mailto:[EMAIL PROTECTED] Sent: Wednesday, September 24, 2008 2:57 PM To: core-user@hadoop.apache.org Subject: Re: Questions about Hadoop Hi, Arijit Mukherjee wrote: > Hi > > We've been thinking of using Hadoop for a decision making system which > will analyze telecom-related data from various sources to take certain > decisions. The data can be huge, of the order of terabytes, and can be > stored as CSV files, which I understand will fit into Hadoop as Tom > White mentions in the Rough Cut Guide that Hadoop is well suited for > records. The question I want to ask is whether it is possible to > perform statistical analysis on the data using Hadoop and MapReduce. > If anyone has done such a thing, we'd be very interested to know about > it. Is it also possible to create a workflow like functionality with > MapReduce? > Hadoop can handle TB data sizes, and statistical data analysis is one of the perfect things that fit into the mapreduce computation model. You can check what people are doing with Hadoop at http://wiki.apache.org/hadoop/PoweredBy. I think the best way to see if your requirements can be met by Hadoop/mapreduce is to read the Mapreduce paper by Dean et.al. Also you might be interested in checking out Mahout, which is a subproject of Lucene. They are doing ML on top of Hadoop. Hadoop is mostly suitable for batch jobs, however these jobs can be chained together to form a workflow. I will try to be more helpful if you could extend what you mean by workflow. Enis Soztutar > Regards > Arijit > > Dr. Arijit Mukherjee > Principal Member of Technical Staff, Level-II > Connectiva Systems (I) Pvt. Ltd. > J-2, Block GP, Sector V, Salt Lake > Kolkata 700 091, India > Phone: +91 (0)33 23577531/32 x 107 http://www.connectivasystems.com > > > No virus found in this incoming message. Checked by AVG - http://www.avg.com Version: 8.0.169 / Virus Database: 270.7.1/1687 - Release Date: 9/23/2008 6:32 PM
Re: Questions about Hadoop
Hi, Arijit Mukherjee wrote: Hi We've been thinking of using Hadoop for a decision making system which will analyze telecom-related data from various sources to take certain decisions. The data can be huge, of the order of terabytes, and can be stored as CSV files, which I understand will fit into Hadoop as Tom White mentions in the Rough Cut Guide that Hadoop is well suited for records. The question I want to ask is whether it is possible to perform statistical analysis on the data using Hadoop and MapReduce. If anyone has done such a thing, we'd be very interested to know about it. Is it also possible to create a workflow like functionality with MapReduce? Hadoop can handle TB data sizes, and statistical data analysis is one of the perfect things that fit into the mapreduce computation model. You can check what people are doing with Hadoop at http://wiki.apache.org/hadoop/PoweredBy. I think the best way to see if your requirements can be met by Hadoop/mapreduce is to read the Mapreduce paper by Dean et.al. Also you might be interested in checking out Mahout, which is a subproject of Lucene. They are doing ML on top of Hadoop. Hadoop is mostly suitable for batch jobs, however these jobs can be chained together to form a workflow. I will try to be more helpful if you could extend what you mean by workflow. Enis Soztutar Regards Arijit Dr. Arijit Mukherjee Principal Member of Technical Staff, Level-II Connectiva Systems (I) Pvt. Ltd. J-2, Block GP, Sector V, Salt Lake Kolkata 700 091, India Phone: +91 (0)33 23577531/32 x 107 http://www.connectivasystems.com