Re: Comparison of Apache Pig Vs. Hadoop Streaming M/R
Rules of thumb IMO: You should be using Pig in place of MR jobs at all times that performance isn't absolutely crucial. Writing unnecessary MR is needless technical debt that you will regret as people are replaced and your organization scales. Pig gets it done in much less time. If you need faster jobs, then optimize your Pig, and if that doesn't work, put a single MAPREDUCEhttp://pig.apache.org/docs/r0.9.2/basic.html#mapreduce job at the bottleneck. Also, realize that it can be hard to actually beat Pig's performance without experience. Check that your MR job is actually faster than Pig at the same load before assuming you can do better than Pig. Streaming is good if your data doesn't easily map to tuples, you really like using the abstractions of your favoriate language's MR library, or you are doing something weird like simulations/pure batch jobs (no mR). If you're doing a lot of joins and performance is a problem - consider doing fewer joins. I would strongly suggest that you prioritize de-normalizing and duplicating data over switching to raw MR jobs because HIVE joins are slow. MapReduce is slow at joins. Programmer time is more valuable than machine time. If you're having to write tons of raw MR, then get more machines. On Fri, Mar 2, 2012 at 6:21 AM, Subir S subir.sasiku...@gmail.com wrote: On Fri, Mar 2, 2012 at 12:38 PM, Harsh J ha...@cloudera.com wrote: On Fri, Mar 2, 2012 at 10:18 AM, Subir S subir.sasiku...@gmail.com wrote: Hello Folks, Are there any pointers to such comparisons between Apache Pig and Hadoop Streaming Map Reduce jobs? I do not see why you seek to compare these two. Pig offers a language that lets you write data-flow operations and runs these statements as a series of MR jobs for you automatically (Making it a great tool to use to get data processing done really quick, without bothering with code), while streaming is something you use to write non-Java, simple MR jobs. Both have their own purposes. Basically we are comparing these two to see the benefits and how much they help in improving the productive coding time, without jeopardizing the performance of MR jobs. Also there was a claim in our company that Pig performs better than Map Reduce jobs? Is this true? Are there any such benchmarks available Pig _runs_ MR jobs. It does do job design (and some data) optimizations based on your queries, which is what may give it an edge over designing elaborate flows of plain MR jobs with tools like Oozie/JobControl (Which takes more time to do). But regardless, Pig only makes it easy doing the same thing with Pig Latin statements for you. I knew that Pig runs MR jobs, as Hive runs MR jobs. But Hive jobs become pretty slow with lot of joins, which we can achieve faster with writing raw MR jobs. So with that context was trying to see how Pig runs MR jobs. Like for example what kind of projects should consider Pig. Say when we have a lot of Joins, which writing with plain MR jobs takes time. Thoughts? Thank you Harsh for your comments. They are helpful! -- Harsh J -- Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com
Re: Comparison of Apache Pig Vs. Hadoop Streaming M/R
Streaming is good for simulation. Long running map-only processes, where pig doesn't really help and it is simple to fire off a streaming process. You do have to set some options so they can take a long time to return/return counters. Russell Jurney http://datasyndrome.com On Mar 5, 2012, at 12:38 PM, Eli Finkelshteyn iefin...@gmail.com wrote: I'm really interested in this as well. I have trouble seeing a really good use case for streaming map-reduce. Is there something I can do in streaming that I can't do in Pig? If I want to re-use previously made Python functions from my code base, I can do that in Pig as much as Streaming, and from what I've experienced thus far, Python streaming seems to go slower than or at the same speed as Pig, so why would I want to write a whole lot of more-difficult-to-read mappers and reducers when I can do equally fast performance-wise, shorter, and clearer code in Pig? Maybe it's obvious, but currently I just can't think of the right use case. Eli On 3/2/12 9:21 AM, Subir S wrote: On Fri, Mar 2, 2012 at 12:38 PM, Harsh Jha...@cloudera.com wrote: On Fri, Mar 2, 2012 at 10:18 AM, Subir Ssubir.sasiku...@gmail.com wrote: Hello Folks, Are there any pointers to such comparisons between Apache Pig and Hadoop Streaming Map Reduce jobs? I do not see why you seek to compare these two. Pig offers a language that lets you write data-flow operations and runs these statements as a series of MR jobs for you automatically (Making it a great tool to use to get data processing done really quick, without bothering with code), while streaming is something you use to write non-Java, simple MR jobs. Both have their own purposes. Basically we are comparing these two to see the benefits and how much they help in improving the productive coding time, without jeopardizing the performance of MR jobs. Also there was a claim in our company that Pig performs better than Map Reduce jobs? Is this true? Are there any such benchmarks available Pig _runs_ MR jobs. It does do job design (and some data) optimizations based on your queries, which is what may give it an edge over designing elaborate flows of plain MR jobs with tools like Oozie/JobControl (Which takes more time to do). But regardless, Pig only makes it easy doing the same thing with Pig Latin statements for you. I knew that Pig runs MR jobs, as Hive runs MR jobs. But Hive jobs become pretty slow with lot of joins, which we can achieve faster with writing raw MR jobs. So with that context was trying to see how Pig runs MR jobs. Like for example what kind of projects should consider Pig. Say when we have a lot of Joins, which writing with plain MR jobs takes time. Thoughts? Thank you Harsh for your comments. They are helpful! -- Harsh J
Re: Comparison of Apache Pig Vs. Hadoop Streaming M/R
Thank you Jie! I have downloaded Pig Experience and will read it. On Fri, Mar 2, 2012 at 12:36 PM, Jie Li ji...@cs.duke.edu wrote: Considering Pig essentially translates scripts into Map Reduce jobs, one can always write as good Map Reduce jobs as Pig does. You can refer to Pig experience paper to see the overhead Pig introduces, but it's been improved all the time. Btw if you really care about the performance, how you configure Hadoop and Pig can also play an important role. Thanks, Jie -- Starfish is an intelligent performance tuning tool for Hadoop. Homepage: www.cs.duke.edu/starfish/ Mailing list: http://groups.google.com/group/hadoop-starfish On Thu, Mar 1, 2012 at 11:48 PM, Subir S subir.sasiku...@gmail.com wrote: Hello Folks, Are there any pointers to such comparisons between Apache Pig and Hadoop Streaming Map Reduce jobs? Also there was a claim in our company that Pig performs better than Map Reduce jobs? Is this true? Are there any such benchmarks available Thanks, Subir
Comparison of Apache Pig Vs. Hadoop Streaming M/R
Hello Folks, Are there any pointers to such comparisons between Apache Pig and Hadoop Streaming Map Reduce jobs? Also there was a claim in our company that Pig performs better than Map Reduce jobs? Is this true? Are there any such benchmarks available Thanks, Subir
Re: Comparison of Apache Pig Vs. Hadoop Streaming M/R
Considering Pig essentially translates scripts into Map Reduce jobs, one can always write as good Map Reduce jobs as Pig does. You can refer to Pig experience paper to see the overhead Pig introduces, but it's been improved all the time. Btw if you really care about the performance, how you configure Hadoop and Pig can also play an important role. Thanks, Jie -- Starfish is an intelligent performance tuning tool for Hadoop. Homepage: www.cs.duke.edu/starfish/ Mailing list: http://groups.google.com/group/hadoop-starfish On Thu, Mar 1, 2012 at 11:48 PM, Subir S subir.sasiku...@gmail.com wrote: Hello Folks, Are there any pointers to such comparisons between Apache Pig and Hadoop Streaming Map Reduce jobs? Also there was a claim in our company that Pig performs better than Map Reduce jobs? Is this true? Are there any such benchmarks available Thanks, Subir
Re: Comparison of Apache Pig Vs. Hadoop Streaming M/R
On Fri, Mar 2, 2012 at 10:18 AM, Subir S subir.sasiku...@gmail.com wrote: Hello Folks, Are there any pointers to such comparisons between Apache Pig and Hadoop Streaming Map Reduce jobs? I do not see why you seek to compare these two. Pig offers a language that lets you write data-flow operations and runs these statements as a series of MR jobs for you automatically (Making it a great tool to use to get data processing done really quick, without bothering with code), while streaming is something you use to write non-Java, simple MR jobs. Both have their own purposes. Also there was a claim in our company that Pig performs better than Map Reduce jobs? Is this true? Are there any such benchmarks available Pig _runs_ MR jobs. It does do job design (and some data) optimizations based on your queries, which is what may give it an edge over designing elaborate flows of plain MR jobs with tools like Oozie/JobControl (Which takes more time to do). But regardless, Pig only makes it easy doing the same thing with Pig Latin statements for you. -- Harsh J