Re: Comparison of Apache Pig Vs. Hadoop Streaming M/R

2012-03-06 Thread Russell Jurney
Rules of thumb IMO:

You should be using Pig in place of MR jobs at all times that performance
isn't absolutely crucial.  Writing unnecessary MR is needless technical
debt that you will regret as people are replaced and your organization
scales.  Pig gets it done in much less time.  If you need faster jobs, then
optimize your Pig, and if that doesn't work, put a single
MAPREDUCEhttp://pig.apache.org/docs/r0.9.2/basic.html#mapreduce job
at the bottleneck.  Also, realize that it can be hard to actually beat
Pig's performance without experience.  Check that your MR job is actually
faster than Pig at the same load before assuming you can do better than Pig.

Streaming is good if your data doesn't easily map to tuples, you really
like using the abstractions of your favoriate language's MR library, or you
are doing something weird like simulations/pure batch jobs (no mR).

If you're doing a lot of joins and performance is a problem - consider
doing fewer joins.  I would strongly suggest that you prioritize
de-normalizing and duplicating data over switching to raw MR jobs because
HIVE joins are slow.  MapReduce is slow at joins.  Programmer time is more
valuable than machine time.  If you're having to write tons of raw MR, then
get more machines.

On Fri, Mar 2, 2012 at 6:21 AM, Subir S subir.sasiku...@gmail.com wrote:

 On Fri, Mar 2, 2012 at 12:38 PM, Harsh J ha...@cloudera.com wrote:

  On Fri, Mar 2, 2012 at 10:18 AM, Subir S subir.sasiku...@gmail.com
  wrote:
   Hello Folks,
  
   Are there any pointers to such comparisons between Apache Pig and
 Hadoop
   Streaming Map Reduce jobs?
 
  I do not see why you seek to compare these two. Pig offers a language
  that lets you write data-flow operations and runs these statements as
  a series of MR jobs for you automatically (Making it a great tool to
  use to get data processing done really quick, without bothering with
  code), while streaming is something you use to write non-Java, simple
  MR jobs. Both have their own purposes.
 

 Basically we are comparing these two to see the benefits and how much they
 help in improving the productive coding time, without jeopardizing the
 performance of MR jobs.


   Also there was a claim in our company that Pig performs better than Map
   Reduce jobs? Is this true? Are there any such benchmarks available
 
  Pig _runs_ MR jobs. It does do job design (and some data)
  optimizations based on your queries, which is what may give it an edge
  over designing elaborate flows of plain MR jobs with tools like
  Oozie/JobControl (Which takes more time to do). But regardless, Pig
  only makes it easy doing the same thing with Pig Latin statements for
  you.
 

 I knew that Pig runs MR jobs, as Hive runs MR jobs. But Hive jobs become
 pretty slow with lot of joins, which we can achieve faster with writing raw
 MR jobs. So with that context was trying to see how Pig runs MR jobs. Like
 for example what kind of projects should consider Pig. Say when we have a
 lot of Joins, which writing with plain MR jobs takes time. Thoughts?

 Thank you Harsh for your comments. They are helpful!


 
  --
  Harsh J
 




-- 
Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com


Re: Comparison of Apache Pig Vs. Hadoop Streaming M/R

2012-03-05 Thread Russell Jurney
Streaming is good for simulation. Long running map-only processes, where pig 
doesn't really help and it is simple to fire off a streaming process.  You do 
have to set some options so they can take a long time to return/return counters.

Russell Jurney http://datasyndrome.com

On Mar 5, 2012, at 12:38 PM, Eli Finkelshteyn iefin...@gmail.com wrote:

 I'm really interested in this as well. I have trouble seeing a really good 
 use case for streaming map-reduce. Is there something I can do in streaming 
 that I can't do in Pig? If I want to re-use previously made Python functions 
 from my code base, I can do that in Pig as much as Streaming, and from what 
 I've experienced thus far, Python streaming seems to go slower than or at the 
 same speed as Pig, so why would I want to write a whole lot of 
 more-difficult-to-read mappers and reducers when I can do equally fast 
 performance-wise, shorter, and clearer code in Pig? Maybe it's obvious, but 
 currently I just can't think of the right use case.
 
 Eli
 
 On 3/2/12 9:21 AM, Subir S wrote:
 On Fri, Mar 2, 2012 at 12:38 PM, Harsh Jha...@cloudera.com  wrote:
 
 On Fri, Mar 2, 2012 at 10:18 AM, Subir Ssubir.sasiku...@gmail.com
 wrote:
 Hello Folks,
 
 Are there any pointers to such comparisons between Apache Pig and Hadoop
 Streaming Map Reduce jobs?
 I do not see why you seek to compare these two. Pig offers a language
 that lets you write data-flow operations and runs these statements as
 a series of MR jobs for you automatically (Making it a great tool to
 use to get data processing done really quick, without bothering with
 code), while streaming is something you use to write non-Java, simple
 MR jobs. Both have their own purposes.
 
 Basically we are comparing these two to see the benefits and how much they
 help in improving the productive coding time, without jeopardizing the
 performance of MR jobs.
 
 
 Also there was a claim in our company that Pig performs better than Map
 Reduce jobs? Is this true? Are there any such benchmarks available
 Pig _runs_ MR jobs. It does do job design (and some data)
 optimizations based on your queries, which is what may give it an edge
 over designing elaborate flows of plain MR jobs with tools like
 Oozie/JobControl (Which takes more time to do). But regardless, Pig
 only makes it easy doing the same thing with Pig Latin statements for
 you.
 
 I knew that Pig runs MR jobs, as Hive runs MR jobs. But Hive jobs become
 pretty slow with lot of joins, which we can achieve faster with writing raw
 MR jobs. So with that context was trying to see how Pig runs MR jobs. Like
 for example what kind of projects should consider Pig. Say when we have a
 lot of Joins, which writing with plain MR jobs takes time. Thoughts?
 
 Thank you Harsh for your comments. They are helpful!
 
 
 --
 Harsh J
 
 


Re: Comparison of Apache Pig Vs. Hadoop Streaming M/R

2012-03-02 Thread Subir S
Thank you Jie!

I have downloaded Pig Experience and will read it.

On Fri, Mar 2, 2012 at 12:36 PM, Jie Li ji...@cs.duke.edu wrote:

 Considering Pig essentially translates scripts into Map Reduce jobs, one
 can always write as good Map Reduce jobs as Pig does. You can refer to Pig
 experience paper to see the overhead Pig introduces, but it's been
 improved all the time.

 Btw if you really care about the performance, how you configure Hadoop and
 Pig can also play an important role.

 Thanks,
 Jie
 --
 Starfish is an intelligent performance tuning tool for Hadoop.
 Homepage: www.cs.duke.edu/starfish/
 Mailing list: http://groups.google.com/group/hadoop-starfish

 On Thu, Mar 1, 2012 at 11:48 PM, Subir S subir.sasiku...@gmail.com
 wrote:

  Hello Folks,
 
  Are there any pointers to such comparisons between Apache Pig and Hadoop
  Streaming Map Reduce jobs?
 
  Also there was a claim in our company that Pig performs better than Map
  Reduce jobs? Is this true? Are there any such benchmarks available
 
  Thanks, Subir
 



Comparison of Apache Pig Vs. Hadoop Streaming M/R

2012-03-01 Thread Subir S
Hello Folks,

Are there any pointers to such comparisons between Apache Pig and Hadoop
Streaming Map Reduce jobs?

Also there was a claim in our company that Pig performs better than Map
Reduce jobs? Is this true? Are there any such benchmarks available

Thanks, Subir


Re: Comparison of Apache Pig Vs. Hadoop Streaming M/R

2012-03-01 Thread Jie Li
Considering Pig essentially translates scripts into Map Reduce jobs, one
can always write as good Map Reduce jobs as Pig does. You can refer to Pig
experience paper to see the overhead Pig introduces, but it's been
improved all the time.

Btw if you really care about the performance, how you configure Hadoop and
Pig can also play an important role.

Thanks,
Jie
--
Starfish is an intelligent performance tuning tool for Hadoop.
Homepage: www.cs.duke.edu/starfish/
Mailing list: http://groups.google.com/group/hadoop-starfish

On Thu, Mar 1, 2012 at 11:48 PM, Subir S subir.sasiku...@gmail.com wrote:

 Hello Folks,

 Are there any pointers to such comparisons between Apache Pig and Hadoop
 Streaming Map Reduce jobs?

 Also there was a claim in our company that Pig performs better than Map
 Reduce jobs? Is this true? Are there any such benchmarks available

 Thanks, Subir



Re: Comparison of Apache Pig Vs. Hadoop Streaming M/R

2012-03-01 Thread Harsh J
On Fri, Mar 2, 2012 at 10:18 AM, Subir S subir.sasiku...@gmail.com wrote:
 Hello Folks,

 Are there any pointers to such comparisons between Apache Pig and Hadoop
 Streaming Map Reduce jobs?

I do not see why you seek to compare these two. Pig offers a language
that lets you write data-flow operations and runs these statements as
a series of MR jobs for you automatically (Making it a great tool to
use to get data processing done really quick, without bothering with
code), while streaming is something you use to write non-Java, simple
MR jobs. Both have their own purposes.

 Also there was a claim in our company that Pig performs better than Map
 Reduce jobs? Is this true? Are there any such benchmarks available

Pig _runs_ MR jobs. It does do job design (and some data)
optimizations based on your queries, which is what may give it an edge
over designing elaborate flows of plain MR jobs with tools like
Oozie/JobControl (Which takes more time to do). But regardless, Pig
only makes it easy doing the same thing with Pig Latin statements for
you.

-- 
Harsh J