Re: Comparison of Apache Pig Vs. Hadoop Streaming M/R

Russell Jurney Tue, 06 Mar 2012 12:17:45 -0800

Rules of thumb IMO:

You should be using Pig in place of MR jobs at all times that performance
isn't absolutely crucial.  Writing unnecessary MR is needless technical
debt that you will regret as people are replaced and your organization
scales.  Pig gets it done in much less time.  If you need faster jobs, then
optimize your Pig, and if that doesn't work, put a single
MAPREDUCE<http://pig.apache.org/docs/r0.9.2/basic.html#mapreduce> job
at the bottleneck.  Also, realize that it can be hard to actually beat
Pig's performance without experience.  Check that your MR job is actually
faster than Pig at the same load before assuming you can do better than Pig.


Streaming is good if your data doesn't easily map to tuples, you really
like using the abstractions of your favoriate language's MR library, or you
are doing something weird like simulations/pure batch jobs (no mR).

If you're doing a lot of joins and performance is a problem - consider
doing fewer joins.  I would strongly suggest that you prioritize
de-normalizing and duplicating data over switching to raw MR jobs because
HIVE joins are slow.  MapReduce is slow at joins.  Programmer time is more
valuable than machine time.  If you're having to write tons of raw MR, then
get more machines.

On Fri, Mar 2, 2012 at 6:21 AM, Subir S <subir.sasiku...@gmail.com> wrote:

> On Fri, Mar 2, 2012 at 12:38 PM, Harsh J <ha...@cloudera.com> wrote:
>
> > On Fri, Mar 2, 2012 at 10:18 AM, Subir S <subir.sasiku...@gmail.com>
> > wrote:
> > > Hello Folks,
> > >
> > > Are there any pointers to such comparisons between Apache Pig and
> Hadoop
> > > Streaming Map Reduce jobs?
> >
> > I do not see why you seek to compare these two. Pig offers a language
> > that lets you write data-flow operations and runs these statements as
> > a series of MR jobs for you automatically (Making it a great tool to
> > use to get data processing done really quick, without bothering with
> > code), while streaming is something you use to write non-Java, simple
> > MR jobs. Both have their own purposes.
> >
>
> Basically we are comparing these two to see the benefits and how much they
> help in improving the productive coding time, without jeopardizing the
> performance of MR jobs.
>
>
> > > Also there was a claim in our company that Pig performs better than Map
> > > Reduce jobs? Is this true? Are there any such benchmarks available
> >
> > Pig _runs_ MR jobs. It does do job design (and some data)
> > optimizations based on your queries, which is what may give it an edge
> > over designing elaborate flows of plain MR jobs with tools like
> > Oozie/JobControl (Which takes more time to do). But regardless, Pig
> > only makes it easy doing the same thing with Pig Latin statements for
> > you.
> >
>
> I knew that Pig runs MR jobs, as Hive runs MR jobs. But Hive jobs become
> pretty slow with lot of joins, which we can achieve faster with writing raw
> MR jobs. So with that context was trying to see how Pig runs MR jobs. Like
> for example what kind of projects should consider Pig. Say when we have a
> lot of Joins, which writing with plain MR jobs takes time. Thoughts?
>
> Thank you Harsh for your comments. They are helpful!
>
>
> >
> > --
> > Harsh J
> >
>



-- 
Russell Jurney twitter.com/rjurney russell.jur...@gmail.com datasyndrome.com

Re: Comparison of Apache Pig Vs. Hadoop Streaming M/R

Reply via email to