Re: Hive Performance

Edward Capriolo Sat, 07 Nov 2009 11:19:46 -0800

A friend and I were disgussing pig vs hive in general yesterday. On
the surface hive is an sql like language.pig is its own language 'pig
latin' however in the end I think they both end up doing column
projections, joins,etc. In the end it is a similar operation happening
on the same cluster. So performance wise I expect the performance will
eventually be similair. now pig offering more sql support is a large
undertaking.


 While pig looks very versatile I resently emultated the example on
cloudera's blog for geoip locating traffic in pig. I did this in hive
with an external perl script using map/transform. (It did not take a
page long pig program) I also think the hive udf framework can be used
in place of many piggybank functions. Also unless I am missing
something a udf is native java. Seems like piggybank functions are
going to be piping /streaming output I can't see that performing
better.

To backtrack if pig adds sql, will we need hive? If hive adds
something like tsql will we need pig?

On 11/7/09, Rob Stewart <robstewar...@googlemail.com> wrote:
> Hi there. I'm in the process of writing a paper, and part of it I aim to
> write (yet another) comparative study on various interfaces with Hadoop.
>
> This will almost certainly include Pig and Hive, probably MapReduce, and
> maybe JAQL.
>
> I have read the papers published on the Hive JIRA (pig vs hive vs MapReduce
> for 2 queries, an aggregation, and a join). I am, however, wanting to know a
> bit from the Hive community.
>
> 1. Do you guys (the Hive developers) have a standardized benchmarking tool
> to use prior to each Hive release? I am thinking of something similar to
> PigMix, used by the Pig developers. In case you don't know, PigMix is a set
> of 12 designed queries, implemented in Pig and Java Hadoop, and comparisons
> are made on execution time. Does the Hive community have something similar?
>
> 2. The Pig wiki point out some unique features of Pig that allow optimal
> execution performance. For instance, they have a methods to optimize queries
> on skewed data (by taking samples of the data for reduce key allocations. Is
> there something about the implementation of Hive that gives it some
> functionality not found in other interfaces. And better still, would there
> some Hive implementation that could work as a proof of concept to show any
> optimized features of Hive?
>
> 3. One section suggested for investigation within the Pig development team
> is to create a SQL like language that could be compiled down through Pig to
> MR jobs. If such a project was to achieve parity with Hive's SQL like
> interface, where would be the distinction be between Pig and Hive.
> Certainly, from a users perspective, there would be very little difference.
> If the only difference turns out to be the execution performance achieved by
> one interface over another, where would this put the inferior interface (be
> that either Pig or Hive) in terms of its relevance in the Hadoop software
> stack?
>
>
> Many thanks,
>
>
> Rob Stewart
>

Re: Hive Performance

Reply via email to