Hive Performance

Rob Stewart Sat, 07 Nov 2009 10:04:17 -0800

Hi there. I'm in the process of writing a paper, and part of it I aim to
write (yet another) comparative study on various interfaces with Hadoop.


This will almost certainly include Pig and Hive, probably MapReduce, and
maybe JAQL.

I have read the papers published on the Hive JIRA (pig vs hive vs MapReduce
for 2 queries, an aggregation, and a join). I am, however, wanting to know a
bit from the Hive community.

1. Do you guys (the Hive developers) have a standardized benchmarking tool
to use prior to each Hive release? I am thinking of something similar to
PigMix, used by the Pig developers. In case you don't know, PigMix is a set
of 12 designed queries, implemented in Pig and Java Hadoop, and comparisons
are made on execution time. Does the Hive community have something similar?

2. The Pig wiki point out some unique features of Pig that allow optimal
execution performance. For instance, they have a methods to optimize queries
on skewed data (by taking samples of the data for reduce key allocations. Is
there something about the implementation of Hive that gives it some
functionality not found in other interfaces. And better still, would there
some Hive implementation that could work as a proof of concept to show any
optimized features of Hive?

3. One section suggested for investigation within the Pig development team
is to create a SQL like language that could be compiled down through Pig to
MR jobs. If such a project was to achieve parity with Hive's SQL like
interface, where would be the distinction be between Pig and Hive.
Certainly, from a users perspective, there would be very little difference.
If the only difference turns out to be the execution performance achieved by
one interface over another, where would this put the inferior interface (be
that either Pig or Hive) in terms of its relevance in the Hadoop software
stack?


Many thanks,


Rob Stewart

Hive Performance

Reply via email to