There are a bunch of optimizations that deal with skewed data in Hive as well. The optimizer is rule based and the user has to hint the query - similar to what is done in RDBMS. We have mostly done our performance work on the benchmark published in the SIGMOD paper.
Ashish -----Original Message----- From: Edward Capriolo [mailto:edlinuxg...@gmail.com] Sent: Saturday, November 07, 2009 11:19 AM To: hive-dev@hadoop.apache.org Subject: Re: Hive Performance A friend and I were disgussing pig vs hive in general yesterday. On the surface hive is an sql like language.pig is its own language 'pig latin' however in the end I think they both end up doing column projections, joins,etc. In the end it is a similar operation happening on the same cluster. So performance wise I expect the performance will eventually be similair. now pig offering more sql support is a large undertaking. While pig looks very versatile I resently emultated the example on cloudera's blog for geoip locating traffic in pig. I did this in hive with an external perl script using map/transform. (It did not take a page long pig program) I also think the hive udf framework can be used in place of many piggybank functions. Also unless I am missing something a udf is native java. Seems like piggybank functions are going to be piping /streaming output I can't see that performing better. To backtrack if pig adds sql, will we need hive? If hive adds something like tsql will we need pig? On 11/7/09, Rob Stewart <robstewar...@googlemail.com> wrote: > Hi there. I'm in the process of writing a paper, and part of it I aim > to write (yet another) comparative study on various interfaces with Hadoop. > > This will almost certainly include Pig and Hive, probably MapReduce, > and maybe JAQL. > > I have read the papers published on the Hive JIRA (pig vs hive vs > MapReduce for 2 queries, an aggregation, and a join). I am, however, > wanting to know a bit from the Hive community. > > 1. Do you guys (the Hive developers) have a standardized benchmarking > tool to use prior to each Hive release? I am thinking of something > similar to PigMix, used by the Pig developers. In case you don't know, > PigMix is a set of 12 designed queries, implemented in Pig and Java > Hadoop, and comparisons are made on execution time. Does the Hive community > have something similar? > > 2. The Pig wiki point out some unique features of Pig that allow > optimal execution performance. For instance, they have a methods to > optimize queries on skewed data (by taking samples of the data for > reduce key allocations. Is there something about the implementation of > Hive that gives it some functionality not found in other interfaces. > And better still, would there some Hive implementation that could work > as a proof of concept to show any optimized features of Hive? > > 3. One section suggested for investigation within the Pig development > team is to create a SQL like language that could be compiled down > through Pig to MR jobs. If such a project was to achieve parity with > Hive's SQL like interface, where would be the distinction be between Pig and > Hive. > Certainly, from a users perspective, there would be very little difference. > If the only difference turns out to be the execution performance > achieved by one interface over another, where would this put the > inferior interface (be that either Pig or Hive) in terms of its > relevance in the Hadoop software stack? > > > Many thanks, > > > Rob Stewart >