A friend and I were disgussing pig vs hive in general yesterday. On the surface hive is an sql like language.pig is its own language 'pig latin' however in the end I think they both end up doing column projections, joins,etc. In the end it is a similar operation happening on the same cluster. So performance wise I expect the performance will eventually be similair. now pig offering more sql support is a large undertaking.
While pig looks very versatile I resently emultated the example on cloudera's blog for geoip locating traffic in pig. I did this in hive with an external perl script using map/transform. (It did not take a page long pig program) I also think the hive udf framework can be used in place of many piggybank functions. Also unless I am missing something a udf is native java. Seems like piggybank functions are going to be piping /streaming output I can't see that performing better. To backtrack if pig adds sql, will we need hive? If hive adds something like tsql will we need pig? On 11/7/09, Rob Stewart <robstewar...@googlemail.com> wrote: > Hi there. I'm in the process of writing a paper, and part of it I aim to > write (yet another) comparative study on various interfaces with Hadoop. > > This will almost certainly include Pig and Hive, probably MapReduce, and > maybe JAQL. > > I have read the papers published on the Hive JIRA (pig vs hive vs MapReduce > for 2 queries, an aggregation, and a join). I am, however, wanting to know a > bit from the Hive community. > > 1. Do you guys (the Hive developers) have a standardized benchmarking tool > to use prior to each Hive release? I am thinking of something similar to > PigMix, used by the Pig developers. In case you don't know, PigMix is a set > of 12 designed queries, implemented in Pig and Java Hadoop, and comparisons > are made on execution time. Does the Hive community have something similar? > > 2. The Pig wiki point out some unique features of Pig that allow optimal > execution performance. For instance, they have a methods to optimize queries > on skewed data (by taking samples of the data for reduce key allocations. Is > there something about the implementation of Hive that gives it some > functionality not found in other interfaces. And better still, would there > some Hive implementation that could work as a proof of concept to show any > optimized features of Hive? > > 3. One section suggested for investigation within the Pig development team > is to create a SQL like language that could be compiled down through Pig to > MR jobs. If such a project was to achieve parity with Hive's SQL like > interface, where would be the distinction be between Pig and Hive. > Certainly, from a users perspective, there would be very little difference. > If the only difference turns out to be the execution performance achieved by > one interface over another, where would this put the inferior interface (be > that either Pig or Hive) in terms of its relevance in the Hadoop software > stack? > > > Many thanks, > > > Rob Stewart >