Hi there. I'm in the process of writing a paper, and part of it I aim to write (yet another) comparative study on various interfaces with Hadoop.
This will almost certainly include Pig and Hive, probably MapReduce, and maybe JAQL. I have read the papers published on the Hive JIRA (pig vs hive vs MapReduce for 2 queries, an aggregation, and a join). I am, however, wanting to know a bit from the Hive community. 1. Do you guys (the Hive developers) have a standardized benchmarking tool to use prior to each Hive release? I am thinking of something similar to PigMix, used by the Pig developers. In case you don't know, PigMix is a set of 12 designed queries, implemented in Pig and Java Hadoop, and comparisons are made on execution time. Does the Hive community have something similar? 2. The Pig wiki point out some unique features of Pig that allow optimal execution performance. For instance, they have a methods to optimize queries on skewed data (by taking samples of the data for reduce key allocations. Is there something about the implementation of Hive that gives it some functionality not found in other interfaces. And better still, would there some Hive implementation that could work as a proof of concept to show any optimized features of Hive? 3. One section suggested for investigation within the Pig development team is to create a SQL like language that could be compiled down through Pig to MR jobs. If such a project was to achieve parity with Hive's SQL like interface, where would be the distinction be between Pig and Hive. Certainly, from a users perspective, there would be very little difference. If the only difference turns out to be the execution performance achieved by one interface over another, where would this put the inferior interface (be that either Pig or Hive) in terms of its relevance in the Hadoop software stack? Many thanks, Rob Stewart