RE: Hive Performance

Ashish Thusoo Mon, 09 Nov 2009 11:54:13 -0800

There are a bunch of optimizations that deal with skewed data in Hive as well. 
The optimizer is rule based and the user has to hint the query - similar to 
what is done in RDBMS. We have mostly done our performance work on the 
benchmark published in the SIGMOD paper.

Ashish

-----Original Message-----
From: Edward Capriolo [mailto:edlinuxg...@gmail.com] 
Sent: Saturday, November 07, 2009 11:19 AM
To: hive-dev@hadoop.apache.org
Subject: Re: Hive Performance

A friend and I were disgussing pig vs hive in general yesterday. On the surface 
hive is an sql like language.pig is its own language 'pig latin' however in the 
end I think they both end up doing column projections, joins,etc. In the end it 
is a similar operation happening on the same cluster. So performance wise I 
expect the performance will eventually be similair. now pig offering more sql 
support is a large undertaking.

 While pig looks very versatile I resently emultated the example on cloudera's 
blog for geoip locating traffic in pig. I did this in hive with an external 
perl script using map/transform. (It did not take a page long pig program) I 
also think the hive udf framework can be used in place of many piggybank 
functions. Also unless I am missing something a udf is native java. Seems like 
piggybank functions are going to be piping /streaming output I can't see that 
performing better.

To backtrack if pig adds sql, will we need hive? If hive adds something like 
tsql will we need pig?

On 11/7/09, Rob Stewart <robstewar...@googlemail.com> wrote:
> Hi there. I'm in the process of writing a paper, and part of it I aim 
> to write (yet another) comparative study on various interfaces with Hadoop.
>
> This will almost certainly include Pig and Hive, probably MapReduce, 
> and maybe JAQL.
>
> I have read the papers published on the Hive JIRA (pig vs hive vs 
> MapReduce for 2 queries, an aggregation, and a join). I am, however, 
> wanting to know a bit from the Hive community.
>
> 1. Do you guys (the Hive developers) have a standardized benchmarking 
> tool to use prior to each Hive release? I am thinking of something 
> similar to PigMix, used by the Pig developers. In case you don't know, 
> PigMix is a set of 12 designed queries, implemented in Pig and Java 
> Hadoop, and comparisons are made on execution time. Does the Hive community 
> have something similar?
>
> 2. The Pig wiki point out some unique features of Pig that allow 
> optimal execution performance. For instance, they have a methods to 
> optimize queries on skewed data (by taking samples of the data for 
> reduce key allocations. Is there something about the implementation of 
> Hive that gives it some functionality not found in other interfaces. 
> And better still, would there some Hive implementation that could work 
> as a proof of concept to show any optimized features of Hive?
>
> 3. One section suggested for investigation within the Pig development 
> team is to create a SQL like language that could be compiled down 
> through Pig to MR jobs. If such a project was to achieve parity with 
> Hive's SQL like interface, where would be the distinction be between Pig and 
> Hive.
> Certainly, from a users perspective, there would be very little difference.
> If the only difference turns out to be the execution performance 
> achieved by one interface over another, where would this put the 
> inferior interface (be that either Pig or Hive) in terms of its 
> relevance in the Hadoop software stack?
>
>
> Many thanks,
>
>
> Rob Stewart
>

RE: Hive Performance

Reply via email to