RE: PIG and Hive

Ashish Thusoo Thu, 07 May 2009 11:19:16 -0700

Ok that explains a lot of that. When we started off Hive our immediate usecase 
was to do group bys on data with a lot of skew on the grouping keys. In that 
scenario it is better to do this in 2 map/reduce jobs using the first one to 
randomly distribute data and generating the partial sums followed by another 
one that does the complete sums. This was originally the default plan in Hive. 
Since then we have moved the default to just using a single map/reduce job and 
using

hive.exec.skeweddata = true as a parameter to trigger the older behavior. 

We already collapse subselects. We already do predicate pushdown and column 
pruning. We don't yet do subexpression elimination but that will happen soon. 
Implicit detection of an inner join is possible though we never had a JIRA 
asking for it. Will open one soon...

I am sure you will not be disappointed by the capabilities of the system when 
you try it again.. Feel free to mail hive-us...@hadoop.apache.org for any 
clarifications/help/optimization questions.

Cheers,
Ashish

-----Original Message-----
From: Scott Carey [mailto:sc...@richrelevance.com] 
Sent: Thursday, May 07, 2009 11:08 AM
To: core-user@hadoop.apache.org
Subject: Re: PIG and Hive

The work was done 3 months ago, and the exact query I used may not have been 
the below - it was functionally the same - two sources,  arithmetic aggregation 
on each inner-joined by a small set of values.  We wrote a hand-coded map 
reduce, a Pig script, and Hive against the same data and performance tested.

At that time, even "SELECT count(a.z) FROM a group by a.z" took 3 phases (not 
sure how many were fetch versus M/R).  Since then, we abandoned Hive for 
reassessment at a later date.  All releases of Hive since then 
http://hadoop.apache.org/hive/docs/r0.3.0/changes.html don't have anything 
under "optimizations" and few of the enhancements listed suggest that there has 
been much change on the performance front (yet).

Can Hive not yet detect an implicit inner join in a WHERE clause?

Our use case would have less optimization-savvy people querying data ad-hoc, so 
being able to detect implicit joins and collapse subselects, etc is a 
requirement.  I'm not going to go sitting over the shoulder of everyone who 
wants to do some ad-hoc data analysis and tell them how to re-write their 
queries to perform better.
That is a big weakness of SQL that affects everything that uses it - there are 
so many equivalent or near-equivalent forms of expression that often lead to 
implementation specific performance preferences.

I'm sure Hive will get over that hump but it takes time.  I'm certainly 
interested in it and will have a deeper look again in the second half of this 
year.

On 5/7/09 10:12 AM, "Namit Jain" <nj...@facebook.com> wrote:

SELECT count(a.z), count(b.z), x, y from a, b where a.x = b.x and a.y = b.y 
group by x, y.

If you do a explain on the above query, you will see that you are performing a 
Cartesian product followed by the filter.

It would be better to rewrite the query as:

SELECT count(a.z), count(b.z), a.x, a.y from a JOIN b ON( a.x = b.x and a.y = 
b.y) group by a.x, a.y;

The explain should have 2 map-reduce jobs and a fetch task (which is not a 
map-reduce job).
Can you send me the exact Hive query that you are trying along with the schema 
of tables 'a' and 'b'.

In order to see the plan, you can do:

Explain
<QUERY>

Thanks,
-namit

------ Forwarded Message
From: Ricky Ho <r...@adobe.com>
Reply-To: <core-user@hadoop.apache.org>
Date: Wed, 6 May 2009 21:11:43 -0700
To: <core-user@hadoop.apache.org>
Subject: RE: PIG and Hive

Thanks for Olga example and Scott's comment.

My goal is to pick a higher level parallel programming language (as a algorithm 
design / prototyping tool) to express my parallel algorithms in a concise way.  
The deeper I look into these, I have a stronger feeling that PIG and HIVE are 
competitors rather than complementing each other.  I think a large set of 
problems can be done in either way, without much difference in terms of 
skillset requirements.

At this moment, I am focus in the richness of the language model rather than 
the implementation optimization.  Supporting "collection" as well as the 
flatten operation in the language model seems to make PIG more powerful.  Yes, 
you can achieve the same thing in Hive but then it starts to look odd.  Am I 
missing something Hive folks ?

Rgds,
Ricky

-----Original Message-----
From: Scott Carey [mailto:sc...@richrelevance.com]
Sent: Wednesday, May 06, 2009 7:48 PM
To: core-user@hadoop.apache.org
Subject: Re: PIG and Hive

Pig currently also compiles similar operations (like the below) into many fewer 
map reduce passes and is several times faster in general.

This will change as the optimizer and available optimizations converge and in 
the future they won't differ much.  But for now, Pig optimizes much better.

I ran a test that boiled down to SQL like this:

SELECT count(a.z), count(b.z), x, y from a, b where a.x = b.x and a.y = b.y 
group by x, y.

(and equivalent, but more verbose Pig)

Pig did it in one map reduce pass in about 2 minutes and Hive did it in 5 map 
reduce passes in 10 minutes.

There is nothing keeping Hive from applying the optimizations necessary to make 
that one pass, but those sort of performance optimizations aren't there yet.  
That is expected, it is a younger project.

It would be useful if more of these higher level tools shared work on the 
various optimizations.  Pig and Hive (and perhaps CloudBase and Cascading?) 
could benefit from a shared map-reduce compiler.

On 5/6/09 5:32 PM, "Olga Natkovich" <ol...@yahoo-inc.com> wrote:

> Hi Ricky,
>
> This is how the code will look in Pig.
>
> A = load 'textdoc' using TextLoader() as (sentence: chararray); B = 
> foreach A generate flatten(TOKENIZE(sentence)) as word; C = group B by 
> word; D = foreach C generate group, COUNT(B); store D into 
> 'wordcount';
>
> Pig training (http://www.cloudera.com/hadoop-training-pig-tutorial)
> explains how the example above works.
>
> Let me know if you have further questions.
>
> Olga
>
>
>> -----Original Message-----
>> From: Ricky Ho [mailto:r...@adobe.com]
>> Sent: Wednesday, May 06, 2009 3:56 PM
>> To: core-user@hadoop.apache.org
>> Subject: RE: PIG and Hive
>>
>> Thanks Amr,
>>
>> Without knowing the details of Hive, one constraint of SQL model is 
>> you can never generate more than one records from a single record.  I 
>> don't know how this is done in Hive.
>> Another question is whether the Hive script can take in user-defined 
>> functions ?
>>
>> Using the following word count as an example.  Can you show me how 
>> the Pig script and Hive script looks like ?
>>
>> Map:
>>   Input: a line (a collection of words)
>>   Output: multiple [word, 1]
>>
>> Reduce:
>>   Input: [word, [1, 1, 1, ...]]
>>   Output: [word, count]
>>
>> Rgds,
>> Ricky
>>
>> -----Original Message-----
>> From: Amr Awadallah [mailto:a...@cloudera.com]
>> Sent: Wednesday, May 06, 2009 3:14 PM
>> To: core-user@hadoop.apache.org
>> Subject: Re: PIG and Hive
>>
>>> The difference between PIG and Hive seems to be pretty
>> insignificant.
>>
>> Difference between Pig and Hive is significant, specifically:
>>
>> (1) Pig doesn't require underlying structure to the data, Hive does 
>> imply structure via a metastore. This has it pros and cons. It allows 
>> Pig to be more suitable for ETL kind tasks where the input data is 
>> still a mish-mash and you want to convert it to be structured. On the 
>> other hand, Hive's metastore provides a dictionary that lets you 
>> easily see what columns exist in which tables which can be very 
>> handy.
>>
>> (2) Pig is a new language, easy to learn if you know languages 
>> similar to Perl. Hive is a sub-set of SQL with very simple variations 
>> to enable map-reduce like computation. So, if you come from a SQL 
>> background you will find Hive QL extremely easy to pickup (many of 
>> your SQL queries will run as is), while if you come from a procedural 
>> programming background (w/o SQL knowledge) then Pig will be much more 
>> suitable for you. Furthermore, Hive is a bit easier to integrate with 
>> other systems and tools since it speaks the language they already 
>> speak (i.e. SQL).
>>
>> You're right that HBase is a completely different game, HBase is not 
>> about being a high level language that compiles to map-reduce, HBase 
>> is about allowing Hadoop to support lookups/transactions on key/value 
>> pairs. HBase allows you to
>> (1) do quick random lookups, versus scan all of data sequentially, 
>> (2) do insert/update/delete from middle, not just add/append.
>>
>> -- amr
>>
>> Ricky Ho wrote:
>>> Jeff,
>>>
>>> Thanks for the pointer.
>>> It is pretty clear that Hive and PIG are the same kind and
>> HBase is a different kind.
>>> The difference between PIG and Hive seems to be pretty
>> insignificant.  Layer a tool on top of them can completely hide their 
>> difference.
>>>
>>> I am viewing your PIG and Hive tutorial and hopefully can
>> extract some technical details there.
>>>
>>> Rgds,
>>> Ricky
>>> -----Original Message-----
>>> From: Jeff Hammerbacher [mailto:ham...@cloudera.com]
>>> Sent: Wednesday, May 06, 2009 1:38 PM
>>> To: core-user@hadoop.apache.org
>>> Subject: Re: PIG and Hive
>>>
>>> Here's a permalink for the thread on MarkMail:
>>> http://markmail.org/thread/ee4hpcji74higqvk
>>>
>>> On Wed, May 6, 2009 at 4:55 AM, Sharad Agarwal
>> <shara...@yahoo-inc.com>wrote<shara...@yahoo-inc.com%3ewrote>:
>>>
>>>
>>>> see core-user mail thread with subject "HBase, Hive, Pig and other 
>>>> Hadoop based technologies"
>>>>
>>>> - Sharad
>>>>
>>>> Ricky Ho wrote:
>>>>
>>>>> Are they competing technologies of providing a higher
>> level language
>>>>> for
>>>>>
>>>> Map/Reduce programming ?
>>>>
>>>>> Or are they complementary ?
>>>>>
>>>>> Any comparison between them ?
>>>>>
>>>>> Rgds,
>>>>> Ricky
>>>>>
>>>>
>>
>

------ End of Forwarded Message

---
h...@lists.facebook.com
http://lists.facebook.com/mailman/listinfo/hive

RE: PIG and Hive

Reply via email to