RE: PIG and Hive

Ashish Thusoo Wed, 06 May 2009 16:21:40 -0700

Ricky,

For your particular example Hive allows you to plugin a user defined map and 
reduce script (in the language of your choice) within Hive QL (there are some 
minor extensions to SQL to support such a use case). So for your case you could 
do the following:

FROM (FROM lines
      MAP line USING 'map_script'  AS word, cnt
      DISTRIBUTE BY word) a
REDUCE a.word, a.cnt USING 'reduce_script';

The map_script and reduce_script has the map and reduce logic (thse can be 
simple shell scripts, python scripts, php, java - you name it).
And they CAN generate multiple records for each input record. In the RDBMS 
world there is a concept of Table functions that achieves the same effect, 
except that those are plugged into the FROM clause of a usual SQL statement.

Also, SQL does actually have a workaround that you can use to generate more 
than one recods from a single record - provided the explosion factor is fixed. 
Suppose you want to generate x record for each input record, you can do a 
cartesian join with a dummy table that has x rows.

Ashish

-----Original Message-----
From: Ricky Ho [mailto:r...@adobe.com] 
Sent: Wednesday, May 06, 2009 3:56 PM
To: core-user@hadoop.apache.org
Subject: RE: PIG and Hive

Thanks Amr,

Without knowing the details of Hive, one constraint of SQL model is you can 
never generate more than one records from a single record.  I don't know how 
this is done in Hive.  Another question is whether the Hive script can take in 
user-defined functions ?

Using the following word count as an example.  Can you show me how the Pig 
script and Hive script looks like ?

Map:
  Input: a line (a collection of words)
  Output: multiple [word, 1]

Reduce:
  Input: [word, [1, 1, 1, ...]]
  Output: [word, count] 

Rgds,
Ricky

-----Original Message-----
From: Amr Awadallah [mailto:a...@cloudera.com]
Sent: Wednesday, May 06, 2009 3:14 PM
To: core-user@hadoop.apache.org
Subject: Re: PIG and Hive

> The difference between PIG and Hive seems to be pretty insignificant. 

Difference between Pig and Hive is significant, specifically:

(1) Pig doesn't require underlying structure to the data, Hive does imply 
structure via a metastore. This has it pros and cons. It allows Pig to be more 
suitable for ETL kind tasks where the input data is still a mish-mash and you 
want to convert it to be structured. On the other hand, Hive's metastore 
provides a dictionary that lets you easily see what columns exist in which 
tables which can be very handy.

(2) Pig is a new language, easy to learn if you know languages similar to Perl. 
Hive is a sub-set of SQL with very simple variations to enable map-reduce like 
computation. So, if you come from a SQL background you will find Hive QL 
extremely easy to pickup (many of your SQL queries will run as is), while if 
you come from a procedural programming background (w/o SQL knowledge) then Pig 
will be much more suitable for you. Furthermore, Hive is a bit easier to 
integrate with other systems and tools since it speaks the language they 
already speak (i.e. SQL).

You're right that HBase is a completely different game, HBase is not about 
being a high level language that compiles to map-reduce, HBase is about 
allowing Hadoop to support lookups/transactions on key/value pairs. HBase 
allows you to (1) do quick random lookups, versus scan all of data 
sequentially, (2) do insert/update/delete from middle, not just add/append.

-- amr

Ricky Ho wrote:
> Jeff,
>
> Thanks for the pointer.
> It is pretty clear that Hive and PIG are the same kind and HBase is a 
> different kind.
> The difference between PIG and Hive seems to be pretty insignificant.  Layer 
> a tool on top of them can completely hide their difference.
>
> I am viewing your PIG and Hive tutorial and hopefully can extract some 
> technical details there.
>
> Rgds,
> Ricky
> -----Original Message-----
> From: Jeff Hammerbacher [mailto:ham...@cloudera.com]
> Sent: Wednesday, May 06, 2009 1:38 PM
> To: core-user@hadoop.apache.org
> Subject: Re: PIG and Hive
>
> Here's a permalink for the thread on MarkMail:
> http://markmail.org/thread/ee4hpcji74higqvk
>
> On Wed, May 6, 2009 at 4:55 AM, Sharad Agarwal <shara...@yahoo-inc.com>wrote:
>
>   
>> see core-user mail thread with subject "HBase, Hive, Pig and other 
>> Hadoop based technologies"
>>
>> - Sharad
>>
>> Ricky Ho wrote:
>>     
>>> Are they competing technologies of providing a higher level language 
>>> for
>>>       
>> Map/Reduce programming ?
>>     
>>> Or are they complementary ?
>>>
>>> Any comparison between them ?
>>>
>>> Rgds,
>>> Ricky
>>>       
>>

RE: PIG and Hive

Reply via email to