Re: PIG and Hive

Alan Gates Thu, 07 May 2009 07:15:08 -0700

SQL has been on Pig's roadmap for some time, see 
http://wiki.apache.org/pig/ProposedRoadMap

We would like to add SQL support to Pig sometime this year. We don'thave an ETA or a JIRA for it yet.


Alan.

On May 6, 2009, at 11:20 PM, Amr Awadallah wrote:

Yiping,

(1) Any ETA for when that will become available?

(2) Where can we read more about the SQL functionality it willsupport?


(3) Where is the JIRA for this?

Thanks,

-- amr

Luc Hunt wrote:

Ricky,

One thing to mention is, SQL support is on the Pig roadmap this year.


--Yiping

On Wed, May 6, 2009 at 9:11 PM, Ricky Ho <r...@adobe.com> wrote:

Thanks for Olga example and Scott's comment.

My goal is to pick a higher level parallel programming language(as aalgorithm design / prototyping tool) to express my parallelalgorithms in aconcise way. The deeper I look into these, I have a strongerfeeling thatPIG and HIVE are competitors rather than complementing eachother. I thinka large set of problems can be done in either way, without muchdifference

in terms of skillset requirements.

At this moment, I am focus in the richness of the language modelratherthan the implementation optimization. Supporting "collection" aswell asthe flatten operation in the language model seems to make PIG morepowerful.Yes, you can achieve the same thing in Hive but then it starts tolook odd.

Am I missing something Hive folks ?

Rgds,
Ricky

-----Original Message-----
From: Scott Carey [mailto:sc...@richrelevance.com]
Sent: Wednesday, May 06, 2009 7:48 PM
To: core-user@hadoop.apache.org
Subject: Re: PIG and Hive

Pig currently also compiles similar operations (like the below)into many

fewer map reduce passes and is several times faster in general.

This will change as the optimizer and available optimizationsconverge andin the future they won't differ much. But for now, Pig optimizesmuch

better.

I ran a test that boiled down to SQL like this:

SELECT count(a.z), count(b.z), x, y from a, b where a.x = b.x anda.y = b.y

group by x, y.

(and equivalent, but more verbose Pig)

Pig did it in one map reduce pass in about 2 minutes and Hive didit in 5

map reduce passes in 10 minutes.

There is nothing keeping Hive from applying the optimizationsnecessary tomake that one pass, but those sort of performance optimizationsaren't

there
yet.  That is expected, it is a younger project.

It would be useful if more of these higher level tools shared workon thevarious optimizations. Pig and Hive (and perhaps CloudBase andCascading?)

could benefit from a shared map-reduce compiler.


On 5/6/09 5:32 PM, "Olga Natkovich" <ol...@yahoo-inc.com> wrote:

Hi Ricky,

This is how the code will look in Pig.

A = load 'textdoc' using TextLoader() as (sentence: chararray);
B = foreach A generate flatten(TOKENIZE(sentence)) as word;
C = group B by word;
D = foreach C generate group, COUNT(B);
store D into 'wordcount';

Pig training (http://www.cloudera.com/hadoop-training-pig-tutorial)
explains how the example above works.

Let me know if you have further questions.

Olga

-----Original Message-----
From: Ricky Ho [mailto:r...@adobe.com]
Sent: Wednesday, May 06, 2009 3:56 PM
To: core-user@hadoop.apache.org
Subject: RE: PIG and Hive

Thanks Amr,

Without knowing the details of Hive, one constraint of SQL
model is you can never generate more than one records from a
single record.  I don't know how this is done in Hive.
Another question is whether the Hive script can take in
user-defined functions ?

Using the following word count as an example.  Can you show
me how the Pig script and Hive script looks like ?

Map:
 Input: a line (a collection of words)
 Output: multiple [word, 1]

Reduce:
 Input: [word, [1, 1, 1, ...]]
 Output: [word, count]

Rgds,
Ricky

-----Original Message-----
From: Amr Awadallah [mailto:a...@cloudera.com]
Sent: Wednesday, May 06, 2009 3:14 PM
To: core-user@hadoop.apache.org
Subject: Re: PIG and Hive

The difference between PIG and Hive seems to be pretty

insignificant.

Difference between Pig and Hive is significant, specifically:

(1) Pig doesn't require underlying structure to the data,
Hive does imply structure via a metastore. This has it pros
and cons. It allows Pig to be more suitable for ETL kind
tasks where the input data is still a mish-mash and you want
to convert it to be structured. On the other hand, Hive's
metastore provides a dictionary that lets you easily see what
columns exist in which tables which can be very handy.

(2) Pig is a new language, easy to learn if you know
languages similar to Perl. Hive is a sub-set of SQL with very
simple variations to enable map-reduce like computation. So,
if you come from a SQL background you will find Hive QL
extremely easy to pickup (many of your SQL queries will run
as is), while if you come from a procedural programming
background (w/o SQL knowledge) then Pig will be much more
suitable for you. Furthermore, Hive is a bit easier to
integrate with other systems and tools since it speaks the
language they already speak (i.e. SQL).

You're right that HBase is a completely different game, HBase
is not about being a high level language that compiles to
map-reduce, HBase is about allowing Hadoop to support
lookups/transactions on key/value pairs. HBase allows you to
(1) do quick random lookups, versus scan all of data
sequentially, (2) do insert/update/delete from middle, not
just add/append.

-- amr

Ricky Ho wrote:

Jeff,

Thanks for the pointer.
It is pretty clear that Hive and PIG are the same kind and

HBase is a different kind.

The difference between PIG and Hive seems to be pretty

insignificant.  Layer a tool on top of them can completely
hide their difference.

I am viewing your PIG and Hive tutorial and hopefully can

extract some technical details there.

Rgds,
Ricky
-----Original Message-----
From: Jeff Hammerbacher [mailto:ham...@cloudera.com]
Sent: Wednesday, May 06, 2009 1:38 PM
To: core-user@hadoop.apache.org
Subject: Re: PIG and Hive

Here's a permalink for the thread on MarkMail:
http://markmail.org/thread/ee4hpcji74higqvk

On Wed, May 6, 2009 at 4:55 AM, Sharad Agarwal

<shara...@yahoo-inc.com>wrote:

see core-user mail thread with subject "HBase, Hive, Pig andother
Hadoop based technologies"

- Sharad

Ricky Ho wrote:
Are they competing technologies of providing a higher

level language

for

Map/Reduce programming ?

Or are they complementary ?

Any comparison between them ?

Rgds,
Ricky

Re: PIG and Hive

Reply via email to