Re: PIG and Hive

Luc Hunt Wed, 06 May 2009 22:47:31 -0700

Ricky,

One thing to mention is, SQL support is on the Pig roadmap this year.



--Yiping

On Wed, May 6, 2009 at 9:11 PM, Ricky Ho <r...@adobe.com> wrote:

> Thanks for Olga example and Scott's comment.
>
> My goal is to pick a higher level parallel programming language (as a
> algorithm design / prototyping tool) to express my parallel algorithms in a
> concise way.  The deeper I look into these, I have a stronger feeling that
> PIG and HIVE are competitors rather than complementing each other.  I think
> a large set of problems can be done in either way, without much difference
> in terms of skillset requirements.
>
> At this moment, I am focus in the richness of the language model rather
> than the implementation optimization.  Supporting "collection" as well as
> the flatten operation in the language model seems to make PIG more powerful.
>  Yes, you can achieve the same thing in Hive but then it starts to look odd.
>  Am I missing something Hive folks ?
>
> Rgds,
> Ricky
>
> -----Original Message-----
> From: Scott Carey [mailto:sc...@richrelevance.com]
> Sent: Wednesday, May 06, 2009 7:48 PM
> To: core-user@hadoop.apache.org
> Subject: Re: PIG and Hive
>
> Pig currently also compiles similar operations (like the below) into many
> fewer map reduce passes and is several times faster in general.
>
> This will change as the optimizer and available optimizations converge and
> in the future they won't differ much.  But for now, Pig optimizes much
> better.
>
> I ran a test that boiled down to SQL like this:
>
> SELECT count(a.z), count(b.z), x, y from a, b where a.x = b.x and a.y = b.y
> group by x, y.
>
> (and equivalent, but more verbose Pig)
>
> Pig did it in one map reduce pass in about 2 minutes and Hive did it in 5
> map reduce passes in 10 minutes.
>
> There is nothing keeping Hive from applying the optimizations necessary to
> make that one pass, but those sort of performance optimizations aren't
> there
> yet.  That is expected, it is a younger project.
>
> It would be useful if more of these higher level tools shared work on the
> various optimizations.  Pig and Hive (and perhaps CloudBase and Cascading?)
> could benefit from a shared map-reduce compiler.
>
>
> On 5/6/09 5:32 PM, "Olga Natkovich" <ol...@yahoo-inc.com> wrote:
>
> > Hi Ricky,
> >
> > This is how the code will look in Pig.
> >
> > A = load 'textdoc' using TextLoader() as (sentence: chararray);
> > B = foreach A generate flatten(TOKENIZE(sentence)) as word;
> > C = group B by word;
> > D = foreach C generate group, COUNT(B);
> > store D into 'wordcount';
> >
> > Pig training (http://www.cloudera.com/hadoop-training-pig-tutorial)
> > explains how the example above works.
> >
> > Let me know if you have further questions.
> >
> > Olga
> >
> >
> >> -----Original Message-----
> >> From: Ricky Ho [mailto:r...@adobe.com]
> >> Sent: Wednesday, May 06, 2009 3:56 PM
> >> To: core-user@hadoop.apache.org
> >> Subject: RE: PIG and Hive
> >>
> >> Thanks Amr,
> >>
> >> Without knowing the details of Hive, one constraint of SQL
> >> model is you can never generate more than one records from a
> >> single record.  I don't know how this is done in Hive.
> >> Another question is whether the Hive script can take in
> >> user-defined functions ?
> >>
> >> Using the following word count as an example.  Can you show
> >> me how the Pig script and Hive script looks like ?
> >>
> >> Map:
> >>   Input: a line (a collection of words)
> >>   Output: multiple [word, 1]
> >>
> >> Reduce:
> >>   Input: [word, [1, 1, 1, ...]]
> >>   Output: [word, count]
> >>
> >> Rgds,
> >> Ricky
> >>
> >> -----Original Message-----
> >> From: Amr Awadallah [mailto:a...@cloudera.com]
> >> Sent: Wednesday, May 06, 2009 3:14 PM
> >> To: core-user@hadoop.apache.org
> >> Subject: Re: PIG and Hive
> >>
> >>> The difference between PIG and Hive seems to be pretty
> >> insignificant.
> >>
> >> Difference between Pig and Hive is significant, specifically:
> >>
> >> (1) Pig doesn't require underlying structure to the data,
> >> Hive does imply structure via a metastore. This has it pros
> >> and cons. It allows Pig to be more suitable for ETL kind
> >> tasks where the input data is still a mish-mash and you want
> >> to convert it to be structured. On the other hand, Hive's
> >> metastore provides a dictionary that lets you easily see what
> >> columns exist in which tables which can be very handy.
> >>
> >> (2) Pig is a new language, easy to learn if you know
> >> languages similar to Perl. Hive is a sub-set of SQL with very
> >> simple variations to enable map-reduce like computation. So,
> >> if you come from a SQL background you will find Hive QL
> >> extremely easy to pickup (many of your SQL queries will run
> >> as is), while if you come from a procedural programming
> >> background (w/o SQL knowledge) then Pig will be much more
> >> suitable for you. Furthermore, Hive is a bit easier to
> >> integrate with other systems and tools since it speaks the
> >> language they already speak (i.e. SQL).
> >>
> >> You're right that HBase is a completely different game, HBase
> >> is not about being a high level language that compiles to
> >> map-reduce, HBase is about allowing Hadoop to support
> >> lookups/transactions on key/value pairs. HBase allows you to
> >> (1) do quick random lookups, versus scan all of data
> >> sequentially, (2) do insert/update/delete from middle, not
> >> just add/append.
> >>
> >> -- amr
> >>
> >> Ricky Ho wrote:
> >>> Jeff,
> >>>
> >>> Thanks for the pointer.
> >>> It is pretty clear that Hive and PIG are the same kind and
> >> HBase is a different kind.
> >>> The difference between PIG and Hive seems to be pretty
> >> insignificant.  Layer a tool on top of them can completely
> >> hide their difference.
> >>>
> >>> I am viewing your PIG and Hive tutorial and hopefully can
> >> extract some technical details there.
> >>>
> >>> Rgds,
> >>> Ricky
> >>> -----Original Message-----
> >>> From: Jeff Hammerbacher [mailto:ham...@cloudera.com]
> >>> Sent: Wednesday, May 06, 2009 1:38 PM
> >>> To: core-user@hadoop.apache.org
> >>> Subject: Re: PIG and Hive
> >>>
> >>> Here's a permalink for the thread on MarkMail:
> >>> http://markmail.org/thread/ee4hpcji74higqvk
> >>>
> >>> On Wed, May 6, 2009 at 4:55 AM, Sharad Agarwal
> >> <shara...@yahoo-inc.com>wrote:
> >>>
> >>>
> >>>> see core-user mail thread with subject "HBase, Hive, Pig and other
> >>>> Hadoop based technologies"
> >>>>
> >>>> - Sharad
> >>>>
> >>>> Ricky Ho wrote:
> >>>>
> >>>>> Are they competing technologies of providing a higher
> >> level language
> >>>>> for
> >>>>>
> >>>> Map/Reduce programming ?
> >>>>
> >>>>> Or are they complementary ?
> >>>>>
> >>>>> Any comparison between them ?
> >>>>>
> >>>>> Rgds,
> >>>>> Ricky
> >>>>>
> >>>>
> >>
> >
>
>

Re: PIG and Hive

Reply via email to