Ricky, One thing to mention is, SQL support is on the Pig roadmap this year.
--Yiping On Wed, May 6, 2009 at 9:11 PM, Ricky Ho <r...@adobe.com> wrote: > Thanks for Olga example and Scott's comment. > > My goal is to pick a higher level parallel programming language (as a > algorithm design / prototyping tool) to express my parallel algorithms in a > concise way. The deeper I look into these, I have a stronger feeling that > PIG and HIVE are competitors rather than complementing each other. I think > a large set of problems can be done in either way, without much difference > in terms of skillset requirements. > > At this moment, I am focus in the richness of the language model rather > than the implementation optimization. Supporting "collection" as well as > the flatten operation in the language model seems to make PIG more powerful. > Yes, you can achieve the same thing in Hive but then it starts to look odd. > Am I missing something Hive folks ? > > Rgds, > Ricky > > -----Original Message----- > From: Scott Carey [mailto:sc...@richrelevance.com] > Sent: Wednesday, May 06, 2009 7:48 PM > To: core-user@hadoop.apache.org > Subject: Re: PIG and Hive > > Pig currently also compiles similar operations (like the below) into many > fewer map reduce passes and is several times faster in general. > > This will change as the optimizer and available optimizations converge and > in the future they won't differ much. But for now, Pig optimizes much > better. > > I ran a test that boiled down to SQL like this: > > SELECT count(a.z), count(b.z), x, y from a, b where a.x = b.x and a.y = b.y > group by x, y. > > (and equivalent, but more verbose Pig) > > Pig did it in one map reduce pass in about 2 minutes and Hive did it in 5 > map reduce passes in 10 minutes. > > There is nothing keeping Hive from applying the optimizations necessary to > make that one pass, but those sort of performance optimizations aren't > there > yet. That is expected, it is a younger project. > > It would be useful if more of these higher level tools shared work on the > various optimizations. Pig and Hive (and perhaps CloudBase and Cascading?) > could benefit from a shared map-reduce compiler. > > > On 5/6/09 5:32 PM, "Olga Natkovich" <ol...@yahoo-inc.com> wrote: > > > Hi Ricky, > > > > This is how the code will look in Pig. > > > > A = load 'textdoc' using TextLoader() as (sentence: chararray); > > B = foreach A generate flatten(TOKENIZE(sentence)) as word; > > C = group B by word; > > D = foreach C generate group, COUNT(B); > > store D into 'wordcount'; > > > > Pig training (http://www.cloudera.com/hadoop-training-pig-tutorial) > > explains how the example above works. > > > > Let me know if you have further questions. > > > > Olga > > > > > >> -----Original Message----- > >> From: Ricky Ho [mailto:r...@adobe.com] > >> Sent: Wednesday, May 06, 2009 3:56 PM > >> To: core-user@hadoop.apache.org > >> Subject: RE: PIG and Hive > >> > >> Thanks Amr, > >> > >> Without knowing the details of Hive, one constraint of SQL > >> model is you can never generate more than one records from a > >> single record. I don't know how this is done in Hive. > >> Another question is whether the Hive script can take in > >> user-defined functions ? > >> > >> Using the following word count as an example. Can you show > >> me how the Pig script and Hive script looks like ? > >> > >> Map: > >> Input: a line (a collection of words) > >> Output: multiple [word, 1] > >> > >> Reduce: > >> Input: [word, [1, 1, 1, ...]] > >> Output: [word, count] > >> > >> Rgds, > >> Ricky > >> > >> -----Original Message----- > >> From: Amr Awadallah [mailto:a...@cloudera.com] > >> Sent: Wednesday, May 06, 2009 3:14 PM > >> To: core-user@hadoop.apache.org > >> Subject: Re: PIG and Hive > >> > >>> The difference between PIG and Hive seems to be pretty > >> insignificant. > >> > >> Difference between Pig and Hive is significant, specifically: > >> > >> (1) Pig doesn't require underlying structure to the data, > >> Hive does imply structure via a metastore. This has it pros > >> and cons. It allows Pig to be more suitable for ETL kind > >> tasks where the input data is still a mish-mash and you want > >> to convert it to be structured. On the other hand, Hive's > >> metastore provides a dictionary that lets you easily see what > >> columns exist in which tables which can be very handy. > >> > >> (2) Pig is a new language, easy to learn if you know > >> languages similar to Perl. Hive is a sub-set of SQL with very > >> simple variations to enable map-reduce like computation. So, > >> if you come from a SQL background you will find Hive QL > >> extremely easy to pickup (many of your SQL queries will run > >> as is), while if you come from a procedural programming > >> background (w/o SQL knowledge) then Pig will be much more > >> suitable for you. Furthermore, Hive is a bit easier to > >> integrate with other systems and tools since it speaks the > >> language they already speak (i.e. SQL). > >> > >> You're right that HBase is a completely different game, HBase > >> is not about being a high level language that compiles to > >> map-reduce, HBase is about allowing Hadoop to support > >> lookups/transactions on key/value pairs. HBase allows you to > >> (1) do quick random lookups, versus scan all of data > >> sequentially, (2) do insert/update/delete from middle, not > >> just add/append. > >> > >> -- amr > >> > >> Ricky Ho wrote: > >>> Jeff, > >>> > >>> Thanks for the pointer. > >>> It is pretty clear that Hive and PIG are the same kind and > >> HBase is a different kind. > >>> The difference between PIG and Hive seems to be pretty > >> insignificant. Layer a tool on top of them can completely > >> hide their difference. > >>> > >>> I am viewing your PIG and Hive tutorial and hopefully can > >> extract some technical details there. > >>> > >>> Rgds, > >>> Ricky > >>> -----Original Message----- > >>> From: Jeff Hammerbacher [mailto:ham...@cloudera.com] > >>> Sent: Wednesday, May 06, 2009 1:38 PM > >>> To: core-user@hadoop.apache.org > >>> Subject: Re: PIG and Hive > >>> > >>> Here's a permalink for the thread on MarkMail: > >>> http://markmail.org/thread/ee4hpcji74higqvk > >>> > >>> On Wed, May 6, 2009 at 4:55 AM, Sharad Agarwal > >> <shara...@yahoo-inc.com>wrote: > >>> > >>> > >>>> see core-user mail thread with subject "HBase, Hive, Pig and other > >>>> Hadoop based technologies" > >>>> > >>>> - Sharad > >>>> > >>>> Ricky Ho wrote: > >>>> > >>>>> Are they competing technologies of providing a higher > >> level language > >>>>> for > >>>>> > >>>> Map/Reduce programming ? > >>>> > >>>>> Or are they complementary ? > >>>>> > >>>>> Any comparison between them ? > >>>>> > >>>>> Rgds, > >>>>> Ricky > >>>>> > >>>> > >> > > > >