Hi Dan/Lorand, Thanks for sharing this beautiful resource and knowledge, I will definitely go through it and let you know should I encounter any issues.
Thanks! On Tue, Oct 28, 2014 at 7:49 PM, Dan DeCapria, CivicScience < [email protected]> wrote: > Hi Vineet, > > Expanding upon Lorand's resources, please note this all really depends on > your actual use case. When blocking out code to transform from SQL to Pig > latin, it's usually a good idea to just flow-chart plan the logical process > of what you want to do - just like you would for SQL queries. Then it's > just a matter of optimizing said queries - again, just like you would with > SQL queries on the DBA layer. the 'under-the-hood' optimizations to MR is > done by Pig. > > Generically, this follows a simple paradigm, ie): > > -- optional runner: nohup pig -p REDUCERS=180 -f /home/hadoop/my_file.pig > 2>&1 > /tmp/my_file.out & > > -- some example configurations, ie) gzip compress the output > SET output.compression.enabled true; > SET output.compression.codec org.apache.hadoop.io.compress.GzipCodec; > --SET default_parallel $REDUCERS; > > A0 = LOAD '/path/to/hdfs/data.dat' USING some.load.func() AS (the typed > schema); -- loader data source A > A1 = FOREACH A0 GENERATE stuff; -- projection steps > A = FILTER A1 BY (stuff); -- filter prior to JOIN > > B0 = LOAD '/path/to/hdfs/data.dat' USING some.load.func() AS (the typed > schema); -- loader data source B > B1 = FOREACH B0 GENERATE stuff; -- projection steps > B = FILTER B1 BY (stuff); -- filter prior to JOIN > > C0 = JOIN A BY (pk), B BY (pk) PARALLEL $REDUCERS; -- where size(A) > > size(B), PARALLEL to force use of all MR capacity > C = FOREACH C0 GENERATE stuff; -- re-alias the JOIN step fields to what you > want, projection > > D0 = GROUP C BY (cks); -- perform your grouping operation > D = FOREACH D0 GENERATE FLATTEN(group) AS (cks), (int)COUNT(C) AS > example_count:int; -- whatever aggregation stats you wanted to perform wrt > the GROUP BY operation > > STORE D INTO '/path/to/hdfs/storage/file' USING PigStorage(); -- flat, > tab-delimited file output of typed schema fields from [D]; here I used > PigStorage() store.func > > Hope this helps, -Dan > > > On Tue, Oct 28, 2014 at 10:09 AM, Lorand Bendig <[email protected]> wrote: > > > Hi Vineet, > > > > I'd recommend you have a look at these excellent resources: > > > > http://hortonworks.com/blog/pig-eye-for-the-sql-guy/ > > http://mortar-public-site-content.s3-website-us-east-1. > > amazonaws.com/Mortar-Pig-Cheat-Sheet.pdf > > http://www.slideshare.net/trihug/practical-pig/11 > > > > --Lorand > > > > > > On 28/10/14 14:34, Vineet Mishra wrote: > > > >> Hi, > >> > >> I was looking out to transform SQL statement which is consisting of > >> multiple clause in the same query specifically, a JOIN followed by some > >> condition(WHERE) and finally grouping on some fields(GROUP BY). > >> Can I have a link or some briefing which can guide me how can I > implement > >> this k/o of complex SQL statement in PIG. > >> > >> Thanks! > >> > >> > > > > > -- > Dan DeCapria > CivicScience, Inc. > Back-End Data IS/BI/DM/ML Specialist >
