Re: Using PIG with complex SQL Statement

Vineet Mishra Tue, 28 Oct 2014 10:20:21 -0700

Hi Dan/Lorand,

Thanks for sharing this beautiful resource and knowledge, I will definitely
go through it and let you know should I encounter any issues.


Thanks!

On Tue, Oct 28, 2014 at 7:49 PM, Dan DeCapria, CivicScience <
[email protected]> wrote:

> Hi Vineet,
>
> Expanding upon Lorand's resources, please note this all really depends on
> your actual use case.  When blocking out code to transform from SQL to Pig
> latin, it's usually a good idea to just flow-chart plan the logical process
> of what you want to do - just like you would for SQL queries.  Then it's
> just a matter of optimizing said queries - again, just like you would with
> SQL queries on the DBA layer.  the 'under-the-hood' optimizations to MR is
> done by Pig.
>
> Generically, this follows a simple paradigm, ie):
>
> --  optional runner: nohup pig -p REDUCERS=180 -f /home/hadoop/my_file.pig
> 2>&1 > /tmp/my_file.out &
>
> --  some example configurations, ie) gzip compress the output
> SET output.compression.enabled true;
> SET output.compression.codec org.apache.hadoop.io.compress.GzipCodec;
> --SET default_parallel $REDUCERS;
>
> A0 = LOAD '/path/to/hdfs/data.dat' USING some.load.func() AS (the typed
> schema); -- loader data source A
> A1 = FOREACH A0 GENERATE stuff; -- projection steps
> A = FILTER A1 BY (stuff); -- filter prior to JOIN
>
> B0 = LOAD '/path/to/hdfs/data.dat' USING some.load.func() AS (the typed
> schema); -- loader data source B
> B1 = FOREACH B0 GENERATE stuff; -- projection steps
> B = FILTER B1 BY (stuff); -- filter prior to JOIN
>
> C0 = JOIN A BY (pk), B BY (pk) PARALLEL $REDUCERS; -- where size(A) >
> size(B), PARALLEL to force use of all MR capacity
> C = FOREACH C0 GENERATE stuff; -- re-alias the JOIN step fields to what you
> want, projection
>
> D0 = GROUP C BY (cks); -- perform your grouping operation
> D = FOREACH D0 GENERATE FLATTEN(group) AS (cks), (int)COUNT(C) AS
> example_count:int; -- whatever aggregation stats you wanted to perform wrt
> the GROUP BY operation
>
> STORE D INTO '/path/to/hdfs/storage/file' USING PigStorage(); -- flat,
> tab-delimited file output of typed schema fields from [D]; here I used
> PigStorage() store.func
>
> Hope this helps,  -Dan
>
>
> On Tue, Oct 28, 2014 at 10:09 AM, Lorand Bendig <[email protected]> wrote:
>
> > Hi Vineet,
> >
> > I'd recommend you have a look at these excellent resources:
> >
> > http://hortonworks.com/blog/pig-eye-for-the-sql-guy/
> > http://mortar-public-site-content.s3-website-us-east-1.
> > amazonaws.com/Mortar-Pig-Cheat-Sheet.pdf
> > http://www.slideshare.net/trihug/practical-pig/11
> >
> > --Lorand
> >
> >
> > On 28/10/14 14:34, Vineet Mishra wrote:
> >
> >> Hi,
> >>
> >> I was looking out to transform SQL statement which is consisting of
> >> multiple clause in the same query specifically, a JOIN followed by some
> >> condition(WHERE) and finally grouping on some fields(GROUP BY).
> >> Can I have a link or some briefing which can guide me how can I
> implement
> >> this k/o of complex SQL statement in PIG.
> >>
> >> Thanks!
> >>
> >>
> >
>
>
> --
> Dan DeCapria
> CivicScience, Inc.
> Back-End Data IS/BI/DM/ML Specialist
>

Re: Using PIG with complex SQL Statement

Reply via email to