Re: Using PIG with complex SQL Statement

Anil Jagtap Wed, 29 Oct 2014 06:07:58 -0700

Dear All,

I recently finished my BigData training and i’m given one more course for free.


Request if someone can advice me which one of the below three should i go for. 
My experience lies in ETL and Reporting.

        • Cassandra
        • Cloud Computing with AWS
        • Apache Storm


Request if you can advise me urgently so i can start on it.

Rgds, Anil



> On 29-Oct-2014, at 4:19 am, Vineet Mishra <[email protected]> wrote:
> 
> Hi Dan/Lorand,
> 
> Thanks for sharing this beautiful resource and knowledge, I will definitely
> go through it and let you know should I encounter any issues.
> 
> Thanks!
> 
> On Tue, Oct 28, 2014 at 7:49 PM, Dan DeCapria, CivicScience <
> [email protected]> wrote:
> 
>> Hi Vineet,
>> 
>> Expanding upon Lorand's resources, please note this all really depends on
>> your actual use case.  When blocking out code to transform from SQL to Pig
>> latin, it's usually a good idea to just flow-chart plan the logical process
>> of what you want to do - just like you would for SQL queries.  Then it's
>> just a matter of optimizing said queries - again, just like you would with
>> SQL queries on the DBA layer.  the 'under-the-hood' optimizations to MR is
>> done by Pig.
>> 
>> Generically, this follows a simple paradigm, ie):
>> 
>> --  optional runner: nohup pig -p REDUCERS=180 -f /home/hadoop/my_file.pig
>> 2>&1 > /tmp/my_file.out &
>> 
>> --  some example configurations, ie) gzip compress the output
>> SET output.compression.enabled true;
>> SET output.compression.codec org.apache.hadoop.io.compress.GzipCodec;
>> --SET default_parallel $REDUCERS;
>> 
>> A0 = LOAD '/path/to/hdfs/data.dat' USING some.load.func() AS (the typed
>> schema); -- loader data source A
>> A1 = FOREACH A0 GENERATE stuff; -- projection steps
>> A = FILTER A1 BY (stuff); -- filter prior to JOIN
>> 
>> B0 = LOAD '/path/to/hdfs/data.dat' USING some.load.func() AS (the typed
>> schema); -- loader data source B
>> B1 = FOREACH B0 GENERATE stuff; -- projection steps
>> B = FILTER B1 BY (stuff); -- filter prior to JOIN
>> 
>> C0 = JOIN A BY (pk), B BY (pk) PARALLEL $REDUCERS; -- where size(A) >
>> size(B), PARALLEL to force use of all MR capacity
>> C = FOREACH C0 GENERATE stuff; -- re-alias the JOIN step fields to what you
>> want, projection
>> 
>> D0 = GROUP C BY (cks); -- perform your grouping operation
>> D = FOREACH D0 GENERATE FLATTEN(group) AS (cks), (int)COUNT(C) AS
>> example_count:int; -- whatever aggregation stats you wanted to perform wrt
>> the GROUP BY operation
>> 
>> STORE D INTO '/path/to/hdfs/storage/file' USING PigStorage(); -- flat,
>> tab-delimited file output of typed schema fields from [D]; here I used
>> PigStorage() store.func
>> 
>> Hope this helps,  -Dan
>> 
>> 
>> On Tue, Oct 28, 2014 at 10:09 AM, Lorand Bendig <[email protected]> wrote:
>> 
>>> Hi Vineet,
>>> 
>>> I'd recommend you have a look at these excellent resources:
>>> 
>>> http://hortonworks.com/blog/pig-eye-for-the-sql-guy/
>>> http://mortar-public-site-content.s3-website-us-east-1.
>>> amazonaws.com/Mortar-Pig-Cheat-Sheet.pdf
>>> http://www.slideshare.net/trihug/practical-pig/11
>>> 
>>> --Lorand
>>> 
>>> 
>>> On 28/10/14 14:34, Vineet Mishra wrote:
>>> 
>>>> Hi,
>>>> 
>>>> I was looking out to transform SQL statement which is consisting of
>>>> multiple clause in the same query specifically, a JOIN followed by some
>>>> condition(WHERE) and finally grouping on some fields(GROUP BY).
>>>> Can I have a link or some briefing which can guide me how can I
>> implement
>>>> this k/o of complex SQL statement in PIG.
>>>> 
>>>> Thanks!
>>>> 
>>>> 
>>> 
>> 
>> 
>> --
>> Dan DeCapria
>> CivicScience, Inc.
>> Back-End Data IS/BI/DM/ML Specialist
>>

Re: Using PIG with complex SQL Statement

Reply via email to