Re: Using PIG with complex SQL Statement

Dan DeCapria, CivicScience Wed, 05 Nov 2014 06:54:12 -0800

Hi Vineet,

Check the UI to see what resources you are using per each of those jobs
from the full Pig script (ie, >lynx localhost:9100).  Sometimes the process
may not use the entire cluster per a job on a module of your Pig script;
per a job having $REDUCERS mappers but only one reducer, impacting
performance.  Setting something like:


A0 = GROUP X BY (stuff) PARALLEL $REDUCERS;
A1 = FOREACH ...

to force parallelism might help you further.  If that is not the case, you
may want to consider grouping on parts of your primary key to get
partial-summation counts.  Then perform multiple GROUP BY statements on
those sub-part aliases to SUM up to what you want per each case.  Another
idea would be to include the filter logic as a classifier field as part of
your pk, such that you can GROUP BY on the (pk,classifier_field) and then
partition your output accordingly.

Hope this helps,

-Dan


On Wed, Nov 5, 2014 at 7:07 AM, Vineet Mishra <[email protected]>
wrote:

> Hi Dan,
>
> In your above mentioned snippet of script at last line
>
> D = FOREACH D1 GENERATE FLATTEN(group) AS (category_id, category_name),
> (int)COUNT(D0) AS cat_id_1_count:int;
>
> I wanted to perform some group level operation, such as conditional based
> additive operation for each grouped rows. How can I achieve the same
> preserving in constant time.
>
> Currently I have applied some filters inside FOREACH of group and
> processing accordingly but that is making the job to be very tedious,
> taking almost 25 times more than the expected run time. Looking for any
> work around for it or any alternate approach on urgent basis. PIG version
> used is 0.11.
>
> Thanks!
>
>
>
>
> On Thu, Oct 30, 2014 at 7:16 PM, Dan DeCapria, CivicScience <
> [email protected]> wrote:
>
> > Hi Vineet,
> >
> > Not entirely sure I'm understanding the problem correctly, but perhaps
> the
> > error you are getting can be fixed by:
> > sum_up_count = FILTER cat_ids BY (category_id == 1);
> >
> > I think that having a more clear description of your use case and input
> > data sets along with your current pig script in development will help us
> in
> > debugging this better.
> > As such, this might not help, but here's a crack at reverse-engineering
> > your problem space:
> >
> > A0 = LOAD '/input_01' USING PigStorage('\t','-noschema') AS (id:int,
> > category_id:int, category_name: chararray, meta:chararray);
> > A1 = FOREACH A0 GENERATE id, category_id, category_name;
> > A = DISTINCT A1;
> >
> > B0 = LOAD '/input_02' USING PigStorage('\t','-noschema') AS (category_id:
> > int, more_meta:chararray);
> > B1 = FOREACH B0 GENERATE category_id;
> > B = DISTINCT B1;
> >
> > C0 = JOIN A BY (category_id), B BY (category_id); -- 0,1,2, 3
> > C1 = FOREACH C0 GENERATE $0 AS id, $1 AS category_id, $2 AS
> category_name;
> > C = DISTINCT C1;
> >
> > D0 = FILTER C BY (category_id == 1);
> > D1 = GROUP D0 BY (category_id, category_name);
> > D = FOREACH D1 GENERATE FLATTEN(group) AS (category_id, category_name),
> > (int)COUNT(D0) AS cat_id_1_count:int;
> >
> >
> > Hope this helps,  -Dan
> >
> > On Thu, Oct 30, 2014 at 7:10 AM, Vineet Mishra <[email protected]>
> > wrote:
> >
> > > Hi Dan,
> > >
> > > I am trying to put Filter inside a Foreach, the description of the
> > group(on
> > > which the FOREACH iteration is happening) is mentioned below. I am
> trying
> > > to get counts of which all are passing the filter,
> > >
> > > Describe grp:
> > > grp: {group: (a::category_id: int,a::category_name: chararray),joind:
> > > {(a::id: int,a::category_id: int, b::category_id: int)}}
> > >
> > > Script:
> > > purified = FOREACH grp {
> > >  cat_ids_bag = FOREACH joind generate b::category_id;
> > >
> > > cat_ids = foreach cat_ids_bag generate flatten(category_id);
> > >
> > > sum_up_count = FILTER cat_ids BY category_id IN (1);
> > > }
> > >
> > > Its throwing error,
> > >   Syntax error, unexpected symbol at or near 'IN'
> > >
> > > Looking out for urgent response.
> > > Thanks!
> > >
> > > On Tue, Oct 28, 2014 at 7:49 PM, Dan DeCapria, CivicScience <
> > > [email protected]> wrote:
> > >
> > > > Hi Vineet,
> > > >
> > > > Expanding upon Lorand's resources, please note this all really
> depends
> > on
> > > > your actual use case.  When blocking out code to transform from SQL
> to
> > > Pig
> > > > latin, it's usually a good idea to just flow-chart plan the logical
> > > process
> > > > of what you want to do - just like you would for SQL queries.  Then
> > it's
> > > > just a matter of optimizing said queries - again, just like you would
> > > with
> > > > SQL queries on the DBA layer.  the 'under-the-hood' optimizations to
> MR
> > > is
> > > > done by Pig.
> > > >
> > > > Generically, this follows a simple paradigm, ie):
> > > >
> > > > --  optional runner: nohup pig -p REDUCERS=180 -f
> > > /home/hadoop/my_file.pig
> > > > 2>&1 > /tmp/my_file.out &
> > > >
> > > > --  some example configurations, ie) gzip compress the output
> > > > SET output.compression.enabled true;
> > > > SET output.compression.codec org.apache.hadoop.io.compress.GzipCodec;
> > > > --SET default_parallel $REDUCERS;
> > > >
> > > > A0 = LOAD '/path/to/hdfs/data.dat' USING some.load.func() AS (the
> typed
> > > > schema); -- loader data source A
> > > > A1 = FOREACH A0 GENERATE stuff; -- projection steps
> > > > A = FILTER A1 BY (stuff); -- filter prior to JOIN
> > > >
> > > > B0 = LOAD '/path/to/hdfs/data.dat' USING some.load.func() AS (the
> typed
> > > > schema); -- loader data source B
> > > > B1 = FOREACH B0 GENERATE stuff; -- projection steps
> > > > B = FILTER B1 BY (stuff); -- filter prior to JOIN
> > > >
> > > > C0 = JOIN A BY (pk), B BY (pk) PARALLEL $REDUCERS; -- where size(A) >
> > > > size(B), PARALLEL to force use of all MR capacity
> > > > C = FOREACH C0 GENERATE stuff; -- re-alias the JOIN step fields to
> what
> > > you
> > > > want, projection
> > > >
> > > > D0 = GROUP C BY (cks); -- perform your grouping operation
> > > > D = FOREACH D0 GENERATE FLATTEN(group) AS (cks), (int)COUNT(C) AS
> > > > example_count:int; -- whatever aggregation stats you wanted to
> perform
> > > wrt
> > > > the GROUP BY operation
> > > >
> > > > STORE D INTO '/path/to/hdfs/storage/file' USING PigStorage(); --
> flat,
> > > > tab-delimited file output of typed schema fields from [D]; here I
> used
> > > > PigStorage() store.func
> > > >
> > > > Hope this helps,  -Dan
> > > >
> > > >
> > > > On Tue, Oct 28, 2014 at 10:09 AM, Lorand Bendig <[email protected]>
> > > wrote:
> > > >
> > > > > Hi Vineet,
> > > > >
> > > > > I'd recommend you have a look at these excellent resources:
> > > > >
> > > > > http://hortonworks.com/blog/pig-eye-for-the-sql-guy/
> > > > > http://mortar-public-site-content.s3-website-us-east-1.
> > > > > amazonaws.com/Mortar-Pig-Cheat-Sheet.pdf
> > > > > http://www.slideshare.net/trihug/practical-pig/11
> > > > >
> > > > > --Lorand
> > > > >
> > > > >
> > > > > On 28/10/14 14:34, Vineet Mishra wrote:
> > > > >
> > > > >> Hi,
> > > > >>
> > > > >> I was looking out to transform SQL statement which is consisting
> of
> > > > >> multiple clause in the same query specifically, a JOIN followed by
> > > some
> > > > >> condition(WHERE) and finally grouping on some fields(GROUP BY).
> > > > >> Can I have a link or some briefing which can guide me how can I
> > > > implement
> > > > >> this k/o of complex SQL statement in PIG.
> > > > >>
> > > > >> Thanks!
> > > > >>
> > > > >>
> > > > >
> > > >
> > > >
> > > > --
> > > > Dan DeCapria
> > > > CivicScience, Inc.
> > > > Back-End Data IS/BI/DM/ML Specialist
> > > >
> > >
> >
> >
> >
> > --
> > Dan DeCapria
> > CivicScience, Inc.
> > Back-End Data IS/BI/DM/ML Specialist
> >
>



-- 
Dan DeCapria
CivicScience, Inc.
Back-End Data IS/BI/DM/ML Specialist

Re: Using PIG with complex SQL Statement

Reply via email to