Hi Vineet, Check the UI to see what resources you are using per each of those jobs from the full Pig script (ie, >lynx localhost:9100). Sometimes the process may not use the entire cluster per a job on a module of your Pig script; per a job having $REDUCERS mappers but only one reducer, impacting performance. Setting something like:
A0 = GROUP X BY (stuff) PARALLEL $REDUCERS; A1 = FOREACH ... to force parallelism might help you further. If that is not the case, you may want to consider grouping on parts of your primary key to get partial-summation counts. Then perform multiple GROUP BY statements on those sub-part aliases to SUM up to what you want per each case. Another idea would be to include the filter logic as a classifier field as part of your pk, such that you can GROUP BY on the (pk,classifier_field) and then partition your output accordingly. Hope this helps, -Dan On Wed, Nov 5, 2014 at 7:07 AM, Vineet Mishra <[email protected]> wrote: > Hi Dan, > > In your above mentioned snippet of script at last line > > D = FOREACH D1 GENERATE FLATTEN(group) AS (category_id, category_name), > (int)COUNT(D0) AS cat_id_1_count:int; > > I wanted to perform some group level operation, such as conditional based > additive operation for each grouped rows. How can I achieve the same > preserving in constant time. > > Currently I have applied some filters inside FOREACH of group and > processing accordingly but that is making the job to be very tedious, > taking almost 25 times more than the expected run time. Looking for any > work around for it or any alternate approach on urgent basis. PIG version > used is 0.11. > > Thanks! > > > > > On Thu, Oct 30, 2014 at 7:16 PM, Dan DeCapria, CivicScience < > [email protected]> wrote: > > > Hi Vineet, > > > > Not entirely sure I'm understanding the problem correctly, but perhaps > the > > error you are getting can be fixed by: > > sum_up_count = FILTER cat_ids BY (category_id == 1); > > > > I think that having a more clear description of your use case and input > > data sets along with your current pig script in development will help us > in > > debugging this better. > > As such, this might not help, but here's a crack at reverse-engineering > > your problem space: > > > > A0 = LOAD '/input_01' USING PigStorage('\t','-noschema') AS (id:int, > > category_id:int, category_name: chararray, meta:chararray); > > A1 = FOREACH A0 GENERATE id, category_id, category_name; > > A = DISTINCT A1; > > > > B0 = LOAD '/input_02' USING PigStorage('\t','-noschema') AS (category_id: > > int, more_meta:chararray); > > B1 = FOREACH B0 GENERATE category_id; > > B = DISTINCT B1; > > > > C0 = JOIN A BY (category_id), B BY (category_id); -- 0,1,2, 3 > > C1 = FOREACH C0 GENERATE $0 AS id, $1 AS category_id, $2 AS > category_name; > > C = DISTINCT C1; > > > > D0 = FILTER C BY (category_id == 1); > > D1 = GROUP D0 BY (category_id, category_name); > > D = FOREACH D1 GENERATE FLATTEN(group) AS (category_id, category_name), > > (int)COUNT(D0) AS cat_id_1_count:int; > > > > > > Hope this helps, -Dan > > > > On Thu, Oct 30, 2014 at 7:10 AM, Vineet Mishra <[email protected]> > > wrote: > > > > > Hi Dan, > > > > > > I am trying to put Filter inside a Foreach, the description of the > > group(on > > > which the FOREACH iteration is happening) is mentioned below. I am > trying > > > to get counts of which all are passing the filter, > > > > > > Describe grp: > > > grp: {group: (a::category_id: int,a::category_name: chararray),joind: > > > {(a::id: int,a::category_id: int, b::category_id: int)}} > > > > > > Script: > > > purified = FOREACH grp { > > > cat_ids_bag = FOREACH joind generate b::category_id; > > > > > > cat_ids = foreach cat_ids_bag generate flatten(category_id); > > > > > > sum_up_count = FILTER cat_ids BY category_id IN (1); > > > } > > > > > > Its throwing error, > > > Syntax error, unexpected symbol at or near 'IN' > > > > > > Looking out for urgent response. > > > Thanks! > > > > > > On Tue, Oct 28, 2014 at 7:49 PM, Dan DeCapria, CivicScience < > > > [email protected]> wrote: > > > > > > > Hi Vineet, > > > > > > > > Expanding upon Lorand's resources, please note this all really > depends > > on > > > > your actual use case. When blocking out code to transform from SQL > to > > > Pig > > > > latin, it's usually a good idea to just flow-chart plan the logical > > > process > > > > of what you want to do - just like you would for SQL queries. Then > > it's > > > > just a matter of optimizing said queries - again, just like you would > > > with > > > > SQL queries on the DBA layer. the 'under-the-hood' optimizations to > MR > > > is > > > > done by Pig. > > > > > > > > Generically, this follows a simple paradigm, ie): > > > > > > > > -- optional runner: nohup pig -p REDUCERS=180 -f > > > /home/hadoop/my_file.pig > > > > 2>&1 > /tmp/my_file.out & > > > > > > > > -- some example configurations, ie) gzip compress the output > > > > SET output.compression.enabled true; > > > > SET output.compression.codec org.apache.hadoop.io.compress.GzipCodec; > > > > --SET default_parallel $REDUCERS; > > > > > > > > A0 = LOAD '/path/to/hdfs/data.dat' USING some.load.func() AS (the > typed > > > > schema); -- loader data source A > > > > A1 = FOREACH A0 GENERATE stuff; -- projection steps > > > > A = FILTER A1 BY (stuff); -- filter prior to JOIN > > > > > > > > B0 = LOAD '/path/to/hdfs/data.dat' USING some.load.func() AS (the > typed > > > > schema); -- loader data source B > > > > B1 = FOREACH B0 GENERATE stuff; -- projection steps > > > > B = FILTER B1 BY (stuff); -- filter prior to JOIN > > > > > > > > C0 = JOIN A BY (pk), B BY (pk) PARALLEL $REDUCERS; -- where size(A) > > > > > size(B), PARALLEL to force use of all MR capacity > > > > C = FOREACH C0 GENERATE stuff; -- re-alias the JOIN step fields to > what > > > you > > > > want, projection > > > > > > > > D0 = GROUP C BY (cks); -- perform your grouping operation > > > > D = FOREACH D0 GENERATE FLATTEN(group) AS (cks), (int)COUNT(C) AS > > > > example_count:int; -- whatever aggregation stats you wanted to > perform > > > wrt > > > > the GROUP BY operation > > > > > > > > STORE D INTO '/path/to/hdfs/storage/file' USING PigStorage(); -- > flat, > > > > tab-delimited file output of typed schema fields from [D]; here I > used > > > > PigStorage() store.func > > > > > > > > Hope this helps, -Dan > > > > > > > > > > > > On Tue, Oct 28, 2014 at 10:09 AM, Lorand Bendig <[email protected]> > > > wrote: > > > > > > > > > Hi Vineet, > > > > > > > > > > I'd recommend you have a look at these excellent resources: > > > > > > > > > > http://hortonworks.com/blog/pig-eye-for-the-sql-guy/ > > > > > http://mortar-public-site-content.s3-website-us-east-1. > > > > > amazonaws.com/Mortar-Pig-Cheat-Sheet.pdf > > > > > http://www.slideshare.net/trihug/practical-pig/11 > > > > > > > > > > --Lorand > > > > > > > > > > > > > > > On 28/10/14 14:34, Vineet Mishra wrote: > > > > > > > > > >> Hi, > > > > >> > > > > >> I was looking out to transform SQL statement which is consisting > of > > > > >> multiple clause in the same query specifically, a JOIN followed by > > > some > > > > >> condition(WHERE) and finally grouping on some fields(GROUP BY). > > > > >> Can I have a link or some briefing which can guide me how can I > > > > implement > > > > >> this k/o of complex SQL statement in PIG. > > > > >> > > > > >> Thanks! > > > > >> > > > > >> > > > > > > > > > > > > > > > > > -- > > > > Dan DeCapria > > > > CivicScience, Inc. > > > > Back-End Data IS/BI/DM/ML Specialist > > > > > > > > > > > > > > > -- > > Dan DeCapria > > CivicScience, Inc. > > Back-End Data IS/BI/DM/ML Specialist > > > -- Dan DeCapria CivicScience, Inc. Back-End Data IS/BI/DM/ML Specialist
