Hi, Below are list of tuples generated by a UDF. ( ( [stdout#{ (day, age, name, address, ['k1#v1','k2#v2'] ) } ] ) ) ( ( [stdout#{ (12/2,22,deepak,newyork, ['k1#v2','k2#v2'] ) } ] ) ) ( ( [stdout#{ (12/3,22,deepak,newyork, ['k1#v1','k2#v2'] ) } ] ) ) group a -- ( v1 , { (day, age, name, address, ['k1#v1','k2#v2'] ), (12/3,22, deepak,newjersy, ['k1#v1','k2#v2']) } ) group b -- ( v2 , { (12/2,22,deepak,newyork, ['k1#v2','k2#v2'])} )
I need to run group by on k1 so that i have two groups. * Approach #1* grped = group inputTuples by $0.$0.#'stdout'.$0.$0.$5#'k1' Error: 2011-03-29 15:16:44,589 [main] WARN org.apache.pig.PigServer - Encountered Warning IMPLICIT_CAST_TO_MAP 1 time(s). 2011-03-29 15:16:44,589 [main] INFO org.apache.pig.tools.pigstats.ScriptState - Pig features used in the script: GROUP_BY 2011-03-29 15:16:44,589 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - pig.usenewlogicalplan is set to true. New logical plan will be used. 2011-03-29 15:16:44,593 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 2042: Error in new logical plan. Try -Dpig.usenewlogicalplan=false. Details at logfile: /home/deepakkv/pigtemp/testworkflow/pig_1301391996435.log *Approach #2* As a result i flattened inputTulpes as follows flat = foreach inputTuples generate flatten($0.$0#'stdout'); (day, age, name, address, ['k1#v1','k2#v2'] ) (12/2,22,deepak,newyork, ['k1#v2','k2#v2'] ) (12/3,22,deepak,newyork, ['k1#v1','k2#v2'] ) So now as i need to group on k1 which is present in a map that is the 5th item (4 index) i grped = group flat by $4#'k1'; Error 2011-03-29 15:25:28,459 [main] INFO org.apache.pig.Main - Logging error messages to: /home/deepakkv/pigtemp/testworkflow/pig_1301392528456.log 2011-03-29 15:25:28,554 [main] INFO org.apache.pig.backend.hadoop.executionengine.HExecutionEngine - Connecting to hadoop file system at: file:/// 2011-03-29 15:25:28,750 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR 1000: Error during parsing. Out of bound access. Trying to access non-existent column: 4. Schema {bytearray} has 1 column(s). Details at logfile: /home/deepakkv/pigtemp/testworkflow/pig_1301392528456.log *Approach #3* As i result i tried grped = group flat by $0.$4#'k1'; Error: 2011-03-29 15:27:18,081 [Thread-13] WARN org.apache.hadoop.mapred.LocalJobRunner - job_local_0001 java.lang.ClassCastException: java.lang.String cannot be cast to org.apache.pig.data.Tuple at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:392) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.PhysicalOperator.processInput(PhysicalOperator.java:276) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:138) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POProject.getNext(POProject.java:276) at org.apache.pig.backend.hadoop.executionengine.physicalLayer.expressionOperators.POCast.getNext(POCast.java:916) How can i group tuples on group id which is present inside a Tuple -> Bag -> Map -> Tuple (Given key) -> 4thItem (Is a Map again) -> Key Regards, Deepak