I simplified the test case some more and filed a bug report. https://issues.apache.org/jira/browse/PIG-2465
On Mon, Jan 9, 2012 at 1:48 PM, David Wahler <[email protected]> wrote: > Hi, > > I've just hit a bug that's present in all versions of Pig that I've > tested. If I generate multiple relations from different projections of > the same grouped input, then union them together and do another group > with a composite key, the local rearrange step chooses the wrong > fields to group by. Versions 0.8.1 and 0.9.1 generate incorrect > output; trunk crashes with a "duplicate uid in schema" error. I > encountered the problem in a fairly complex script, but managed to > boil it down to the following test case: > > ---- bug.pig > > a = LOAD 'bug.in' AS (x:int, y:chararray, z:chararray); > > SPLIT a INTO a1 IF x==1, a2 IF x==2, a3 IF x==3; > > grouped = COGROUP a1 BY y, a2 BY y, a3 BY y; > projected = FOREACH grouped GENERATE a1.z AS z1, a2.z AS z2, a3.z AS z3; > > b1 = FOREACH projected GENERATE FLATTEN(z1) AS first, FLATTEN(z2) AS second; > b2 = FOREACH projected GENERATE FLATTEN(z2) AS first, FLATTEN(z3) AS second; > > c = UNION b1, b2; > -- results are as expected until this point > d = GROUP c BY (first,second); > STORE d INTO 'bug.out'; > > ---- Input: > > 1 foo line1 > 2 foo line2 > 3 foo line3 > 3 foo line4 > > ---- Expected output: > > (line1,line2) {(line1,line2)} > (line2,line3) {(line2,line3)} > (line2,line4) {(line2,line4)} > > ---- Actual output from 0.8/0.9 > ---- notice that the group is being done on (first,first) instead of > (first,second): > > (line1,line1) {(line1,line2)} > (line2,line2) {(line2,line3),(line2,line4)} > > ---- Stack trace from trunk: > > 2012-01-09 13:25:55,230 [main] INFO > org.apache.pig.tools.pigstats.ScriptState - Pig features used in the > script: COGROUP,GROUP_BY,UNION > 2012-01-09 13:25:55,258 [main] ERROR org.apache.pig.tools.grunt.Grunt > - ERROR 2270: Logical plan invalid state: duplicate uid in schema : > first#298:chararray,second#298:chararray > 2012-01-09 13:25:55,258 [main] ERROR org.apache.pig.tools.grunt.Grunt > - org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2000: > Error processing rule LoadTypeCastInserter > at > org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:122) > at > org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:287) > at org.apache.pig.PigServer.compilePp(PigServer.java:1317) > at > org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1254) > at org.apache.pig.PigServer.execute(PigServer.java:1246) > at org.apache.pig.PigServer.executeBatch(PigServer.java:362) > at > org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:131) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:192) > at > org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:164) > at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84) > at org.apache.pig.Main.run(Main.java:589) > at org.apache.pig.Main.main(Main.java:148) > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) > at java.lang.reflect.Method.invoke(Method.java:597) > at org.apache.hadoop.util.RunJar.main(RunJar.java:186) > Caused by: org.apache.pig.impl.plan.PlanValidationException: ERROR > 2270: Logical plan invalid state: duplicate uid in schema : > first#298:chararray,second#298:chararray > at > org.apache.pig.newplan.logical.optimizer.SchemaResetter.validate(SchemaResetter.java:225) > at > org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:160) > at > org.apache.pig.newplan.logical.relational.LOUnion.accept(LOUnion.java:182) > at > org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75) > at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:50) > at > org.apache.pig.newplan.logical.optimizer.SchemaPatcher.transformed(SchemaPatcher.java:43) > at > org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:113) > ... 16 more > > It's possible to work around the problem by performing multiple JOINs > instead of a single COGROUP and multiple FLATTENs, but the resulting > plan uses more map-reduce jobs and does a lot of redundant work. > > Is this a known issue or limitation? (I searched JIRA and the list > archives, but didn't see anything that looked relevant.) If not, I'll > open an issue. > > Thanks, > -- David
