Re: Group key uses wrong fields (duplicate uids)

David Wahler Tue, 10 Jan 2012 15:12:03 -0800

I simplified the test case some more and filed a bug report.

https://issues.apache.org/jira/browse/PIG-2465


On Mon, Jan 9, 2012 at 1:48 PM, David Wahler <[email protected]> wrote:
> Hi,
>
> I've just hit a bug that's present in all versions of Pig that I've
> tested. If I generate multiple relations from different projections of
> the same grouped input, then union them together and do another group
> with a composite key, the local rearrange step chooses the wrong
> fields to group by. Versions 0.8.1 and 0.9.1 generate incorrect
> output; trunk crashes with a "duplicate uid in schema" error. I
> encountered the problem in a fairly complex script, but managed to
> boil it down to the following test case:
>
> ---- bug.pig
>
> a = LOAD 'bug.in' AS (x:int, y:chararray, z:chararray);
>
> SPLIT a INTO a1 IF x==1, a2 IF x==2, a3 IF x==3;
>
> grouped = COGROUP a1 BY y, a2 BY y, a3 BY y;
> projected = FOREACH grouped GENERATE a1.z AS z1, a2.z AS z2, a3.z AS z3;
>
> b1 = FOREACH projected GENERATE FLATTEN(z1) AS first, FLATTEN(z2) AS second;
> b2 = FOREACH projected GENERATE FLATTEN(z2) AS first, FLATTEN(z3) AS second;
>
> c = UNION b1, b2;
> -- results are as expected until this point
> d = GROUP c BY (first,second);
> STORE d INTO 'bug.out';
>
> ---- Input:
>
> 1       foo     line1
> 2       foo     line2
> 3       foo     line3
> 3       foo     line4
>
> ---- Expected output:
>
> (line1,line2)   {(line1,line2)}
> (line2,line3)   {(line2,line3)}
> (line2,line4)   {(line2,line4)}
>
> ---- Actual output from 0.8/0.9
> ---- notice that the group is being done on (first,first) instead of
> (first,second):
>
> (line1,line1)   {(line1,line2)}
> (line2,line2)   {(line2,line3),(line2,line4)}
>
> ---- Stack trace from trunk:
>
> 2012-01-09 13:25:55,230 [main] INFO
> org.apache.pig.tools.pigstats.ScriptState - Pig features used in the
> script: COGROUP,GROUP_BY,UNION
> 2012-01-09 13:25:55,258 [main] ERROR org.apache.pig.tools.grunt.Grunt
> - ERROR 2270: Logical plan invalid state: duplicate uid in schema :
> first#298:chararray,second#298:chararray
> 2012-01-09 13:25:55,258 [main] ERROR org.apache.pig.tools.grunt.Grunt
> - org.apache.pig.impl.logicalLayer.FrontendException: ERROR 2000:
> Error processing rule LoadTypeCastInserter
>        at 
> org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:122)
>        at 
> org.apache.pig.backend.hadoop.executionengine.HExecutionEngine.compile(HExecutionEngine.java:287)
>        at org.apache.pig.PigServer.compilePp(PigServer.java:1317)
>        at 
> org.apache.pig.PigServer.executeCompiledLogicalPlan(PigServer.java:1254)
>        at org.apache.pig.PigServer.execute(PigServer.java:1246)
>        at org.apache.pig.PigServer.executeBatch(PigServer.java:362)
>        at 
> org.apache.pig.tools.grunt.GruntParser.executeBatch(GruntParser.java:131)
>        at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:192)
>        at 
> org.apache.pig.tools.grunt.GruntParser.parseStopOnError(GruntParser.java:164)
>        at org.apache.pig.tools.grunt.Grunt.exec(Grunt.java:84)
>        at org.apache.pig.Main.run(Main.java:589)
>        at org.apache.pig.Main.main(Main.java:148)
>        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>        at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
>        at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
>        at java.lang.reflect.Method.invoke(Method.java:597)
>        at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
> Caused by: org.apache.pig.impl.plan.PlanValidationException: ERROR
> 2270: Logical plan invalid state: duplicate uid in schema :
> first#298:chararray,second#298:chararray
>        at 
> org.apache.pig.newplan.logical.optimizer.SchemaResetter.validate(SchemaResetter.java:225)
>        at 
> org.apache.pig.newplan.logical.optimizer.SchemaResetter.visit(SchemaResetter.java:160)
>        at 
> org.apache.pig.newplan.logical.relational.LOUnion.accept(LOUnion.java:182)
>        at 
> org.apache.pig.newplan.DependencyOrderWalker.walk(DependencyOrderWalker.java:75)
>        at org.apache.pig.newplan.PlanVisitor.visit(PlanVisitor.java:50)
>        at 
> org.apache.pig.newplan.logical.optimizer.SchemaPatcher.transformed(SchemaPatcher.java:43)
>        at 
> org.apache.pig.newplan.optimizer.PlanOptimizer.optimize(PlanOptimizer.java:113)
>        ... 16 more
>
> It's possible to work around the problem by performing multiple JOINs
> instead of a single COGROUP and multiple FLATTENs, but the resulting
> plan uses more map-reduce jobs and does a lot of redundant work.
>
> Is this a known issue or limitation? (I searched JIRA and the list
> archives, but didn't see anything that looked relevant.) If not, I'll
> open an issue.
>
> Thanks,
> -- David

Re: Group key uses wrong fields (duplicate uids)

Reply via email to