[
https://issues.apache.org/jira/browse/PIG-2563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13224076#comment-13224076
]
Daniel Dai commented on PIG-2563:
---------------------------------
bq. Cheap code style comments
sure will change
bq. More expensive code content comments
Not sure if I completely understand your point, let me explain the design of
foreach nested plan and why I make the change. Let me know if you need further
explanation. Uid and schema inference process is very core to logical plan. If
one changes anywhere in the process, he needs to make sure the existing
functionality is not broken. In the patch, I change the way project infer its
uid, because earlier, it does not generate new uid for the new bag after nested
foreach. Here is how uid for foreach inner plan works:
# every foreach statement starts with LOInnerLoad, ends with LOGenerate
# simple foreach should keep uid, eg: foreach a generate $1, $2, we shall keep
the uid for $1, $2, even if it is a bag column, there are couple of places make
this assumption
# if input column is a bag, LOInnerLoad take the schema of its inner schema,
eg, if $1 is bag#2{t#3(x#4, y#5)}, LOInnerLoad will have the schema (x#4, y#5),
it can be followed with nested operator
# LOGenerate regenerates the bag after the inner operator pipeline, in this
case, bag#2{t#3(x#4, y#5)}, we need to keep uid
# currently all nested operator does not change uid, except ForEach, that is
the approach I took in the patch: unless see a ForEach, reuse uid
Here are complete examples:
{code}
b = foreach a generate a1, a2; (a0:xxxx, a1:chararray#1, a2:bag#2{t#3(x#4,
y#5)})
LOInnerLoad(a1:chararray) LOInnerLoad(x#4, y#5)
\ /
LOGenerate(a1:chararray#1, a2:bag#2{t#3(x#4, y#5)})
{code}
{code}
b = foreach a { c = filter a2 by x==1;generate a1, c; }; (a0:xxxx,
a1:chararray#1, a2:bag#2{t#3(x#4, y#5)})
LOInnerLoad(a1:chararray) LOInnerLoad(x#4, y#5)
\ /
\ LOFilter(x#4, y#5)
\ /
LOGenerate(a1:chararray#1, c:bag#2{t#3(x#4, y#5)})
{code}
{code}
b = foreach a { c = a2.x;generate a1, c; }; (a0:xxxx, a1:chararray#1,
a2:bag#2{t#3(x#4, y#5)})
LOInnerLoad(a1:chararray) LOInnerLoad(x#4, y#5)
\ /
\ LOForEach(x#4)
\ /
LOGenerate(a1:chararray#1, c:bag#7{t#6(x#4)})
{code}
> IndexOutOfBoundsException: while projecting fields from a bag
> -------------------------------------------------------------
>
> Key: PIG-2563
> URL: https://issues.apache.org/jira/browse/PIG-2563
> Project: Pig
> Issue Type: Bug
> Affects Versions: 0.9.1, 0.10
> Reporter: Vivek Padmanabhan
> Assignee: Daniel Dai
> Fix For: 0.10, 0.11
>
> Attachments: PIG-2563-1.patch
>
>
> The below script fails with Pig 0.9 / Pig 0.10 but works fine for Pig 0.8.
> {code}
> A = load 'i1' as (a,b,c:chararray);
> B = load 'i2' as (d,e,f:chararray);
> C = cogroup A by a, B by d;
> D = foreach C {
> tmp = B.d;
> tmp_dis = distinct tmp;
> generate A,B,tmp_dis ; } ;
> E = foreach D generate B.(d,e) as v;
> dump E;
> {code}
> The script fails with the below exception. Looks like DereferenceExpression
> is using wrong schema to build inner schema.
> java.lang.IndexOutOfBoundsException: Index: 1, Size: 1
> at java.util.ArrayList.RangeCheck(ArrayList.java:547)
> at java.util.ArrayList.get(ArrayList.java:322)
> at
> org.apache.pig.newplan.logical.relational.LogicalSchema.getField(LogicalSchema.java:653)
> at
> org.apache.pig.newplan.logical.expression.DereferenceExpression.getFieldSchema(DereferenceExpression.java:167)
> at
> org.apache.pig.newplan.logical.relational.LOGenerate.getSchema(LOGenerate.java:88)
> at
> org.apache.pig.newplan.logical.visitor.TypeCheckingRelVisitor.visit(TypeCheckingRelVisitor.java:160)
> at
> org.apache.pig.newplan.logical.relational.LOGenerate.accept(LOGenerate.java:242)
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira