[
https://issues.apache.org/jira/browse/PIG-2970?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13487965#comment-13487965
]
Rohini Palaniswamy commented on PIG-2970:
-----------------------------------------
Gianmarco,
Thanks for the pointer. They are related. The plan for your example in
PIG-2219 gets generated as
{noformat}
a = load 'b.txt' AS (id:chararray, num:int);
b = group a by id;
c = foreach b {
d = order a by num DESC;
n = COUNT(a);
e = limit d 1;
generate n;
}
|---c: (Name: LOForEach Schema: id#1:chararray,num#2:int)
| |
| e: (Name: LOLimit Schema: id#1:chararray,num#2:int)
| |
| |---d: (Name: LOSort Schema: id#1:chararray,num#2:int)
| | |
| | num:(Name: Project Type: int Uid: 2 Input: 0 Column: 1)
| |
| |---a: (Name: LOInnerLoad[1] Schema: id#1:chararray,num#2:int)
|
| (Name: LOGenerate[false] Schema: #6:long)
| | |
| | (Name: UserFunc(org.apache.pig.builtin.COUNT) Type: long Uid: 6)
| | |
| | |---a:(Name: Project Type: bag Uid: 3 Input: 0 Column: (*))
| |
| |---a: (Name: LOInnerLoad[1] Schema: id#1:chararray,num#2:int)
For the query in this example:
|---c: (Name: LOForEach Schema: v1#2:bytearray)
| |
| e: (Name: LODistinct Schema: v1#2:bytearray)
| |
| |---d: (Name: LOForEach Schema: v1#2:bytearray)
| | |
| | (Name: LOGenerate[false] Schema: v1#2:bytearray)
| | | |
| | | v1:(Name: Project Type: bytearray Uid: 2 Input: 0
Column: (*))
| | |
| | |---(Name: LOInnerLoad[1] Schema: v1#2:bytearray)
| |
| |---a: (Name: LOInnerLoad[1] Schema:
id#1:bytearray,v1#2:bytearray)
|
| (Name: LOGenerate[false,false] Schema:
group#1:bytearray,a#3:bag{#5:tuple(id#1:bytearray,v1#2:bytearray)})
| | |
| | group:(Name: Project Type: bytearray Uid: 1 Input: 0 Column:
(*))
| | |
| | a:(Name: Project Type: bag Uid: 3 Input: 1 Column: (*))
| |
| |---(Name: LOInnerLoad[0] Schema: group#1:bytearray)
| |
| |---a: (Name: LOInnerLoad[1] Schema: id#1:bytearray,v1#2:bytearray)
{noformat}
The problem is Schema for ForEach gets set based on the first leaf. In your
case, Schema for both the leaves contained the same required fields and so
there was no error. In Koji's case the schema is different for both the leaves
and hence the error.
Koji,
Connecting both the separate nodes just to get the schema correct changes
the schema in such a way that it is not dangling anymore. My take on this is
that we should move the DanglingNestedNodeRemover (which was wrote to handle
this scenario) from HExecutionEngine to LogicalPlanBuilder.buildForeachOp()
before the SchemaResetter is called through expandAndResetVisitor, so that the
dangling node is removed during construction itself and the correct schema is
set by SchemaResetter. Thoughts?
> Nested foreach getting incorrect schema when having unrelated inner query
> -------------------------------------------------------------------------
>
> Key: PIG-2970
> URL: https://issues.apache.org/jira/browse/PIG-2970
> Project: Pig
> Issue Type: Bug
> Components: parser
> Affects Versions: 0.10.0
> Reporter: Koji Noguchi
> Assignee: Koji Noguchi
> Priority: Minor
> Fix For: 0.11, 0.12
>
> Attachments: pig-2970-trunk-v01.txt
>
>
> While looking at PIG-2968, hit a weird error message.
> {noformat}
> $ cat -n test/foreach2.pig
> 1 daily = load 'nyse' as (exchange, symbol);
> 2 grpd = group daily by exchange;
> 3 unique = foreach grpd {
> 4 sym = daily.symbol;
> 5 uniq_sym = distinct sym;
> 6 --ignoring uniq_sym result
> 7 generate group, daily;
> 8 };
> 9 describe unique;
> 10 zzz = foreach unique generate group;
> 11 explain zzz;
> % pig -x local -t ColumnMapKeyPrune test/foreach2.pig
> ...
> unique: {symbol: bytearray}
> 2012-10-12 16:55:44,226 [main] ERROR org.apache.pig.tools.grunt.Grunt - ERROR
> 1025:
> <file test/foreach2.pig, line 10, column 30> Invalid field projection.
> Projected field [group] does not exist in schema: symbol:bytearray.
> ...
> {noformat}
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira