[ https://issues.apache.org/jira/browse/PIG-3310?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Clément Stenac updated PIG-3310: -------------------------------- Attachment: generate-uid-for-nested-fields.patch > ImplicitSplitInserter does not generate new uids for nested schema fields, > leading to miscomputations > ----------------------------------------------------------------------------------------------------- > > Key: PIG-3310 > URL: https://issues.apache.org/jira/browse/PIG-3310 > Project: Pig > Issue Type: Bug > Components: impl > Affects Versions: 0.11.1 > Environment: Reproduced on 0.10.1, 0.11.1 and trunk > Reporter: Clément Stenac > Attachments: generate-uid-for-nested-fields.patch > > > Hi, > Consider the following example > {code} > inp = LOAD '$INPUT' AS (memberId:long, shopId:long, score:int); > tuplified = FOREACH inp GENERATE (memberId, shopId) AS tuplify, score; > D1 = FOREACH tuplified GENERATE tuplify.memberId as memberId, tuplify.shopId > as shopId, score AS score; > D2 = FOREACH tuplified GENERATE tuplify.memberId as memberId, tuplify.shopId > as shopId, score AS score; > J = JOIN D1 By shopId, D2 by shopId; > K = FOREACH J GENERATE D1::memberId AS member_id1, D2::memberId AS > member_id2, D1::shopId as shop; > EXPLAIN K; > DUMP K; > {code} > It is a bit weird written like that, but it provides a minimal reproduction > case (in the real case, the "tuplified" phase came from a multi-key grouping). > On input data: > {code} > 1 1001 101 > 1 1002 103 > 1 1003 102 > 1 1004 102 > 2 1005 101 > 2 1003 101 > 2 1002 123 > 3 1042 101 > 3 1005 101 > 3 1002 133 > {code} > This will give a wrongful output like .. > {code} > (1,1001,1001) > (1,1002,1002) > (1,1002,1002) > (1,1002,1002) > {code} > The second column should be a member id so (1,2,3,4,5). > In the initial case, there was a FILTER (member_id1 < member_id2) after K, > and computation failed because of PushUpFilter optimization mistakenly moving > the LOFilter operation before the join, at a place where it tried to work on > a tuple and failed. > My understanding of the issue is that when the ImplicitSplitInserter creates > the LOSplitOutputs, it will correctly reset the schema, and the LOSplitOutput > will regenerate uids for the fields of D1 and D2 ... but will not do that on > the tuple members. > The logical plan after the ImplicitSplitINserter will look like (simplified) > {code} > |---D1: (Name: LOForEach Schema: > memberId#124:long,shopId#125:long)ColumnPrune:InputUids=[127]ColumnPrune:OutputUids=[125, > 124] > |---tuplified: (Name: LOSplitOutput Schema: > tuplify#127:tuple(memberId#124:long,shopId#125:long))ColumnPrune:InputUids=[123]ColumnPrune:OutputUids=[127] > |---tuplified: (Name: LOSplit Schema: > tuplify#123:tuple(memberId#124:long,shopId#125:long))ColumnPrune:InputUids=[123]ColumnPrune:OutputUids=[123] > |---D2: (Name: LOForEach Schema: > memberId#124:long,shopId#125:long)ColumnPrune:InputUids=[130]ColumnPrune:OutputUids=[125, > 124] > |---tuplified: (Name: LOSplitOutput Schema: > tuplify#130:tuple(memberId#124:long,shopId#125:long))ColumnPrune:InputUids=[123]ColumnPrune:OutputUids=[130] > |---tuplified: (Name: LOSplit Schema: > tuplify#123:tuple(memberId#124:long,shopId#125:long))ColumnPrune:InputUids=[123]ColumnPrune:OutputUids=[123] > {code} > tuplified correctly gets a new uid (127 and 130) but the members of the tuple > don't. When they get reprojected, both branches have the same uid and the > join looks like: > {code} > |---J: (Name: LOJoin(HASH) Schema: > D1::memberId#124:long,D1::shopId#125:long,D2::memberId#139:long,D2::shopId#132:long)ColumnPrune:InputUids=[125, > 124, 132]ColumnPrune:OutputUids=[125, 124, 132] > | | > | shopId:(Name: Project Type: long Uid: 125 Input: 0 Column: 1) > | | > | shopId:(Name: Project Type: long Uid: 125 Input: 1 Column: 1) > {code} > If for example instead of reprojecting "memberId", we project "memberId+0", a > new node is created, and ultimately the two branches of the join will > correctly get separate uids. > My understanding is that LOSplitOutput.getSchema() should recurse on nested > schema fields. However, I only have a light understanding of all of the > logical plan handling, so I may be completely wrong. > Attached is a draft of patch and a test reproducing the issue. Unfortunately, > I haven't been able to run all unit tests with the "fix" (I have some weird > hangs) > I'd be happy if you could indicate if that looks like completely the wrong > way to fix the issue. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira