[jira] [Commented] (PIG-3347) Store invocation brings side effect
[ https://issues.apache.org/jira/browse/PIG-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13895148#comment-13895148 ] Koji Noguchi commented on PIG-3347: --- bq. It could but introduce a lot of complications. Currently only LOForEach/LOSplitOutput is dealing with dup-uid, otherwise it will sprawl to all operators and all optimizer rules. Thanks Daniel. This helps me understand why I always get confused on this. Maybe someday I can separate the two. As for your latest patch(PIG-3347-5.patch), it passed the unit tests(including mine) and e2e was fine also. I'm +1. > Store invocation brings side effect > --- > > Key: PIG-3347 > URL: https://issues.apache.org/jira/browse/PIG-3347 > Project: Pig > Issue Type: Bug > Components: grunt >Affects Versions: 0.11 > Environment: local mode >Reporter: Sergey >Assignee: Daniel Dai >Priority: Critical > Fix For: 0.12.1 > > Attachments: PIG-3347-1.patch, PIG-3347-2-testonly.patch, > PIG-3347-3.patch, PIG-3347-4-testonly.patch, PIG-3347-5.patch > > > The problem is that intermediate 'store' invocation "changes" the final store > output. Looks like it brings some kind of side effect. We did use 'local' > mode to run script > here is the input data: > 1 > 1 > Here is the script: > {code} > a = load 'test'; > a_group = group a by $0; > b = foreach a_group { > a_distinct = distinct a.$0; > generate group, a_distinct; > } > --store b into 'b'; > c = filter b by SIZE(a_distinct) == 1; > store c into 'out'; > {code} > We expect output to be: > 1 1 > The output is empty file. > Uncomment {code}--store b into 'b';{code} line and see the diffrence. > Yuo would get expected output. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-3347) Store invocation brings side effect
[ https://issues.apache.org/jira/browse/PIG-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13891178#comment-13891178 ] Daniel Dai commented on PIG-3347: - It could but introduce a lot of complications. Currently only LOForEach/LOSplitOutput is dealing with dup-uid, otherwise it will sprawl to all operators and all optimizer rules. > Store invocation brings side effect > --- > > Key: PIG-3347 > URL: https://issues.apache.org/jira/browse/PIG-3347 > Project: Pig > Issue Type: Bug > Components: grunt >Affects Versions: 0.11 > Environment: local mode >Reporter: Sergey >Assignee: Daniel Dai >Priority: Critical > Fix For: 0.12.1 > > Attachments: PIG-3347-1.patch, PIG-3347-2-testonly.patch, > PIG-3347-3.patch, PIG-3347-4-testonly.patch, PIG-3347-5.patch > > > The problem is that intermediate 'store' invocation "changes" the final store > output. Looks like it brings some kind of side effect. We did use 'local' > mode to run script > here is the input data: > 1 > 1 > Here is the script: > {code} > a = load 'test'; > a_group = group a by $0; > b = foreach a_group { > a_distinct = distinct a.$0; > generate group, a_distinct; > } > --store b into 'b'; > c = filter b by SIZE(a_distinct) == 1; > store c into 'out'; > {code} > We expect output to be: > 1 1 > The output is empty file. > Uncomment {code}--store b into 'b';{code} line and see the diffrence. > Yuo would get expected output. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-3347) Store invocation brings side effect
[ https://issues.apache.org/jira/browse/PIG-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13891127#comment-13891127 ] Koji Noguchi commented on PIG-3347: --- bq. we will need to generate a new uid for col2 to avoid uid conflict (using a UDF IdentityColumn) Daniel, I think I understand how it is being used, but my confusion is: for the pure purpose of tracking column lineage, shouldn't the redundant uid inside the relation be allowed? Isn't the requirement of no-conflict-uid coming from using the same uid for ProjectionPatcher which serves a different purpose than the lineage tracking? > Store invocation brings side effect > --- > > Key: PIG-3347 > URL: https://issues.apache.org/jira/browse/PIG-3347 > Project: Pig > Issue Type: Bug > Components: grunt >Affects Versions: 0.11 > Environment: local mode >Reporter: Sergey >Assignee: Daniel Dai >Priority: Critical > Fix For: 0.12.1 > > Attachments: PIG-3347-1.patch, PIG-3347-2-testonly.patch, > PIG-3347-3.patch, PIG-3347-4-testonly.patch, PIG-3347-5.patch > > > The problem is that intermediate 'store' invocation "changes" the final store > output. Looks like it brings some kind of side effect. We did use 'local' > mode to run script > here is the input data: > 1 > 1 > Here is the script: > {code} > a = load 'test'; > a_group = group a by $0; > b = foreach a_group { > a_distinct = distinct a.$0; > generate group, a_distinct; > } > --store b into 'b'; > c = filter b by SIZE(a_distinct) == 1; > store c into 'out'; > {code} > We expect output to be: > 1 1 > The output is empty file. > Uncomment {code}--store b into 'b';{code} line and see the diffrence. > Yuo would get expected output. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-3347) Store invocation brings side effect
[ https://issues.apache.org/jira/browse/PIG-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13891032#comment-13891032 ] Daniel Dai commented on PIG-3347: - [~knoguchi], in the "B = foreach A generate a as col1, a as col2; ", we will need to generate a new uid for col2 to avoid uid conflict (using a UDF IdentityColumn). The downside is this will break the lineage chain. The uid is mostly used in optimizer, there several holes when we use it for pure lineage. Optimizer rules is expected to live with these holes by skip optimize (eg, PushUpFilter is skip the foreach with UDF, which include IdentityColumn aiming to fix the uid conflict) > Store invocation brings side effect > --- > > Key: PIG-3347 > URL: https://issues.apache.org/jira/browse/PIG-3347 > Project: Pig > Issue Type: Bug > Components: grunt >Affects Versions: 0.11 > Environment: local mode >Reporter: Sergey >Assignee: Daniel Dai >Priority: Critical > Fix For: 0.12.1 > > Attachments: PIG-3347-1.patch, PIG-3347-2-testonly.patch, > PIG-3347-3.patch, PIG-3347-4-testonly.patch, PIG-3347-5.patch > > > The problem is that intermediate 'store' invocation "changes" the final store > output. Looks like it brings some kind of side effect. We did use 'local' > mode to run script > here is the input data: > 1 > 1 > Here is the script: > {code} > a = load 'test'; > a_group = group a by $0; > b = foreach a_group { > a_distinct = distinct a.$0; > generate group, a_distinct; > } > --store b into 'b'; > c = filter b by SIZE(a_distinct) == 1; > store c into 'out'; > {code} > We expect output to be: > 1 1 > The output is empty file. > Uncomment {code}--store b into 'b';{code} line and see the diffrence. > Yuo would get expected output. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-3347) Store invocation brings side effect
[ https://issues.apache.org/jira/browse/PIG-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13890942#comment-13890942 ] Koji Noguchi commented on PIG-3347: --- bq. UID is to track column lineage so in logical optimizer, so that we can freely move operate up and down, ProjectionPatcher will reposition the column according to uid I think part of my confusion comes from these two. UID is used for (1) tracking column lineage. (2) UID is also used for ProjectionPatcher to reposition therefore requiring UID to be unique within each relation. Because of (2), we're seeing new uid being created whenever column is referenced multiple times. Like A = load 'a.txt' as (a:int); B = foreach A generate a as col1, a as col2; This would create a schema like {noformat} 1-2: (Name: LOStore Schema: col1#1:int,col2#2:int) ... |---A: (Name: LOLoad Schema: a#1:int)RequiredFields:null {noformat} So without traversing the lineage, I cannot connect 'col2' to original 'a'. However, optimizer like PushUpFilter&FilterAboveForeach seems to be using just UID to determine the field usages... But this is outside of this jira. I need to spend more time learning how the pig compiler works. > Store invocation brings side effect > --- > > Key: PIG-3347 > URL: https://issues.apache.org/jira/browse/PIG-3347 > Project: Pig > Issue Type: Bug > Components: grunt >Affects Versions: 0.11 > Environment: local mode >Reporter: Sergey >Assignee: Daniel Dai >Priority: Critical > Fix For: 0.12.1 > > Attachments: PIG-3347-1.patch, PIG-3347-2-testonly.patch, > PIG-3347-3.patch, PIG-3347-4-testonly.patch > > > The problem is that intermediate 'store' invocation "changes" the final store > output. Looks like it brings some kind of side effect. We did use 'local' > mode to run script > here is the input data: > 1 > 1 > Here is the script: > {code} > a = load 'test'; > a_group = group a by $0; > b = foreach a_group { > a_distinct = distinct a.$0; > generate group, a_distinct; > } > --store b into 'b'; > c = filter b by SIZE(a_distinct) == 1; > store c into 'out'; > {code} > We expect output to be: > 1 1 > The output is empty file. > Uncomment {code}--store b into 'b';{code} line and see the diffrence. > Yuo would get expected output. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-3347) Store invocation brings side effect
[ https://issues.apache.org/jira/browse/PIG-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13890903#comment-13890903 ] Daniel Dai commented on PIG-3347: - All unit tests pass with the patch. > Store invocation brings side effect > --- > > Key: PIG-3347 > URL: https://issues.apache.org/jira/browse/PIG-3347 > Project: Pig > Issue Type: Bug > Components: grunt >Affects Versions: 0.11 > Environment: local mode >Reporter: Sergey >Assignee: Daniel Dai >Priority: Critical > Fix For: 0.12.1 > > Attachments: PIG-3347-1.patch, PIG-3347-2-testonly.patch, > PIG-3347-3.patch > > > The problem is that intermediate 'store' invocation "changes" the final store > output. Looks like it brings some kind of side effect. We did use 'local' > mode to run script > here is the input data: > 1 > 1 > Here is the script: > {code} > a = load 'test'; > a_group = group a by $0; > b = foreach a_group { > a_distinct = distinct a.$0; > generate group, a_distinct; > } > --store b into 'b'; > c = filter b by SIZE(a_distinct) == 1; > store c into 'out'; > {code} > We expect output to be: > 1 1 > The output is empty file. > Uncomment {code}--store b into 'b';{code} line and see the diffrence. > Yuo would get expected output. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-3347) Store invocation brings side effect
[ https://issues.apache.org/jira/browse/PIG-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13887236#comment-13887236 ] Daniel Dai commented on PIG-3347: - UID is to track column lineage so in logical optimizer, so that we can freely move operate up and down, ProjectionPatcher will reposition the column according to uid, even if the column get reordered. A new source of data should have a new UID, that's the case for nested LOForEach/LODistinct, since they are not directly derived from the previous operator, instead, it is a new field generated by the foreach. > Store invocation brings side effect > --- > > Key: PIG-3347 > URL: https://issues.apache.org/jira/browse/PIG-3347 > Project: Pig > Issue Type: Bug > Components: grunt >Affects Versions: 0.11 > Environment: local mode >Reporter: Sergey >Assignee: Daniel Dai >Priority: Critical > Fix For: 0.12.1 > > Attachments: PIG-3347-1.patch > > > The problem is that intermediate 'store' invocation "changes" the final store > output. Looks like it brings some kind of side effect. We did use 'local' > mode to run script > here is the input data: > 1 > 1 > Here is the script: > {code} > a = load 'test'; > a_group = group a by $0; > b = foreach a_group { > a_distinct = distinct a.$0; > generate group, a_distinct; > } > --store b into 'b'; > c = filter b by SIZE(a_distinct) == 1; > store c into 'out'; > {code} > We expect output to be: > 1 1 > The output is empty file. > Uncomment {code}--store b into 'b';{code} line and see the diffrence. > Yuo would get expected output. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-3347) Store invocation brings side effect
[ https://issues.apache.org/jira/browse/PIG-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13887203#comment-13887203 ] Julien Le Dem commented on PIG-3347: I thought that the field UIDs were used to track lineage across the plan. [~aniket486] correct me if I'm wrong but it is used to determine which fields are reads for projection push down. In the case of self join (directly or indirectly) we end up with duplicate ids in the same relation because the same field is derived to 2 different fields. Otherwise I'm as lost as [~knoguchi] regarding the actual mechanisms around the UID. I tried to fix some of these in the past (PIG-3020) but it appears they created more problems (PIG-3492) [~daijy] maybe you can enlighten us? > Store invocation brings side effect > --- > > Key: PIG-3347 > URL: https://issues.apache.org/jira/browse/PIG-3347 > Project: Pig > Issue Type: Bug > Components: grunt >Affects Versions: 0.11 > Environment: local mode >Reporter: Sergey >Assignee: Daniel Dai >Priority: Critical > Fix For: 0.12.1 > > Attachments: PIG-3347-1.patch > > > The problem is that intermediate 'store' invocation "changes" the final store > output. Looks like it brings some kind of side effect. We did use 'local' > mode to run script > here is the input data: > 1 > 1 > Here is the script: > {code} > a = load 'test'; > a_group = group a by $0; > b = foreach a_group { > a_distinct = distinct a.$0; > generate group, a_distinct; > } > --store b into 'b'; > c = filter b by SIZE(a_distinct) == 1; > store c into 'out'; > {code} > We expect output to be: > 1 1 > The output is empty file. > Uncomment {code}--store b into 'b';{code} line and see the diffrence. > Yuo would get expected output. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-3347) Store invocation brings side effect
[ https://issues.apache.org/jira/browse/PIG-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13887114#comment-13887114 ] Dmitriy V. Ryaboy commented on PIG-3347: Yikes. [~aniket486] & [~julienledem] this seems like a critical bug to look at. Julien, you investigated this UID situation before, right? > Store invocation brings side effect > --- > > Key: PIG-3347 > URL: https://issues.apache.org/jira/browse/PIG-3347 > Project: Pig > Issue Type: Bug > Components: grunt >Affects Versions: 0.11 > Environment: local mode >Reporter: Sergey >Assignee: Daniel Dai >Priority: Critical > Fix For: 0.12.1 > > Attachments: PIG-3347-1.patch > > > The problem is that intermediate 'store' invocation "changes" the final store > output. Looks like it brings some kind of side effect. We did use 'local' > mode to run script > here is the input data: > 1 > 1 > Here is the script: > {code} > a = load 'test'; > a_group = group a by $0; > b = foreach a_group { > a_distinct = distinct a.$0; > generate group, a_distinct; > } > --store b into 'b'; > c = filter b by SIZE(a_distinct) == 1; > store c into 'out'; > {code} > We expect output to be: > 1 1 > The output is empty file. > Uncomment {code}--store b into 'b';{code} line and see the diffrence. > Yuo would get expected output. -- This message was sent by Atlassian JIRA (v6.1.5#6160)
[jira] [Commented] (PIG-3347) Store invocation brings side effect
[ https://issues.apache.org/jira/browse/PIG-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13817385#comment-13817385 ] Koji Noguchi commented on PIG-3347: --- I wish there was a wiki/document describing how UID should be assigned. Even after going through PIG-3492, I'm still lost on when exactly we should assign new UIDs. My current understanding(or guess) is, UID represents an uniqueness within a record. Just by looking at the UIDs from the two separate relations(bags), we can tell if the fields were altered or not. (Although we cannot tell if the tuples were filtered or not.) FilterAboveForeach(PushUpFilter) is using this property to determine if FILTER can be moved before the foreach. Bug here is, nested distinct is not assigning a new UID for the bag it creates so FilterAboveForeach mistakenly thinks that no fields were altered within the foreach and decides to move this filter upfront. Following show the schema BEFORE calling PushUpFilter/FilterAboveForeach without and with Daniel's patch. We can see that after applying the patch, relation 'c' and 'a_group' contain different UIDs for the bag. {noformat} (without the patch) |---c: (Name: LOFilter Schema: group#11:bytearray,a_distinct#12:bag{#13:tuple(#14:bytearray)}) |---b: (Name: LOForEach Schema: group#11:bytearray,a_distinct#12:bag{#13:tuple(#14:bytearray)}) |---a_group: (Name: LOCogroup Schema: group#11:bytearray,a#12:bag{#13:tuple()}) (with the patch) |---c: (Name: LOFilter Schema: group#15:bytearray,a_distinct#20:bag{#19:tuple(#18:bytearray)}) |---b: (Name: LOForEach Schema: group#15:bytearray,a_distinct#20:bag{#19:tuple(#18:bytearray)}) |---a_group: (Name: LOCogroup Schema: group#15:bytearray,a#16:bag{#17:tuple()}) {noformat} So I think the patch fixes the bug described on the jira nicely. However, question remains for other nested operations. I believe the same bug can appear for nested LIMIT and nested FILTER. For example, {noformat} a = load 'test.txt'; a_group = group a by $0; b = foreach a_group { a_limit = limit a.$0 5; generate group, a_limit; } c = filter b by SIZE(a_limit) == 5; store c into 'out'; {noformat} {noformat} a = load 'test3.txt' as (a0, a1); a_group = group a by a0; b = foreach a_group { newA = filter a by a1 == 2; generate group, newA; } c = filter b by SIZE(newA) == 5; store c into 'out'; {noformat} I confirmed these two examples also mistakenly push the filter before foreach and produce empty results. Former case, nested LIMIT, is actually covered with the current patch since nested LIMIT uses LOLIMIT+LOForeach. So the patch {noformat} + 98 // If it is nested foreach or nested distinct, generate new uid + 99 if (op instanceof LOForEach || op instanceof LODistinct) { + 100 needNewUid = true; + 101 } {noformat} takes care of nested limit although comment doesn't mention it. Nested filter is not the case here and the bug still exists after the current patch. Can we cover this case as well? > Store invocation brings side effect > --- > > Key: PIG-3347 > URL: https://issues.apache.org/jira/browse/PIG-3347 > Project: Pig > Issue Type: Bug > Components: grunt >Affects Versions: 0.11 > Environment: local mode >Reporter: Sergey >Assignee: Daniel Dai > Fix For: 0.12.1 > > Attachments: PIG-3347-1.patch > > > The problem is that intermediate 'store' invocation "changes" the final store > output. Looks like it brings some kind of side effect. We did use 'local' > mode to run script > here is the input data: > 1 > 1 > Here is the script: > {code} > a = load 'test'; > a_group = group a by $0; > b = foreach a_group { > a_distinct = distinct a.$0; > generate group, a_distinct; > } > --store b into 'b'; > c = filter b by SIZE(a_distinct) == 1; > store c into 'out'; > {code} > We expect output to be: > 1 1 > The output is empty file. > Uncomment {code}--store b into 'b';{code} line and see the diffrence. > Yuo would get expected output. -- This message was sent by Atlassian JIRA (v6.1#6144)