[jira] [Commented] (PIG-3347) Store invocation brings side effect

2014-02-07 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13895148#comment-13895148
 ] 

Koji Noguchi commented on PIG-3347:
---

bq. It could but introduce a lot of complications. Currently only 
LOForEach/LOSplitOutput is dealing with dup-uid, otherwise it will sprawl to 
all operators and all optimizer rules.

Thanks Daniel.  This helps me understand why I always get confused on this.  
Maybe someday I can separate the two.

As for your latest patch(PIG-3347-5.patch), it passed the unit tests(including 
mine) and e2e was fine also.  I'm +1.

> Store invocation brings side effect
> ---
>
> Key: PIG-3347
> URL: https://issues.apache.org/jira/browse/PIG-3347
> Project: Pig
>  Issue Type: Bug
>  Components: grunt
>Affects Versions: 0.11
> Environment: local mode
>Reporter: Sergey
>Assignee: Daniel Dai
>Priority: Critical
> Fix For: 0.12.1
>
> Attachments: PIG-3347-1.patch, PIG-3347-2-testonly.patch, 
> PIG-3347-3.patch, PIG-3347-4-testonly.patch, PIG-3347-5.patch
>
>
> The problem is that intermediate 'store' invocation "changes" the final store 
> output. Looks like it brings some kind of side effect. We did use 'local' 
> mode to run script
> here is the input data:
> 1
> 1
> Here is the script:
> {code}
> a = load 'test';
> a_group = group a by $0;
> b = foreach a_group {
>   a_distinct = distinct a.$0;
>   generate group, a_distinct;
> }
> --store b into 'b';
> c = filter b by SIZE(a_distinct) == 1;
> store c into 'out';
> {code}
> We expect output to be:
> 1 1
> The output is empty file.
> Uncomment {code}--store b into 'b';{code} line and see the diffrence.
> Yuo would get expected output.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3347) Store invocation brings side effect

2014-02-04 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13891178#comment-13891178
 ] 

Daniel Dai commented on PIG-3347:
-

It could but introduce a lot of complications. Currently only 
LOForEach/LOSplitOutput is dealing with dup-uid, otherwise it will sprawl to 
all operators and all optimizer rules.

> Store invocation brings side effect
> ---
>
> Key: PIG-3347
> URL: https://issues.apache.org/jira/browse/PIG-3347
> Project: Pig
>  Issue Type: Bug
>  Components: grunt
>Affects Versions: 0.11
> Environment: local mode
>Reporter: Sergey
>Assignee: Daniel Dai
>Priority: Critical
> Fix For: 0.12.1
>
> Attachments: PIG-3347-1.patch, PIG-3347-2-testonly.patch, 
> PIG-3347-3.patch, PIG-3347-4-testonly.patch, PIG-3347-5.patch
>
>
> The problem is that intermediate 'store' invocation "changes" the final store 
> output. Looks like it brings some kind of side effect. We did use 'local' 
> mode to run script
> here is the input data:
> 1
> 1
> Here is the script:
> {code}
> a = load 'test';
> a_group = group a by $0;
> b = foreach a_group {
>   a_distinct = distinct a.$0;
>   generate group, a_distinct;
> }
> --store b into 'b';
> c = filter b by SIZE(a_distinct) == 1;
> store c into 'out';
> {code}
> We expect output to be:
> 1 1
> The output is empty file.
> Uncomment {code}--store b into 'b';{code} line and see the diffrence.
> Yuo would get expected output.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3347) Store invocation brings side effect

2014-02-04 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13891127#comment-13891127
 ] 

Koji Noguchi commented on PIG-3347:
---

bq. we will need to generate a new uid for col2 to avoid uid conflict (using a 
UDF IdentityColumn)

Daniel, I think I understand how it is being used, but my confusion is: for the 
pure purpose of tracking column lineage, shouldn't the redundant uid inside the 
relation be allowed?  Isn't the requirement of no-conflict-uid coming from 
using the same uid for ProjectionPatcher which serves a different purpose than 
the lineage tracking?

> Store invocation brings side effect
> ---
>
> Key: PIG-3347
> URL: https://issues.apache.org/jira/browse/PIG-3347
> Project: Pig
>  Issue Type: Bug
>  Components: grunt
>Affects Versions: 0.11
> Environment: local mode
>Reporter: Sergey
>Assignee: Daniel Dai
>Priority: Critical
> Fix For: 0.12.1
>
> Attachments: PIG-3347-1.patch, PIG-3347-2-testonly.patch, 
> PIG-3347-3.patch, PIG-3347-4-testonly.patch, PIG-3347-5.patch
>
>
> The problem is that intermediate 'store' invocation "changes" the final store 
> output. Looks like it brings some kind of side effect. We did use 'local' 
> mode to run script
> here is the input data:
> 1
> 1
> Here is the script:
> {code}
> a = load 'test';
> a_group = group a by $0;
> b = foreach a_group {
>   a_distinct = distinct a.$0;
>   generate group, a_distinct;
> }
> --store b into 'b';
> c = filter b by SIZE(a_distinct) == 1;
> store c into 'out';
> {code}
> We expect output to be:
> 1 1
> The output is empty file.
> Uncomment {code}--store b into 'b';{code} line and see the diffrence.
> Yuo would get expected output.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3347) Store invocation brings side effect

2014-02-04 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13891032#comment-13891032
 ] 

Daniel Dai commented on PIG-3347:
-

[~knoguchi], in the "B = foreach A generate a as col1, a as col2; ", we will 
need to generate a new uid for col2 to avoid uid conflict (using a UDF 
IdentityColumn). The downside is this will break the lineage chain. The uid is 
mostly used in optimizer, there several holes when we use it for pure lineage. 
Optimizer rules is expected to live with these holes by skip optimize (eg, 
PushUpFilter is skip the foreach with UDF, which include IdentityColumn aiming 
to fix the uid conflict)

> Store invocation brings side effect
> ---
>
> Key: PIG-3347
> URL: https://issues.apache.org/jira/browse/PIG-3347
> Project: Pig
>  Issue Type: Bug
>  Components: grunt
>Affects Versions: 0.11
> Environment: local mode
>Reporter: Sergey
>Assignee: Daniel Dai
>Priority: Critical
> Fix For: 0.12.1
>
> Attachments: PIG-3347-1.patch, PIG-3347-2-testonly.patch, 
> PIG-3347-3.patch, PIG-3347-4-testonly.patch, PIG-3347-5.patch
>
>
> The problem is that intermediate 'store' invocation "changes" the final store 
> output. Looks like it brings some kind of side effect. We did use 'local' 
> mode to run script
> here is the input data:
> 1
> 1
> Here is the script:
> {code}
> a = load 'test';
> a_group = group a by $0;
> b = foreach a_group {
>   a_distinct = distinct a.$0;
>   generate group, a_distinct;
> }
> --store b into 'b';
> c = filter b by SIZE(a_distinct) == 1;
> store c into 'out';
> {code}
> We expect output to be:
> 1 1
> The output is empty file.
> Uncomment {code}--store b into 'b';{code} line and see the diffrence.
> Yuo would get expected output.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3347) Store invocation brings side effect

2014-02-04 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13890942#comment-13890942
 ] 

Koji Noguchi commented on PIG-3347:
---

bq. UID is to track column lineage so in logical optimizer, so that we can 
freely move operate up and down,  ProjectionPatcher will reposition the column 
according to uid

I think part of my confusion comes from these two.  UID is used for (1) 
tracking column lineage.  (2) UID is also used for ProjectionPatcher to 
reposition therefore requiring UID to be unique within each relation.

Because of (2), we're seeing new uid being created whenever column is 
referenced multiple times.
Like 
A = load 'a.txt' as (a:int);
B = foreach A generate a as col1, a as col2; 

This would create a schema like 
{noformat}
1-2: (Name: LOStore Schema: col1#1:int,col2#2:int)
...
|---A: (Name: LOLoad Schema: a#1:int)RequiredFields:null
{noformat}

So without traversing the lineage, I cannot connect 'col2' to original 'a'.
However, optimizer like PushUpFilter&FilterAboveForeach seems to be using just 
UID to determine the field usages...

But this is outside of this jira.  I need to spend more time learning how the 
pig compiler works.

> Store invocation brings side effect
> ---
>
> Key: PIG-3347
> URL: https://issues.apache.org/jira/browse/PIG-3347
> Project: Pig
>  Issue Type: Bug
>  Components: grunt
>Affects Versions: 0.11
> Environment: local mode
>Reporter: Sergey
>Assignee: Daniel Dai
>Priority: Critical
> Fix For: 0.12.1
>
> Attachments: PIG-3347-1.patch, PIG-3347-2-testonly.patch, 
> PIG-3347-3.patch, PIG-3347-4-testonly.patch
>
>
> The problem is that intermediate 'store' invocation "changes" the final store 
> output. Looks like it brings some kind of side effect. We did use 'local' 
> mode to run script
> here is the input data:
> 1
> 1
> Here is the script:
> {code}
> a = load 'test';
> a_group = group a by $0;
> b = foreach a_group {
>   a_distinct = distinct a.$0;
>   generate group, a_distinct;
> }
> --store b into 'b';
> c = filter b by SIZE(a_distinct) == 1;
> store c into 'out';
> {code}
> We expect output to be:
> 1 1
> The output is empty file.
> Uncomment {code}--store b into 'b';{code} line and see the diffrence.
> Yuo would get expected output.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3347) Store invocation brings side effect

2014-02-04 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13890903#comment-13890903
 ] 

Daniel Dai commented on PIG-3347:
-

All unit tests pass with the patch.

> Store invocation brings side effect
> ---
>
> Key: PIG-3347
> URL: https://issues.apache.org/jira/browse/PIG-3347
> Project: Pig
>  Issue Type: Bug
>  Components: grunt
>Affects Versions: 0.11
> Environment: local mode
>Reporter: Sergey
>Assignee: Daniel Dai
>Priority: Critical
> Fix For: 0.12.1
>
> Attachments: PIG-3347-1.patch, PIG-3347-2-testonly.patch, 
> PIG-3347-3.patch
>
>
> The problem is that intermediate 'store' invocation "changes" the final store 
> output. Looks like it brings some kind of side effect. We did use 'local' 
> mode to run script
> here is the input data:
> 1
> 1
> Here is the script:
> {code}
> a = load 'test';
> a_group = group a by $0;
> b = foreach a_group {
>   a_distinct = distinct a.$0;
>   generate group, a_distinct;
> }
> --store b into 'b';
> c = filter b by SIZE(a_distinct) == 1;
> store c into 'out';
> {code}
> We expect output to be:
> 1 1
> The output is empty file.
> Uncomment {code}--store b into 'b';{code} line and see the diffrence.
> Yuo would get expected output.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3347) Store invocation brings side effect

2014-01-30 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13887236#comment-13887236
 ] 

Daniel Dai commented on PIG-3347:
-

UID is to track column lineage so in logical optimizer, so that we can freely 
move operate up and down, ProjectionPatcher will reposition the column 
according to uid, even if the column get reordered. A new source of data should 
have a new UID, that's the case for nested LOForEach/LODistinct, since they are 
not directly derived from the previous operator, instead, it is a new field 
generated by the foreach.

> Store invocation brings side effect
> ---
>
> Key: PIG-3347
> URL: https://issues.apache.org/jira/browse/PIG-3347
> Project: Pig
>  Issue Type: Bug
>  Components: grunt
>Affects Versions: 0.11
> Environment: local mode
>Reporter: Sergey
>Assignee: Daniel Dai
>Priority: Critical
> Fix For: 0.12.1
>
> Attachments: PIG-3347-1.patch
>
>
> The problem is that intermediate 'store' invocation "changes" the final store 
> output. Looks like it brings some kind of side effect. We did use 'local' 
> mode to run script
> here is the input data:
> 1
> 1
> Here is the script:
> {code}
> a = load 'test';
> a_group = group a by $0;
> b = foreach a_group {
>   a_distinct = distinct a.$0;
>   generate group, a_distinct;
> }
> --store b into 'b';
> c = filter b by SIZE(a_distinct) == 1;
> store c into 'out';
> {code}
> We expect output to be:
> 1 1
> The output is empty file.
> Uncomment {code}--store b into 'b';{code} line and see the diffrence.
> Yuo would get expected output.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3347) Store invocation brings side effect

2014-01-30 Thread Julien Le Dem (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13887203#comment-13887203
 ] 

Julien Le Dem commented on PIG-3347:


I thought that the field UIDs were used to track lineage across the plan.
[~aniket486] correct me if I'm wrong but it is used to determine which fields 
are reads for projection push down.
In the case of self join (directly or indirectly) we end up with duplicate ids 
in the same relation because the same field is derived to 2 different fields.
Otherwise I'm as lost as [~knoguchi] regarding the actual mechanisms around the 
UID.
I tried to fix some of these in the past (PIG-3020) but it appears they created 
more problems (PIG-3492)
[~daijy] maybe you can enlighten us?

> Store invocation brings side effect
> ---
>
> Key: PIG-3347
> URL: https://issues.apache.org/jira/browse/PIG-3347
> Project: Pig
>  Issue Type: Bug
>  Components: grunt
>Affects Versions: 0.11
> Environment: local mode
>Reporter: Sergey
>Assignee: Daniel Dai
>Priority: Critical
> Fix For: 0.12.1
>
> Attachments: PIG-3347-1.patch
>
>
> The problem is that intermediate 'store' invocation "changes" the final store 
> output. Looks like it brings some kind of side effect. We did use 'local' 
> mode to run script
> here is the input data:
> 1
> 1
> Here is the script:
> {code}
> a = load 'test';
> a_group = group a by $0;
> b = foreach a_group {
>   a_distinct = distinct a.$0;
>   generate group, a_distinct;
> }
> --store b into 'b';
> c = filter b by SIZE(a_distinct) == 1;
> store c into 'out';
> {code}
> We expect output to be:
> 1 1
> The output is empty file.
> Uncomment {code}--store b into 'b';{code} line and see the diffrence.
> Yuo would get expected output.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3347) Store invocation brings side effect

2014-01-30 Thread Dmitriy V. Ryaboy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13887114#comment-13887114
 ] 

Dmitriy V. Ryaboy commented on PIG-3347:


Yikes.

[~aniket486] & [~julienledem] this seems like a critical bug to look at. 
Julien, you investigated this UID situation before, right?

> Store invocation brings side effect
> ---
>
> Key: PIG-3347
> URL: https://issues.apache.org/jira/browse/PIG-3347
> Project: Pig
>  Issue Type: Bug
>  Components: grunt
>Affects Versions: 0.11
> Environment: local mode
>Reporter: Sergey
>Assignee: Daniel Dai
>Priority: Critical
> Fix For: 0.12.1
>
> Attachments: PIG-3347-1.patch
>
>
> The problem is that intermediate 'store' invocation "changes" the final store 
> output. Looks like it brings some kind of side effect. We did use 'local' 
> mode to run script
> here is the input data:
> 1
> 1
> Here is the script:
> {code}
> a = load 'test';
> a_group = group a by $0;
> b = foreach a_group {
>   a_distinct = distinct a.$0;
>   generate group, a_distinct;
> }
> --store b into 'b';
> c = filter b by SIZE(a_distinct) == 1;
> store c into 'out';
> {code}
> We expect output to be:
> 1 1
> The output is empty file.
> Uncomment {code}--store b into 'b';{code} line and see the diffrence.
> Yuo would get expected output.



--
This message was sent by Atlassian JIRA
(v6.1.5#6160)


[jira] [Commented] (PIG-3347) Store invocation brings side effect

2013-11-08 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13817385#comment-13817385
 ] 

Koji Noguchi commented on PIG-3347:
---

I wish there was a wiki/document describing how UID should be assigned.  Even 
after going through PIG-3492, I'm still lost on when exactly we should assign 
new UIDs.  
My current understanding(or guess) is, UID represents an uniqueness within a 
record.  
Just by looking at the UIDs from the two separate relations(bags), we can tell 
if the fields were altered or not.  (Although we cannot tell if the tuples were 
filtered or not.)

FilterAboveForeach(PushUpFilter) is using this property to determine if FILTER 
can be moved before the foreach.  Bug here is, nested distinct is not assigning 
a new UID for the bag it creates so FilterAboveForeach mistakenly thinks that 
no fields were altered within the foreach and decides to move this filter 
upfront.
Following show the schema BEFORE calling PushUpFilter/FilterAboveForeach 
without and with Daniel's patch.  We can see that after applying the patch, 
relation 'c' and 'a_group' contain different UIDs for the bag.
{noformat}
(without the patch)
|---c: (Name: LOFilter Schema: 
group#11:bytearray,a_distinct#12:bag{#13:tuple(#14:bytearray)})
|---b: (Name: LOForEach Schema: 
group#11:bytearray,a_distinct#12:bag{#13:tuple(#14:bytearray)})
|---a_group: (Name: LOCogroup Schema: 
group#11:bytearray,a#12:bag{#13:tuple()})
(with the patch)
|---c: (Name: LOFilter Schema: 
group#15:bytearray,a_distinct#20:bag{#19:tuple(#18:bytearray)})
|---b: (Name: LOForEach Schema: 
group#15:bytearray,a_distinct#20:bag{#19:tuple(#18:bytearray)})
|---a_group: (Name: LOCogroup Schema: 
group#15:bytearray,a#16:bag{#17:tuple()})
{noformat}

So I think the patch fixes the bug described on the jira nicely.  However, 
question remains for other nested operations. I believe the same bug can appear 
for nested LIMIT and nested FILTER.  For example,

{noformat}
a = load 'test.txt';
a_group = group a by $0;
b = foreach a_group {
  a_limit = limit a.$0 5;
  generate group, a_limit;
}
c = filter b by SIZE(a_limit) == 5;
store c into 'out';
{noformat}

{noformat}
a = load 'test3.txt' as (a0, a1); 
a_group = group a by a0;
b = foreach a_group {
  newA = filter a by a1 == 2;
  generate group, newA;
}
c = filter b by SIZE(newA) == 5;
store c into 'out';
{noformat}

I confirmed these two examples also mistakenly push the filter before foreach 
and produce empty results.  Former case, nested LIMIT, is actually covered with 
the current patch since nested LIMIT uses LOLIMIT+LOForeach.  So the patch 
{noformat}
+  98  // If it is nested foreach or nested distinct, generate new uid
+  99  if (op instanceof LOForEach || op instanceof LODistinct) {
+ 100 needNewUid = true;
+ 101 }
{noformat} 
takes care of nested limit although comment doesn't mention it.   Nested filter 
is not the case here and the bug still exists after the current patch.  Can we 
cover this case as well? 


> Store invocation brings side effect
> ---
>
> Key: PIG-3347
> URL: https://issues.apache.org/jira/browse/PIG-3347
> Project: Pig
>  Issue Type: Bug
>  Components: grunt
>Affects Versions: 0.11
> Environment: local mode
>Reporter: Sergey
>Assignee: Daniel Dai
> Fix For: 0.12.1
>
> Attachments: PIG-3347-1.patch
>
>
> The problem is that intermediate 'store' invocation "changes" the final store 
> output. Looks like it brings some kind of side effect. We did use 'local' 
> mode to run script
> here is the input data:
> 1
> 1
> Here is the script:
> {code}
> a = load 'test';
> a_group = group a by $0;
> b = foreach a_group {
>   a_distinct = distinct a.$0;
>   generate group, a_distinct;
> }
> --store b into 'b';
> c = filter b by SIZE(a_distinct) == 1;
> store c into 'out';
> {code}
> We expect output to be:
> 1 1
> The output is empty file.
> Uncomment {code}--store b into 'b';{code} line and see the diffrence.
> Yuo would get expected output.



--
This message was sent by Atlassian JIRA
(v6.1#6144)