[jira] [Commented] (PIG-4819) RANDOM() udf can lead to missing or redundant records
[ https://issues.apache.org/jira/browse/PIG-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15187597#comment-15187597 ] Koji Noguchi commented on PIG-4819: --- bq. TestBuiltin.testURIWithCurlyBrace is failing after addition of testRandomJob with -Dhadoopversion=23 -Dexectype=tez. Fixed in PIG-4833. bq. Also would be good to put this in Pig 0.15.1 as well. Not sure. I do like my change but still afraid of how it'll perform for our users. For now, I prefer to keep it only in trunk. > RANDOM() udf can lead to missing or redundant records > - > > Key: PIG-4819 > URL: https://issues.apache.org/jira/browse/PIG-4819 > Project: Pig > Issue Type: Bug >Reporter: Koji Noguchi >Assignee: Koji Noguchi > Fix For: 0.16.0 > > Attachments: pig-4819-v01.patch, pig-4819-v02.patch, > pig-4819-v02_fix_v01.patch, pig-4819-v02_fix_v02.patch, > pig-4819-v02_fix_v03.patch, pig-4819-v02_fix_v04.patch, > pig-4819-v02_fix_v05.patch, pig-4819-v02_fix_v06.patch > > > When RANDOM() value is used for grouping/distinct/etc, it breaks the > mapreduce rule and can lead to redundant or missing records. > Some discussion can be found in > https://issues.apache.org/jira/browse/PIG-3257?focusedCommentId=13669195#comment-13669195 > We should make RANDOM less random so that it'll produce the same sequence of > random values from the task retries. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4819) RANDOM() udf can lead to missing or redundant records
[ https://issues.apache.org/jira/browse/PIG-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15179448#comment-15179448 ] Rohini Palaniswamy commented on PIG-4819: - [~knoguchi], TestBuiltin.testURIWithCurlyBrace is failing after addition of testRandomJob with -Dhadoopversion=23 -Dexectype=tez. Possible to take a look at it? Also would be good to put this in Pig 0.15.1 as well. > RANDOM() udf can lead to missing or redundant records > - > > Key: PIG-4819 > URL: https://issues.apache.org/jira/browse/PIG-4819 > Project: Pig > Issue Type: Bug >Reporter: Koji Noguchi >Assignee: Koji Noguchi > Fix For: 0.16.0 > > Attachments: pig-4819-v01.patch, pig-4819-v02.patch, > pig-4819-v02_fix_v01.patch, pig-4819-v02_fix_v02.patch, > pig-4819-v02_fix_v03.patch, pig-4819-v02_fix_v04.patch, > pig-4819-v02_fix_v05.patch, pig-4819-v02_fix_v06.patch > > > When RANDOM() value is used for grouping/distinct/etc, it breaks the > mapreduce rule and can lead to redundant or missing records. > Some discussion can be found in > https://issues.apache.org/jira/browse/PIG-3257?focusedCommentId=13669195#comment-13669195 > We should make RANDOM less random so that it'll produce the same sequence of > random values from the task retries. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4819) RANDOM() udf can lead to missing or redundant records
[ https://issues.apache.org/jira/browse/PIG-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15178597#comment-15178597 ] Rohini Palaniswamy commented on PIG-4819: - +1 > RANDOM() udf can lead to missing or redundant records > - > > Key: PIG-4819 > URL: https://issues.apache.org/jira/browse/PIG-4819 > Project: Pig > Issue Type: Bug >Reporter: Koji Noguchi >Assignee: Koji Noguchi > Fix For: 0.16.0 > > Attachments: pig-4819-v01.patch, pig-4819-v02.patch, > pig-4819-v02_fix_v01.patch, pig-4819-v02_fix_v02.patch, > pig-4819-v02_fix_v03.patch, pig-4819-v02_fix_v04.patch, > pig-4819-v02_fix_v05.patch, pig-4819-v02_fix_v06.patch > > > When RANDOM() value is used for grouping/distinct/etc, it breaks the > mapreduce rule and can lead to redundant or missing records. > Some discussion can be found in > https://issues.apache.org/jira/browse/PIG-3257?focusedCommentId=13669195#comment-13669195 > We should make RANDOM less random so that it'll produce the same sequence of > random values from the task retries. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-4819) RANDOM() udf can lead to missing or redundant records
[ https://issues.apache.org/jira/browse/PIG-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15176100#comment-15176100 ] Rohini Palaniswamy commented on PIG-4819: - +1 bq. But should I simply extend org.apache.pig.builtin.RANDOM from org.apache.pig.piggybank.evaluation.math.RANDOM Would be ideal, but if they use newer piggybank jar with older version of pig it will break. So I think duplicating code is better for now. > RANDOM() udf can lead to missing or redundant records > - > > Key: PIG-4819 > URL: https://issues.apache.org/jira/browse/PIG-4819 > Project: Pig > Issue Type: Bug >Reporter: Koji Noguchi >Assignee: Koji Noguchi > Attachments: pig-4819-v01.patch, pig-4819-v02.patch > > > When RANDOM() value is used for grouping/distinct/etc, it breaks the > mapreduce rule and can lead to redundant or missing records. > Some discussion can be found in > https://issues.apache.org/jira/browse/PIG-3257?focusedCommentId=13669195#comment-13669195 > We should make RANDOM less random so that it'll produce the same sequence of > random values from the task retries. -- This message was sent by Atlassian JIRA (v6.3.4#6332)