[ https://issues.apache.org/jira/browse/PIG-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Koji Noguchi updated PIG-4819: ------------------------------ Attachment: pig-4819-v02_fix_v02.patch Discussing with Rohini, simplified a call by creating a long by connecting two jobid-hash and xor-ing with a task number. Also, added a logic so that if RANDOM is called for more than once in the script, they would return a different value. {code} B = FOREACH A generate RANDOM(), RANDOM(); {code} bq. To add more randomness across jobs, adding submit time with XOR. This didn't work with Tez. It wasn't transferring "pig.job.submitted.timestamp". For now, taking it out but it would be nice to have this. (Even better with nanosecond). {quote} bq. But should I simply extend org.apache.pig.builtin.RANDOM from org.apache.pig.piggybank.evaluation.math.RANDOM Would be ideal, but if they use newer piggybank jar with older version of pig it will break. So I think duplicating code is better for now. {quote} Given the not so obvious changes I've made to original RANDOM, I wasn't comfortable with copy and pasting. I simply went with extending option. My understanding is, worst case would be piggybank.RANDOM referencing the original builtin.RANDOM without my changes but it won't fail. > RANDOM() udf can lead to missing or redundant records > ----------------------------------------------------- > > Key: PIG-4819 > URL: https://issues.apache.org/jira/browse/PIG-4819 > Project: Pig > Issue Type: Bug > Reporter: Koji Noguchi > Assignee: Koji Noguchi > Fix For: 0.16.0 > > Attachments: pig-4819-v01.patch, pig-4819-v02.patch, > pig-4819-v02_fix_v01.patch, pig-4819-v02_fix_v02.patch > > > When RANDOM() value is used for grouping/distinct/etc, it breaks the > mapreduce rule and can lead to redundant or missing records. > Some discussion can be found in > https://issues.apache.org/jira/browse/PIG-3257?focusedCommentId=13669195#comment-13669195 > We should make RANDOM less random so that it'll produce the same sequence of > random values from the task retries. -- This message was sent by Atlassian JIRA (v6.3.4#6332)