[ https://issues.apache.org/jira/browse/PIG-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Koji Noguchi updated PIG-4819: ------------------------------ Attachment: pig-4819-v02_fix_v01.patch Wasn't sure if I should rollback the change and attach a complete new patch or attach a diff from the last commit. This one does the latter. Two major changes. * Given two consecutive seeds seem to result in close random values (at least for some initial ones), now creating a seed(type:long) with task_number and job_id.hascode() connected instead of other ways. Given only the first 48 bits is used from the passed seed, leaving 28 bits for the task id. * To add more randomness across jobs, adding submit time with XOR. (Ideally it would be better if this was nano-seconds, but I think this is good enough with milli-seconds.) I'll do some more testing tomorrow. > RANDOM() udf can lead to missing or redundant records > ----------------------------------------------------- > > Key: PIG-4819 > URL: https://issues.apache.org/jira/browse/PIG-4819 > Project: Pig > Issue Type: Bug > Reporter: Koji Noguchi > Assignee: Koji Noguchi > Fix For: 0.16.0 > > Attachments: pig-4819-v01.patch, pig-4819-v02.patch, > pig-4819-v02_fix_v01.patch > > > When RANDOM() value is used for grouping/distinct/etc, it breaks the > mapreduce rule and can lead to redundant or missing records. > Some discussion can be found in > https://issues.apache.org/jira/browse/PIG-3257?focusedCommentId=13669195#comment-13669195 > We should make RANDOM less random so that it'll produce the same sequence of > random values from the task retries. -- This message was sent by Atlassian JIRA (v6.3.4#6332)