[ 
https://issues.apache.org/jira/browse/PIG-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Noguchi updated PIG-4819:
------------------------------
    Attachment: pig-4819-v02_fix_v01.patch

Wasn't sure if I should rollback the change and attach a complete new patch or 
attach a diff from the last commit. 
This one does the latter.

Two major changes.
* Given two consecutive seeds seem to result in close random values (at least 
for some initial ones), now creating a seed(type:long) with task_number and 
job_id.hascode() connected instead of other ways.  Given only the first 48 bits 
is used from the passed seed, leaving 28 bits for the task id.

* To add more randomness across jobs, adding submit time with XOR.  (Ideally it 
would be better if this was nano-seconds, but I think this is good enough with 
milli-seconds.)

I'll do some more testing tomorrow.

> RANDOM() udf can lead to missing or redundant records
> -----------------------------------------------------
>
>                 Key: PIG-4819
>                 URL: https://issues.apache.org/jira/browse/PIG-4819
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Koji Noguchi
>            Assignee: Koji Noguchi
>             Fix For: 0.16.0
>
>         Attachments: pig-4819-v01.patch, pig-4819-v02.patch, 
> pig-4819-v02_fix_v01.patch
>
>
> When RANDOM() value is used for grouping/distinct/etc, it breaks the 
> mapreduce rule and can lead to redundant or missing records. 
> Some discussion can be found in 
> https://issues.apache.org/jira/browse/PIG-3257?focusedCommentId=13669195#comment-13669195
> We should make RANDOM less random so that it'll produce the same sequence of 
> random values from the task retries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to