[jira] [Updated] (PIG-4819) RANDOM() udf can lead to missing or redundant records

Koji Noguchi (JIRA) Thu, 03 Mar 2016 10:33:42 -0800

     [ 
https://issues.apache.org/jira/browse/PIG-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Koji Noguchi updated PIG-4819:
------------------------------
    Attachment: pig-4819-v02_fix_v02.patch

Discussing with Rohini, simplified a call by creating a long by connecting two 
jobid-hash and xor-ing with a task number.  Also, added a logic so that if 
RANDOM is called for more than once in the script, they would return a 
different value.

{code}
B = FOREACH A generate RANDOM(), RANDOM();
{code}

bq. To add more randomness across jobs, adding submit time with XOR.

This didn't work with Tez.  It wasn't transferring 
"pig.job.submitted.timestamp".  For now, taking it out but it would be nice to 
have this.  (Even better with nanosecond).

{quote}
bq. But should I simply extend org.apache.pig.builtin.RANDOM from 
org.apache.pig.piggybank.evaluation.math.RANDOM
Would be ideal, but if they use newer piggybank jar with older version of pig 
it will break. So I think duplicating code is better for now.
{quote}
Given the not so obvious changes I've made to original RANDOM, I wasn't 
comfortable with copy and pasting.  I simply went with extending option.
My understanding is, worst case would be piggybank.RANDOM referencing the 
original builtin.RANDOM without my changes but it won't fail.

> RANDOM() udf can lead to missing or redundant records
> -----------------------------------------------------
>
>                 Key: PIG-4819
>                 URL: https://issues.apache.org/jira/browse/PIG-4819
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Koji Noguchi
>            Assignee: Koji Noguchi
>             Fix For: 0.16.0
>
>         Attachments: pig-4819-v01.patch, pig-4819-v02.patch, 
> pig-4819-v02_fix_v01.patch, pig-4819-v02_fix_v02.patch
>
>
> When RANDOM() value is used for grouping/distinct/etc, it breaks the 
> mapreduce rule and can lead to redundant or missing records. 
> Some discussion can be found in 
> https://issues.apache.org/jira/browse/PIG-3257?focusedCommentId=13669195#comment-13669195
> We should make RANDOM less random so that it'll produce the same sequence of 
> random values from the task retries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4819) RANDOM() udf can lead to missing or redundant records

Reply via email to