[jira] [Updated] (PIG-4819) RANDOM() udf can lead to missing or redundant records

Koji Noguchi (JIRA) Wed, 02 Mar 2016 07:23:38 -0800

     [ 
https://issues.apache.org/jira/browse/PIG-4819?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Koji Noguchi updated PIG-4819:
------------------------------
    Attachment: pig-4819-v02.patch

{quote}
bq. Let me know if I should make the same change.
Yes. That would be good.
{quote}
Made the same change there.
But should I simply extend {{org.apache.pig.builtin.RANDOM}} from 
{{org.apache.pig.piggybank.evaluation.math.RANDOM}} ?

bq. Tab spacing should be 4 spaces and not two in exec() method.

I did have 4 spaces for the lines I touched.  Assuming you're talking about the 
existing code having actual tab in the code, replaced them with 4 spaces.  

bq. Can we remove System.err.println(tmpresult[i]); or use debug logging?
Forgot to delete that.  Thanks for catching it.
Since the output gave me assurance that I am getting some random values, I 
replaced it with logging(info).


> RANDOM() udf can lead to missing or redundant records
> -----------------------------------------------------
>
>                 Key: PIG-4819
>                 URL: https://issues.apache.org/jira/browse/PIG-4819
>             Project: Pig
>          Issue Type: Bug
>            Reporter: Koji Noguchi
>            Assignee: Koji Noguchi
>         Attachments: pig-4819-v01.patch, pig-4819-v02.patch
>
>
> When RANDOM() value is used for grouping/distinct/etc, it breaks the 
> mapreduce rule and can lead to redundant or missing records. 
> Some discussion can be found in 
> https://issues.apache.org/jira/browse/PIG-3257?focusedCommentId=13669195#comment-13669195
> We should make RANDOM less random so that it'll produce the same sequence of 
> random values from the task retries.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Updated] (PIG-4819) RANDOM() udf can lead to missing or redundant records

Reply via email to