[jira] [Commented] (PIG-3257) Add unique identifier UDF

Koji Noguchi (JIRA) Wed, 29 May 2013 05:40:34 -0700

    [ 
https://issues.apache.org/jira/browse/PIG-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13669195#comment-13669195
 ]


Koji Noguchi commented on PIG-3257:
-----------------------------------

With your first example, say you have _n_ input records. 1 mapper 2 reducers.
{noformat}
A = load ...
B = group A by UUID();
STORE B ...
{noformat}
This job could successfully finish with output ranging from 0 to 2n records.
For example, sequence of events can be, 
   # mapper0_attempt0 finish with n outputs and say all n uuid keys were 
assigned to reducer0.
   # reducer0_attempt0 pulls map outputs and produces _n_ outputs.
   # reducer1_attempt0 tries to pull mapper0_attempt0 output and fail. (could 
be fetch failure or node failure).
   # mapper0_attempt1 rerun. And this time, all n uuid keys were assigned to 
reducer1.
   # reducer1_attempt0 pulls mapper0_attempt1 output and produces n outputs.
   # job finish successfully with 2n outputs.

This is certainly unexpected to users.

Now, with your second example
{noformat}
A = load 'over100k' using org.apache.hcatalog.pig.HCatLoader();
B = foreach A generate *, UUID();
C = group B by s;
D = foreach C generate flatten(B), SUM(B.i) as sum_b;
E = group B by si;
F = foreach E generate flatten(B), SUM(B.f) as sum_f;
G = join D by uuid, F by uuid;
H = foreach G generate D::B::s, sum_b, sum_f;
store H into 'output';
{noformat}

Let's say pig decides to implement the two group by (C and E) with one 
map-reduce job. For simplicity purposes let's use 1 mapper 2 reducers again and 
assume pig decides to partition all group by in _C_ to reducer0 and _E_ to 
reducer1.  Now, using the same story as above, there could be a case where 
reducer0(group-by-C) gets one set of UUID from mapper0_attempt0  and 
reducer1(group-by-E) gets another completely different set of UUID from 
mapper0_attempt1.

When this happen, join _G_ would produce 0 results which is unexpected to users.
Of course this depends on how pig performs the above query but I hope it 
demonstrates how tricky it gets when introducing a pure random id in hadoop.

What's worst about all these is that this is a corner case which won't get 
caught in users' QE phases and it would only manifest during production 
pipeline.  Users would then yell at me for corrupted output from successful 
jobs.  Thus my previous comment on "support nightmare".





                
> Add unique identifier UDF
> -------------------------
>
>                 Key: PIG-3257
>                 URL: https://issues.apache.org/jira/browse/PIG-3257
>             Project: Pig
>          Issue Type: Improvement
>          Components: internal-udfs
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>             Fix For: 0.12
>
>         Attachments: PIG-3257.patch
>
>
> It would be good to have a Pig function to generate unique identifiers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-3257) Add unique identifier UDF

Reply via email to