[
https://issues.apache.org/jira/browse/PIG-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13669195#comment-13669195
]
Koji Noguchi commented on PIG-3257:
-----------------------------------
With your first example, say you have _n_ input records. 1 mapper 2 reducers.
{noformat}
A = load ...
B = group A by UUID();
STORE B ...
{noformat}
This job could successfully finish with output ranging from 0 to 2n records.
For example, sequence of events can be,
# mapper0_attempt0 finish with n outputs and say all n uuid keys were
assigned to reducer0.
# reducer0_attempt0 pulls map outputs and produces _n_ outputs.
# reducer1_attempt0 tries to pull mapper0_attempt0 output and fail. (could
be fetch failure or node failure).
# mapper0_attempt1 rerun. And this time, all n uuid keys were assigned to
reducer1.
# reducer1_attempt0 pulls mapper0_attempt1 output and produces n outputs.
# job finish successfully with 2n outputs.
This is certainly unexpected to users.
Now, with your second example
{noformat}
A = load 'over100k' using org.apache.hcatalog.pig.HCatLoader();
B = foreach A generate *, UUID();
C = group B by s;
D = foreach C generate flatten(B), SUM(B.i) as sum_b;
E = group B by si;
F = foreach E generate flatten(B), SUM(B.f) as sum_f;
G = join D by uuid, F by uuid;
H = foreach G generate D::B::s, sum_b, sum_f;
store H into 'output';
{noformat}
Let's say pig decides to implement the two group by (C and E) with one
map-reduce job. For simplicity purposes let's use 1 mapper 2 reducers again and
assume pig decides to partition all group by in _C_ to reducer0 and _E_ to
reducer1. Now, using the same story as above, there could be a case where
reducer0(group-by-C) gets one set of UUID from mapper0_attempt0 and
reducer1(group-by-E) gets another completely different set of UUID from
mapper0_attempt1.
When this happen, join _G_ would produce 0 results which is unexpected to users.
Of course this depends on how pig performs the above query but I hope it
demonstrates how tricky it gets when introducing a pure random id in hadoop.
What's worst about all these is that this is a corner case which won't get
caught in users' QE phases and it would only manifest during production
pipeline. Users would then yell at me for corrupted output from successful
jobs. Thus my previous comment on "support nightmare".
> Add unique identifier UDF
> -------------------------
>
> Key: PIG-3257
> URL: https://issues.apache.org/jira/browse/PIG-3257
> Project: Pig
> Issue Type: Improvement
> Components: internal-udfs
> Reporter: Alan Gates
> Assignee: Alan Gates
> Fix For: 0.12
>
> Attachments: PIG-3257.patch
>
>
> It would be good to have a Pig function to generate unique identifiers.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira