[ 
https://issues.apache.org/jira/browse/PIG-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13668748#comment-13668748
 ] 

Alan Gates edited comment on PIG-3257 at 5/28/13 10:32 PM:
-----------------------------------------------------------

I don't see how records can be missing or redundant.  Take the following query:

{code}
A = load ...
B = group A by UUID();
C = foreach B...
{code}

This won't reduce at all.  For every record it is totally irrelevant what 
particular value its key is, because it's guaranteed to be unique for each 
record.  So 1) this is a totally meaningless thing to do; 2) if a particular 
map does get rerun or is used in speculative execution it doesn't matter 
because which particular key is generated by UUID is irrelevant.  The way this 
intended to be used is something like this:

{code}
A = load 'over100k' using org.apache.hcatalog.pig.HCatLoader();
B = foreach A generate *, UUID();
C = group B by s;
D = foreach C generate flatten(B), SUM(B.i) as sum_b;
E = group B by si;
F = foreach E generate flatten(B), SUM(B.f) as sum_f;
G = join D by uuid, F by uuid;
H = foreach G generate D::B::s, sum_b, sum_f;
store H into 'output';
{code}

                
      was (Author: alangates):
    I don't see how records can be missing or redundant.  Take the following 
query:

{code}
A = load ...
B = group A by UUID();
C = foreach B...
{code]

This won't reduce at all.  For every record it is totally irrelevant what 
particular value its key is, because it's guaranteed to be unique for each 
record.  So 1) this is a totally meaningless thing to do; 2) if a particular 
map does get rerun or is used in speculative execution it doesn't matter 
because which particular key is generated by UUID is irrelevant.  The way this 
intended to be used is something like this:

{code}
A = load 'over100k' using org.apache.hcatalog.pig.HCatLoader();
B = foreach A generate *, UUID();
C = group B by s;
D = foreach C generate flatten(B), SUM(B.i) as sum_b;
E = group B by si;
F = foreach E generate flatten(B), SUM(B.f) as sum_f;
G = join D by uuid, F by uuid;
H = foreach G generate D::B::s, sum_b, sum_f;
store H into 'output';
{code}

                  
> Add unique identifier UDF
> -------------------------
>
>                 Key: PIG-3257
>                 URL: https://issues.apache.org/jira/browse/PIG-3257
>             Project: Pig
>          Issue Type: Improvement
>          Components: internal-udfs
>            Reporter: Alan Gates
>            Assignee: Alan Gates
>             Fix For: 0.12
>
>         Attachments: PIG-3257.patch
>
>
> It would be good to have a Pig function to generate unique identifiers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to