[jira] [Commented] (PIG-3257) Add unique identifier UDF

2014-10-29 Thread Daniel Dai (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14189670#comment-14189670
 ] 

Daniel Dai commented on PIG-3257:
-

Since we cannot reach consensus, I will close this issue and provide a 
SequenceID in PIG-4253 instead.

> Add unique identifier UDF
> -
>
> Key: PIG-3257
> URL: https://issues.apache.org/jira/browse/PIG-3257
> Project: Pig
>  Issue Type: Improvement
>  Components: internal-udfs
>Reporter: Alan Gates
>Assignee: Alan Gates
> Fix For: 0.14.0
>
> Attachments: PIG-3257.patch
>
>
> It would be good to have a Pig function to generate unique identifiers.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (PIG-3257) Add unique identifier UDF

2013-05-29 Thread Rohini Palaniswamy (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13669660#comment-13669660
 ] 

Rohini Palaniswamy commented on PIG-3257:
-

Alan,
   Why don't we do it as a sequence instead of generating random numbers. Doing 
something like mapid- or reduceid-. i.e First mapper will 
do 0-0, 0-1..0-1. 2nd mapper will do 1-0,1-1,...1-1. Just a idea and we 
can think off a better implementation. It will anyways not be in sequence 
across the job -- but will be in sequence within the map and can be used as a 
UUID across the job which is repeatable if run with same number of 
mappers/reducers. This would avoid all problems of using random numbers and 
avoid human mistakes of writing a script without understanding the internals of 
how UUID is going to work which I don't think a user should be bothered with. 

> Add unique identifier UDF
> -
>
> Key: PIG-3257
> URL: https://issues.apache.org/jira/browse/PIG-3257
> Project: Pig
>  Issue Type: Improvement
>  Components: internal-udfs
>Reporter: Alan Gates
>Assignee: Alan Gates
> Fix For: 0.12
>
> Attachments: PIG-3257.patch
>
>
> It would be good to have a Pig function to generate unique identifiers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3257) Add unique identifier UDF

2013-05-29 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13669593#comment-13669593
 ] 

Alan Gates commented on PIG-3257:
-

Would it make you happy if we added to the javadoc comments on this function 
not to use it as a key in the same job it's generated in?

> Add unique identifier UDF
> -
>
> Key: PIG-3257
> URL: https://issues.apache.org/jira/browse/PIG-3257
> Project: Pig
>  Issue Type: Improvement
>  Components: internal-udfs
>Reporter: Alan Gates
>Assignee: Alan Gates
> Fix For: 0.12
>
> Attachments: PIG-3257.patch
>
>
> It would be good to have a Pig function to generate unique identifiers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3257) Add unique identifier UDF

2013-05-29 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13669195#comment-13669195
 ] 

Koji Noguchi commented on PIG-3257:
---

With your first example, say you have _n_ input records. 1 mapper 2 reducers.
{noformat}
A = load ...
B = group A by UUID();
STORE B ...
{noformat}
This job could successfully finish with output ranging from 0 to 2n records.
For example, sequence of events can be, 
   # mapper0_attempt0 finish with n outputs and say all n uuid keys were 
assigned to reducer0.
   # reducer0_attempt0 pulls map outputs and produces _n_ outputs.
   # reducer1_attempt0 tries to pull mapper0_attempt0 output and fail. (could 
be fetch failure or node failure).
   # mapper0_attempt1 rerun. And this time, all n uuid keys were assigned to 
reducer1.
   # reducer1_attempt0 pulls mapper0_attempt1 output and produces n outputs.
   # job finish successfully with 2n outputs.

This is certainly unexpected to users.

Now, with your second example
{noformat}
A = load 'over100k' using org.apache.hcatalog.pig.HCatLoader();
B = foreach A generate *, UUID();
C = group B by s;
D = foreach C generate flatten(B), SUM(B.i) as sum_b;
E = group B by si;
F = foreach E generate flatten(B), SUM(B.f) as sum_f;
G = join D by uuid, F by uuid;
H = foreach G generate D::B::s, sum_b, sum_f;
store H into 'output';
{noformat}

Let's say pig decides to implement the two group by (C and E) with one 
map-reduce job. For simplicity purposes let's use 1 mapper 2 reducers again and 
assume pig decides to partition all group by in _C_ to reducer0 and _E_ to 
reducer1.  Now, using the same story as above, there could be a case where 
reducer0(group-by-C) gets one set of UUID from mapper0_attempt0  and 
reducer1(group-by-E) gets another completely different set of UUID from 
mapper0_attempt1.

When this happen, join _G_ would produce 0 results which is unexpected to users.
Of course this depends on how pig performs the above query but I hope it 
demonstrates how tricky it gets when introducing a pure random id in hadoop.

What's worst about all these is that this is a corner case which won't get 
caught in users' QE phases and it would only manifest during production 
pipeline.  Users would then yell at me for corrupted output from successful 
jobs.  Thus my previous comment on "support nightmare".






> Add unique identifier UDF
> -
>
> Key: PIG-3257
> URL: https://issues.apache.org/jira/browse/PIG-3257
> Project: Pig
>  Issue Type: Improvement
>  Components: internal-udfs
>Reporter: Alan Gates
>Assignee: Alan Gates
> Fix For: 0.12
>
> Attachments: PIG-3257.patch
>
>
> It would be good to have a Pig function to generate unique identifiers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3257) Add unique identifier UDF

2013-05-28 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13668748#comment-13668748
 ] 

Alan Gates commented on PIG-3257:
-

I don't see how records can be missing or redundant.  Take the following query:

{code}
A = load ...
B = group A by UUID();
C = foreach B...
{code]

This won't reduce at all.  For every record it is totally irrelevant what 
particular value its key is, because it's guaranteed to be unique for each 
record.  So 1) this is a totally meaningless thing to do; 2) if a particular 
map does get rerun or is used in speculative execution it doesn't matter 
because which particular key is generated by UUID is irrelevant.  The way this 
intended to be used is something like this:

{code}
A = load 'over100k' using org.apache.hcatalog.pig.HCatLoader();
B = foreach A generate *, UUID();
C = group B by s;
D = foreach C generate flatten(B), SUM(B.i) as sum_b;
E = group B by si;
F = foreach E generate flatten(B), SUM(B.f) as sum_f;
G = join D by uuid, F by uuid;
H = foreach G generate D::B::s, sum_b, sum_f;
store H into 'output';
{code}


> Add unique identifier UDF
> -
>
> Key: PIG-3257
> URL: https://issues.apache.org/jira/browse/PIG-3257
> Project: Pig
>  Issue Type: Improvement
>  Components: internal-udfs
>Reporter: Alan Gates
>Assignee: Alan Gates
> Fix For: 0.12
>
> Attachments: PIG-3257.patch
>
>
> It would be good to have a Pig function to generate unique identifiers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3257) Add unique identifier UDF

2013-05-28 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13668717#comment-13668717
 ] 

Koji Noguchi commented on PIG-3257:
---

bq. incomplete/incorrect output 
I mean, this can result in missing records or redundant records.  (support 
nightmare for me.)

> Add unique identifier UDF
> -
>
> Key: PIG-3257
> URL: https://issues.apache.org/jira/browse/PIG-3257
> Project: Pig
>  Issue Type: Improvement
>  Components: internal-udfs
>Reporter: Alan Gates
>Assignee: Alan Gates
> Fix For: 0.12
>
> Attachments: PIG-3257.patch
>
>
> It would be good to have a Pig function to generate unique identifiers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3257) Add unique identifier UDF

2013-05-28 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13668705#comment-13668705
 ] 

Koji Noguchi commented on PIG-3257:
---

bq. I can't see how it would matter whether it produced random key X1 vs random 
key X2 for any given record.

If used in mapreduce key, this can lead to incomplete/incorrect output when 
mappers are retried.

> Add unique identifier UDF
> -
>
> Key: PIG-3257
> URL: https://issues.apache.org/jira/browse/PIG-3257
> Project: Pig
>  Issue Type: Improvement
>  Components: internal-udfs
>Reporter: Alan Gates
>Assignee: Alan Gates
> Fix For: 0.12
>
> Attachments: PIG-3257.patch
>
>
> It would be good to have a Pig function to generate unique identifiers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3257) Add unique identifier UDF

2013-05-28 Thread Alan Gates (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13668691#comment-13668691
 ] 

Alan Gates commented on PIG-3257:
-

No it would not, but it would be very weird to use this as a key anyway, since 
it would produce a different random key for each record.  I can't see how it 
would matter whether it produced random key X1 vs random key X2 for any given 
record.

> Add unique identifier UDF
> -
>
> Key: PIG-3257
> URL: https://issues.apache.org/jira/browse/PIG-3257
> Project: Pig
>  Issue Type: Improvement
>  Components: internal-udfs
>Reporter: Alan Gates
>Assignee: Alan Gates
> Fix For: 0.12
>
> Attachments: PIG-3257.patch
>
>
> It would be good to have a Pig function to generate unique identifiers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3257) Add unique identifier UDF

2013-05-28 Thread Koji Noguchi (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13668630#comment-13668630
 ] 

Koji Noguchi commented on PIG-3257:
---

Would this ensure that same unique identifier is reproduced when (map) task 
attempt is retried?  Otherwise, I'm afraid it would lead to a random pig 
behavior when we use this id as the map-reduce key.

> Add unique identifier UDF
> -
>
> Key: PIG-3257
> URL: https://issues.apache.org/jira/browse/PIG-3257
> Project: Pig
>  Issue Type: Improvement
>  Components: internal-udfs
>Reporter: Alan Gates
>Assignee: Alan Gates
> Fix For: 0.12
>
> Attachments: PIG-3257.patch
>
>
> It would be good to have a Pig function to generate unique identifiers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (PIG-3257) Add unique identifier UDF

2013-05-28 Thread Cheolsoo Park (JIRA)

[ 
https://issues.apache.org/jira/browse/PIG-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13668605#comment-13668605
 ] 

Cheolsoo Park commented on PIG-3257:


+1.

> Add unique identifier UDF
> -
>
> Key: PIG-3257
> URL: https://issues.apache.org/jira/browse/PIG-3257
> Project: Pig
>  Issue Type: Improvement
>  Components: internal-udfs
>Reporter: Alan Gates
>Assignee: Alan Gates
> Fix For: 0.12
>
> Attachments: PIG-3257.patch
>
>
> It would be good to have a Pig function to generate unique identifiers.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira