[jira] [Commented] (PIG-3257) Add unique identifier UDF
[ https://issues.apache.org/jira/browse/PIG-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14189670#comment-14189670 ] Daniel Dai commented on PIG-3257: - Since we cannot reach consensus, I will close this issue and provide a SequenceID in PIG-4253 instead. > Add unique identifier UDF > - > > Key: PIG-3257 > URL: https://issues.apache.org/jira/browse/PIG-3257 > Project: Pig > Issue Type: Improvement > Components: internal-udfs >Reporter: Alan Gates >Assignee: Alan Gates > Fix For: 0.14.0 > > Attachments: PIG-3257.patch > > > It would be good to have a Pig function to generate unique identifiers. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (PIG-3257) Add unique identifier UDF
[ https://issues.apache.org/jira/browse/PIG-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13669660#comment-13669660 ] Rohini Palaniswamy commented on PIG-3257: - Alan, Why don't we do it as a sequence instead of generating random numbers. Doing something like mapid- or reduceid-. i.e First mapper will do 0-0, 0-1..0-1. 2nd mapper will do 1-0,1-1,...1-1. Just a idea and we can think off a better implementation. It will anyways not be in sequence across the job -- but will be in sequence within the map and can be used as a UUID across the job which is repeatable if run with same number of mappers/reducers. This would avoid all problems of using random numbers and avoid human mistakes of writing a script without understanding the internals of how UUID is going to work which I don't think a user should be bothered with. > Add unique identifier UDF > - > > Key: PIG-3257 > URL: https://issues.apache.org/jira/browse/PIG-3257 > Project: Pig > Issue Type: Improvement > Components: internal-udfs >Reporter: Alan Gates >Assignee: Alan Gates > Fix For: 0.12 > > Attachments: PIG-3257.patch > > > It would be good to have a Pig function to generate unique identifiers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3257) Add unique identifier UDF
[ https://issues.apache.org/jira/browse/PIG-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13669593#comment-13669593 ] Alan Gates commented on PIG-3257: - Would it make you happy if we added to the javadoc comments on this function not to use it as a key in the same job it's generated in? > Add unique identifier UDF > - > > Key: PIG-3257 > URL: https://issues.apache.org/jira/browse/PIG-3257 > Project: Pig > Issue Type: Improvement > Components: internal-udfs >Reporter: Alan Gates >Assignee: Alan Gates > Fix For: 0.12 > > Attachments: PIG-3257.patch > > > It would be good to have a Pig function to generate unique identifiers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3257) Add unique identifier UDF
[ https://issues.apache.org/jira/browse/PIG-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13669195#comment-13669195 ] Koji Noguchi commented on PIG-3257: --- With your first example, say you have _n_ input records. 1 mapper 2 reducers. {noformat} A = load ... B = group A by UUID(); STORE B ... {noformat} This job could successfully finish with output ranging from 0 to 2n records. For example, sequence of events can be, # mapper0_attempt0 finish with n outputs and say all n uuid keys were assigned to reducer0. # reducer0_attempt0 pulls map outputs and produces _n_ outputs. # reducer1_attempt0 tries to pull mapper0_attempt0 output and fail. (could be fetch failure or node failure). # mapper0_attempt1 rerun. And this time, all n uuid keys were assigned to reducer1. # reducer1_attempt0 pulls mapper0_attempt1 output and produces n outputs. # job finish successfully with 2n outputs. This is certainly unexpected to users. Now, with your second example {noformat} A = load 'over100k' using org.apache.hcatalog.pig.HCatLoader(); B = foreach A generate *, UUID(); C = group B by s; D = foreach C generate flatten(B), SUM(B.i) as sum_b; E = group B by si; F = foreach E generate flatten(B), SUM(B.f) as sum_f; G = join D by uuid, F by uuid; H = foreach G generate D::B::s, sum_b, sum_f; store H into 'output'; {noformat} Let's say pig decides to implement the two group by (C and E) with one map-reduce job. For simplicity purposes let's use 1 mapper 2 reducers again and assume pig decides to partition all group by in _C_ to reducer0 and _E_ to reducer1. Now, using the same story as above, there could be a case where reducer0(group-by-C) gets one set of UUID from mapper0_attempt0 and reducer1(group-by-E) gets another completely different set of UUID from mapper0_attempt1. When this happen, join _G_ would produce 0 results which is unexpected to users. Of course this depends on how pig performs the above query but I hope it demonstrates how tricky it gets when introducing a pure random id in hadoop. What's worst about all these is that this is a corner case which won't get caught in users' QE phases and it would only manifest during production pipeline. Users would then yell at me for corrupted output from successful jobs. Thus my previous comment on "support nightmare". > Add unique identifier UDF > - > > Key: PIG-3257 > URL: https://issues.apache.org/jira/browse/PIG-3257 > Project: Pig > Issue Type: Improvement > Components: internal-udfs >Reporter: Alan Gates >Assignee: Alan Gates > Fix For: 0.12 > > Attachments: PIG-3257.patch > > > It would be good to have a Pig function to generate unique identifiers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3257) Add unique identifier UDF
[ https://issues.apache.org/jira/browse/PIG-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13668748#comment-13668748 ] Alan Gates commented on PIG-3257: - I don't see how records can be missing or redundant. Take the following query: {code} A = load ... B = group A by UUID(); C = foreach B... {code] This won't reduce at all. For every record it is totally irrelevant what particular value its key is, because it's guaranteed to be unique for each record. So 1) this is a totally meaningless thing to do; 2) if a particular map does get rerun or is used in speculative execution it doesn't matter because which particular key is generated by UUID is irrelevant. The way this intended to be used is something like this: {code} A = load 'over100k' using org.apache.hcatalog.pig.HCatLoader(); B = foreach A generate *, UUID(); C = group B by s; D = foreach C generate flatten(B), SUM(B.i) as sum_b; E = group B by si; F = foreach E generate flatten(B), SUM(B.f) as sum_f; G = join D by uuid, F by uuid; H = foreach G generate D::B::s, sum_b, sum_f; store H into 'output'; {code} > Add unique identifier UDF > - > > Key: PIG-3257 > URL: https://issues.apache.org/jira/browse/PIG-3257 > Project: Pig > Issue Type: Improvement > Components: internal-udfs >Reporter: Alan Gates >Assignee: Alan Gates > Fix For: 0.12 > > Attachments: PIG-3257.patch > > > It would be good to have a Pig function to generate unique identifiers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3257) Add unique identifier UDF
[ https://issues.apache.org/jira/browse/PIG-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13668717#comment-13668717 ] Koji Noguchi commented on PIG-3257: --- bq. incomplete/incorrect output I mean, this can result in missing records or redundant records. (support nightmare for me.) > Add unique identifier UDF > - > > Key: PIG-3257 > URL: https://issues.apache.org/jira/browse/PIG-3257 > Project: Pig > Issue Type: Improvement > Components: internal-udfs >Reporter: Alan Gates >Assignee: Alan Gates > Fix For: 0.12 > > Attachments: PIG-3257.patch > > > It would be good to have a Pig function to generate unique identifiers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3257) Add unique identifier UDF
[ https://issues.apache.org/jira/browse/PIG-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13668705#comment-13668705 ] Koji Noguchi commented on PIG-3257: --- bq. I can't see how it would matter whether it produced random key X1 vs random key X2 for any given record. If used in mapreduce key, this can lead to incomplete/incorrect output when mappers are retried. > Add unique identifier UDF > - > > Key: PIG-3257 > URL: https://issues.apache.org/jira/browse/PIG-3257 > Project: Pig > Issue Type: Improvement > Components: internal-udfs >Reporter: Alan Gates >Assignee: Alan Gates > Fix For: 0.12 > > Attachments: PIG-3257.patch > > > It would be good to have a Pig function to generate unique identifiers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3257) Add unique identifier UDF
[ https://issues.apache.org/jira/browse/PIG-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13668691#comment-13668691 ] Alan Gates commented on PIG-3257: - No it would not, but it would be very weird to use this as a key anyway, since it would produce a different random key for each record. I can't see how it would matter whether it produced random key X1 vs random key X2 for any given record. > Add unique identifier UDF > - > > Key: PIG-3257 > URL: https://issues.apache.org/jira/browse/PIG-3257 > Project: Pig > Issue Type: Improvement > Components: internal-udfs >Reporter: Alan Gates >Assignee: Alan Gates > Fix For: 0.12 > > Attachments: PIG-3257.patch > > > It would be good to have a Pig function to generate unique identifiers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3257) Add unique identifier UDF
[ https://issues.apache.org/jira/browse/PIG-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13668630#comment-13668630 ] Koji Noguchi commented on PIG-3257: --- Would this ensure that same unique identifier is reproduced when (map) task attempt is retried? Otherwise, I'm afraid it would lead to a random pig behavior when we use this id as the map-reduce key. > Add unique identifier UDF > - > > Key: PIG-3257 > URL: https://issues.apache.org/jira/browse/PIG-3257 > Project: Pig > Issue Type: Improvement > Components: internal-udfs >Reporter: Alan Gates >Assignee: Alan Gates > Fix For: 0.12 > > Attachments: PIG-3257.patch > > > It would be good to have a Pig function to generate unique identifiers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Commented] (PIG-3257) Add unique identifier UDF
[ https://issues.apache.org/jira/browse/PIG-3257?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13668605#comment-13668605 ] Cheolsoo Park commented on PIG-3257: +1. > Add unique identifier UDF > - > > Key: PIG-3257 > URL: https://issues.apache.org/jira/browse/PIG-3257 > Project: Pig > Issue Type: Improvement > Components: internal-udfs >Reporter: Alan Gates >Assignee: Alan Gates > Fix For: 0.12 > > Attachments: PIG-3257.patch > > > It would be good to have a Pig function to generate unique identifiers. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira