[
https://issues.apache.org/jira/browse/PIG-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13173405#comment-13173405
]
Gianmarco De Francisci Morales commented on PIG-2353:
-----------------------------------------------------
My idea would be to have a distributed implementation of RANK in the following
manner:
Run a Map-only job with n mapper, each mapper just computes the number of
records in each input split and accumulates it in an internal variable (or
alternatively it uses dynamic counters).
At the end, we have a map(partition_id => number_of_records).
This map is small enough to be put in the distributed cache.
Compute the cumulative sum of each number of records.
Then launch a second Map-only job with exactly n mappers, each will read it's
input split and the cumulative number of records preceding it, initialize the
counter with this value and finally RANK the records as they come in.
This would be a distributed implementation of RANK that could scale very well.
I haven't figured out how to integrate it into Pig yet.
> RANK function like in SQL
> -------------------------
>
> Key: PIG-2353
> URL: https://issues.apache.org/jira/browse/PIG-2353
> Project: Pig
> Issue Type: New Feature
> Reporter: Gianmarco De Francisci Morales
> Attachments: PIG2353.patch
>
>
> Implement a function that given a (sorted) bag adds to each tuple a unique,
> increasing identifier without gaps, like what RANK does for SQL.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira