[
https://issues.apache.org/jira/browse/PIG-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13194966#comment-13194966
]
David Ciemiewicz commented on PIG-2353:
---------------------------------------
There is a much more efficient way to compute RANK, DENSE_RANK, CUMULATIVE_SUM
and more if you have billions of rows of data, especially if the data follows a
power law/zipf distribution (like queries do). It involves using Map-Reduce to
compute a histogram of the frequencies/counts and then serializing and sorting
the histogram which is something like 20,000 rows for 1B queries.
https://issues.apache.org/jira/browse/PIG-821
> RANK function like in SQL
> -------------------------
>
> Key: PIG-2353
> URL: https://issues.apache.org/jira/browse/PIG-2353
> Project: Pig
> Issue Type: New Feature
> Reporter: Gianmarco De Francisci Morales
> Attachments: PIG2353.patch
>
>
> Implement a function that given a (sorted) bag adds to each tuple a unique,
> increasing identifier without gaps, like what RANK does for SQL.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira