[ 
https://issues.apache.org/jira/browse/PIG-2353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13194966#comment-13194966
 ] 

David Ciemiewicz commented on PIG-2353:
---------------------------------------

There is a much more efficient way to compute RANK, DENSE_RANK, CUMULATIVE_SUM 
and more if you have billions of rows of data, especially if the data follows a 
power law/zipf distribution (like queries do).  It involves using Map-Reduce to 
compute a histogram of the frequencies/counts and then serializing and sorting 
the histogram which is something like 20,000 rows for 1B queries.

https://issues.apache.org/jira/browse/PIG-821
                
> RANK function like in SQL
> -------------------------
>
>                 Key: PIG-2353
>                 URL: https://issues.apache.org/jira/browse/PIG-2353
>             Project: Pig
>          Issue Type: New Feature
>            Reporter: Gianmarco De Francisci Morales
>         Attachments: PIG2353.patch
>
>
> Implement a function that given a (sorted) bag adds to each tuple a unique, 
> increasing identifier without gaps, like what RANK does for SQL.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to