An efficient implementation of top K without full histogramming would still be very, very useful.
On Sat, Jun 7, 2008 at 5:35 PM, Pi Song (JIRA) <[EMAIL PROTECTED]> wrote: > > [ > https://issues.apache.org/jira/browse/PIG-171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12603350#action_12603350] > > Pi Song commented on PIG-171: > ----------------------------- > > Seems like all efficient histogram generation algorithms are probabilistic > so my optimization idea wouldn't work. > > > Top K > > ----- > > > > Key: PIG-171 > > URL: https://issues.apache.org/jira/browse/PIG-171 > > Project: Pig > > Issue Type: New Feature > > Reporter: Amir Youssefi > > Assignee: Amir Youssefi > > > > Frequently, users are interested on Top results (especially Top K rows) . > This can be implemented efficiently in Pig /Map Reduce settings to deliver > rapid results and low Network Bandwidth/Memory usage. > > > > Key point is to prune all data on the map side and keep only small set > of rows with Top criteria . We can do it in Algebraic function (combiner) > with multiple value output. Only a small data-set gets out of mapper node. > > The same idea is applicable to solve variants of this problem: > > - An Algebraic Function for 'Top K Rows' > > - An Algebraic Function for 'Top K' values ('Top Rank K' and 'Top Dense > Rank K') > > - TOP K ORDER BY. > > Another words implementation is similar to combiners for aggregate > functions but instead of one value we get multiple ones. > > I will add a sample implementation for Top K Rows and possibly TOP K > ORDER BY to clarify details. > > -- > This message is automatically generated by JIRA. > - > You can reply to this email to add a comment to the issue online. > > -- ted
