If I want to do a sample, I will typically filter by a uniform random number, not take the top k.
And if I do take the top K, K is usually fairly small so sorting it by conventional mechanisms later is fine by me. On Sun, Jun 8, 2008 at 4:05 AM, Pi Song (JIRA) <[EMAIL PROTECTED]> wrote: > > [ > https://issues.apache.org/jira/browse/PIG-171?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12603362#action_12603362] > > Pi Song commented on PIG-171: > ----------------------------- > > Ted (From mailing-list): > bq. An efficient implementation of top K without full histogramming would > still be very, very useful. > > Logically (not by experience) I still concern about TOP K without order. > Does this thing really have a good use? The formal definition of TOP K > always goes with scoring function. Naturally, we also say we want TOP K > order by something. > > The only use case that I would think people might be doing TOP K without > order is just to work with sample data. But then doing TOP K is not gonna > give a statistically good representation. My idea is that it should be > better if we design the language by not allowing people to do the wrong > thing. > > If people want to do approximate queries I think we'd better provide a > proper way like adding:- > > {code} > X = SAMPLE 10% OF A ; > Y = SAMPLE 100 OF B ; > {code} > > What do you think? > > > Top K > > ----- > > > > Key: PIG-171 > > URL: https://issues.apache.org/jira/browse/PIG-171 > > Project: Pig > > Issue Type: New Feature > > Reporter: Amir Youssefi > > Assignee: Amir Youssefi > > > > Frequently, users are interested on Top results (especially Top K rows) . > This can be implemented efficiently in Pig /Map Reduce settings to deliver > rapid results and low Network Bandwidth/Memory usage. > > > > Key point is to prune all data on the map side and keep only small set > of rows with Top criteria . We can do it in Algebraic function (combiner) > with multiple value output. Only a small data-set gets out of mapper node. > > The same idea is applicable to solve variants of this problem: > > - An Algebraic Function for 'Top K Rows' > > - An Algebraic Function for 'Top K' values ('Top Rank K' and 'Top Dense > Rank K') > > - TOP K ORDER BY. > > Another words implementation is similar to combiners for aggregate > functions but instead of one value we get multiple ones. > > I will add a sample implementation for Top K Rows and possibly TOP K > ORDER BY to clarify details. > > -- > This message is automatically generated by JIRA. > - > You can reply to this email to add a comment to the issue online. > > -- ted
