[jira] [Commented] (FLINK-2549) Add topK operator for DataSet

Stephan Ewen (JIRA) Thu, 20 Aug 2015 01:24:07 -0700

    [ 
https://issues.apache.org/jira/browse/FLINK-2549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14704498#comment-14704498
 ]


Stephan Ewen commented on FLINK-2549:
-------------------------------------

You can implement topK on top if sort()/first().

It will be much less efficient then it could be, though. In that strategy, you 
need to sort the whole input, which is computationally more intensive and may 
need to spill to disk for large data.

Using a heap, you can simply always keep the lowest k elements. That way, you 
avoid the sort operations for most elements (that can be immediately discarded) 
and require little memory (only for k elements), most likely never spilling.

> Add topK operator for DataSet
> -----------------------------
>
>                 Key: FLINK-2549
>                 URL: https://issues.apache.org/jira/browse/FLINK-2549
>             Project: Flink
>          Issue Type: New Feature
>          Components: Core, Java API, Scala API
>            Reporter: Chengxiang Li
>            Assignee: Chengxiang Li
>            Priority: Minor
>
> topK is a common operation for user, it would be great to have it in Flink. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (FLINK-2549) Add topK operator for DataSet

Reply via email to