[ 
https://issues.apache.org/jira/browse/TEZ-1608?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14146038#comment-14146038
 ] 

Krisztian Horvath commented on TEZ-1608:
----------------------------------------

I received few improvements which I'm going to apply:
Also, for the topK, can the sum task maintain a local top K and output only 
that much and the writer can pick the global topK from the local topKs. Would 
reduce the data transfer quite a bit. Then we may be able to use an 
UnorderedKVEdge instead of an OrderedPartitionedKVEdge? That will avoid the 
need to sort at the output and merge sort at the input.

> TopK example
> ------------
>
>                 Key: TEZ-1608
>                 URL: https://issues.apache.org/jira/browse/TEZ-1608
>             Project: Apache Tez
>          Issue Type: Sub-task
>    Affects Versions: 0.5.0
>            Reporter: Janos Matyas
>         Attachments: TEZ-1608-1.patch
>
>
> The goal of this sample is to find the topK elements of a dataset, while 
> guiding through the basics of Tez (DAG creation, tokenizers, custom 
> comparators and parallelism). 
> An example use case for top K:
>   Given a large data set in CSV format of user comments on a site listed as: 
> userid,postid,commentid,comment,timestamp and we are looking for the top K 
> commenter or the posts with the most comment. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to