[ 
https://issues.apache.org/jira/browse/TEZ-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14055251#comment-14055251
 ] 

Siddharth Seth commented on TEZ-1265:
-------------------------------------

I'd go with the approach of the partitioner vertex generating stats - since 
that has the potential to provide information earlier on, in terms of which 
downstream tasks need to be started (although this info may not be enough), 
rather than downstream tasks providing this information. Some speculative task 
launches with absolute limits are possible as well - ends up processing and 
pulling a little extra data, but tasks are launched earlier.

> Custom input to fetch source task inputs in order
> -------------------------------------------------
>
>                 Key: TEZ-1265
>                 URL: https://issues.apache.org/jira/browse/TEZ-1265
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Rohini Palaniswamy
>
> Consider the case of having to LIMIT m records after an Order by. A 
> distributed orderby vertex produces data in sorted order from 
> task0,task1...taskn. Each task limits its output to m records (the output 
> count can be <m also). The limit vertex (parallelism 1) following the order 
> by vertex has to fetch output of all n tasks, shuffle merge its inputs (to 
> maintain the order) and then limit m records again.  So need a input that 
> fetches from source tasks in order and reads them in order. Since data 
> produced is ordered from task0,task1...taskn it can be consumed without 
> shuffle and sort. If the limit is hit early it can skip fetching more task 
> inputs. 
> More details in 
> https://issues.apache.org/jira/browse/PIG-4049?focusedCommentId=14053217&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14053217



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to