[ 
https://issues.apache.org/jira/browse/TEZ-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14055204#comment-14055204
 ] 

Bikas Saha edited comment on TEZ-1265 at 7/8/14 5:54 PM:
---------------------------------------------------------

The above insight is correct. The further up we can push the limit the better 
it is. We have discussed something similar for Hive for limit processing. A 
custom vertex manager can be used for the order-by vertex. What this vertex 
does is start the order by task in a restrained fashion. E.g. it could start 
task0. When task0 knows its number of records, then it can send a 
VertexManagerEvent to this vertex manager. The event will tell the Vertex 
manager if enough records have been processed or we need to process more 
records. If enough have been processed then the Vertex manager could stop the 
vertex right there (support will need to be added for that in Tez). If enough 
records have not been processed then it could launch more tasks. Doing tasks 1 
by 1 ensures that we never produce more records than required but may slow 
things down. Alternatively, the partition vertex tasks could emit events to the 
vertex manager by providing stats on the number of records generated (similar 
to generating data size stats for auto reduce). Given those stats, the vertex 
manager could figure out exactly the number of tasks to start as well as be 
able to configure the last task to generate X number of records so that the 
total == desired. (cc [~gopalv]])


was (Author: bikassaha):
The above insight is correct. The further up we can push the limit the better 
it is. We have discussed something similar for Hive for limit processing. A 
custom vertex manager can be used for the order-by vertex. What this vertex 
does is start the order by task in a restrained fashion. E.g. it could start 
task0. When task0 knows its number of records, then it can send a 
VertexManagerEvent to this vertex manager. The event will tell the Vertex 
manager if enough records have been processed or we need to process more 
records. If enough have been processed then the Vertex manager could stop the 
vertex right there (support will need to be added for that in Tez). If enough 
records have not been processed then it could launch more tasks. Doing tasks 1 
by 1 ensures that we never produce more records than required but may slow 
things down. Alternatively, the partition vertex tasks could emit events to the 
vertex manager by providing stats on the number of records generated (similar 
to generating data size stats for auto reduce). Given those stats, the vertex 
manager could figure out exactly the number of tasks to start as well as be 
able to configure the last task to generate X number of records so that the 
total == desired.

> Custom input to fetch source task inputs in order
> -------------------------------------------------
>
>                 Key: TEZ-1265
>                 URL: https://issues.apache.org/jira/browse/TEZ-1265
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Rohini Palaniswamy
>
> Consider the case of having to LIMIT m records after an Order by. A 
> distributed orderby vertex produces data in sorted order from 
> task0,task1...taskn. Each task limits its output to m records (the output 
> count can be <m also). The limit vertex (parallelism 1) following the order 
> by vertex has to fetch output of all n tasks, shuffle merge its inputs (to 
> maintain the order) and then limit m records again.  So need a input that 
> fetches from source tasks in order and reads them in order. Since data 
> produced is ordered from task0,task1...taskn it can be consumed without 
> shuffle and sort. If the limit is hit early it can skip fetching more task 
> inputs. 
> More details in 
> https://issues.apache.org/jira/browse/PIG-4049?focusedCommentId=14053217&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14053217



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to