[ 
https://issues.apache.org/jira/browse/TEZ-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14055220#comment-14055220
 ] 

Rohini Palaniswamy commented on TEZ-1265:
-----------------------------------------

I am assuming Hive is trying to do for all limit operations as they do not have 
distributed orderby yet. I was mostly thinking it for order by + limit as in 
case of orderby there is always a partitioner vertex which can be used to 
determine the number of records only thing being they all need to complete for 
order by tasks to start which should not be that bad. 

bq. If enough records have not been processed then it could launch more tasks. 
Doing tasks 1 by 1 ensures that we never produce more records than required but 
may slow things down.
   Yes. That would be the case for limit without sort. Might have to go via the 
alternative route you suggested.

> Custom input to fetch source task inputs in order
> -------------------------------------------------
>
>                 Key: TEZ-1265
>                 URL: https://issues.apache.org/jira/browse/TEZ-1265
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Rohini Palaniswamy
>
> Consider the case of having to LIMIT m records after an Order by. A 
> distributed orderby vertex produces data in sorted order from 
> task0,task1...taskn. Each task limits its output to m records (the output 
> count can be <m also). The limit vertex (parallelism 1) following the order 
> by vertex has to fetch output of all n tasks, shuffle merge its inputs (to 
> maintain the order) and then limit m records again.  So need a input that 
> fetches from source tasks in order and reads them in order. Since data 
> produced is ordered from task0,task1...taskn it can be consumed without 
> shuffle and sort. If the limit is hit early it can skip fetching more task 
> inputs. 
> More details in 
> https://issues.apache.org/jira/browse/PIG-4049?focusedCommentId=14053217&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14053217



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to