[
https://issues.apache.org/jira/browse/TEZ-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14055250#comment-14055250
]
Bikas Saha commented on TEZ-1265:
---------------------------------
Which route are you talking about?
Route 1) Orderby tasks reports number of generated records to manager and
manager starts them 1 by 1
Route 2) Partition tasks report record stats per partition to manager and
manager start the correct number of order by tasks. (By potentially changing
the parallelism and routing only the required partitions 1-N? This could
actually be an extension of the newly added vertex manager for orderby in Pig.
The sampler determines the histograms and number of partitions currently. It
could additionally determine the early number of orderby tasks.
> Custom input to fetch source task inputs in order
> -------------------------------------------------
>
> Key: TEZ-1265
> URL: https://issues.apache.org/jira/browse/TEZ-1265
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: Rohini Palaniswamy
>
> Consider the case of having to LIMIT m records after an Order by. A
> distributed orderby vertex produces data in sorted order from
> task0,task1...taskn. Each task limits its output to m records (the output
> count can be <m also). The limit vertex (parallelism 1) following the order
> by vertex has to fetch output of all n tasks, shuffle merge its inputs (to
> maintain the order) and then limit m records again. So need a input that
> fetches from source tasks in order and reads them in order. Since data
> produced is ordered from task0,task1...taskn it can be consumed without
> shuffle and sort. If the limit is hit early it can skip fetching more task
> inputs.
> More details in
> https://issues.apache.org/jira/browse/PIG-4049?focusedCommentId=14053217&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14053217
--
This message was sent by Atlassian JIRA
(v6.2#6252)