Rohini Palaniswamy created TEZ-1265:
---------------------------------------
Summary: Custom input to fetch source task inputs in order
Key: TEZ-1265
URL: https://issues.apache.org/jira/browse/TEZ-1265
Project: Apache Tez
Issue Type: Improvement
Reporter: Rohini Palaniswamy
Consider the case of having to LIMIT m records after an Order by. A distributed
orderby vertex produces data in sorted order from task0,task1...taskn. Each
task limits its output to m records (the output count can be <m also). The
limit vertex (parallelism 1) following the order by vertex has to fetch output
of all n tasks, shuffle merge its inputs (to maintain the order) and then limit
m records again. So need a input that fetches from source tasks in order and
reads them in order. Since data produced is ordered from task0,task1...taskn it
can be consumed without shuffle and sort. If the limit is hit early it can skip
fetching more task inputs.
More details in
https://issues.apache.org/jira/browse/PIG-4049?focusedCommentId=14053217&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14053217
--
This message was sent by Atlassian JIRA
(v6.2#6252)