[
https://issues.apache.org/jira/browse/TEZ-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14054287#comment-14054287
]
Rohini Palaniswamy commented on TEZ-1265:
-----------------------------------------
[~sseth] had better suggestion of making it even more generic and controllable
than fetching in order -
https://issues.apache.org/jira/browse/PIG-4049?focusedCommentId=14054266&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14054266
A custom version of the Input which, instead of providing a unified view of the
data, gives access to individual chunks along with meta-information (taskId
etc). This could, additionally, be fully controlled by the user in terms of
which chunks need to be fetched.
> Custom input to fetch source task inputs in order
> -------------------------------------------------
>
> Key: TEZ-1265
> URL: https://issues.apache.org/jira/browse/TEZ-1265
> Project: Apache Tez
> Issue Type: Improvement
> Reporter: Rohini Palaniswamy
>
> Consider the case of having to LIMIT m records after an Order by. A
> distributed orderby vertex produces data in sorted order from
> task0,task1...taskn. Each task limits its output to m records (the output
> count can be <m also). The limit vertex (parallelism 1) following the order
> by vertex has to fetch output of all n tasks, shuffle merge its inputs (to
> maintain the order) and then limit m records again. So need a input that
> fetches from source tasks in order and reads them in order. Since data
> produced is ordered from task0,task1...taskn it can be consumed without
> shuffle and sort. If the limit is hit early it can skip fetching more task
> inputs.
> More details in
> https://issues.apache.org/jira/browse/PIG-4049?focusedCommentId=14053217&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14053217
--
This message was sent by Atlassian JIRA
(v6.2#6252)