[jira] [Commented] (TEZ-1265) Custom input to fetch source task inputs in order

Rohini Palaniswamy (JIRA) Tue, 08 Jul 2014 14:41:22 -0700

    [ 
https://issues.apache.org/jira/browse/TEZ-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14055576#comment-14055576
 ]


Rohini Palaniswamy commented on TEZ-1265:
-----------------------------------------

Route 2. Yes you are right that we have to make changes to the OrderBy vertex 
manager as we already have that set for orderby vertex.  We will give it a shot 
and see if there is actually any Tez changes needed to accomplish that.

For the case of limit without orderby - I think it makes sense to just add more 
features to ShuffleScheduler to only fetch as many completed part files based 
on record limit instead of having another input.  Since it is always 
UnorderedPartitionedInput, there is no sort-merge involved but it would pull 
unnecessarily more files before the Processor terminates when limit is reached. 
Is Hive trying to do limit for that case in same phase itself instead of a 
additional limit job? In the case of LOAD, FILTER and LIMIT you will always 
require a extra limit job. You can avoid the extra limit job only if there was 
at least one map phase before it.

> Custom input to fetch source task inputs in order
> -------------------------------------------------
>
>                 Key: TEZ-1265
>                 URL: https://issues.apache.org/jira/browse/TEZ-1265
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Rohini Palaniswamy
>
> Consider the case of having to LIMIT m records after an Order by. A 
> distributed orderby vertex produces data in sorted order from 
> task0,task1...taskn. Each task limits its output to m records (the output 
> count can be <m also). The limit vertex (parallelism 1) following the order 
> by vertex has to fetch output of all n tasks, shuffle merge its inputs (to 
> maintain the order) and then limit m records again.  So need a input that 
> fetches from source tasks in order and reads them in order. Since data 
> produced is ordered from task0,task1...taskn it can be consumed without 
> shuffle and sort. If the limit is hit early it can skip fetching more task 
> inputs. 
> More details in 
> https://issues.apache.org/jira/browse/PIG-4049?focusedCommentId=14053217&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14053217



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (TEZ-1265) Custom input to fetch source task inputs in order

Reply via email to