[ 
https://issues.apache.org/jira/browse/TEZ-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Rajesh Balamohan updated TEZ-2001:
----------------------------------
    Attachment: TEZ-2001.1.patch

Similar to the approach listed in TEZ-1094, but specific to ordered usecases.
- As of now, pipelinedshuffle is enabled only when PipeledlinedSorter is used.  
PipelinedSorter uses multiple threads and churns out sorted files.
- Can be enabled by setting "tez.runtime.pipelined-shuffle.enabled=true"
- Spills will be stored in 
{code}${appDir}/output/${uniqueId}_${spillNumber}/file.out{code}.  This would 
make it easier to make use of existing ShuffleHandler to serve the output 
without issues. 
- Whenever a spill happens, DME is sent out with spill id. If 3 spills are 
done, 3 events are sent out.
- On consumer side, this data is collated before completing the fetcher 
threads.  
- maxTaskAttempts is set to 1 when pipelined shuffle is enabled.  Need to 
create additional jiras to enhance error handling.

Overall this would be beneficial in cases, where map side spills are causing 
the job runtime to suffer and pipelining helps in overlapping the networking 
with CPU resources.

Attaching the initial patch with this.

> Support pipelined data transfer for ordered output
> --------------------------------------------------
>
>                 Key: TEZ-2001
>                 URL: https://issues.apache.org/jira/browse/TEZ-2001
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Rajesh Balamohan
>            Assignee: Rajesh Balamohan
>         Attachments: TEZ-2001.1.patch
>
>




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to