[ https://issues.apache.org/jira/browse/TEZ-2001?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Rajesh Balamohan updated TEZ-2001: ---------------------------------- Attachment: TEZ-2001.1.patch Similar to the approach listed in TEZ-1094, but specific to ordered usecases. - As of now, pipelinedshuffle is enabled only when PipeledlinedSorter is used. PipelinedSorter uses multiple threads and churns out sorted files. - Can be enabled by setting "tez.runtime.pipelined-shuffle.enabled=true" - Spills will be stored in {code}${appDir}/output/${uniqueId}_${spillNumber}/file.out{code}. This would make it easier to make use of existing ShuffleHandler to serve the output without issues. - Whenever a spill happens, DME is sent out with spill id. If 3 spills are done, 3 events are sent out. - On consumer side, this data is collated before completing the fetcher threads. - maxTaskAttempts is set to 1 when pipelined shuffle is enabled. Need to create additional jiras to enhance error handling. Overall this would be beneficial in cases, where map side spills are causing the job runtime to suffer and pipelining helps in overlapping the networking with CPU resources. Attaching the initial patch with this. > Support pipelined data transfer for ordered output > -------------------------------------------------- > > Key: TEZ-2001 > URL: https://issues.apache.org/jira/browse/TEZ-2001 > Project: Apache Tez > Issue Type: Improvement > Reporter: Rajesh Balamohan > Assignee: Rajesh Balamohan > Attachments: TEZ-2001.1.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)