[ https://issues.apache.org/jira/browse/TEZ-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14364314#comment-14364314 ]
Hadoop QA commented on TEZ-2198: -------------------------------- {color:green}+1 overall{color}. Here are the results of testing the latest attachment http://issues.apache.org/jira/secure/attachment/12704927/TEZ-2198.2.patch against master revision b18552b. {color:green}+1 @author{color}. The patch does not contain any @author tags. {color:green}+1 tests included{color}. The patch appears to include 2 new or modified test files. {color:green}+1 javac{color}. The applied patch does not increase the total number of javac compiler warnings. {color:green}+1 javadoc{color}. There were no new javadoc warning messages. {color:green}+1 findbugs{color}. The patch does not introduce any new Findbugs (version 2.0.3) warnings. {color:green}+1 release audit{color}. The applied patch does not increase the total number of release audit warnings. {color:green}+1 core tests{color}. The patch passed unit tests in . Test results: https://builds.apache.org/job/PreCommit-TEZ-Build/307//testReport/ Console output: https://builds.apache.org/job/PreCommit-TEZ-Build/307//console This message is automatically generated. > Fix sorter spill counts > ----------------------- > > Key: TEZ-2198 > URL: https://issues.apache.org/jira/browse/TEZ-2198 > Project: Apache Tez > Issue Type: Improvement > Reporter: Rajesh Balamohan > Assignee: Rajesh Balamohan > Attachments: TEZ-2198.1.patch, TEZ-2198.2.patch > > > Prior to pipelined shuffle, tez merged all spilled data into a single file. > This ended up creating one index file and one output file. In this context, > TaskCounter.ADDITIONAL_SPILL_COUNT was referred as the number of additional > spills and there was no counter needed to track the number of merges. > With pipelined shuffle, there is no final merge and ADDITIONAL_SPILL_COUNT > would be misleading, as these spills are direct output files which are > consumed by the consumers. > It would be good to have the following > - ADDITIONAL_SPILL_COUNT: represents the spills that are needed by the task > to generate the final merged output > - TOTAL_SPILLS: represents the total number of shuffle directories (index + > output files) that got created at the end of processing. > For e.g, Assume sorter generated 5 spills in an attempt > Without pipelining: > ============== > ADDITIONAL_SPILL_COUNT = 5 <-- Additional spills involved in sorting > TOTAL_SPILLS = 1 <-- Final merged output > With pipelining: > ============ > ADDITIONAL_SPILL_COUNT = 0 <-- Additional spills involved in sorting > TOTAL_SPILLS = 5 <--- all spills are final output -- This message was sent by Atlassian JIRA (v6.3.4#6332)