[ https://issues.apache.org/jira/browse/TEZ-2198?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14514727#comment-14514727 ]
Hitesh Shah commented on TEZ-2198: ---------------------------------- [~gopalv] [~sseth] review please. > Fix sorter spill counts > ----------------------- > > Key: TEZ-2198 > URL: https://issues.apache.org/jira/browse/TEZ-2198 > Project: Apache Tez > Issue Type: Improvement > Reporter: Rajesh Balamohan > Assignee: Rajesh Balamohan > Attachments: TEZ-2198.1.patch, TEZ-2198.2.patch, > no_additional_spills_eg_pipelined_shuffle.png, with_additional_spills.png > > > Prior to pipelined shuffle, tez merged all spilled data into a single file. > This ended up creating one index file and one output file. In this context, > TaskCounter.ADDITIONAL_SPILL_COUNT was referred as the number of additional > spills and there was no counter needed to track the number of merges. > With pipelined shuffle, there is no final merge and ADDITIONAL_SPILL_COUNT > would be misleading, as these spills are direct output files which are > consumed by the consumers. > It would be good to have the following > - ADDITIONAL_SPILL_COUNT: represents the spills that are needed by the task > to generate the final merged output > - TOTAL_SPILLS: represents the total number of shuffle directories (index + > output files) that got created at the end of processing. > For e.g, Assume sorter generated 5 spills in an attempt > Without pipelining: > ============== > ADDITIONAL_SPILL_COUNT = 5 <-- Additional spills involved in sorting > TOTAL_SPILLS = 1 <-- Final merged output > With pipelining: > ============ > ADDITIONAL_SPILL_COUNT = 0 <-- Additional spills involved in sorting > TOTAL_SPILLS = 5 <--- all spills are final output -- This message was sent by Atlassian JIRA (v6.3.4#6332)