[ https://issues.apache.org/jira/browse/TEZ-3605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15948010#comment-15948010 ]
Siddharth Seth commented on TEZ-3605: ------------------------------------- Took a while to get to this, and to recollect what is done in the UnorderedWriter / DefaultWriter. If I'm not mistaken, the patch is trying to avoid writing out the default 4bytes(?) that is generated by an IFile.Writer?, when the partition does not have data? (TEZ-941) The changes to track numRecordsPerPartition are required for this. The Sorters already know how to generate the empty partition bitset by making use of TezSpillRecord and TezIndexRecord.hasData. The current changes to track numRecordsPerPartition also breaks PipelinedSHuffle / AvoidFinalMerge - since the partition stats are cumulative, and not per partition. Synchronization will also need to be looked at (suspect there may be some issues with the size stats as well). The unordered case does not respect "sendEmptyPartitionsViaEvents" as a configuration parameter, and always sends empty partition information. IIRC this is why it is able to avoid the Writer for an empty partition - the reader will never access it. In the ordered case, if sendEmptyPartitionsViaEvents is disabled, the reader may try interpreting the contents of TezIndexRecord, which was not written, and fail (need to check how this will behave). I think the changes to track number of records should be removed. Instead, the main changes should be in DefaultSorter (and maybe the same changes in PipelinedSorter). These changes should skip creating the writer only if sendEmptyPartitionsViaEvents is enabled. Also, in the current changes to DefaultSorter, is it possible to move (if (writer == null)) - outside of the while loop? > Detect and prune empty partitions for the Ordered case > ------------------------------------------------------ > > Key: TEZ-3605 > URL: https://issues.apache.org/jira/browse/TEZ-3605 > Project: Apache Tez > Issue Type: Bug > Reporter: Kuhu Shukla > Assignee: Kuhu Shukla > Attachments: TEZ-3605.001.patch, TEZ-3605.002.patch, > TEZ-3605.003.patch, TEZ-3605.004.patch, TEZ-3605.005.patch, TEZ-3605.006.patch > > > Analogous to the Unordered case we should not have empty partition > entries/segments in the Ordered/DefaultSorter case. This will save writing > unnecessary data. -- This message was sent by Atlassian JIRA (v6.3.15#6346)