[ 
https://issues.apache.org/jira/browse/TEZ-3605?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15948010#comment-15948010
 ] 

Siddharth Seth commented on TEZ-3605:
-------------------------------------

Took a while to get to this, and to recollect what is done in the 
UnorderedWriter / DefaultWriter.
If I'm not mistaken, the patch is trying to avoid writing out the default 
4bytes(?) that is generated by an IFile.Writer?, when the partition does not 
have data? (TEZ-941)

The changes to track numRecordsPerPartition are required for this. The Sorters 
already know how to generate the empty partition bitset by making use of 
TezSpillRecord and TezIndexRecord.hasData.
The current changes to track numRecordsPerPartition also breaks 
PipelinedSHuffle / AvoidFinalMerge - since the partition stats are cumulative, 
and not per partition. Synchronization will also need to be looked at (suspect 
there may be some issues with the size stats as well).

The unordered case does not respect "sendEmptyPartitionsViaEvents" as a 
configuration parameter, and always sends empty partition information. IIRC 
this is why it is able to avoid the Writer for an empty partition - the reader 
will never access it.
In the ordered case, if sendEmptyPartitionsViaEvents is disabled, the reader 
may try interpreting the contents of TezIndexRecord, which was not written, and 
fail (need to check how this will behave).

I think the changes to track number of records should be removed. Instead, the 
main changes should be in DefaultSorter (and maybe the same changes in 
PipelinedSorter). These changes should skip creating the writer only if 
sendEmptyPartitionsViaEvents is enabled.
Also, in the current changes to DefaultSorter, is it possible to move (if 
(writer == null)) - outside of the while loop? 

> Detect and prune empty partitions for the Ordered case
> ------------------------------------------------------
>
>                 Key: TEZ-3605
>                 URL: https://issues.apache.org/jira/browse/TEZ-3605
>             Project: Apache Tez
>          Issue Type: Bug
>            Reporter: Kuhu Shukla
>            Assignee: Kuhu Shukla
>         Attachments: TEZ-3605.001.patch, TEZ-3605.002.patch, 
> TEZ-3605.003.patch, TEZ-3605.004.patch, TEZ-3605.005.patch, TEZ-3605.006.patch
>
>
> Analogous to the Unordered case we should not have empty partition 
> entries/segments in the Ordered/DefaultSorter case. This will save writing 
> unnecessary data.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to