[ 
https://issues.apache.org/jira/browse/MAPREDUCE-2910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13096818#comment-13096818
 ] 

Binglin Chang commented on MAPREDUCE-2910:
------------------------------------------

LzoCodec: 2byte EOF marker + 4 byte checksum -> 14 byte compressed data + 4 
byte checksum
GzipCodec: 2byte EOF marker + 4 byte checksum -> 26 byte compressed data + 4 
byte checksum
Empty segments don't have any bytes, thus the seek & read in MapOutputServlet 
can also be saved.
This optimization is only for extreme cases, I often see large proportion(90%) 
of empty segments in vary large jobs(particularly with map side filter) in our 
cluster, this is partially because of bad configuration or bad partitioner, but 
tuning a partitioner or key distribution sometimes is non trivial for user.


> Allow empty MapOutputFile segments
> ----------------------------------
>
>                 Key: MAPREDUCE-2910
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2910
>             Project: Hadoop Map/Reduce
>          Issue Type: Improvement
>          Components: task, tasktracker
>    Affects Versions: 0.20.2, 0.23.0
>            Reporter: Binglin Chang
>            Priority: Minor
>             Fix For: 0.23.0
>
>
> As the scale of cluster and job get larger, we see a lot of empty partitions 
> in MapOutputFile due to large reduce numbers or partition skew. When map 
> output compression is enabled, empty map output partitions gets larger & has 
> additional compressor/decompressor initialization overhead. 
> This can be optimized by allowing empty MapOutputFile segments, where the 
> rawLength & partLength of IndexRecord all equal to 0. Corresponding support 
> need to be added to IFile reader, writer, and reduce shuffle copier.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to