Emilio Coppa created MAPREDUCE-5958:
---------------------------------------
Summary: Wrong reduce task progress if map output is compressed
Key: MAPREDUCE-5958
URL: https://issues.apache.org/jira/browse/MAPREDUCE-5958
Project: Hadoop Map/Reduce
Issue Type: Bug
Affects Versions: 2.4.1, 2.2.1, 2.3.0, 2.2.0, 2.4.0
Reporter: Emilio Coppa
Priority: Minor
If the map output is compressed (_mapreduce.map.output.compress_ set to _true_)
then the reduce task progress may be highly underestimated.
In the reduce phase (but also in the merge phase), the progress of a reduce
task is computed as the ratio between the number of processed bytes and the
number of total bytes. But:
- the number of total bytes is computed by summing up the uncompressed segment
sizes (_Merger.Segment.getRawDataLength()_)
- the number of processed bytes is computed by exploiting the position of the
current _IFile.Reader_ (using _IFile.Reader.getPosition()_) but this may refer
to the position in the underlying on disk file (which may be compressed)
Thus, if the map output are compressed then the progress may be underestimated
(e.g., only 1 map output ondisk file, the compressed file is 25% of its
original size, then the reduce task progress during the reduce phase will range
between 0 and 0.25 and then artificially jump to 1.0).
Attached there is a patch: the number of processed bytes is now computed by
exploiting _IFile.Reader.bytesRead_ (if the the reader is in memory, then
_getPosition()_ already returns exactly this field).
--
This message was sent by Atlassian JIRA
(v6.2#6252)