[ https://issues.apache.org/jira/browse/MAPREDUCE-5958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14201035#comment-14201035 ]
Hudson commented on MAPREDUCE-5958: ----------------------------------- SUCCESS: Integrated in Hadoop-trunk-Commit #6470 (See [https://builds.apache.org/job/Hadoop-trunk-Commit/6470/]) MAPREDUCE-5958. Wrong reduce task progress if map output is compressed. Contributed by Emilio Coppa and Jason Lowe. (kihwal: rev 8f701ae07a0b1dc70b8e1eb8d4a5c35c0a1e76da) * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapred/Merger.java * hadoop-mapreduce-project/CHANGES.txt * hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestMerger.java > Wrong reduce task progress if map output is compressed > ------------------------------------------------------ > > Key: MAPREDUCE-5958 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5958 > Project: Hadoop Map/Reduce > Issue Type: Bug > Affects Versions: 2.2.0, 2.3.0, 2.2.1, 2.4.0, 2.4.1 > Reporter: Emilio Coppa > Assignee: Emilio Coppa > Priority: Minor > Labels: progress, reduce > Fix For: 2.6.0 > > Attachments: HADOOP-5958-v2.patch, MAPREDUCE-5958v3.patch > > > If the map output is compressed (_mapreduce.map.output.compress_ set to > _true_) then the reduce task progress may be highly underestimated. > In the reduce phase (but also in the merge phase), the progress of a reduce > task is computed as the ratio between the number of processed bytes and the > number of total bytes. But: > - the number of total bytes is computed by summing up the uncompressed > segment sizes (_Merger.Segment.getRawDataLength()_) > - the number of processed bytes is computed by exploiting the position of the > current _IFile.Reader_ (using _IFile.Reader.getPosition()_) but this may > refer to the position in the underlying on disk file (which may be compressed) > Thus, if the map outputs are compressed then the progress may be > underestimated (e.g., only 1 map output ondisk file, the compressed file is > 25% of its original size, then the reduce task progress during the reduce > phase will range between 0 and 0.25 and then artificially jump to 1.0). > Attached there is a patch: the number of processed bytes is now computed by > exploiting _IFile.Reader.bytesRead_ (if the the reader is in memory, then > _getPosition()_ already returns exactly this field). -- This message was sent by Atlassian JIRA (v6.3.4#6332)