Ted, Thank you. I filled MAPREDUCE-1571 to cover this issue. I might have some time to write a patch later this week.
Jacob Rideout On Sat, Mar 6, 2010 at 11:37 PM, Ted Yu <yuzhih...@gmail.com> wrote: > I think there is mismatch (in ReduceTask.java) between: > this.numCopiers = conf.getInt("mapred.reduce.parallel.copies", 5); > and: > maxSingleShuffleLimit = (long)(maxSize * > MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION); > where MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION is 0.25f > > because > copiers = new ArrayList<MapOutputCopier>(numCopiers); > so the total memory allocated for in-mem shuffle is 1.25 * maxSize > > A JIRA should be filed to correlate the constant 5 above and > MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION. > > Cheers > > On Sat, Mar 6, 2010 at 8:31 AM, Jacob R Rideout > <apa...@jacobrideout.net>wrote: > >> Hi all, >> >> We are seeing the following error in our reducers of a particular job: >> >> Error: java.lang.OutOfMemoryError: Java heap space >> at >> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508) >> at >> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408) >> at >> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261) >> at >> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195) >> >> >> After enough reducers fail the entire job fails. This error occurs >> regardless of whether mapred.compress.map.output is true. We were able >> to avoid the issue by reducing mapred.job.shuffle.input.buffer.percent >> to 20%. Shouldn't the framework via ShuffleRamManager.canFitInMemory >> and.ShuffleRamManager.reserve correctly detect the the memory >> available for allocation? I would think that with poor configuration >> settings (and default settings in particular) the job may not be as >> efficient, but wouldn't die. >> >> Here is some more context in the logs, I have attached the full >> reducer log here: http://gist.github.com/323746 >> >> >> 2010-03-06 07:54:49,621 INFO org.apache.hadoop.mapred.ReduceTask: >> Shuffling 4191933 bytes (435311 raw bytes) into RAM from >> attempt_201003060739_0002_m_000061_0 >> 2010-03-06 07:54:50,222 INFO org.apache.hadoop.mapred.ReduceTask: Task >> attempt_201003060739_0002_r_000000_0: Failed fetch #1 from >> attempt_201003060739_0002_m_000202_0 >> 2010-03-06 07:54:50,223 WARN org.apache.hadoop.mapred.ReduceTask: >> attempt_201003060739_0002_r_000000_0 adding host >> hd37.dfs.returnpath.net to penalty box, next contact in 4 seconds >> 2010-03-06 07:54:50,223 INFO org.apache.hadoop.mapred.ReduceTask: >> attempt_201003060739_0002_r_000000_0: Got 1 map-outputs from previous >> failures >> 2010-03-06 07:54:50,223 FATAL org.apache.hadoop.mapred.TaskRunner: >> attempt_201003060739_0002_r_000000_0 : Map output copy failure : >> java.lang.OutOfMemoryError: Java heap space >> at >> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508) >> at >> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408) >> at >> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261) >> at >> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195) >> >> >> We tried this both in 0.20.1 and 0.20.2. We had hoped MAPREDUCE-1182 >> would address the issue in 0.20.2, but it did not. Does anyone have >> any comments or suggestions? Is this a bug I should file a JIRA for? >> >> Jacob Rideout >> Return Path >> >