Hi all, We are seeing the following error in our reducers of a particular job:
Error: java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195) After enough reducers fail the entire job fails. This error occurs regardless of whether mapred.compress.map.output is true. We were able to avoid the issue by reducing mapred.job.shuffle.input.buffer.percent to 20%. Shouldn't the framework via ShuffleRamManager.canFitInMemory and.ShuffleRamManager.reserve correctly detect the the memory available for allocation? I would think that with poor configuration settings (and default settings in particular) the job may not be as efficient, but wouldn't die. Here is some more context in the logs, I have attached the full reducer log here: http://gist.github.com/323746 2010-03-06 07:54:49,621 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling 4191933 bytes (435311 raw bytes) into RAM from attempt_201003060739_0002_m_000061_0 2010-03-06 07:54:50,222 INFO org.apache.hadoop.mapred.ReduceTask: Task attempt_201003060739_0002_r_000000_0: Failed fetch #1 from attempt_201003060739_0002_m_000202_0 2010-03-06 07:54:50,223 WARN org.apache.hadoop.mapred.ReduceTask: attempt_201003060739_0002_r_000000_0 adding host hd37.dfs.returnpath.net to penalty box, next contact in 4 seconds 2010-03-06 07:54:50,223 INFO org.apache.hadoop.mapred.ReduceTask: attempt_201003060739_0002_r_000000_0: Got 1 map-outputs from previous failures 2010-03-06 07:54:50,223 FATAL org.apache.hadoop.mapred.TaskRunner: attempt_201003060739_0002_r_000000_0 : Map output copy failure : java.lang.OutOfMemoryError: Java heap space at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261) at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195) We tried this both in 0.20.1 and 0.20.2. We had hoped MAPREDUCE-1182 would address the issue in 0.20.2, but it did not. Does anyone have any comments or suggestions? Is this a bug I should file a JIRA for? Jacob Rideout Return Path