Re: Shuffle In Memory OutOfMemoryError

Ted Yu Sat, 06 Mar 2010 22:37:27 -0800

I think there is mismatch (in ReduceTask.java) between:
      this.numCopiers = conf.getInt("mapred.reduce.parallel.copies", 5);
and:
        maxSingleShuffleLimit = (long)(maxSize *
MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION);
where MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION is 0.25f


because
      copiers = new ArrayList<MapOutputCopier>(numCopiers);
so the total memory allocated for in-mem shuffle is 1.25 * maxSize

A JIRA should be filed to correlate the constant 5 above and
MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION.

Cheers

On Sat, Mar 6, 2010 at 8:31 AM, Jacob R Rideout <apa...@jacobrideout.net>wrote:

> Hi all,
>
> We are seeing the following error in our reducers of a particular job:
>
> Error: java.lang.OutOfMemoryError: Java heap space
>        at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
>        at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
>        at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
>        at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)
>
>
> After enough reducers fail the entire job fails. This error occurs
> regardless of whether mapred.compress.map.output is true. We were able
> to avoid the issue by reducing mapred.job.shuffle.input.buffer.percent
> to 20%. Shouldn't the framework via ShuffleRamManager.canFitInMemory
> and.ShuffleRamManager.reserve correctly detect the the memory
> available for allocation? I would think that with poor configuration
> settings (and default settings in particular) the job may not be as
> efficient, but wouldn't die.
>
> Here is some more context in the logs, I have attached the full
> reducer log here: http://gist.github.com/323746
>
>
> 2010-03-06 07:54:49,621 INFO org.apache.hadoop.mapred.ReduceTask:
> Shuffling 4191933 bytes (435311 raw bytes) into RAM from
> attempt_201003060739_0002_m_000061_0
> 2010-03-06 07:54:50,222 INFO org.apache.hadoop.mapred.ReduceTask: Task
> attempt_201003060739_0002_r_000000_0: Failed fetch #1 from
> attempt_201003060739_0002_m_000202_0
> 2010-03-06 07:54:50,223 WARN org.apache.hadoop.mapred.ReduceTask:
> attempt_201003060739_0002_r_000000_0 adding host
> hd37.dfs.returnpath.net to penalty box, next contact in 4 seconds
> 2010-03-06 07:54:50,223 INFO org.apache.hadoop.mapred.ReduceTask:
> attempt_201003060739_0002_r_000000_0: Got 1 map-outputs from previous
> failures
> 2010-03-06 07:54:50,223 FATAL org.apache.hadoop.mapred.TaskRunner:
> attempt_201003060739_0002_r_000000_0 : Map output copy failure :
> java.lang.OutOfMemoryError: Java heap space
>        at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
>        at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
>        at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
>        at
> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)
>
>
> We tried this both in 0.20.1 and 0.20.2. We had hoped MAPREDUCE-1182
> would address the issue in 0.20.2, but it did not. Does anyone have
> any comments or suggestions? Is this a bug I should file a JIRA for?
>
> Jacob Rideout
> Return Path
>

Re: Shuffle In Memory OutOfMemoryError

Reply via email to