Ted,

Thank you. I filled MAPREDUCE-1571 to cover this issue. I might have
some time to write a patch later this week.

Jacob Rideout

On Sat, Mar 6, 2010 at 11:37 PM, Ted Yu <yuzhih...@gmail.com> wrote:
> I think there is mismatch (in ReduceTask.java) between:
>      this.numCopiers = conf.getInt("mapred.reduce.parallel.copies", 5);
> and:
>        maxSingleShuffleLimit = (long)(maxSize *
> MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION);
> where MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION is 0.25f
>
> because
>      copiers = new ArrayList<MapOutputCopier>(numCopiers);
> so the total memory allocated for in-mem shuffle is 1.25 * maxSize
>
> A JIRA should be filed to correlate the constant 5 above and
> MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION.
>
> Cheers
>
> On Sat, Mar 6, 2010 at 8:31 AM, Jacob R Rideout 
> <apa...@jacobrideout.net>wrote:
>
>> Hi all,
>>
>> We are seeing the following error in our reducers of a particular job:
>>
>> Error: java.lang.OutOfMemoryError: Java heap space
>>        at
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
>>        at
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
>>        at
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
>>        at
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)
>>
>>
>> After enough reducers fail the entire job fails. This error occurs
>> regardless of whether mapred.compress.map.output is true. We were able
>> to avoid the issue by reducing mapred.job.shuffle.input.buffer.percent
>> to 20%. Shouldn't the framework via ShuffleRamManager.canFitInMemory
>> and.ShuffleRamManager.reserve correctly detect the the memory
>> available for allocation? I would think that with poor configuration
>> settings (and default settings in particular) the job may not be as
>> efficient, but wouldn't die.
>>
>> Here is some more context in the logs, I have attached the full
>> reducer log here: http://gist.github.com/323746
>>
>>
>> 2010-03-06 07:54:49,621 INFO org.apache.hadoop.mapred.ReduceTask:
>> Shuffling 4191933 bytes (435311 raw bytes) into RAM from
>> attempt_201003060739_0002_m_000061_0
>> 2010-03-06 07:54:50,222 INFO org.apache.hadoop.mapred.ReduceTask: Task
>> attempt_201003060739_0002_r_000000_0: Failed fetch #1 from
>> attempt_201003060739_0002_m_000202_0
>> 2010-03-06 07:54:50,223 WARN org.apache.hadoop.mapred.ReduceTask:
>> attempt_201003060739_0002_r_000000_0 adding host
>> hd37.dfs.returnpath.net to penalty box, next contact in 4 seconds
>> 2010-03-06 07:54:50,223 INFO org.apache.hadoop.mapred.ReduceTask:
>> attempt_201003060739_0002_r_000000_0: Got 1 map-outputs from previous
>> failures
>> 2010-03-06 07:54:50,223 FATAL org.apache.hadoop.mapred.TaskRunner:
>> attempt_201003060739_0002_r_000000_0 : Map output copy failure :
>> java.lang.OutOfMemoryError: Java heap space
>>        at
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
>>        at
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1408)
>>        at
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1261)
>>        at
>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1195)
>>
>>
>> We tried this both in 0.20.1 and 0.20.2. We had hoped MAPREDUCE-1182
>> would address the issue in 0.20.2, but it did not. Does anyone have
>> any comments or suggestions? Is this a bug I should file a JIRA for?
>>
>> Jacob Rideout
>> Return Path
>>
>

Reply via email to