I verified that size and maxSize are long. This means MR-1182 didn't resolve
Andy's issue.

According to Andy:
At the beginning of the job there are 209,754 pending map tasks and 32
pending reduce tasks

My guess is that GC wasn't reclaiming memory fast enough, leading to OOME
because of large number of in-memory shuffle candidates.

My suggestion for Andy would be to:
1. add -*verbose*:*gc as JVM parameter
2. modify reserve() slightly to calculate the maximum outstanding
numPendingRequests and print the maximum.

Based on the output from above two items, we can discuss solution.
My intuition is to place upperbound on numPendingRequests beyond which
canFitInMemory() returns false.
*
My two cents.

On Tue, Mar 9, 2010 at 11:51 PM, Christopher Douglas
<chri...@yahoo-inc.com>wrote:

> That section of code is unmodified in MR-1182. See the patches/svn log. -C
>
> Sent from my iPhone
>
>
> On Mar 9, 2010, at 7:44 PM, "Ted Yu" <yuzhih...@gmail.com> wrote:
>
>  I just downloaded hadoop-0.20.2 tar ball from cloudera mirror.
>> This is what I see in ReduceTask (line 999):
>>     public synchronized boolean reserve(int requestedSize, InputStream in)
>>
>>     throws InterruptedException {
>>       // Wait till the request can be fulfilled...
>>       while ((size + requestedSize) > maxSize) {
>>
>> I don't see the fix from MR-1182.
>>
>> That's why I suggested to Andy that he manually apply MR-1182.
>>
>> Cheers
>>
>> On Tue, Mar 9, 2010 at 5:01 PM, Andy Sautins <andy.saut...@returnpath.net
>> >wrote:
>>
>>
>>> Thanks Christopher.
>>>
>>> The heap size for reduce tasks is configured to be 640M (
>>> mapred.child.java.opts set to -Xmx640m ).
>>>
>>> Andy
>>>
>>> -----Original Message-----
>>> From: Christopher Douglas [mailto:chri...@yahoo-inc.com]
>>> Sent: Tuesday, March 09, 2010 5:19 PM
>>> To: common-user@hadoop.apache.org
>>> Subject: Re: Shuffle In Memory OutOfMemoryError
>>>
>>> No, MR-1182 is included in 0.20.2
>>>
>>> What heap size have you set for your reduce tasks? -C
>>>
>>> Sent from my iPhone
>>>
>>> On Mar 9, 2010, at 2:34 PM, "Ted Yu" <yuzhih...@gmail.com> wrote:
>>>
>>>  Andy:
>>>> You need to manually apply the patch.
>>>>
>>>> Cheers
>>>>
>>>> On Tue, Mar 9, 2010 at 2:23 PM, Andy Sautins <
>>>>
>>> andy.saut...@returnpath.net
>>>
>>>> wrote:
>>>>>
>>>>
>>>>
>>>>> Thanks Ted.  My understanding is that MAPREDUCE-1182 is included
>>>>> in the
>>>>> 0.20.2 release.  We upgraded our cluster to 0.20.2 this weekend and
>>>>> re-ran
>>>>> the same job scenarios.  Running with mapred.reduce.parallel.copies
>>>>> set to 1
>>>>> and continue to have the same Java heap space error.
>>>>>
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Ted Yu [mailto:yuzhih...@gmail.com]
>>>>> Sent: Tuesday, March 09, 2010 12:56 PM
>>>>> To: common-user@hadoop.apache.org
>>>>> Subject: Re: Shuffle In Memory OutOfMemoryError
>>>>>
>>>>> This issue has been resolved in
>>>>> http://issues.apache.org/jira/browse/MAPREDUCE-1182
>>>>>
>>>>> Please apply the patch
>>>>> M1182-1v20.patch<
>>>>>
>>>>>
>>> http://issues.apache.org/jira/secure/attachment/12424116/M1182-1v20.patch
>>>
>>>>
>>>>>>
>>>>> On Sun, Mar 7, 2010 at 3:57 PM, Andy Sautins <
>>>>>
>>>> andy.saut...@returnpath.net
>>>
>>>> wrote:
>>>>>>
>>>>>
>>>>>
>>>>>> Thanks Ted.  Very helpful.  You are correct that I misunderstood the
>>>>>>
>>>>> code
>>>>>
>>>>>> at ReduceTask.java:1535.  I missed the fact that it's in a
>>>>>> IOException
>>>>>>
>>>>> catch
>>>>>
>>>>>> block.  My mistake.  That's what I get for being in a rush.
>>>>>>
>>>>>> For what it's worth I did re-run the job with
>>>>>> mapred.reduce.parallel.copies set with values from 5 all the way
>>>>>> down to
>>>>>>
>>>>> 1.
>>>>>
>>>>>> All failed with the same error:
>>>>>>
>>>>>> Error: java.lang.OutOfMemoryError: Java heap space
>>>>>>     at
>>>>>>
>>>>>>  org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>>> $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
>>>>>
>>>>>>     at
>>>>>>
>>>>>>  org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>>> $MapOutputCopier.getMapOutput(ReduceTask.java:1408)
>>>>>
>>>>>>     at
>>>>>>
>>>>>>  org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>>> $MapOutputCopier.copyOutput(ReduceTask.java:1261)
>>>>>
>>>>>>     at
>>>>>>
>>>>>>  org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run
>>>>> (ReduceTask.java:1195)
>>>>>
>>>>>>
>>>>>>
>>>>>> So from that it does seem like something else might be going on,
>>>>>> yes?
>>>>>>
>>>>> I
>>>>>
>>>>>> need to do some more research.
>>>>>>
>>>>>> I appreciate your insights.
>>>>>>
>>>>>> Andy
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Ted Yu [mailto:yuzhih...@gmail.com]
>>>>>> Sent: Sunday, March 07, 2010 3:38 PM
>>>>>> To: common-user@hadoop.apache.org
>>>>>> Subject: Re: Shuffle In Memory OutOfMemoryError
>>>>>>
>>>>>> My observation is based on this call chain:
>>>>>> MapOutputCopier.run() calling copyOutput() calling getMapOutput()
>>>>>> calling
>>>>>> ramManager.canFitInMemory(decompressedLength)
>>>>>>
>>>>>> Basically ramManager.canFitInMemory() makes decision without
>>>>>> considering
>>>>>> the
>>>>>> number of MapOutputCopiers that are running. Thus 1.25 * 0.7 of
>>>>>> total
>>>>>>
>>>>> heap
>>>>>
>>>>>> may be used in shuffling if default parameters were used.
>>>>>> Of course, you should check the value for
>>>>>> mapred.reduce.parallel.copies
>>>>>>
>>>>> to
>>>>>
>>>>>> see if it is 5. If it is 4 or lower, my reasoning wouldn't apply.
>>>>>>
>>>>>> About ramManager.unreserve() call, ReduceTask.java from hadoop
>>>>>> 0.20.2
>>>>>>
>>>>> only
>>>>>
>>>>>> has 2731 lines. So I have to guess the location of the code
>>>>>> snippet you
>>>>>> provided.
>>>>>> I found this around line 1535:
>>>>>>     } catch (IOException ioe) {
>>>>>>       LOG.info("Failed to shuffle from " +
>>>>>> mapOutputLoc.getTaskAttemptId(),
>>>>>>                ioe);
>>>>>>
>>>>>>       // Inform the ram-manager
>>>>>>       ramManager.closeInMemoryFile(mapOutputLength);
>>>>>>       ramManager.unreserve(mapOutputLength);
>>>>>>
>>>>>>       // Discard the map-output
>>>>>>       try {
>>>>>>         mapOutput.discard();
>>>>>>       } catch (IOException ignored) {
>>>>>>         LOG.info("Failed to discard map-output from " +
>>>>>>                  mapOutputLoc.getTaskAttemptId(), ignored);
>>>>>>       }
>>>>>> Please confirm the line number.
>>>>>>
>>>>>> If we're looking at the same code, I am afraid I don't see how we
>>>>>> can
>>>>>> improve it. First, I assume IOException shouldn't happen that often.
>>>>>> Second,
>>>>>> mapOutput.discard() just sets:
>>>>>>       data = null;
>>>>>> for in memory case. Even if we call mapOutput.discard() before
>>>>>> ramManager.unreserve(), we don't know when GC would kick in and
>>>>>> make more
>>>>>> memory available.
>>>>>> Of course, given the large number of map outputs in your system, it
>>>>>>
>>>>> became
>>>>>
>>>>>> more likely that the root cause from my reasoning made OOME happen
>>>>>>
>>>>> sooner.
>>>>>
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>>>  On Sun, Mar 7, 2010 at 1:03 PM, Andy Sautins <
>>>>>>
>>>>> andy.saut...@returnpath.net
>>>>>
>>>>>> wrote:
>>>>>>>
>>>>>>
>>>>>>
>>>>>>> Ted,
>>>>>>>
>>>>>>> I'm trying to follow the logic in your mail and I'm not sure I'm
>>>>>>> following.  If you would mind helping me understand I would
>>>>>>> appreciate
>>>>>>>
>>>>>> it.
>>>>>>
>>>>>>>
>>>>>>> Looking at the code maxSingleShuffleLimit is only used in
>>>>>>> determining
>>>>>>>
>>>>>> if
>>>>>>
>>>>>>> the copy _can_ fit into memory:
>>>>>>>
>>>>>>>  boolean canFitInMemory(long requestedSize) {
>>>>>>>     return (requestedSize < Integer.MAX_VALUE &&
>>>>>>>             requestedSize < maxSingleShuffleLimit);
>>>>>>>   }
>>>>>>>
>>>>>>>  It also looks like the RamManager.reserve should wait until
>>>>>>> memory
>>>>>>>
>>>>>> is
>>>>>
>>>>>> available so it should hit a memory limit for that reason.
>>>>>>>
>>>>>>>  What does seem a little strange to me is the following (
>>>>>>>
>>>>>> ReduceTask.java
>>>>>>
>>>>>>> starting at 2730 ):
>>>>>>>
>>>>>>>       // Inform the ram-manager
>>>>>>>       ramManager.closeInMemoryFile(mapOutputLength);
>>>>>>>       ramManager.unreserve(mapOutputLength);
>>>>>>>
>>>>>>>       // Discard the map-output
>>>>>>>       try {
>>>>>>>         mapOutput.discard();
>>>>>>>       } catch (IOException ignored) {
>>>>>>>         LOG.info("Failed to discard map-output from " +
>>>>>>>                  mapOutputLoc.getTaskAttemptId(), ignored);
>>>>>>>       }
>>>>>>>       mapOutput = null;
>>>>>>>
>>>>>>> So to me that looks like the ramManager unreserves the memory
>>>>>>> before
>>>>>>>
>>>>>> the
>>>>>>
>>>>>>> mapOutput is discarded.  Shouldn't the mapOutput be discarded
>>>>>>> _before_
>>>>>>>
>>>>>> the
>>>>>>
>>>>>>> ramManager unreserves the memory?  If the memory is unreserved
>>>>>>> before
>>>>>>>
>>>>>> the
>>>>>
>>>>>> actual underlying data references are removed then it seems like
>>>>>>>
>>>>>> another
>>>>>
>>>>>> thread can try to allocate memory ( ReduceTask.java:2730 ) before
>>>>>>> the
>>>>>>> previous memory is disposed ( mapOutput.discard() ).
>>>>>>>
>>>>>>> Not sure that makes sense.  One thing to note is that the
>>>>>>> particular
>>>>>>>
>>>>>> job
>>>>>>
>>>>>>> that is failing does have a good number ( 200k+ ) of map
>>>>>>> outputs.  The
>>>>>>>
>>>>>> large
>>>>>>
>>>>>>> number of small map outputs may be why we are triggering a problem.
>>>>>>>
>>>>>>> Thanks again for your thoughts.
>>>>>>>
>>>>>>> Andy
>>>>>>>
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: Jacob R Rideout [mailto:apa...@jacobrideout.net]
>>>>>>> Sent: Sunday, March 07, 2010 1:21 PM
>>>>>>> To: common-user@hadoop.apache.org
>>>>>>> Cc: Andy Sautins; Ted Yu
>>>>>>> Subject: Re: Shuffle In Memory OutOfMemoryError
>>>>>>>
>>>>>>> Ted,
>>>>>>>
>>>>>>> Thank you. I filled MAPREDUCE-1571 to cover this issue. I might
>>>>>>> have
>>>>>>> some time to write a patch later this week.
>>>>>>>
>>>>>>> Jacob Rideout
>>>>>>>
>>>>>>> On Sat, Mar 6, 2010 at 11:37 PM, Ted Yu <yuzhih...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I think there is mismatch (in ReduceTask.java) between:
>>>>>>>>   this.numCopiers = conf.getInt("mapred.reduce.parallel.copies",
>>>>>>>>
>>>>>>> 5);
>>>>>
>>>>>> and:
>>>>>>>>     maxSingleShuffleLimit = (long)(maxSize *
>>>>>>>> MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION);
>>>>>>>> where MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION is 0.25f
>>>>>>>>
>>>>>>>> because
>>>>>>>>   copiers = new ArrayList<MapOutputCopier>(numCopiers);
>>>>>>>> so the total memory allocated for in-mem shuffle is 1.25 * maxSize
>>>>>>>>
>>>>>>>> A JIRA should be filed to correlate the constant 5 above and
>>>>>>>> MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION.
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>>
>>>>>>>> On Sat, Mar 6, 2010 at 8:31 AM, Jacob R Rideout <
>>>>>>>>
>>>>>>> apa...@jacobrideout.net
>>>>>>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>>  Hi all,
>>>>>>>>>
>>>>>>>>> We are seeing the following error in our reducers of a particular
>>>>>>>>>
>>>>>>>> job:
>>>>>
>>>>>>
>>>>>>>>> Error: java.lang.OutOfMemoryError: Java heap space
>>>>>>>>>     at
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>  org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>>> $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
>>>>>
>>>>>>     at
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>  org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>>> $MapOutputCopier.getMapOutput(ReduceTask.java:1408)
>>>>>
>>>>>>     at
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>  org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>>> $MapOutputCopier.copyOutput(ReduceTask.java:1261)
>>>>>
>>>>>>     at
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>  org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run
>>>>> (ReduceTask.java:1195)
>>>>>
>>>>>>
>>>>>>>>>
>>>>>>>>> After enough reducers fail the entire job fails. This error
>>>>>>>>> occurs
>>>>>>>>> regardless of whether mapred.compress.map.output is true. We were
>>>>>>>>>
>>>>>>>> able
>>>>>
>>>>>> to avoid the issue by reducing
>>>>>>>>>
>>>>>>>> mapred.job.shuffle.input.buffer.percent
>>>>>
>>>>>> to 20%. Shouldn't the framework via
>>>>>>>>> ShuffleRamManager.canFitInMemory
>>>>>>>>> and.ShuffleRamManager.reserve correctly detect the the memory
>>>>>>>>> available for allocation? I would think that with poor
>>>>>>>>> configuration
>>>>>>>>> settings (and default settings in particular) the job may not
>>>>>>>>> be as
>>>>>>>>> efficient, but wouldn't die.
>>>>>>>>>
>>>>>>>>> Here is some more context in the logs, I have attached the full
>>>>>>>>> reducer log here: http://gist.github.com/323746
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> 2010-03-06 07:54:49,621 INFO org.apache.hadoop.mapred.ReduceTask:
>>>>>>>>> Shuffling 4191933 bytes (435311 raw bytes) into RAM from
>>>>>>>>> attempt_201003060739_0002_m_000061_0
>>>>>>>>> 2010-03-06 07:54:50,222 INFO org.apache.hadoop.mapred.ReduceTask:
>>>>>>>>>
>>>>>>>> Task
>>>>>
>>>>>> attempt_201003060739_0002_r_000000_0: Failed fetch #1 from
>>>>>>>>> attempt_201003060739_0002_m_000202_0
>>>>>>>>> 2010-03-06 07:54:50,223 WARN org.apache.hadoop.mapred.ReduceTask:
>>>>>>>>> attempt_201003060739_0002_r_000000_0 adding host
>>>>>>>>> hd37.dfs.returnpath.net to penalty box, next contact in 4 seconds
>>>>>>>>> 2010-03-06 07:54:50,223 INFO org.apache.hadoop.mapred.ReduceTask:
>>>>>>>>> attempt_201003060739_0002_r_000000_0: Got 1 map-outputs from
>>>>>>>>>
>>>>>>>> previous
>>>>>
>>>>>> failures
>>>>>>>>> 2010-03-06 07:54:50,223 FATAL
>>>>>>>>> org.apache.hadoop.mapred.TaskRunner:
>>>>>>>>> attempt_201003060739_0002_r_000000_0 : Map output copy failure :
>>>>>>>>> java.lang.OutOfMemoryError: Java heap space
>>>>>>>>>     at
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>  org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>>> $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508)
>>>>>
>>>>>>     at
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>  org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>>> $MapOutputCopier.getMapOutput(ReduceTask.java:1408)
>>>>>
>>>>>>     at
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>  org.apache.hadoop.mapred.ReduceTask$ReduceCopier
>>>>> $MapOutputCopier.copyOutput(ReduceTask.java:1261)
>>>>>
>>>>>>     at
>>>>>>>>>
>>>>>>>>>
>>>>>>>
>>>>>>  org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run
>>>>> (ReduceTask.java:1195)
>>>>>
>>>>>>
>>>>>>>>>
>>>>>>>>> We tried this both in 0.20.1 and 0.20.2. We had hoped
>>>>>>>>> MAPREDUCE-1182
>>>>>>>>> would address the issue in 0.20.2, but it did not. Does anyone
>>>>>>>>> have
>>>>>>>>> any comments or suggestions? Is this a bug I should file a JIRA
>>>>>>>>> for?
>>>>>>>>>
>>>>>>>>> Jacob Rideout
>>>>>>>>> Return Path
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>

Reply via email to