I pressed send key a bit early. I will have to dig a bit deeper. Hopefully someone can find reader.close() call after which I will look for another possible root cause :-)
On Wed, Mar 10, 2010 at 7:48 PM, Ted Yu <yuzhih...@gmail.com> wrote: > Thanks to Andy for the log he provided. > > You can see from the log below that size increased steadily from 341535057 > to 408181692, approaching maxSize. Then OOME: > > > 2010-03-10 18:38:32,936 INFO org.apache.hadoop.mapred.ReduceTask: reserve: > pos=start requestedSize=3893000 size=341535057 numPendingRequests=0 > maxSize=417601952 > 2010-03-10 18:38:32,936 INFO org.apache.hadoop.mapred.ReduceTask: reserve: > pos=end requestedSize=3893000 size=345428057 numPendingRequests=0 > maxSize=417601952 > ... > 2010-03-10 18:38:35,950 INFO org.apache.hadoop.mapred.ReduceTask: reserve: > pos=end requestedSize=635753 size=408181692 numPendingRequests=0 > maxSize=417601952 > 2010-03-10 18:38:36,603 INFO org.apache.hadoop.mapred.ReduceTask: Task > attempt_201003101826_0001_r_000004_0: Failed fetch #1 from > attempt_201003101826_0001_m_000875_0 > > 2010-03-10 18:38:36,603 WARN org.apache.hadoop.mapred.ReduceTask: > attempt_201003101826_0001_r_000004_0 adding host hd17.dfs.returnpath.netto > penalty box, next contact in 4 seconds > 2010-03-10 18:38:36,604 INFO org.apache.hadoop.mapred.ReduceTask: > attempt_201003101826_0001_r_000004_0: Got 1 map-outputs from previous > failures > 2010-03-10 18:38:36,605 FATAL org.apache.hadoop.mapred.TaskRunner: > attempt_201003101826_0001_r_000004_0 : Map output copy failure : > java.lang.OutOfMemoryError: Java heap space > at > org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1513) > at > org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1413) > at > org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1266) > at > org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1200) > > Looking at the call to unreserve() in ReduceTask, two were for IOException > and the other was for Sanity check (line 1557). Meaning they wouldn't be > called in normal execution path. > > I see one call in IFile.InMemoryReader close() method: > // Inform the RamManager > ramManager.unreserve(bufferSize); > > And InMemoryReader is used in createInMemorySegments(): > Reader<K, V> reader = > new InMemoryReader<K, V>(ramManager, mo.mapAttemptId, > mo.data, 0, mo.data.length); > > But I don't see reader.close() in ReduceTask file. > > On Wed, Mar 10, 2010 at 3:34 PM, Chris Douglas <chri...@yahoo-inc.com>wrote: > >> I don't think this OOM is a framework bug per se, and given the >> rewrite/refactoring of the shuffle in MAPREDUCE-318 (in 0.21), tuning the >> 0.20 shuffle semantics is likely not worthwhile (though data informing >> improvements to trunk would be excellent). Most likely (and tautologically), >> ReduceTask simply requires more memory than is available and the job failure >> can be avoided by either 0) increasing the heap size or 1) lowering >> mapred.shuffle.input.buffer.percent. Most of the tasks we run have a heap of >> 1GB. For a reduce fetching >200k map outputs, that's a reasonable, even >> stingy amount of space. -C >> >> >> On Mar 10, 2010, at 5:26 AM, Ted Yu wrote: >> >> I verified that size and maxSize are long. This means MR-1182 didn't >>> resolve >>> Andy's issue. >>> >>> According to Andy: >>> At the beginning of the job there are 209,754 pending map tasks and 32 >>> pending reduce tasks >>> >>> My guess is that GC wasn't reclaiming memory fast enough, leading to OOME >>> because of large number of in-memory shuffle candidates. >>> >>> My suggestion for Andy would be to: >>> 1. add -*verbose*:*gc as JVM parameter >>> 2. modify reserve() slightly to calculate the maximum outstanding >>> numPendingRequests and print the maximum. >>> >>> Based on the output from above two items, we can discuss solution. >>> My intuition is to place upperbound on numPendingRequests beyond which >>> canFitInMemory() returns false. >>> * >>> My two cents. >>> >>> On Tue, Mar 9, 2010 at 11:51 PM, Christopher Douglas >>> <chri...@yahoo-inc.com>wrote: >>> >>> That section of code is unmodified in MR-1182. See the patches/svn log. >>>> -C >>>> >>>> Sent from my iPhone >>>> >>>> >>>> On Mar 9, 2010, at 7:44 PM, "Ted Yu" <yuzhih...@gmail.com> wrote: >>>> >>>> I just downloaded hadoop-0.20.2 tar ball from cloudera mirror. >>>> >>>>> This is what I see in ReduceTask (line 999): >>>>> public synchronized boolean reserve(int requestedSize, InputStream >>>>> in) >>>>> >>>>> throws InterruptedException { >>>>> // Wait till the request can be fulfilled... >>>>> while ((size + requestedSize) > maxSize) { >>>>> >>>>> I don't see the fix from MR-1182. >>>>> >>>>> That's why I suggested to Andy that he manually apply MR-1182. >>>>> >>>>> Cheers >>>>> >>>>> On Tue, Mar 9, 2010 at 5:01 PM, Andy Sautins < >>>>> andy.saut...@returnpath.net >>>>> >>>>>> wrote: >>>>>> >>>>> >>>>> >>>>> Thanks Christopher. >>>>>> >>>>>> The heap size for reduce tasks is configured to be 640M ( >>>>>> mapred.child.java.opts set to -Xmx640m ). >>>>>> >>>>>> Andy >>>>>> >>>>>> -----Original Message----- >>>>>> From: Christopher Douglas [mailto:chri...@yahoo-inc.com] >>>>>> Sent: Tuesday, March 09, 2010 5:19 PM >>>>>> To: common-user@hadoop.apache.org >>>>>> Subject: Re: Shuffle In Memory OutOfMemoryError >>>>>> >>>>>> No, MR-1182 is included in 0.20.2 >>>>>> >>>>>> What heap size have you set for your reduce tasks? -C >>>>>> >>>>>> Sent from my iPhone >>>>>> >>>>>> On Mar 9, 2010, at 2:34 PM, "Ted Yu" <yuzhih...@gmail.com> wrote: >>>>>> >>>>>> Andy: >>>>>> >>>>>>> You need to manually apply the patch. >>>>>>> >>>>>>> Cheers >>>>>>> >>>>>>> On Tue, Mar 9, 2010 at 2:23 PM, Andy Sautins < >>>>>>> >>>>>>> andy.saut...@returnpath.net >>>>>> >>>>>> wrote: >>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> Thanks Ted. My understanding is that MAPREDUCE-1182 is included >>>>>>>> in the >>>>>>>> 0.20.2 release. We upgraded our cluster to 0.20.2 this weekend and >>>>>>>> re-ran >>>>>>>> the same job scenarios. Running with mapred.reduce.parallel.copies >>>>>>>> set to 1 >>>>>>>> and continue to have the same Java heap space error. >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -----Original Message----- >>>>>>>> From: Ted Yu [mailto:yuzhih...@gmail.com] >>>>>>>> Sent: Tuesday, March 09, 2010 12:56 PM >>>>>>>> To: common-user@hadoop.apache.org >>>>>>>> Subject: Re: Shuffle In Memory OutOfMemoryError >>>>>>>> >>>>>>>> This issue has been resolved in >>>>>>>> http://issues.apache.org/jira/browse/MAPREDUCE-1182 >>>>>>>> >>>>>>>> Please apply the patch >>>>>>>> M1182-1v20.patch< >>>>>>>> >>>>>>>> >>>>>>>> >>>>>> http://issues.apache.org/jira/secure/attachment/12424116/M1182-1v20.patch >>>>>> >>>>>> >>>>>>> >>>>>>>>> On Sun, Mar 7, 2010 at 3:57 PM, Andy Sautins < >>>>>>>> >>>>>>>> andy.saut...@returnpath.net >>>>>>> >>>>>> >>>>>> wrote: >>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> Thanks Ted. Very helpful. You are correct that I misunderstood >>>>>>>>> the >>>>>>>>> >>>>>>>>> code >>>>>>>> >>>>>>>> at ReduceTask.java:1535. I missed the fact that it's in a >>>>>>>>> IOException >>>>>>>>> >>>>>>>>> catch >>>>>>>> >>>>>>>> block. My mistake. That's what I get for being in a rush. >>>>>>>>> >>>>>>>>> For what it's worth I did re-run the job with >>>>>>>>> mapred.reduce.parallel.copies set with values from 5 all the way >>>>>>>>> down to >>>>>>>>> >>>>>>>>> 1. >>>>>>>> >>>>>>>> All failed with the same error: >>>>>>>>> >>>>>>>>> Error: java.lang.OutOfMemoryError: Java heap space >>>>>>>>> at >>>>>>>>> >>>>>>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier >>>>>>>>> >>>>>>>> $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508) >>>>>>>> >>>>>>>> at >>>>>>>>> >>>>>>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier >>>>>>>>> >>>>>>>> $MapOutputCopier.getMapOutput(ReduceTask.java:1408) >>>>>>>> >>>>>>>> at >>>>>>>>> >>>>>>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier >>>>>>>>> >>>>>>>> $MapOutputCopier.copyOutput(ReduceTask.java:1261) >>>>>>>> >>>>>>>> at >>>>>>>>> >>>>>>>>> >>>>>>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run >>>>>>>>> >>>>>>>> (ReduceTask.java:1195) >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>> So from that it does seem like something else might be going on, >>>>>>>>> yes? >>>>>>>>> >>>>>>>>> I >>>>>>>> >>>>>>>> need to do some more research. >>>>>>>>> >>>>>>>>> I appreciate your insights. >>>>>>>>> >>>>>>>>> Andy >>>>>>>>> >>>>>>>>> -----Original Message----- >>>>>>>>> From: Ted Yu [mailto:yuzhih...@gmail.com] >>>>>>>>> Sent: Sunday, March 07, 2010 3:38 PM >>>>>>>>> To: common-user@hadoop.apache.org >>>>>>>>> Subject: Re: Shuffle In Memory OutOfMemoryError >>>>>>>>> >>>>>>>>> My observation is based on this call chain: >>>>>>>>> MapOutputCopier.run() calling copyOutput() calling getMapOutput() >>>>>>>>> calling >>>>>>>>> ramManager.canFitInMemory(decompressedLength) >>>>>>>>> >>>>>>>>> Basically ramManager.canFitInMemory() makes decision without >>>>>>>>> considering >>>>>>>>> the >>>>>>>>> number of MapOutputCopiers that are running. Thus 1.25 * 0.7 of >>>>>>>>> total >>>>>>>>> >>>>>>>>> heap >>>>>>>> >>>>>>>> may be used in shuffling if default parameters were used. >>>>>>>>> Of course, you should check the value for >>>>>>>>> mapred.reduce.parallel.copies >>>>>>>>> >>>>>>>>> to >>>>>>>> >>>>>>>> see if it is 5. If it is 4 or lower, my reasoning wouldn't apply. >>>>>>>>> >>>>>>>>> About ramManager.unreserve() call, ReduceTask.java from hadoop >>>>>>>>> 0.20.2 >>>>>>>>> >>>>>>>>> only >>>>>>>> >>>>>>>> has 2731 lines. So I have to guess the location of the code >>>>>>>>> snippet you >>>>>>>>> provided. >>>>>>>>> I found this around line 1535: >>>>>>>>> } catch (IOException ioe) { >>>>>>>>> LOG.info("Failed to shuffle from " + >>>>>>>>> mapOutputLoc.getTaskAttemptId(), >>>>>>>>> ioe); >>>>>>>>> >>>>>>>>> // Inform the ram-manager >>>>>>>>> ramManager.closeInMemoryFile(mapOutputLength); >>>>>>>>> ramManager.unreserve(mapOutputLength); >>>>>>>>> >>>>>>>>> // Discard the map-output >>>>>>>>> try { >>>>>>>>> mapOutput.discard(); >>>>>>>>> } catch (IOException ignored) { >>>>>>>>> LOG.info("Failed to discard map-output from " + >>>>>>>>> mapOutputLoc.getTaskAttemptId(), ignored); >>>>>>>>> } >>>>>>>>> Please confirm the line number. >>>>>>>>> >>>>>>>>> If we're looking at the same code, I am afraid I don't see how we >>>>>>>>> can >>>>>>>>> improve it. First, I assume IOException shouldn't happen that >>>>>>>>> often. >>>>>>>>> Second, >>>>>>>>> mapOutput.discard() just sets: >>>>>>>>> data = null; >>>>>>>>> for in memory case. Even if we call mapOutput.discard() before >>>>>>>>> ramManager.unreserve(), we don't know when GC would kick in and >>>>>>>>> make more >>>>>>>>> memory available. >>>>>>>>> Of course, given the large number of map outputs in your system, it >>>>>>>>> >>>>>>>>> became >>>>>>>> >>>>>>>> more likely that the root cause from my reasoning made OOME happen >>>>>>>>> >>>>>>>>> sooner. >>>>>>>> >>>>>>>> >>>>>>>>> Thanks >>>>>>>>> >>>>>>>>> >>>>>>>>> On Sun, Mar 7, 2010 at 1:03 PM, Andy Sautins < >>>>>>>>>> >>>>>>>>> >>>>>>>>> andy.saut...@returnpath.net >>>>>>>> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>>> Ted, >>>>>>>>>> >>>>>>>>>> I'm trying to follow the logic in your mail and I'm not sure I'm >>>>>>>>>> following. If you would mind helping me understand I would >>>>>>>>>> appreciate >>>>>>>>>> >>>>>>>>>> it. >>>>>>>>> >>>>>>>>> >>>>>>>>>> Looking at the code maxSingleShuffleLimit is only used in >>>>>>>>>> determining >>>>>>>>>> >>>>>>>>>> if >>>>>>>>> >>>>>>>>> the copy _can_ fit into memory: >>>>>>>>>> >>>>>>>>>> boolean canFitInMemory(long requestedSize) { >>>>>>>>>> return (requestedSize < Integer.MAX_VALUE && >>>>>>>>>> requestedSize < maxSingleShuffleLimit); >>>>>>>>>> } >>>>>>>>>> >>>>>>>>>> It also looks like the RamManager.reserve should wait until >>>>>>>>>> memory >>>>>>>>>> >>>>>>>>>> is >>>>>>>>> >>>>>>>> >>>>>>>> available so it should hit a memory limit for that reason. >>>>>>>>> >>>>>>>>>> >>>>>>>>>> What does seem a little strange to me is the following ( >>>>>>>>>> >>>>>>>>>> ReduceTask.java >>>>>>>>> >>>>>>>>> starting at 2730 ): >>>>>>>>>> >>>>>>>>>> // Inform the ram-manager >>>>>>>>>> ramManager.closeInMemoryFile(mapOutputLength); >>>>>>>>>> ramManager.unreserve(mapOutputLength); >>>>>>>>>> >>>>>>>>>> // Discard the map-output >>>>>>>>>> try { >>>>>>>>>> mapOutput.discard(); >>>>>>>>>> } catch (IOException ignored) { >>>>>>>>>> LOG.info("Failed to discard map-output from " + >>>>>>>>>> mapOutputLoc.getTaskAttemptId(), ignored); >>>>>>>>>> } >>>>>>>>>> mapOutput = null; >>>>>>>>>> >>>>>>>>>> So to me that looks like the ramManager unreserves the memory >>>>>>>>>> before >>>>>>>>>> >>>>>>>>>> the >>>>>>>>> >>>>>>>>> mapOutput is discarded. Shouldn't the mapOutput be discarded >>>>>>>>>> _before_ >>>>>>>>>> >>>>>>>>>> the >>>>>>>>> >>>>>>>>> ramManager unreserves the memory? If the memory is unreserved >>>>>>>>>> before >>>>>>>>>> >>>>>>>>>> the >>>>>>>>> >>>>>>>> >>>>>>>> actual underlying data references are removed then it seems like >>>>>>>>> >>>>>>>>>> >>>>>>>>>> another >>>>>>>>> >>>>>>>> >>>>>>>> thread can try to allocate memory ( ReduceTask.java:2730 ) before >>>>>>>>> >>>>>>>>>> the >>>>>>>>>> previous memory is disposed ( mapOutput.discard() ). >>>>>>>>>> >>>>>>>>>> Not sure that makes sense. One thing to note is that the >>>>>>>>>> particular >>>>>>>>>> >>>>>>>>>> job >>>>>>>>> >>>>>>>>> that is failing does have a good number ( 200k+ ) of map >>>>>>>>>> outputs. The >>>>>>>>>> >>>>>>>>>> large >>>>>>>>> >>>>>>>>> number of small map outputs may be why we are triggering a >>>>>>>>>> problem. >>>>>>>>>> >>>>>>>>>> Thanks again for your thoughts. >>>>>>>>>> >>>>>>>>>> Andy >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -----Original Message----- >>>>>>>>>> From: Jacob R Rideout [mailto:apa...@jacobrideout.net] >>>>>>>>>> Sent: Sunday, March 07, 2010 1:21 PM >>>>>>>>>> To: common-user@hadoop.apache.org >>>>>>>>>> Cc: Andy Sautins; Ted Yu >>>>>>>>>> Subject: Re: Shuffle In Memory OutOfMemoryError >>>>>>>>>> >>>>>>>>>> Ted, >>>>>>>>>> >>>>>>>>>> Thank you. I filled MAPREDUCE-1571 to cover this issue. I might >>>>>>>>>> have >>>>>>>>>> some time to write a patch later this week. >>>>>>>>>> >>>>>>>>>> Jacob Rideout >>>>>>>>>> >>>>>>>>>> On Sat, Mar 6, 2010 at 11:37 PM, Ted Yu <yuzhih...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>> I think there is mismatch (in ReduceTask.java) between: >>>>>>>>>>> this.numCopiers = conf.getInt("mapred.reduce.parallel.copies", >>>>>>>>>>> >>>>>>>>>>> 5); >>>>>>>>>> >>>>>>>>> >>>>>>>> and: >>>>>>>>> >>>>>>>>>> maxSingleShuffleLimit = (long)(maxSize * >>>>>>>>>>> MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION); >>>>>>>>>>> where MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION is 0.25f >>>>>>>>>>> >>>>>>>>>>> because >>>>>>>>>>> copiers = new ArrayList<MapOutputCopier>(numCopiers); >>>>>>>>>>> so the total memory allocated for in-mem shuffle is 1.25 * >>>>>>>>>>> maxSize >>>>>>>>>>> >>>>>>>>>>> A JIRA should be filed to correlate the constant 5 above and >>>>>>>>>>> MAX_SINGLE_SHUFFLE_SEGMENT_FRACTION. >>>>>>>>>>> >>>>>>>>>>> Cheers >>>>>>>>>>> >>>>>>>>>>> On Sat, Mar 6, 2010 at 8:31 AM, Jacob R Rideout < >>>>>>>>>>> >>>>>>>>>>> apa...@jacobrideout.net >>>>>>>>>> >>>>>>>>> >>>>>>>>> wrote: >>>>>>>>>> >>>>>>>>>>> >>>>>>>>>>> Hi all, >>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> We are seeing the following error in our reducers of a >>>>>>>>>>>> particular >>>>>>>>>>>> >>>>>>>>>>>> job: >>>>>>>>>>> >>>>>>>>>> >>>>>>>> >>>>>>>>> Error: java.lang.OutOfMemoryError: Java heap space >>>>>>>>>>>> at >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier >>>>>>>>> >>>>>>>> $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508) >>>>>>>> >>>>>>>> at >>>>>>>>> >>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier >>>>>>>>> >>>>>>>> $MapOutputCopier.getMapOutput(ReduceTask.java:1408) >>>>>>>> >>>>>>>> at >>>>>>>>> >>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier >>>>>>>>> >>>>>>>> $MapOutputCopier.copyOutput(ReduceTask.java:1261) >>>>>>>> >>>>>>>> at >>>>>>>>> >>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run >>>>>>>>> >>>>>>>> (ReduceTask.java:1195) >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>>>>> After enough reducers fail the entire job fails. This error >>>>>>>>>>>> occurs >>>>>>>>>>>> regardless of whether mapred.compress.map.output is true. We >>>>>>>>>>>> were >>>>>>>>>>>> >>>>>>>>>>>> able >>>>>>>>>>> >>>>>>>>>> >>>>>>>> to avoid the issue by reducing >>>>>>>>> >>>>>>>>>> >>>>>>>>>>>> mapred.job.shuffle.input.buffer.percent >>>>>>>>>>> >>>>>>>>>> >>>>>>>> to 20%. Shouldn't the framework via >>>>>>>>> >>>>>>>>>> ShuffleRamManager.canFitInMemory >>>>>>>>>>>> and.ShuffleRamManager.reserve correctly detect the the memory >>>>>>>>>>>> available for allocation? I would think that with poor >>>>>>>>>>>> configuration >>>>>>>>>>>> settings (and default settings in particular) the job may not >>>>>>>>>>>> be as >>>>>>>>>>>> efficient, but wouldn't die. >>>>>>>>>>>> >>>>>>>>>>>> Here is some more context in the logs, I have attached the full >>>>>>>>>>>> reducer log here: http://gist.github.com/323746 >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> 2010-03-06 07:54:49,621 INFO >>>>>>>>>>>> org.apache.hadoop.mapred.ReduceTask: >>>>>>>>>>>> Shuffling 4191933 bytes (435311 raw bytes) into RAM from >>>>>>>>>>>> attempt_201003060739_0002_m_000061_0 >>>>>>>>>>>> 2010-03-06 07:54:50,222 INFO >>>>>>>>>>>> org.apache.hadoop.mapred.ReduceTask: >>>>>>>>>>>> >>>>>>>>>>>> Task >>>>>>>>>>> >>>>>>>>>> >>>>>>>> attempt_201003060739_0002_r_000000_0: Failed fetch #1 from >>>>>>>>> >>>>>>>>>> attempt_201003060739_0002_m_000202_0 >>>>>>>>>>>> 2010-03-06 07:54:50,223 WARN >>>>>>>>>>>> org.apache.hadoop.mapred.ReduceTask: >>>>>>>>>>>> attempt_201003060739_0002_r_000000_0 adding host >>>>>>>>>>>> hd37.dfs.returnpath.net to penalty box, next contact in 4 >>>>>>>>>>>> seconds >>>>>>>>>>>> 2010-03-06 07:54:50,223 INFO >>>>>>>>>>>> org.apache.hadoop.mapred.ReduceTask: >>>>>>>>>>>> attempt_201003060739_0002_r_000000_0: Got 1 map-outputs from >>>>>>>>>>>> >>>>>>>>>>>> previous >>>>>>>>>>> >>>>>>>>>> >>>>>>>> failures >>>>>>>>> >>>>>>>>>> 2010-03-06 07:54:50,223 FATAL >>>>>>>>>>>> org.apache.hadoop.mapred.TaskRunner: >>>>>>>>>>>> attempt_201003060739_0002_r_000000_0 : Map output copy failure : >>>>>>>>>>>> java.lang.OutOfMemoryError: Java heap space >>>>>>>>>>>> at >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier >>>>>>>>> >>>>>>>> $MapOutputCopier.shuffleInMemory(ReduceTask.java:1508) >>>>>>>> >>>>>>>> at >>>>>>>>> >>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier >>>>>>>>> >>>>>>>> $MapOutputCopier.getMapOutput(ReduceTask.java:1408) >>>>>>>> >>>>>>>> at >>>>>>>>> >>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier >>>>>>>>> >>>>>>>> $MapOutputCopier.copyOutput(ReduceTask.java:1261) >>>>>>>> >>>>>>>> at >>>>>>>>> >>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>> org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run >>>>>>>>> >>>>>>>> (ReduceTask.java:1195) >>>>>>>> >>>>>>>> >>>>>>>>> >>>>>>>>>>>> We tried this both in 0.20.1 and 0.20.2. We had hoped >>>>>>>>>>>> MAPREDUCE-1182 >>>>>>>>>>>> would address the issue in 0.20.2, but it did not. Does anyone >>>>>>>>>>>> have >>>>>>>>>>>> any comments or suggestions? Is this a bug I should file a JIRA >>>>>>>>>>>> for? >>>>>>>>>>>> >>>>>>>>>>>> Jacob Rideout >>>>>>>>>>>> Return Path >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>> >> >