We were getting this exact same problem in a really simple MR job, on input produced from a known-working MR job.
It seemed to happen intermittently, and we couldn't figure out what was up. In the end we solved the problem by increasing the number of maps (80 to 200, this is a 6 node, 12 code cluster). Apparently, QuickSort can have problems with big chunks of pre-sorted data. Too much recursion, I believe. This might not be what's going on with you, maybe you're on a cluster of some other scale, but this worked for us (and in a setup with Hadoop 0.17.) Good luck! -Colin On Mon, Jun 2, 2008 at 3:18 PM, Devaraj Das <[EMAIL PROTECTED]> wrote: > Hi, do you have a testcase that we can run to reproduce this? Thanks! > > > -----Original Message----- > > From: jkupferman [mailto:[EMAIL PROTECTED] > > Sent: Monday, June 02, 2008 9:22 AM > > To: core-user@hadoop.apache.org > > Subject: Stack Overflow When Running Job > > > > > > Hi everyone, > > I have a job running that keeps failing with Stack Overflows > > and I really dont see how that is happening. > > The job runs for about 20-30 minutes before one task errors, > > then a few more error and it fails. > > I am running hadoop-17 and ive tried lowering these settings > > to no avail: > > io.sort.factor 50 > > io.seqfile.sorter.recordlimit 500000 > > > > java.io.IOException: Spill failed > > at > > org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write( > > MapTask.java:594) > > at > > org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write( > > MapTask.java:576) > > at java.io.DataOutputStream.writeInt(DataOutputStream.java:180) > > at Group.write(Group.java:68) > > at GroupPair.write(GroupPair.java:67) > > at > > org.apache.hadoop.io.serializer.WritableSerialization$Writable > Serializer.serialize(WritableSerialization.java:90) > > at > > org.apache.hadoop.io.serializer.WritableSerialization$Writable > Serializer.serialize(WritableSerialization.java:77) > > at > > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTa > > sk.java:434) > > at MyMapper.map(MyMapper.java:27) > > at MyMapper.map(MyMapper.java:10) > > at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47) > > at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219) > > at > > org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124) > > Caused by: java.lang.StackOverflowError > > at java.io.DataInputStream.readInt(DataInputStream.java:370) > > at Group.readFields(Group.java:62) > > at GroupPair.readFields(GroupPair.java:60) > > at > > org.apache.hadoop.io.WritableComparator.compare(WritableCompar > > ator.java:91) > > at > > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTa > > sk.java:494) > > at org.apache.hadoop.util.QuickSort.fix(QuickSort.java:29) > > at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:58) > > at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:58) > > ....the above line repeated 200x > > > > I defined writeablecomparable called GroupPair which simply > > holds to Group objects, each of which contains two integers. > > I fail to see how QuickSort could recurse 200+ times since > > that would require an insanely large amount of entries , far > > more then the 500 million that had been output at that point. > > > > How is this even possible? And what can be done to fix this? > > -- > > View this message in context: > > http://www.nabble.com/Stack-Overflow-When-Running-Job-tp175935 > > 94p17593594.html > > Sent from the Hadoop core-user mailing list archive at Nabble.com. > > > > > >