We were getting this exact same problem in a really simple MR job, on input
produced from a known-working MR job.

It seemed to happen intermittently, and we couldn't figure out what was up.
In the end we solved the problem by increasing the number of maps (80 to
200, this is a 6 node, 12 code cluster).  Apparently, QuickSort can have
problems with big chunks of pre-sorted data.  Too much recursion, I believe.

This might not be what's going on with you, maybe you're on a cluster of
some other scale, but this worked for us (and in a setup with Hadoop 0.17.)

Good luck!

-Colin

On Mon, Jun 2, 2008 at 3:18 PM, Devaraj Das <[EMAIL PROTECTED]> wrote:

> Hi, do you have a testcase that we can run to reproduce this? Thanks!
>
> > -----Original Message-----
> > From: jkupferman [mailto:[EMAIL PROTECTED]
> > Sent: Monday, June 02, 2008 9:22 AM
> > To: core-user@hadoop.apache.org
> > Subject: Stack Overflow When Running Job
> >
> >
> > Hi everyone,
> > I have a job running that keeps failing with Stack Overflows
> > and I really dont see how that is happening.
> > The job runs for about 20-30 minutes before one task errors,
> > then a few more error and it fails.
> > I am running hadoop-17 and ive tried lowering these settings
> > to no avail:
> > io.sort.factor        50
> > io.seqfile.sorter.recordlimit 500000
> >
> > java.io.IOException: Spill failed
> >       at
> > org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(
> > MapTask.java:594)
> >       at
> > org.apache.hadoop.mapred.MapTask$MapOutputBuffer$Buffer.write(
> > MapTask.java:576)
> >       at java.io.DataOutputStream.writeInt(DataOutputStream.java:180)
> >       at Group.write(Group.java:68)
> >       at GroupPair.write(GroupPair.java:67)
> >       at
> > org.apache.hadoop.io.serializer.WritableSerialization$Writable
> Serializer.serialize(WritableSerialization.java:90)
> >       at
> > org.apache.hadoop.io.serializer.WritableSerialization$Writable
> Serializer.serialize(WritableSerialization.java:77)
> >       at
> > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.collect(MapTa
> > sk.java:434)
> >       at MyMapper.map(MyMapper.java:27)
> >       at MyMapper.map(MyMapper.java:10)
> >       at org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:47)
> >       at org.apache.hadoop.mapred.MapTask.run(MapTask.java:219)
> >       at
> > org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.java:2124)
> > Caused by: java.lang.StackOverflowError
> >       at java.io.DataInputStream.readInt(DataInputStream.java:370)
> >       at Group.readFields(Group.java:62)
> >       at GroupPair.readFields(GroupPair.java:60)
> >       at
> > org.apache.hadoop.io.WritableComparator.compare(WritableCompar
> > ator.java:91)
> >       at
> > org.apache.hadoop.mapred.MapTask$MapOutputBuffer.compare(MapTa
> > sk.java:494)
> >       at org.apache.hadoop.util.QuickSort.fix(QuickSort.java:29)
> >       at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:58)
> >       at org.apache.hadoop.util.QuickSort.sort(QuickSort.java:58)
> > ....the above line repeated 200x
> >
> > I defined writeablecomparable called GroupPair which simply
> > holds to Group objects, each of which contains two integers.
> > I fail to see how QuickSort could recurse 200+ times since
> > that would require an insanely large amount of entries , far
> > more then the 500 million that had been output at that point.
> >
> > How is this even possible? And what can be done to fix this?
> > --
> > View this message in context:
> > http://www.nabble.com/Stack-Overflow-When-Running-Job-tp175935
> > 94p17593594.html
> > Sent from the Hadoop core-user mailing list archive at Nabble.com.
> >
> >
>
>

Reply via email to