If you're doing a lot of gzip compression/decompression, you *might* be
hitting this 6+-year-old Sun JVM bug:
"Instantiating Inflater/Deflater causes OutOfMemoryError; finalizers not
called promptly enough"
http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4797189
A workaround is listed in the issue: ensuring you call close() or end()
on the Deflater; something similar might apply to Inflater.
(This is one of those fun JVM situations where having more heap space
may make OOMEs more likely: less heap memory pressure leaves more un-GCd
or un-finalized heap objects around, each of which is holding a bit of
native memory.)
- Gordon @ IA
bzheng wrote:
I have about 24k gz files (about 550GB total) on hdfs and has a really simple
java program to convert them into sequence files. If the script's
setInputPaths takes a Path[] of all 24k files, it will get a OutOfMemory
error at about 35% map complete. If I make the script process 2k files per
job and run 12 jobs consecutively, then it goes through all files fine. The
cluster I'm using has about 67 nodes. Each nodes has 16GB memory, max 7
map, and max 2 reduce.
The map task is really simple, it takes LongWritable as key and Text as
value, generate a Text newKey, and output.collect(Text newKey, Text value).
It doesn't have any code that can possibly leak memory.
There's no stack trace for the vast majority of the OutOfMemory error,
there's just a single line in the log like this:
2009-02-23 14:27:50,902 INFO org.apache.hadoop.mapred.TaskTracker:
java.lang.OutOfMemoryError: Java heap space
I can't find the stack trace right now, but rarely the OutOfMemory error
originates from some hadoop config array copy opertaion. There's no special
config for the script.