Re: OutOfMemory error processing large amounts of gz files

Gordon Mohr Tue, 24 Feb 2009 21:41:08 -0800

If you're doing a lot of gzip compression/decompression, you *might* behitting this 6+-year-old Sun JVM bug:

"Instantiating Inflater/Deflater causes OutOfMemoryError; finalizers notcalled promptly enough"

http://bugs.sun.com/bugdatabase/view_bug.do?bug_id=4797189

A workaround is listed in the issue: ensuring you call close() or end()on the Deflater; something similar might apply to Inflater.

(This is one of those fun JVM situations where having more heap spacemay make OOMEs more likely: less heap memory pressure leaves more un-GCdor un-finalized heap objects around, each of which is holding a bit ofnative memory.)


- Gordon @ IA

bzheng wrote:

I have about 24k gz files (about 550GB total) on hdfs and has a really simple
java program to convert them into sequence files.  If the script's
setInputPaths takes a Path[] of all 24k files, it will get a OutOfMemory
error at about 35% map complete.  If I make the script process 2k files per
job and run 12 jobs consecutively, then it goes through all files fine.  The
cluster I'm using has about 67 nodes.  Each nodes has 16GB memory, max 7
map, and max 2 reduce.

The map task is really simple, it takes LongWritable as key and Text as

value, generate a Text newKey, and output.collect(Text newKey, Text value).It doesn't have any code that can possibly leak memory.


There's no stack trace for the vast majority of the OutOfMemory error,
there's just a single line in the log like this:

2009-02-23 14:27:50,902 INFO org.apache.hadoop.mapred.TaskTracker:
java.lang.OutOfMemoryError: Java heap space

I can't find the stack trace right now, but rarely the OutOfMemory error
originates from some hadoop config array copy opertaion.  There's no special
config for the script.

Re: OutOfMemory error processing large amounts of gz files

Reply via email to