OOME hell

Andrew Purtell Mon, 01 Dec 2008 11:16:18 -0800

I am constantly needing to restart my cluster now, even running region servers 
with 3GB of heap. The production cluster is running Hadoop 0.18.1 and HBase 
0.18.1


I will see mapred tasks fail with (copied by hand, please forgive):

java.io.IOException: java.lang.OutOfMemoryError: Java heap space
at java.io.DataInputStream.readFull(DataInputSteram.java:175)
at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:64)
at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:102)
at org.apahce.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1933)
at org.apahce.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1833)
at org.apahce.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1879)
at org.apache.hadoop.io.MapFile$Reader.next(MapFile.java:516)
at 
org.apache.hadoop.hbase.regionserver.StoreFileScanner.getNext(StoreFileScanner.java:312)
...

This problem is really killing us. When the OOMEs happen, the cluster does not 
recover without manual intervention. The regionservers sometimes go down after 
this, or sometimes do not and stay up in sick condition for a while. Regions go 
offline and remain unavailable, causing indefinite stalls all over the place.

Even so, my workload is modest continuous write operations, maybe up to 
100/sec, of objects typically < 4K in size but can be as large as 20MB. Writes 
happen to both a 'urls' table and a 'content' table. 'content' table gets the 
raw content and uses RECORD compression. 'urls' table gets metadata only. 
Concurrent with this are two mapred tasks, one running on the 'urls' table, one 
on the 'content' table. The mapred tasks run once every few minutes for a few 
minutes, with a interval between executions currently at 5 minutes. 

Along with jgray's import problems, I wonder if there is some issue with writes 
in general, or at least in my case, some interaction between the write side of 
things and the read side (caching, etc.) One thing I notice every so often is 
that if I stop the write load on the cluster then a few moments later a number 
of compactions and sometimes also splits start running as if they were being 
deferred. 

For a while I was doing funky things with store files but I have since 
reinitialized and am running with defaults for everything but blockcache (I use 
blocks of 8192). 

Any thoughts as to what I can do to help the situation?

   - Andy

OOME hell

Reply via email to