Andrew Purtell wrote:
I am constantly needing to restart my cluster now, even running region servers 
with 3GB of heap. The production cluster is running Hadoop 0.18.1 and HBase 
0.18.1

I will see mapred tasks fail with (copied by hand, please forgive):

java.io.IOException: java.lang.OutOfMemoryError: Java heap space
at java.io.DataInputStream.readFull(DataInputSteram.java:175)
at org.apache.hadoop.io.DataOutputBuffer$Buffer.write(DataOutputBuffer.java:64)
at org.apache.hadoop.io.DataOutputBuffer.write(DataOutputBuffer.java:102)
at org.apahce.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1933)
at org.apahce.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1833)
at org.apahce.hadoop.io.SequenceFile$Reader.next(SequenceFile.java:1879)
at org.apache.hadoop.io.MapFile$Reader.next(MapFile.java:516)
at 
org.apache.hadoop.hbase.regionserver.StoreFileScanner.getNext(StoreFileScanner.java:312)


Can you see which store file this is happening against? Does it always OOME against same storefile? Does it always OOME in same place? Do you think these cells wholesome? Not extremely large? (thought is that there might be a corrupted record that manifests itself as a very large record and we OOME trying to read it in to memory to shuttle across to the client). I can make a mapfile checker for you if you'd like -- just say.

...

This problem is really killing us. When the OOMEs happen, the cluster does not 
recover without manual intervention. The regionservers sometimes go down after 
this, or sometimes do not and stay up in sick condition for a while. Regions go 
offline and remain unavailable, causing indefinite stalls all over the place.

Is this because the OOMEs are bubbling up in a place that doesn't run the release of resevoir memory and trigger proper node shutdown? Should we backport hbase-1020/hbase-1006?

Even so, my workload is modest continuous write operations, maybe up to 100/sec, of objects typically < 4K in size but can be as large as 20MB. Writes happen to both a 'urls' table and a 'content' table. 'content' table gets the raw content and uses RECORD compression.

I have no experience using compression in HStoreFiles. Running compression buffers may introduce a new uncertainty regards memory management (Just guessing -- I have not looked). Have you tried with compression disabled? Or, is it that you cannot disable compression once enabled.

'urls' table gets metadata only. Concurrent with this are two mapred tasks, one running on the 'urls' table, one on the 'content' table. The mapred tasks run once every few minutes for a few minutes, with a interval between executions currently at 5 minutes. Along with jgray's import problems,

These might be something other than OOME issues having spent some time studying jgray cluster last wednesday (whole cluster went into swap; nothing was working -- GCs couldn't complete because of swapping. OOME was a symptom. Are you seeing any instances of HBASE-616 in your logs Andrew?).

I wonder if there is some issue with writes in general, or at least in my case, some interaction between the write side of things and the read side (caching, etc.) One thing I notice every so often is that if I stop the write load on the cluster then a few moments later a number of compactions and sometimes also splits start running as if they were being deferred.
There could be an issue here. I can look at log files if you put them in a place I can pull.
For a while I was doing funky things with store files but I have since reinitialized and am running with defaults for everything but blockcache (I use blocks of 8192).

You need blockcache? Blockcache uses soft references. Will fill until memory pressure and only then will it dump items. Might help if you disable this.

What version of the JVM are you using?

St.Ack

Reply via email to