We have very frequent cluster wide pauses that stop all reads and writes
for seconds. We are constantly loading data to this cluster of 10 nodes.
These pauses can happen as frequently as every minute but sometimes are not
seen for 15+ minutes. Basically watching the Region server list with request
counts is the only evidence of what is going on. All reads and writes
totally stop and if there is ever any activity it is on the node hosting the
.META. table with a request count of region count + 1. This problem seems to
be worse with a larger region size. We tried a 1GB region size and saw this
more than we saw actual activity (and stopped using a larger region size
because of it). We went back to the default region size and it was better,
but we had too many regions so now we are up to 512M for a region size and
we are seeing it more again.

Does anyone know what this is? We have dug into all of the logs to find some
sort of pause but are not able to find anything. Is this an wal hlog roll?
Is this a region split or compaction? Of course our biggest fear is a GC
pause on the master but we do not have java logging turned on with the
master to tell. What could possibly stop the entire cluster from working for
seconds at a time very frequently?

Thanks in advance for any ideas of what could be causing this.

Reply via email to