We have very frequent cluster wide pauses that stop all reads and writes for seconds. We are constantly loading data to this cluster of 10 nodes. These pauses can happen as frequently as every minute but sometimes are not seen for 15+ minutes. Basically watching the Region server list with request counts is the only evidence of what is going on. All reads and writes totally stop and if there is ever any activity it is on the node hosting the .META. table with a request count of region count + 1. This problem seems to be worse with a larger region size. We tried a 1GB region size and saw this more than we saw actual activity (and stopped using a larger region size because of it). We went back to the default region size and it was better, but we had too many regions so now we are up to 512M for a region size and we are seeing it more again.
Does anyone know what this is? We have dug into all of the logs to find some sort of pause but are not able to find anything. Is this an wal hlog roll? Is this a region split or compaction? Of course our biggest fear is a GC pause on the master but we do not have java logging turned on with the master to tell. What could possibly stop the entire cluster from working for seconds at a time very frequently? Thanks in advance for any ideas of what could be causing this.