If it was related to maxClientCnxns, you would see sessions being torn-down and recreated in HBase on that node, as well as a clear message in the ZK server log that it's denying requests because the number of outstanding connections from that host exceeds the limit.

ConnectionLoss is a transient ZooKeeper state; more often than not, I see this manifest as a result of unplanned pauses in HBase itself. Typically this is a result of JVM garbage collection pauses, other times from Linux kernel/OS-level pauses. The former you can diagnose via the standard JVM GC logging mechanisms, the latter usually via your syslog or dmesg.

When looking for unexpected pauses, remember that you also need to look at what was happening in ZK. A JVM GC pause in ZK would exhibit the same kind of symptoms in HBase.

One final suggestion is to correlate it against other batch jobs (e.g. YARN, Spark) which may be running on the same node. It's possible that the node is not experiencing any explicit problems, but there is some transient workload which happens to run and slows things down.

Have fun digging!

On 8/31/18 3:19 PM, Srinidhi Muppalla wrote:
Hello all,

Our production application has recently experienced a very high spike in the 
following exception along with very large read times to our hbase cluster.

“org.apache.hadoop.hbase.shaded.org.apache.zookeeper.KeeperException$ConnectionLossException:
 KeeperErrorCode = ConnectionLoss for /hbase/meta-region-server\n\tat 
org.apache.hadoop.hbase.shaded.org.apache.zookeeper.KeeperException.create(KeeperException.java:99)\n\tat
 
org.apache.hadoop.hbase.shaded.org.apache.zookeeper.KeeperException.create(KeeperException.java:51)\n\tat
 
org.apache.hadoop.hbase.shaded.org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155)\n\tat
 
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:359)\n\tat
 org.apache.hadoop.hbase.zookeeper.ZKUtil.getData(ZKUtil.java:623)\n\tat 
org.apache.hadoop.hbase.zookeeper.MetaTableLocator.getMetaRegionState(MetaTableLocator.java:487)\n\tat
 
org.apache.hadoop.hbase.zookeeper.MetaTableLocator.getMetaRegionLocation(MetaTableLocator.java:168)\n\tat
 
org.apache.hadoop.hbase.zookeeper.MetaTableLocator.blockUntilAvailable(MetaTableLocator.java:605)\n\tat
 
org.apache.hadoop.hbase.zookeeper.MetaTableLocator.blockUntilAvailable(MetaTableLocator.java:585)\n\tat
 
org.apache.hadoop.hbase.zookeeper.MetaTableLocator.blockUntilAvailable(MetaTableLocator.java:564)\n\tat
 
org.apache.hadoop.hbase.client.ZooKeeperRegistry.getMetaRegionLocation(ZooKeeperRegistry.java:61)\n\tat
 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateMeta(ConnectionManager.java:1211)\n\tat
 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1178)\n\tat
 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.relocateRegion(ConnectionManager.java:1152)\n\tat
 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegionInMeta(ConnectionManager.java:1357)\n\tat
 
org.apache.hadoop.hbase.client.ConnectionManager$HConnectionImplementation.locateRegion(ConnectionManager.java:1181)\n\tat
 
org.apache.hadoop.hbase.client.RpcRetryingCallerWithReadReplicas.getRegionLocations(RpcRetryingCallerWithReadReplicas.java:305)\n\tat
 
org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:156)\n\tat
 
org.apache.hadoop.hbase.client.ScannerCallableWithReplicas.call(ScannerCallableWithReplicas.java:60)\n\tat
 
org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithoutRetries(RpcRetryingCaller.java:200)\n\tat
 org.apache.hadoop.hbase.client.ClientScanner.call(ClientScanner.java:326)\n\tat 
org.apache.hadoop.hbase.client.ClientScanner.nextScanner(ClientScanner.java:301)\n\tat
 
org.apache.hadoop.hbase.client.ClientScanner.initializeScannerInConstruction(ClientScanner.java:166)\n\tat
 
org.apache.hadoop.hbase.client.ClientScanner.<init>(ClientScanner.java:161)\n\tat
 org.apache.hadoop.hbase.client.HTable.getScanner(HTable.java:794)\n\tat”

This error is not happening consistently as some reads to our table are 
happening successfully, so I am unable to narrow the issue down to a single 
configuration or connectivity failure.

Things I’ve tried are:
Using hbase zkcli to connect to our zookeeper server from the master instance. 
It is able to successfully connect and when running ‘ls’, the 
“/hbase/meta-region-server” znode is present.
Checking the number of connections that are occurring to our zookeeper instance 
using the HBase web UI. The number of connections is currently 162. I double 
checked our hbase config and the value for 
‘hbase.zookeeper.property.maxClientCnxns’ is 300.

Any insight into the cause or other steps that I could take to debug this issue 
would be greatly appreciated.

Thank you,
Srinidhi

Reply via email to