liwei created HBASE-7259:
----------------------------
Summary: Deadlock in HBaseClient when KeeperException occured
Key: HBASE-7259
URL: https://issues.apache.org/jira/browse/HBASE-7259
Project: HBase
Issue Type: Bug
Components: Zookeeper
Affects Versions: 0.94.2, 0.94.1, 0.94.0
Reporter: liwei
Priority: Critical
HBaseClient was running after a period of time, all of get operation became too
slow.
>From the client logs I could see the following:
1. Unable to get data of znode /hbase/root-region-server
java.lang.InterruptedException
at java.lang.Object.wait(Native Method)
at java.lang.Object.wait(Object.java:485)
at org.apache.zookeeper.ClientCnxn.submitRequest(ClientCnxn.java:1253)
at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1129)
at
org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:264)
at
org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataInternal(ZKUtil.java:522)
at
org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataAndWatch(ZKUtil.java:498)
at
org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.getData(ZooKeeperNodeTracker.java:156)
at
org.apache.hadoop.hbase.zookeeper.RootRegionTracker.getRootRegionLocation(RootRegionTracker.java:62)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:821)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:801)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:933)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:832)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:801)
at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:234)
at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:174)
at
org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:150)
at
org.apache.hadoop.hbase.client.MetaScanner.access$000(MetaScanner.java:48)
at
org.apache.hadoop.hbase.client.MetaScanner$1.connect(MetaScanner.java:126)
at
org.apache.hadoop.hbase.client.MetaScanner$1.connect(MetaScanner.java:123)
at
org.apache.hadoop.hbase.client.HConnectionManager.execute(HConnectionManager.java:359)
at
org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:123)
at
org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:99)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.prefetchRegionCache(HConnectionManager.java:894)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:948)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:836)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:801)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionLocation(HConnectionManager.java:725)
at
org.apache.hadoop.hbase.client.ServerCallable.connect(ServerCallable.java:82)
at
org.apache.hadoop.hbase.client.ServerCallable.withRetries(ServerCallable.java:162)
at org.apache.hadoop.hbase.client.HTable.get(HTable.java:685)
at
org.apache.hadoop.hbase.client.HTablePool$PooledHTable.get(HTablePool.java:366)
2. jstack traces found one Java-level deadlock:
=============================
"catalina-exec-800":
waiting to lock monitor 0x000000005f1f6530 (object 0x0000000731902200, a
java.lang.Object),
which is held by "catalina-exec-710"
"catalina-exec-710":
waiting to lock monitor 0x00002aaab9a05bd0 (object 0x00000007321f8708, a
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation),
which is held by "catalina-exec-29-EventThread"
"catalina-exec-29-EventThread":
waiting to lock monitor 0x000000005f9f0af0 (object 0x0000000732a9c7e0, a
org.apache.hadoop.hbase.zookeeper.RootRegionTracker),
which is held by "catalina-exec-710"
Java stack information for the threads listed above:
===================================================
"catalina-exec-800":
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:943)
- waiting to lock <0x0000000731902200> (a java.lang.Object)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:836)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.relocateRegion(HConnectionManager.java:807)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionLocation(HConnectionManager.java:725)
at
org.apache.hadoop.hbase.client.ServerCallable.connect(ServerCallable.java:82)
at
org.apache.hadoop.hbase.client.ServerCallable.withRetries(ServerCallable.java:162)
at org.apache.hadoop.hbase.client.HTable.get(HTable.java:685)
at
org.apache.hadoop.hbase.client.HTablePool$PooledHTable.get(HTablePool.java:366)
"catalina-exec-710":
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.resetZooKeeperTrackers(HConnectionManager.java:599)
- waiting to lock <0x00000007321f8708> (a
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.abort(HConnectionManager.java:1660)
at
org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.getData(ZooKeeperNodeTracker.java:158)
- locked <0x0000000732a9c7e0> (a
org.apache.hadoop.hbase.zookeeper.RootRegionTracker)
at
org.apache.hadoop.hbase.zookeeper.RootRegionTracker.getRootRegionLocation(RootRegionTracker.java:62)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:821)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:801)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:933)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:832)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:801)
at org.apache.hadoop.hbase.client.HTable.finishSetup(HTable.java:234)
at org.apache.hadoop.hbase.client.HTable.<init>(HTable.java:174)
at
org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:150)
at
org.apache.hadoop.hbase.client.MetaScanner.access$000(MetaScanner.java:48)
at
org.apache.hadoop.hbase.client.MetaScanner$1.connect(MetaScanner.java:126)
at
org.apache.hadoop.hbase.client.MetaScanner$1.connect(MetaScanner.java:123)
at
org.apache.hadoop.hbase.client.HConnectionManager.execute(HConnectionManager.java:359)
at
org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:123)
at
org.apache.hadoop.hbase.client.MetaScanner.metaScan(MetaScanner.java:99)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.prefetchRegionCache(HConnectionManager.java:894)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegionInMeta(HConnectionManager.java:948)
- locked <0x0000000731902200> (a java.lang.Object)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.locateRegion(HConnectionManager.java:836)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.relocateRegion(HConnectionManager.java:807)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.getRegionLocation(HConnectionManager.java:725)
at
org.apache.hadoop.hbase.client.ServerCallable.connect(ServerCallable.java:82)
at
org.apache.hadoop.hbase.client.ServerCallable.withRetries(ServerCallable.java:162)
at org.apache.hadoop.hbase.client.HTable.get(HTable.java:685)
at
org.apache.hadoop.hbase.client.HTablePool$PooledHTable.get(HTablePool.java:366)
"catalina-exec-29-EventThread":
at
org.apache.hadoop.hbase.zookeeper.ZooKeeperNodeTracker.stop(ZooKeeperNodeTracker.java:98)
- waiting to lock <0x0000000732a9c7e0> (a
org.apache.hadoop.hbase.zookeeper.RootRegionTracker)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.resetZooKeeperTrackers(HConnectionManager.java:604)
- locked <0x00000007321f8708> (a
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation)
at
org.apache.hadoop.hbase.client.HConnectionManager$HConnectionImplementation.abort(HConnectionManager.java:1660)
at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.connectionEvent(ZooKeeperWatcher.java:374)
at
org.apache.hadoop.hbase.zookeeper.ZooKeeperWatcher.process(ZooKeeperWatcher.java:271)
at
org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:521)
at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:497)
Found 1 deadlock.
>From the source code , the reason for this problem is doing
>ZooKeeperNodeTracker.getData that has a KeeperException occured. And try to
>resetZookeeperTracker. At the same time, ClientCnxn.EventThread also do
>resetZookeeperTracker ,too. Because of getData have already held the lock of
>ZooKeeperNodeTracke , that lead to the order of the lock two threads to obtain
>does not accord. So deadlock happened.
In order to avoid the problem, we can through reduce range of the lock of
getData.
See the patch with 0.94.0.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira