[ https://issues.apache.org/jira/browse/HBASE-5153?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13194445#comment-13194445 ]
Jieshan Bean commented on HBASE-5153: ------------------------------------- "The endless loop happens when ZK is actually down." If ZK is actually down, the below code will throw a Exception: this.zooKeeper = getZooKeeperWatcher(); Then catched by the below code: {noformat} try { LOG.info("This client just lost it's session with ZooKeeper, trying" + " to reconnect."); resetZooKeeperTrackersWithRetries(); LOG.info("Reconnected successfully. This disconnect could have been" + " caused by a network partition or a long-running GC pause," + " either way it's recommended that you verify your environment."); return; } catch (ZooKeeperConnectionException e) { LOG.error("Could not reconnect to ZooKeeper after session" + " expiration, aborting"); t = e; } if (t != null) LOG.fatal(msg, t); else LOG.fatal(msg); HConnectionManager.deleteStaleConnection(this); {noformat} It should not be a endless loop. Does that make sense? > Add retry logic in HConnectionImplementation#resetZooKeeperTrackers > ------------------------------------------------------------------- > > Key: HBASE-5153 > URL: https://issues.apache.org/jira/browse/HBASE-5153 > Project: HBase > Issue Type: Bug > Components: client > Affects Versions: 0.90.4 > Reporter: Jieshan Bean > Assignee: Jieshan Bean > Fix For: 0.94.0, 0.90.6, 0.92.1 > > Attachments: 5153-92.txt, 5153-trunk-v2.txt, 5153-trunk.txt, > 5153-trunk.txt, HBASE-5153-V2.patch, HBASE-5153-V3.patch, > HBASE-5153-V4-90.patch, HBASE-5153-V5-90.patch, > HBASE-5153-V6-90-minorchange.patch, HBASE-5153-V6-90.txt, > HBASE-5153-trunk-v2.patch, HBASE-5153-trunk.patch, HBASE-5153.patch, > TestResults-hbase5153.out > > > HBASE-4893 is related to this issue. In that issue, we know, if multi-threads > share a same connection, once this connection got abort in one thread, the > other threads will got a > "HConnectionManager$HConnectionImplementation@18fb1f7 closed" exception. > It solve the problem of "stale connection can't removed". But the orignal > HTable instance cann't be continue to use. The connection in HTable should be > recreated. > Actually, there's two aproach to solve this: > 1. In user code, once catch an IOE, close connection and re-create HTable > instance. We can use this as a workaround. > 2. In HBase Client side, catch this exception, and re-create connection. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira