[ https://issues.apache.org/jira/browse/HBASE-2849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Benoit Sigoure updated HBASE-2849: ---------------------------------- Attachment: 0001-HBASE-2849-Have-HBase-clients-recover-from-ZooKeeper.patch Patch that fixes the issue. Actually there was some logic I didn't notice earlier in {{HConnectionManager}} to attempt to deal with ZK failures and reconnect when needed, but the code wasn't doing the right thing and didn't work when there was a disconnection between the HBase client and the ZK quorum. So the patch is rather simple and consists in fixing the existing logic in {{HConnectionManager.ClientZKWatcher}}. I tested this by starting a long running HBase application, killing the whole ZooKeeper ensemble and restarting it. The application experiences a hiccup while ZK is unavailable and is able to recover automatically soon after the ZK quorum is back online. Someone else is more than welcome to write a unit test that simulates this scenario if they feel like it. > HBase clients cannot recover when their ZooKeeper session becomes invalid > ------------------------------------------------------------------------- > > Key: HBASE-2849 > URL: https://issues.apache.org/jira/browse/HBASE-2849 > Project: HBase > Issue Type: Bug > Components: client > Affects Versions: 0.89.20100621 > Reporter: stack > Assignee: Benoit Sigoure > Priority: Critical > Fix For: 0.90.0 > > Attachments: > 0001-HBASE-2849-Have-HBase-clients-recover-from-ZooKeeper.patch > > > Someone made mention of this loop last week but I don't think I filed an > issue. Here is another instance, again from a secret hbase admirer: > "It seems that when Zookeeper dies and restarts, all client applications need > to be restarted too. I just restarted HBase in non-distributed mode (which > includes a ZK) and now my application can't reconnect to ZK unless I restart > it too. I'm stuck in this loop: > {code} > 2010-07-19 00:13:05,725 INFO org.apache.zookeeper.server.NIOServerCnxn: > Closed socket connection for client /127.0.0.1:55153 (no session > established for client) > 2010-07-19 00:13:07,052 INFO org.apache.zookeeper.server.NIOServerCnxn: > Accepted socket connection from /127.0.0.1:55154 > 2010-07-19 00:13:07,053 INFO org.apache.zookeeper.server.NIOServerCnxn: > Refusing session request for client /127.0.0.1:55154 as it has seen zxid > 0xf5 our last zxid is 0xd7 > client must try another server > {code} > " -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.