[ 
https://issues.apache.org/jira/browse/HBASE-2849?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Benoit Sigoure updated HBASE-2849:
----------------------------------

    Attachment: 0001-HBASE-2849-Have-HBase-clients-recover-from-ZooKeeper.patch

Patch that fixes the issue.  Actually there was some logic I didn't notice 
earlier in {{HConnectionManager}} to attempt to deal with ZK failures and 
reconnect when needed, but the code wasn't doing the right thing and didn't 
work when there was a disconnection between the HBase client and the ZK quorum. 
 So the patch is rather simple and consists in fixing the existing logic in 
{{HConnectionManager.ClientZKWatcher}}.

I tested this by starting a long running HBase application, killing the whole 
ZooKeeper ensemble and restarting it.  The application experiences a hiccup 
while ZK is unavailable and is able to recover automatically soon after the ZK 
quorum is back online.  Someone else is more than welcome to write a unit test 
that simulates this scenario if they feel like it.

> HBase clients cannot recover when their ZooKeeper session becomes invalid
> -------------------------------------------------------------------------
>
>                 Key: HBASE-2849
>                 URL: https://issues.apache.org/jira/browse/HBASE-2849
>             Project: HBase
>          Issue Type: Bug
>          Components: client
>    Affects Versions: 0.89.20100621
>            Reporter: stack
>            Assignee: Benoit Sigoure
>            Priority: Critical
>             Fix For: 0.90.0
>
>         Attachments: 
> 0001-HBASE-2849-Have-HBase-clients-recover-from-ZooKeeper.patch
>
>
> Someone made mention of this loop last week but I don't think I filed an 
> issue.  Here is another instance, again from a secret hbase admirer:
> "It seems that when Zookeeper dies and restarts, all client applications need 
> to be restarted too. I just restarted HBase in non-distributed mode (which 
> includes a ZK) and now my application can't reconnect to ZK unless I restart 
> it too.  I'm stuck in this loop:
> {code}
> 2010-07-19 00:13:05,725 INFO org.apache.zookeeper.server.NIOServerCnxn:
>   Closed socket connection for client /127.0.0.1:55153 (no session 
> established for client)
> 2010-07-19 00:13:07,052 INFO org.apache.zookeeper.server.NIOServerCnxn:
>   Accepted socket connection from /127.0.0.1:55154
> 2010-07-19 00:13:07,053 INFO org.apache.zookeeper.server.NIOServerCnxn:
>   Refusing session request for client /127.0.0.1:55154 as it has seen zxid 
> 0xf5 our last zxid is 0xd7
>   client must try another server
> {code}
> "

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to