[ 
https://issues.apache.org/jira/browse/HBASE-18058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16013479#comment-16013479
 ] 

Allan Yang edited comment on HBASE-18058 at 5/17/17 4:12 AM:
-------------------------------------------------------------

{quote}
Normally in this case RegionServer will crash due to zookeeper session timeout, 
similar like when RS full GC, right? Mind share the case in your scenario? How 
do you keep RS alive while zookeeper down for some while? Thanks. Allan Yang
{quote}
Yes, It is a very interesting case and really happened. If the server hosting 
zookeeper is disk full, the zookeeper quorum won't really went down but reject 
all write request. So at HBase side, new zk write request will suffers from 
exception and retry. But connection remains so the session won't timeout. When 
disk full situation have been resolved, the zookeeper quorum can work normally 
again. But the very high sleep time cause some module of RegionServer will 
still sleep for a long time(in our case, the balancer) before working.


was (Author: allan163):
{quote}
Normally in this case RegionServer will crash due to zookeeper session timeout, 
similar like when RS full GC, right? Mind share the case in your scenario? How 
do you keep RS alive while zookeeper down for some while? Thanks. Allan Yang
{quote}
Yes, It is a very interesting case and really happened. If the server hosting 
zookeeper is disk full, the zookeeper quorum won't really went down but reject 
all connection and request. So at HBase side, it will suffers from connection 
loss and retry. When disk full situation have been resolved, the zookeeper 
quorum can work normally again and all session won't time out. So HBase server 
won't crash due to session timeout, but the very high sleep time cause some 
module of RegionServer will still sleep for a long time(in our case, the 
balancer) before working.

> Zookeeper retry sleep time should have a up limit
> -------------------------------------------------
>
>                 Key: HBASE-18058
>                 URL: https://issues.apache.org/jira/browse/HBASE-18058
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 2.0.0, 1.4.0
>            Reporter: Allan Yang
>            Assignee: Allan Yang
>         Attachments: HBASE-18058-branch-1.patch, 
> HBASE-18058-branch-1.v2.patch, HBASE-18058.patch
>
>
> Now, in {{RecoverableZooKeeper}}, the retry backoff sleep time grow 
> exponentially, but it doesn't have any up limit. It directly lead to a long 
> long recovery time after Zookeeper going down for some while and come back.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Reply via email to