[ 
https://issues.apache.org/jira/browse/HBASE-21744?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16751719#comment-16751719
 ] 

Sergey Shelukhin commented on HBASE-21744:
------------------------------------------

Updated the patch to base refresh on timeout and heartbeat configs. Looks like 
none of this code is covered by unit tests, RefreshRunnable is relatively easy 
to test in isolation with some refactoring, I may add a test later.

> timeout for server list refresh calls 
> --------------------------------------
>
>                 Key: HBASE-21744
>                 URL: https://issues.apache.org/jira/browse/HBASE-21744
>             Project: HBase
>          Issue Type: Bug
>    Affects Versions: 3.0.0
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>            Priority: Major
>         Attachments: HBASE-21744.01.patch, HBASE-21744.patch
>
>
> Not sure why yet, but we are seeing the case when cluster is in overall a bad 
> state, where after RS dies and deletes its znode, the notification looks like 
> it's lost, so the master doesn't detect the failure. ZK itself appears to be 
> healthy and doesn't report anything special.
> After some other change is made to the server list, master rescans the list 
> and picks up the stale change. Might make sense to add a config that would 
> trigger the refresh if it hasn't happened for a while (e.g. 1 minute).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to