[jira] [Commented] (HBASE-19144) [RSgroups] Retry assignments in FAILED_OPEN state when servers (re)join the cluster

Andrew Purtell (JIRA) Thu, 02 Nov 2017 11:39:12 -0700

    [ 
https://issues.apache.org/jira/browse/HBASE-19144?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16236355#comment-16236355
 ]


Andrew Purtell commented on HBASE-19144:
----------------------------------------

bq. while (!hasChanged) ?

Good point. I suppose this should be changed everywhere. Javadoc of Object#wait 
says spurious wakeups are possible. Let me make this change. 

This code also punts on interrupt handling. We should fall through and check 
the state of master.isAborted and master.isStopped. If 'hasChanged' is still 
false we can just go back to waiting. The threads are daemon threads so won't 
stop a shutdown. Will change this too.

> [RSgroups] Retry assignments in FAILED_OPEN state when servers (re)join the 
> cluster
> -----------------------------------------------------------------------------------
>
>                 Key: HBASE-19144
>                 URL: https://issues.apache.org/jira/browse/HBASE-19144
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Andrew Purtell
>            Assignee: Andrew Purtell
>            Priority: Major
>             Fix For: 2.0.0, 3.0.0, 1.4.0, 1.5.0
>
>         Attachments: HBASE-19144-branch-1.patch, HBASE-19144.patch
>
>
> After all servers in the RSgroup are down the regions cannot be opened 
> anywhere and transition rapidly into FAILED_OPEN state.
>  
> 2017-10-31 21:06:25,449 INFO [ProcedureExecutor-13] master.RegionStates: 
> Transition {c6c8150c9f4b8df25ba32073f25a5143 state=OFFLINE, ts=1509483985448, 
> server=node-5.cluster,16020,1509482700768} to 
> {c6c8150c9f4b8df25ba32073f25a5143 state=FAILED_OPEN, ts=1509483985449, 
> server=node-5.cluster,16020,1509482700768}
> 2017-10-31 21:06:25,449 WARN [ProcedureExecutor-13] master.RegionStates: 
> Failed to open/close d4e2f173e31ffad6aac140f4bd7b02bc on 
> node-5.cluster,16020,1509482700768, set to FAILED_OPEN
>  
> Any region in FAILED_OPEN state has to be manually reassigned, or the master 
> can be restarted and this will also cause reattempt of assignment of any 
> regions in FAILED_OPEN state. This is not unexpected but is an operational 
> headache. It would be better if the RSGroupInfoManager could automatically 
> kick reassignments of regions in FAILED_OPEN state when servers rejoin the 
> cluster. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

[jira] [Commented] (HBASE-19144) [RSgroups] Retry assignments in FAILED_OPEN state when servers (re)join the cluster

Reply via email to