[ 
https://issues.apache.org/jira/browse/HBASE-20976?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16576210#comment-16576210
 ] 

Allan Yang commented on HBASE-20976:
------------------------------------

{quote}
I think we'd better do it a bit clean without adding too much checks...

I think here we need to make sure that the deadServers check can work and 
prevent scheduling redundant SCPs. We can do the SCPs check when restarting is 
that, we have not started the PE yet so it is safe, but during the execution, 
this is not a good idea as there is no fencing...
{quote}
Yes, there is no fence here... But the worst case is that there is a race 
condition so we still schedule redundant SCPs, still better than now I think.
Making deadServers working is indeed a better idea, but I can't think a better 
way to do it for now.  IIRC, the deadservers are removed so that the master Web 
UI won't show a dead server foverever there...

> SCP can be scheduled multiple times for the same RS
> ---------------------------------------------------
>
>                 Key: HBASE-20976
>                 URL: https://issues.apache.org/jira/browse/HBASE-20976
>             Project: HBase
>          Issue Type: Sub-task
>    Affects Versions: 2.1.0, 2.0.1
>            Reporter: Allan Yang
>            Assignee: Allan Yang
>            Priority: Major
>             Fix For: 2.0.2
>
>         Attachments: HBASE-20976.branch-2.0.001.patch, 
> HBASE-20976.branch-2.0.002.patch
>
>
> SCP can be scheduled multiple times for the same RS:
> 1. a RS crashed, a SCP was submitted for it
> 2. before this SCP finish, the Master crashed
> 3. The new master will scan the meta table and find some region is still open 
> on a dead server
> 4. The new master submit a SCP for the dead server again
> The two SCP for the same RS can even execute concurrently if without 
> HBASE-20846…
> Provided a test case to reproduce this issue and a fix solution in the patch.
> Another case that SCP might be scheduled multiple times for the same RS(with 
> HBASE-20708.):
> 1.  a RS crashed, a SCP was submitted for it
> 2. A new RS on the same host started, the old RS's Serveranme was remove from 
> DeadServer.deadServers
> 3. after the SCP passed the Handle_RIT state, a UnassignProcedure need to 
> send a close region operation to the crashed RS
> 4. The UnassignProcedure's dispatch failed since 'NoServerDispatchException'
> 5. Begin to expire the RS, but only find it not online and not in deadServer 
> list, so a SCP was submitted for the same RS again
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to