[ https://issues.apache.org/jira/browse/HBASE-21440?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
stack updated HBASE-21440: -------------------------- Resolution: Fixed Hadoop Flags: Reviewed Status: Resolved (was: Patch Available) Pushed on branch-2.0 and branch-2.1. Resolving. Thanks [~an...@apache.org] > Assign procedure on the crashed server is not properly interrupted > ------------------------------------------------------------------ > > Key: HBASE-21440 > URL: https://issues.apache.org/jira/browse/HBASE-21440 > Project: HBase > Issue Type: Bug > Affects Versions: 2.0.2 > Reporter: Ankit Singhal > Assignee: Ankit Singhal > Priority: Major > Fix For: 2.0.3, 2.1.2 > > Attachments: HBASE-21440.branch-2.0.001.patch, > HBASE-21440.branch-2.0.002.patch, HBASE-21440.branch-2.0.003.patch, > HBASE-21440.branch-2.0.004.patch, HBASE-21440.branch-2.0.005.patch, > HBASE-21440.branch-2.1.005.patch > > > When the server crashes, it's SCP checks if there is already a procedure > assigning the region on this crashed server. If we found one, SCP will just > interrupt the already running AssignProcedure by calling remoteCallFailed > which internally just changes the region node state to OFFLINE and send the > procedure back with transition queue state for assignment with a new plan. > But, due to the race condition between the calling of the remoteCallFailed > and current state of the already running assign > procedure(REGION_TRANSITION_FINISH: where the region is already opened), it > is possible that assign procedure goes ahead in updating the regionStateNode > to OPEN on a crashed server. > As SCP had already skipped this region for assignment as it was relying on > existing assign procedure to do the right thing, this whole confusion leads > region to a not accessible state. -- This message was sent by Atlassian JIRA (v7.6.3#76005)