[ https://issues.apache.org/jira/browse/HBASE-20634?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
stack reassigned HBASE-20634: ----------------------------- Assignee: stack > Reopen region while server crash can cause the procedure to be stuck > -------------------------------------------------------------------- > > Key: HBASE-20634 > URL: https://issues.apache.org/jira/browse/HBASE-20634 > Project: HBase > Issue Type: Bug > Reporter: Duo Zhang > Assignee: stack > Priority: Critical > Fix For: 3.0.0, 2.1.0, 2.0.1 > > Attachments: HBASE-20634-UT.patch > > > Found this when implementing HBASE-20424, where we will transit the peer sync > replication state while there is server crash. > The problem is that, in ServerCrashAssign, we do not have the region lock, so > it is possible that after we call handleRIT to clear the existing > assign/unassign procedures related to this rs, and before we schedule the > assign procedures, it is possible that that we schedule a unassign procedure > for a region on the crashed rs. This procedure will not receive the > ServerCrashException, instead, in addToRemoteDispatcher, it will find that it > can not dispatch the remote call and then a FailedRemoteDispatchException > will be raised. But we do not treat this exception the same with > ServerCrashException, instead, we will try to expire the rs. Obviously the rs > has already been marked as expired, so this is almost a no-op. Then the > procedure will be stuck there for ever. > A possible way to fix it is to treat FailedRemoteDispatchException the same > with ServerCrashException, as it will be created in addToRemoteDispatcher > only, and the only reason we can not dispatch a remote call is that the rs > has already been dead. The nodeMap is a ConcurrentMap so I think we could use > it as a guard. -- This message was sent by Atlassian JIRA (v7.6.3#76005)