[ https://issues.apache.org/jira/browse/HBASE-19929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16351722#comment-16351722 ]
Duo Zhang commented on HBASE-19929: ----------------------------------- AsyncDFSClient is not the problem. The problem is AsyncFSWAL. By design it will not fail any requests and will always try to open a new writer to write the pending requests. When rolling failed, the log rolle will abort the rs, and when aborting we will close the WAL and the pending sync will be notified. The problem here is, we enter the shutdown processing before setting abortRequested to true, so we will try to flush all the regions first and wait them to be closed. And then we found that the WAL is broken and there is an abort request from the log roller, but it does not help, the close of WAL is after the waiting of regions to be closed, so it is something like a dead lock here... So I think a possible solution is to close WAL directly when log roller wants to abort an RS. Let me prepare a patch. Thanks. > Call RS.stop on a session expired RS may hang > --------------------------------------------- > > Key: HBASE-19929 > URL: https://issues.apache.org/jira/browse/HBASE-19929 > Project: HBase > Issue Type: Bug > Reporter: Duo Zhang > Priority: Major > > See the discussion in HBASE-19927. The problem is that, for a normal stop we > will try to close all the regions and wait until they are all closed. But if > the RS has already session expired, master will start the failover work which > will move the WAL directory, and then we will be stuck in writing flush > marker. -- This message was sent by Atlassian JIRA (v7.6.3#76005)