[ 
https://issues.apache.org/jira/browse/HBASE-19929?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16351722#comment-16351722
 ] 

Duo Zhang commented on HBASE-19929:
-----------------------------------

AsyncDFSClient is not the problem. The problem is AsyncFSWAL. By design it will 
not fail any requests and will always try to open a new writer to write the 
pending requests. When rolling failed, the log rolle will abort the rs, and 
when aborting we will close the WAL and the pending sync will be notified.

The problem here is, we enter the shutdown processing before setting 
abortRequested to true, so we will try to flush all the regions first and wait 
them to be closed. And then we found that the WAL is broken and there is an 
abort request from the log roller, but it does not help, the close of WAL is 
after the waiting of regions to be closed, so it is something like a dead lock 
here...

So I think a possible solution is to close WAL directly when log roller wants 
to abort an RS.  Let me prepare a patch.

Thanks.

> Call RS.stop on a session expired RS may hang
> ---------------------------------------------
>
>                 Key: HBASE-19929
>                 URL: https://issues.apache.org/jira/browse/HBASE-19929
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Duo Zhang
>            Priority: Major
>
> See the discussion in HBASE-19927. The problem is that, for a normal stop we 
> will try to close all the regions and wait until they are all closed. But if 
> the RS has already session expired, master will start the failover work which 
> will move the WAL directory, and then we will be stuck in writing flush 
> marker.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to