[ 
https://issues.apache.org/jira/browse/HBASE-21325?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16656459#comment-16656459
 ] 

Guanghao Zhang commented on HBASE-21325:
----------------------------------------

Write a ut for this case. And found the regionserver not hang in 
waitOnAllRegionsToClose. As we will break the loop even there are online 
regions.
{code:java}
        // No regions in RIT, we could stop waiting now.
        if (this.regionsInTransitionInRS.isEmpty()) {
          if (!isOnlineRegionsEmpty()) {
            LOG.info("We were exiting though online regions are not empty," +
                " because some regions failed closing");
          }
          break;
        }
{code}

2018-10-19 16:26:28,449 INFO  [RS:1;hao-OptiPlex-7050:37602] 
regionserver.HRegionServer(1426): We were exiting though online regions are not 
empty, because some regions failed closing

But the regionserver still hang in shutdown wal when stop.
{code:java}
"RS:1;hao-OptiPlex-7050:37602" daemon prio=5 tid=380 in Object.wait()
java.lang.Thread.State: WAITING (on object monitor)
        at sun.misc.Unsafe.park(Native Method)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
        at 
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
        at 
java.util.concurrent.locks.ReentrantLock$FairSync.lock(ReentrantLock.java:224)
        at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
        at 
org.apache.hadoop.hbase.regionserver.wal.AbstractFSWAL.shutdown(AbstractFSWAL.java:821)
        at 
org.apache.hadoop.hbase.wal.SyncReplicationWALProvider.shutdown(SyncReplicationWALProvider.java:225)
        at org.apache.hadoop.hbase.wal.WALFactory.shutdown(WALFactory.java:246)
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.shutdownWAL(HRegionServer.java:1459)
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1115)
        at 
org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer.runRegionServer(MiniHBaseCluster.java:184)
        at 
org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer.access$000(MiniHBaseCluster.java:130)
        at 
org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer$1.run(MiniHBaseCluster.java:168)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:360)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1742)
        at 
org.apache.hadoop.hbase.security.User$SecureHadoopUser.runAs(User.java:341)
        at 
org.apache.hadoop.hbase.MiniHBaseCluster$MiniHBaseClusterRegionServer.run(MiniHBaseCluster.java:165)
        at java.lang.Thread.run(Thread.java:748)
{code}



> Add a max wait time for waitOnAllRegionsToClose
> -----------------------------------------------
>
>                 Key: HBASE-21325
>                 URL: https://issues.apache.org/jira/browse/HBASE-21325
>             Project: HBase
>          Issue Type: Improvement
>            Reporter: Duo Zhang
>            Assignee: Guanghao Zhang
>            Priority: Major
>
> When testing sync replication, I found that, if I transit the remote cluster 
> to DA, while the local cluster is still in A, the region server will hang 
> when shutdown. As the fsOk flag only test the local cluster(which is 
> reasonable), we will enter the waitOnAllRegionsToClose, and since the WAL is 
> broken(the remote wal directory is gone)  so we will never succeed. And this 
> lead to an infinite wait inside waitOnAllRegionsToClose.
> So I think here we should have an upper bound for the wait time in 
> waitOnAllRegionsToClose method.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to