[ 
https://issues.apache.org/jira/browse/HBASE-28113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Duo Zhang resolved HBASE-28113.
-------------------------------
    Fix Version/s: 2.6.0
                   2.4.18
                   3.0.0-beta-1
                   2.5.7
     Hadoop Flags: Reviewed
       Resolution: Fixed

Pushed to all active branches.

Thanks [~luoen] for contributing!

> Modify the way of acquiring the RegionStateNode lock in 
> checkOnlineRegionsReport to tryLock
> -------------------------------------------------------------------------------------------
>
>                 Key: HBASE-28113
>                 URL: https://issues.apache.org/jira/browse/HBASE-28113
>             Project: HBase
>          Issue Type: Improvement
>          Components: master
>    Affects Versions: 3.0.0-beta-1
>            Reporter: Haiping lv
>            Assignee: Haiping lv
>            Priority: Major
>             Fix For: 2.6.0, 2.4.18, 3.0.0-beta-1, 2.5.7
>
>         Attachments: master.stack
>
>
> HBase Cluster description: *1 master and 5 region servers*
> During the execution of itbll process, when ChaosMonkey performs 
> RestartRandomRsAction, it triggers this issue.
> The steps for the RestartRandomRsAction operation are as follows.{*}:{*}
>  # stop node-3, node-2, node-4。
>  # then stop the node-5 that holds the meta node.
>  # start node-3
>  # then stop node-1。
>  # start node-2, node-4, node-5, node-1。
> *Fault description:*
> 1. The RegionServer nodes, including node-2, node-4, node-5, and node-1, are 
> unable to come online.
> Observing the RegionServer logs, the reportForDuty operation consistently 
> times out. The log is as follows:
> {code:java}
> 2023-09-21T08:05:30,251 INFO  [regionserver/core-1-2:16020] 
> regionserver.HRegionServer: reportForDuty to 
> master=master-1-1,16000,1695254395517 with port=16020, startcode=1695254725874
> 2023-09-21T08:05:43,581 INFO  [regionserver/core-1-2:16020] 
> regionserver.HRegionServer: reportForDuty to 
> master=master-1-1,16000,1695254395517 with port=16020, startcode=1695254725874
> 2023-09-21T08:05:59,591 INFO  [regionserver/core-1-2:16020] 
> regionserver.HRegionServer: reportForDuty to 
> master=master-1-1,16000,1695254395517 with port=16020, startcode=1695254725874
> 2023-09-21T08:06:21,601 INFO  [regionserver/core-1-2:16020] 
> regionserver.HRegionServer: reportForDuty to 
> master=master-1-1,16000,1695254395517 with port=16020, startcode=1695254725874
> 2023-09-21T08:06:55,611 INFO  [regionserver/core-1-2:16020] 
> regionserver.HRegionServer: reportForDuty to 
> master=master-1-1,16000,1695254395517 with port=16020, startcode=1695254725874
> 2023-09-21T08:07:53,620 INFO  [regionserver/core-1-2:16020] 
> regionserver.HRegionServer: reportForDuty to 
> master=master-1-1,16000,1695254395517 with port=16020, startcode=1695254725874
> 2023-09-21T08:09:39,631 INFO  [regionserver/core-1-2:16020] 
> regionserver.HRegionServer: reportForDuty to 
> master=master-1-1,16000,1695254395517 with port=16020, startcode=1695254725874
> 2023-09-21T08:13:01,642 INFO  [regionserver/core-1-2:16020] 
> regionserver.HRegionServer: reportForDuty to 
> master=master-1-1,16000,1695254395517 with port=16020, 
> startcode=1695254725874 {code}
> 2. The master thread is blocked.
>  * All two RpcServer.priority.RWQ.Fifo.write.handler threads are blocked on 
> RegionStateNode.lock
> {code:java}
> "RpcServer.priority.RWQ.Fifo.write.handler=1,queue=0,port=16000" #67 daemon 
> prio=5 os_prio=0 tid=0x00007f6ae3caf800 nid=0xea405 waiting on condition 
> [0x00007f6aa1dcd000]
>    java.lang.Thread.State: WAITING (parking)
>     at sun.misc.Unsafe.park(Native Method)
>     - parking to wait for  <0x00000004e3c8e6f0> (a 
> java.util.concurrent.locks.ReentrantLock$NonfairSync)
>     at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>     at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>     at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
>     at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
>     at 
> java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
>     at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
>     at 
> org.apache.hadoop.hbase.master.assignment.RegionStateNode.lock(RegionStateNode.java:323)
>     at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.checkOnlineRegionsReport(AssignmentManager.java:1401)
>     at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.reportOnlineRegions(AssignmentManager.java:1363)
>     at 
> org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport(MasterRpcServices.java:639)
>     at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:17395)
>     at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:437)
>     at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124)
>     at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:102)
>     at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:82) {code}
>  * 20 PEWorker threads are blocked on RegionStateStore.updateRegionLocation.
> {code:java}
> "PEWorker-1" #133 daemon prio=5 os_prio=0 tid=0x00007f6acdcf9800 nid=0xea5bc 
> waiting on condition [0x00007f6a9d799000]
>    java.lang.Thread.State: WAITING (parking)
>     at sun.misc.Unsafe.park(Native Method)
>     - parking to wait for  <0x00000004e4cc8e58> (a 
> java.util.concurrent.CompletableFuture$Signaller)
>     at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>     at 
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
>     at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
>     at 
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
>     at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
>     at org.apache.hadoop.hbase.util.FutureUtils.get(FutureUtils.java:182)
>     at 
> org.apache.hadoop.hbase.client.TableOverAsyncTable.put(TableOverAsyncTable.java:213)
>     at 
> org.apache.hadoop.hbase.master.assignment.RegionStateStore.updateRegionLocation(RegionStateStore.java:259)
>     at 
> org.apache.hadoop.hbase.master.assignment.RegionStateStore.updateRegionLocation(RegionStateStore.java:224)
>     at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.regionClosedAbnormally(AssignmentManager.java:2076)
>     at 
> org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.execute(RegionRemoteProcedureBase.java:305)
>     at 
> org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.execute(RegionRemoteProcedureBase.java:57)
>     at 
> org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:921)
>     at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1650)
>     at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1396)
>     at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1000(ProcedureExecutor.java:75)
>     at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.runProcedure(ProcedureExecutor.java:1962)
>     at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread$$Lambda$610/726348606.call(Unknown
>  Source)
>     at org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:216)
>     at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1989)
>  {code}
>  * All four KeepAlivePEWorker threads are blocked.
> KeepAlivePEWorker-17 18 19 are blocked on 
> RegionStateStore.updateRegionLocation
> {code:java}
> "KeepAlivePEWorker-17" #381 daemon prio=5 os_prio=0 tid=0x000056260b75d000 
> nid=0xeffb0 waiting on condition [0x00007f6a94339000]
>    java.lang.Thread.State: WAITING (parking)
>     at sun.misc.Unsafe.park(Native Method)
>     - parking to wait for  <0x00000004ebf83440> (a 
> java.util.concurrent.CompletableFuture$Signaller)
>     at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>     at 
> java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
>     at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
>     at 
> java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
>     at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
>     at org.apache.hadoop.hbase.util.FutureUtils.get(FutureUtils.java:182)
>     at 
> org.apache.hadoop.hbase.client.TableOverAsyncTable.put(TableOverAsyncTable.java:213)
>     at 
> org.apache.hadoop.hbase.master.assignment.RegionStateStore.updateRegionLocation(RegionStateStore.java:259)
>     at 
> org.apache.hadoop.hbase.master.assignment.RegionStateStore.updateRegionLocation(RegionStateStore.java:224)
>     at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.transitStateAndUpdate(AssignmentManager.java:1982)
>     at 
> org.apache.hadoop.hbase.master.assignment.AssignmentManager.regionOpening(AssignmentManager.java:1997)
>     at 
> org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.openRegion(TransitRegionStateProcedure.java:279)
>     at 
> org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.executeFromState(TransitRegionStateProcedure.java:434)
>     at 
> org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.executeFromState(TransitRegionStateProcedure.java:111)
>     at 
> org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:188)
>     at 
> org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.execute(TransitRegionStateProcedure.java:398)
>     at 
> org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.execute(TransitRegionStateProcedure.java:111)
>     at 
> org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:921)
>     at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1650)
>     at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1396)
>     at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1000(ProcedureExecutor.java:75)
>     at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.runProcedure(ProcedureExecutor.java:1962)
>     at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread$$Lambda$610/726348606.call(Unknown
>  Source)
>     at org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:216)
>     at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1989)
>  {code}
>  * KeepAlivePEWorker-20 are blocked on RegionStateNode.lock
> {code:java}
> "KeepAlivePEWorker-20" #388 daemon prio=5 os_prio=0 tid=0x000056260b847800 
> nid=0xf02da waiting on condition [0x00007f6a92e25000]
>    java.lang.Thread.State: WAITING (parking)
>     at sun.misc.Unsafe.park(Native Method)
>     - parking to wait for  <0x00000004e3c8d990> (a 
> java.util.concurrent.locks.ReentrantLock$NonfairSync)
>     at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
>     at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
>     at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870)
>     at 
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199)
>     at 
> java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209)
>     at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285)
>     at 
> org.apache.hadoop.hbase.master.assignment.RegionStateNode.lock(RegionStateNode.java:323)
>     at 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.assignRegions(ServerCrashProcedure.java:551)
>     at 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:243)
>     at 
> org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:68)
>     at 
> org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:188)
>     at 
> org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:921)
>     at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1650)
>     at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1396)
>     at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1000(ProcedureExecutor.java:75)
>     at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.runProcedure(ProcedureExecutor.java:1962)
>     at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread$$Lambda$610/726348606.call(Unknown
>  Source)
>     at org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:216)
>     at 
> org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1989)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to