[ https://issues.apache.org/jira/browse/HBASE-28113?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Duo Zhang resolved HBASE-28113. ------------------------------- Fix Version/s: 2.6.0 2.4.18 3.0.0-beta-1 2.5.7 Hadoop Flags: Reviewed Resolution: Fixed Pushed to all active branches. Thanks [~luoen] for contributing! > Modify the way of acquiring the RegionStateNode lock in > checkOnlineRegionsReport to tryLock > ------------------------------------------------------------------------------------------- > > Key: HBASE-28113 > URL: https://issues.apache.org/jira/browse/HBASE-28113 > Project: HBase > Issue Type: Improvement > Components: master > Affects Versions: 3.0.0-beta-1 > Reporter: Haiping lv > Assignee: Haiping lv > Priority: Major > Fix For: 2.6.0, 2.4.18, 3.0.0-beta-1, 2.5.7 > > Attachments: master.stack > > > HBase Cluster description: *1 master and 5 region servers* > During the execution of itbll process, when ChaosMonkey performs > RestartRandomRsAction, it triggers this issue. > The steps for the RestartRandomRsAction operation are as follows.{*}:{*} > # stop node-3, node-2, node-4。 > # then stop the node-5 that holds the meta node. > # start node-3 > # then stop node-1。 > # start node-2, node-4, node-5, node-1。 > *Fault description:* > 1. The RegionServer nodes, including node-2, node-4, node-5, and node-1, are > unable to come online. > Observing the RegionServer logs, the reportForDuty operation consistently > times out. The log is as follows: > {code:java} > 2023-09-21T08:05:30,251 INFO [regionserver/core-1-2:16020] > regionserver.HRegionServer: reportForDuty to > master=master-1-1,16000,1695254395517 with port=16020, startcode=1695254725874 > 2023-09-21T08:05:43,581 INFO [regionserver/core-1-2:16020] > regionserver.HRegionServer: reportForDuty to > master=master-1-1,16000,1695254395517 with port=16020, startcode=1695254725874 > 2023-09-21T08:05:59,591 INFO [regionserver/core-1-2:16020] > regionserver.HRegionServer: reportForDuty to > master=master-1-1,16000,1695254395517 with port=16020, startcode=1695254725874 > 2023-09-21T08:06:21,601 INFO [regionserver/core-1-2:16020] > regionserver.HRegionServer: reportForDuty to > master=master-1-1,16000,1695254395517 with port=16020, startcode=1695254725874 > 2023-09-21T08:06:55,611 INFO [regionserver/core-1-2:16020] > regionserver.HRegionServer: reportForDuty to > master=master-1-1,16000,1695254395517 with port=16020, startcode=1695254725874 > 2023-09-21T08:07:53,620 INFO [regionserver/core-1-2:16020] > regionserver.HRegionServer: reportForDuty to > master=master-1-1,16000,1695254395517 with port=16020, startcode=1695254725874 > 2023-09-21T08:09:39,631 INFO [regionserver/core-1-2:16020] > regionserver.HRegionServer: reportForDuty to > master=master-1-1,16000,1695254395517 with port=16020, startcode=1695254725874 > 2023-09-21T08:13:01,642 INFO [regionserver/core-1-2:16020] > regionserver.HRegionServer: reportForDuty to > master=master-1-1,16000,1695254395517 with port=16020, > startcode=1695254725874 {code} > 2. The master thread is blocked. > * All two RpcServer.priority.RWQ.Fifo.write.handler threads are blocked on > RegionStateNode.lock > {code:java} > "RpcServer.priority.RWQ.Fifo.write.handler=1,queue=0,port=16000" #67 daemon > prio=5 os_prio=0 tid=0x00007f6ae3caf800 nid=0xea405 waiting on condition > [0x00007f6aa1dcd000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x00000004e3c8e6f0> (a > java.util.concurrent.locks.ReentrantLock$NonfairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209) > at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285) > at > org.apache.hadoop.hbase.master.assignment.RegionStateNode.lock(RegionStateNode.java:323) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.checkOnlineRegionsReport(AssignmentManager.java:1401) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.reportOnlineRegions(AssignmentManager.java:1363) > at > org.apache.hadoop.hbase.master.MasterRpcServices.regionServerReport(MasterRpcServices.java:639) > at > org.apache.hadoop.hbase.shaded.protobuf.generated.RegionServerStatusProtos$RegionServerStatusService$2.callBlockingMethod(RegionServerStatusProtos.java:17395) > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:437) > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124) > at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:102) > at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:82) {code} > * 20 PEWorker threads are blocked on RegionStateStore.updateRegionLocation. > {code:java} > "PEWorker-1" #133 daemon prio=5 os_prio=0 tid=0x00007f6acdcf9800 nid=0xea5bc > waiting on condition [0x00007f6a9d799000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x00000004e4cc8e58> (a > java.util.concurrent.CompletableFuture$Signaller) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707) > at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) > at > java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908) > at org.apache.hadoop.hbase.util.FutureUtils.get(FutureUtils.java:182) > at > org.apache.hadoop.hbase.client.TableOverAsyncTable.put(TableOverAsyncTable.java:213) > at > org.apache.hadoop.hbase.master.assignment.RegionStateStore.updateRegionLocation(RegionStateStore.java:259) > at > org.apache.hadoop.hbase.master.assignment.RegionStateStore.updateRegionLocation(RegionStateStore.java:224) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.regionClosedAbnormally(AssignmentManager.java:2076) > at > org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.execute(RegionRemoteProcedureBase.java:305) > at > org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.execute(RegionRemoteProcedureBase.java:57) > at > org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:921) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1650) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1396) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1000(ProcedureExecutor.java:75) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.runProcedure(ProcedureExecutor.java:1962) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread$$Lambda$610/726348606.call(Unknown > Source) > at org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:216) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1989) > {code} > * All four KeepAlivePEWorker threads are blocked. > KeepAlivePEWorker-17 18 19 are blocked on > RegionStateStore.updateRegionLocation > {code:java} > "KeepAlivePEWorker-17" #381 daemon prio=5 os_prio=0 tid=0x000056260b75d000 > nid=0xeffb0 waiting on condition [0x00007f6a94339000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x00000004ebf83440> (a > java.util.concurrent.CompletableFuture$Signaller) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707) > at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323) > at > java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742) > at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908) > at org.apache.hadoop.hbase.util.FutureUtils.get(FutureUtils.java:182) > at > org.apache.hadoop.hbase.client.TableOverAsyncTable.put(TableOverAsyncTable.java:213) > at > org.apache.hadoop.hbase.master.assignment.RegionStateStore.updateRegionLocation(RegionStateStore.java:259) > at > org.apache.hadoop.hbase.master.assignment.RegionStateStore.updateRegionLocation(RegionStateStore.java:224) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.transitStateAndUpdate(AssignmentManager.java:1982) > at > org.apache.hadoop.hbase.master.assignment.AssignmentManager.regionOpening(AssignmentManager.java:1997) > at > org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.openRegion(TransitRegionStateProcedure.java:279) > at > org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.executeFromState(TransitRegionStateProcedure.java:434) > at > org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.executeFromState(TransitRegionStateProcedure.java:111) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:188) > at > org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.execute(TransitRegionStateProcedure.java:398) > at > org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.execute(TransitRegionStateProcedure.java:111) > at > org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:921) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1650) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1396) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1000(ProcedureExecutor.java:75) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.runProcedure(ProcedureExecutor.java:1962) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread$$Lambda$610/726348606.call(Unknown > Source) > at org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:216) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1989) > {code} > * KeepAlivePEWorker-20 are blocked on RegionStateNode.lock > {code:java} > "KeepAlivePEWorker-20" #388 daemon prio=5 os_prio=0 tid=0x000056260b847800 > nid=0xf02da waiting on condition [0x00007f6a92e25000] > java.lang.Thread.State: WAITING (parking) > at sun.misc.Unsafe.park(Native Method) > - parking to wait for <0x00000004e3c8d990> (a > java.util.concurrent.locks.ReentrantLock$NonfairSync) > at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireQueued(AbstractQueuedSynchronizer.java:870) > at > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquire(AbstractQueuedSynchronizer.java:1199) > at > java.util.concurrent.locks.ReentrantLock$NonfairSync.lock(ReentrantLock.java:209) > at java.util.concurrent.locks.ReentrantLock.lock(ReentrantLock.java:285) > at > org.apache.hadoop.hbase.master.assignment.RegionStateNode.lock(RegionStateNode.java:323) > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.assignRegions(ServerCrashProcedure.java:551) > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:243) > at > org.apache.hadoop.hbase.master.procedure.ServerCrashProcedure.executeFromState(ServerCrashProcedure.java:68) > at > org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:188) > at > org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:921) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1650) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1396) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor.access$1000(ProcedureExecutor.java:75) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.runProcedure(ProcedureExecutor.java:1962) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread$$Lambda$610/726348606.call(Unknown > Source) > at org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:216) > at > org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1989) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)