[ https://issues.apache.org/jira/browse/HBASE-29294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17950486#comment-17950486 ]
Duo Zhang commented on HBASE-29294: ----------------------------------- The exception is thrown when we reach the rpc timeout. And if we choose sync = true when calling balanceSwitch, we will synchronize on the balancer, and we will also synchronize on the balancer when doing balancing, it could cost a lot of time, so after the balanceSwitch call enter the synchronized section to actually update the master region, we will timeout immediately... After HBASE-29251, we will abort master when updating master region fails, so I propose we apply the solution in HBASE-23895 for all the updates to master region, since a rpc timeout should not crash master. > Master crashed because of failing to update master region > --------------------------------------------------------- > > Key: HBASE-29294 > URL: https://issues.apache.org/jira/browse/HBASE-29294 > Project: HBase > Issue Type: Bug > Reporter: Duo Zhang > Priority: Major > > {noformat} > 2025-05-08T16:20:49,486 ERROR > [RpcServer.default.FPBQ.Fifo.handler=27,queue=0,port=16000] master.HMaster: > ***** ABORTING master meta02,16000,1746686264263: MasterRegion update is not > successful ***** > org.apache.hadoop.hbase.exceptions.TimeoutIOException: Timed out waiting for > lock for row: load_balancer_on in region 1595e783b53d99cd5eef43b6debb2682 > at > org.apache.hadoop.hbase.regionserver.HRegion.getRowLockInternal(HRegion.java:7131) > ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT] > at > org.apache.hadoop.hbase.regionserver.HRegion.lambda$getRowLock$26(HRegion.java:7164) > ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT] > at org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:216) > ~[hbase-common-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT] > at > org.apache.hadoop.hbase.regionserver.HRegion.getRowLock(HRegion.java:7164) > ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT] > at > org.apache.hadoop.hbase.regionserver.HRegion$BatchOperation.lockRowsAndBuildMiniBatch(HRegion.java:3686) > ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT] > at > org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutate(HRegion.java:4882) > ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT] > at > org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:4848) > ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT] > at > org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:4765) > ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT] > at > org.apache.hadoop.hbase.regionserver.HRegion.mutate(HRegion.java:5264) > ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT] > at > org.apache.hadoop.hbase.regionserver.HRegion.mutate(HRegion.java:5258) > ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT] > at > org.apache.hadoop.hbase.regionserver.HRegion.mutate(HRegion.java:5254) > ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT] > at > org.apache.hadoop.hbase.regionserver.HRegion.lambda$put$11(HRegion.java:3399) > ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT] > at org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:216) > ~[hbase-common-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT] > at > org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:3388) > ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT] > at > org.apache.hadoop.hbase.master.MasterStateStore.lambda$update$0(MasterStateStore.java:76) > ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT] > at > org.apache.hadoop.hbase.master.region.MasterRegion.update(MasterRegion.java:166) > ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT] > at > org.apache.hadoop.hbase.master.MasterStateStore.update(MasterStateStore.java:76) > ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT] > at > org.apache.hadoop.hbase.master.MasterStateStore.setState(MasterStateStore.java:68) > ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT] > at > org.apache.hadoop.hbase.master.BooleanStateStore.set(BooleanStateStore.java:59) > ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT] > at > org.apache.hadoop.hbase.master.MasterRpcServices.switchBalancer(MasterRpcServices.java:562) > ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT] > at > org.apache.hadoop.hbase.master.MasterRpcServices.synchronousBalanceSwitch(MasterRpcServices.java:579) > ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT] > at > org.apache.hadoop.hbase.master.MasterRpcServices.setBalancerRunning(MasterRpcServices.java:1732) > ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT] > at > org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java) > ~[hbase-protocol-shaded-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT] > at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:457) > ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT] > at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124) > ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT] > at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:102) > ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT] > at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:82) > ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT] > {noformat} > Need to dig more. -- This message was sent by Atlassian Jira (v8.20.10#820010)