[ 
https://issues.apache.org/jira/browse/HBASE-29294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17950486#comment-17950486
 ] 

Duo Zhang commented on HBASE-29294:
-----------------------------------

The exception is thrown when we reach the rpc timeout.

And if we choose sync = true when calling balanceSwitch, we will synchronize on 
the balancer, and we will also synchronize on the balancer when doing 
balancing, it could cost a lot of time, so after the balanceSwitch call enter 
the synchronized section to actually update the master region, we will timeout 
immediately...

After HBASE-29251, we will abort master when updating master region fails, so I 
propose we apply the solution in HBASE-23895 for all the updates to master 
region, since a rpc timeout should not crash master.

> Master crashed because of failing to update master region
> ---------------------------------------------------------
>
>                 Key: HBASE-29294
>                 URL: https://issues.apache.org/jira/browse/HBASE-29294
>             Project: HBase
>          Issue Type: Bug
>            Reporter: Duo Zhang
>            Priority: Major
>
> {noformat}
> 2025-05-08T16:20:49,486 ERROR 
> [RpcServer.default.FPBQ.Fifo.handler=27,queue=0,port=16000] master.HMaster: 
> ***** ABORTING master meta02,16000,1746686264263: MasterRegion update is not 
> successful *****
> org.apache.hadoop.hbase.exceptions.TimeoutIOException: Timed out waiting for 
> lock for row: load_balancer_on in region 1595e783b53d99cd5eef43b6debb2682
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.getRowLockInternal(HRegion.java:7131)
>  ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.lambda$getRowLock$26(HRegion.java:7164)
>  ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
>         at org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:216) 
> ~[hbase-common-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.getRowLock(HRegion.java:7164) 
> ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion$BatchOperation.lockRowsAndBuildMiniBatch(HRegion.java:3686)
>  ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.doMiniBatchMutate(HRegion.java:4882)
>  ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:4848) 
> ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.batchMutate(HRegion.java:4765) 
> ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.mutate(HRegion.java:5264) 
> ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.mutate(HRegion.java:5258) 
> ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.mutate(HRegion.java:5254) 
> ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.lambda$put$11(HRegion.java:3399) 
> ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
>         at org.apache.hadoop.hbase.trace.TraceUtil.trace(TraceUtil.java:216) 
> ~[hbase-common-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hbase.regionserver.HRegion.put(HRegion.java:3388) 
> ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hbase.master.MasterStateStore.lambda$update$0(MasterStateStore.java:76)
>  ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hbase.master.region.MasterRegion.update(MasterRegion.java:166)
>  ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hbase.master.MasterStateStore.update(MasterStateStore.java:76)
>  ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hbase.master.MasterStateStore.setState(MasterStateStore.java:68)
>  ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hbase.master.BooleanStateStore.set(BooleanStateStore.java:59)
>  ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hbase.master.MasterRpcServices.switchBalancer(MasterRpcServices.java:562)
>  ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hbase.master.MasterRpcServices.synchronousBalanceSwitch(MasterRpcServices.java:579)
>  ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hbase.master.MasterRpcServices.setBalancerRunning(MasterRpcServices.java:1732)
>  ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
>         at 
> org.apache.hadoop.hbase.shaded.protobuf.generated.MasterProtos$MasterService$2.callBlockingMethod(MasterProtos.java)
>  ~[hbase-protocol-shaded-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
>         at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:457) 
> ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
>         at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124) 
> ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
>         at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:102) 
> ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
>         at org.apache.hadoop.hbase.ipc.RpcHandler.run(RpcHandler.java:82) 
> ~[hbase-server-3.0.0-beta-2-SNAPSHOT.jar:3.0.0-beta-2-SNAPSHOT]
> {noformat}
> Need to dig more.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to