qazwsx created HBASE-29552:
------------------------------
Summary: RegionRemoteProcedureBase inconsistent state loading
caused startup failure.
Key: HBASE-29552
URL: https://issues.apache.org/jira/browse/HBASE-29552
Project: HBase
Issue Type: Bug
Reporter: qazwsx
Before the power failure, the Region (9bf8064aa66e5c6391bcf1d291f5e3fa) was
performing a balance operation, which triggered the TransitRegionStateProcedure
to execute the Move operation. Due to the fact that part of the in-memory data
of HDFS was not persisted when the power failure occurred, the state of the
Region recorded in the META table was shown as OPENING, and the Procedure
record with pid=53510 was lost.
After the system was started, when loadProcedure reloaded the
CloseRegionProcedure, the transitionState operation failed, which ultimately
led to the failure of the Master service to start.
# log
before power off:
2025-08-24 13:53:33,254 | INFO | master/ndp-hbase-master-1:16000.Chore.5 |
balance hri=9bf8064aa66e5c6391bcf1d291f5e3fa,
source=ndp-hbase-region-0.hbaseregion.sop.svc.cluster.local,16020,1756010506000,
destination=ndp-hbase-region-1.hbaseregion.sop.svc.cluster.local,16020,1756010435228
|
org.apache.hadoop.hbase.master.HMaster.executeRegionPlansWithThrottling(HMaster.java:1987)
2025-08-24 13:53:33,266 | INFO | PEWorker-10 | Initialized
subprocedures=[\{pid=53505, ppid=53504, state=RUNNABLE; CloseRegionProcedure
9bf8064aa66e5c6391bcf1d291f5e3fa,
server=ndp-hbase-region-0.hbaseregion.sop.svc.cluster.local,16020,1756010506000}]
|
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1685)
2025-08-24 13:53:33,423 | INFO | RSProcedureDispatcher-pool-23 | Using
KERBEROS authentication for service=AdminService, sasl=true, type='kerberos' |
org.apache.hadoop.hbase.ipc.RpcConnection.<init>(RpcConnection.java:124)
2025-08-24 14:01:50,565 | INFO | PEWorker-15 | pid=53504 updating hbase:meta
row=9bf8064aa66e5c6391bcf1d291f5e3fa, regionState=CLOSED |
org.apache.hadoop.hbase.master.assignment.RegionStateStore.createPutForRegionLocUpdate(RegionStateStore.java:253)
2025-08-24 14:01:50,569 | INFO | PEWorker-15 | Finished pid=53505, ppid=53504,
state=SUCCESS; CloseRegionProcedure 9bf8064aa66e5c6391bcf1d291f5e3fa,
server=ndp-hbase-region-0.hbaseregion.sop.svc.cluster.local,16020,1756010506000
in 8 mins, 17.302 sec |
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1414)
2025-08-24 14:01:50,569 | INFO | PEWorker-12 | Starting pid=53504,
state=RUNNABLE:REGION_STATE_TRANSITION_GET_ASSIGN_CANDIDATE, locked=true;
TransitRegionStateProcedure table=student001,
region=9bf8064aa66e5c6391bcf1d291f5e3fa, REOPEN/MOVE; state=CLOSED,
location=ndp-hbase-region-1.hbaseregion.sop.svc.cluster.local,16020,1756010435228;
forceNewPlan=false, retain=false |
org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.queueAssign(TransitRegionStateProcedure.java:250)
2025-08-24 14:01:50,720 | INFO | PEWorker-18 | pid=53504 updating hbase:meta
row=9bf8064aa66e5c6391bcf1d291f5e3fa, regionState=OPENING,
regionLocation=ndp-hbase-region-1.hbaseregion.sop.svc.cluster.local,16020,1756010435228
|
org.apache.hadoop.hbase.master.assignment.RegionStateStore.createPutForRegionLocUpdate(RegionStateStore.java:253)
2025-08-24 14:01:50,726 | INFO | PEWorker-18 | Initialized
subprocedures=[\{pid=53510, ppid=53504, state=RUNNABLE; OpenRegionProcedure
9bf8064aa66e5c6391bcf1d291f5e3fa,
server=ndp-hbase-region-1.hbaseregion.sop.svc.cluster.local,16020,1756010435228}]
|
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1685)
2025-08-24 14:01:51,054 | INFO | PEWorker-5 | pid=53504 updating hbase:meta
row=9bf8064aa66e5c6391bcf1d291f5e3fa, regionState=OPEN, openSeqNum=96213,
regionLocation=ndp-hbase-region-1.hbaseregion.sop.svc.cluster.local,16020,1756010435228
|
org.apache.hadoop.hbase.master.assignment.RegionStateStore.createPutForRegionLocUpdate(RegionStateStore.java:253)
2025-08-24 14:01:51,059 | INFO | PEWorker-5 | Finished pid=53510, ppid=53504,
state=SUCCESS; OpenRegionProcedure 9bf8064aa66e5c6391bcf1d291f5e3fa,
server=ndp-hbase-region-1.hbaseregion.sop.svc.cluster.local,16020,1756010435228
in 330 msec |
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1414)
2025-08-24 14:01:51,060 | INFO | PEWorker-7 | Finished pid=53504,
state=SUCCESS; TransitRegionStateProcedure table=student001,
region=9bf8064aa66e5c6391bcf1d291f5e3fa, REOPEN/MOVE in 8 mins, 17.804 sec |
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1414)
掉电恢复启动失败
2025-08-24 14:58:19,266 | ERROR |
master/ndp-hbase-master-1:16000:becomeActiveMaster | Failed to become active
master |
org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2393)
java.lang.AssertionError:
org.apache.hadoop.hbase.exceptions.UnexpectedStateException: Expected [CLOSING,
CLOSED] so could move to CLOSED but current state=OPENING
at
org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.stateLoaded(RegionRemoteProcedureBase.java:290)
at
org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure.stateLoaded(TransitRegionStateProcedure.java:668)
at
org.apache.hadoop.hbase.master.assignment.AssignmentManager$RegionMetaLoadingVisitor.visitRegionState(AssignmentManager.java:1879)
at
org.apache.hadoop.hbase.master.assignment.RegionStateStore.visitMetaEntry(RegionStateStore.java:153)
at
org.apache.hadoop.hbase.master.assignment.RegionStateStore.access$100(RegionStateStore.java:66)
at
org.apache.hadoop.hbase.master.assignment.RegionStateStore$1.visit(RegionStateStore.java:95)
at
org.apache.hadoop.hbase.MetaTableAccessor.scanMeta(MetaTableAccessor.java:809)
at
org.apache.hadoop.hbase.MetaTableAccessor.scanMeta(MetaTableAccessor.java:755)
at
org.apache.hadoop.hbase.MetaTableAccessor.scanMeta(MetaTableAccessor.java:716)
at
org.apache.hadoop.hbase.MetaTableAccessor.fullScanRegions(MetaTableAccessor.java:193)
at
org.apache.hadoop.hbase.master.assignment.RegionStateStore.visitMeta(RegionStateStore.java:85)
at
org.apache.hadoop.hbase.master.assignment.AssignmentManager.loadMeta(AssignmentManager.java:1909)
at
org.apache.hadoop.hbase.master.assignment.AssignmentManager.joinCluster(AssignmentManager.java:1779)
at
org.apache.hadoop.hbase.master.HMaster.finishActiveMasterInitialization(HMaster.java:1035)
at
org.apache.hadoop.hbase.master.HMaster.startActiveMasterManager(HMaster.java:2389)
at org.apache.hadoop.hbase.master.HMaster.lambda$run$0(HMaster.java:558)
at java.lang.Thread.run(Thread.java:750)
Caused by: org.apache.hadoop.hbase.exceptions.UnexpectedStateException:
Expected [CLOSING, CLOSED] so could move to CLOSED but current state=OPENING
at
org.apache.hadoop.hbase.master.assignment.RegionStateNode.transitionState(RegionStateNode.java:142)
at
org.apache.hadoop.hbase.master.assignment.AssignmentManager.regionClosedWithoutPersistingToMeta(AssignmentManager.java:2234)
at
org.apache.hadoop.hbase.master.assignment.CloseRegionProcedure.restoreSucceedState(CloseRegionProcedure.java:116)
at
org.apache.hadoop.hbase.master.assignment.RegionRemoteProcedureBase.stateLoaded(RegionRemoteProcedureBase.java:287)
... 16 more
# my question
The entry condition for the {{RegionRemoteProcedureBase#restoreSucceedState}}
method is
{{{}RegionRemoteProcedureBaseState.REGION_REMOTE_PROCEDURE_REPORT_SUCCEED{}}}.
Is it possible to skip the expected result verification when
{{regionNode.transitionState}} is executed?
--
This message was sent by Atlassian Jira
(v8.20.10#820010)