[
https://issues.apache.org/jira/browse/HBASE-30143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18078745#comment-18078745
]
Kiran Kumar Maturi commented on HBASE-30143:
--------------------------------------------
SplitRegionProcedure failed at SPLIT_TABLE_REGIONS_CHECK_CLOSED_REGIONS as
there were recovered edits.
2026-04-01 16:18:24,423 ERROR [PEWorker-46]
assignment.SplitTableRegionProcedure: Splitting
fcc017f900f94981ad490e291dd70dfe, pid=14060510,
state=RUNNABLE:SPLIT_TABLE_REGIONS_CHECK_CLOSED_REGIONS, locked=true;
SplitTableRegionProcedure table=tsdb, parent=fcc017f900f94981ad490e291dd70dfe,
daughterA=a7439e4c913b08c90c2ca6be66d46683,
daughterB=f67ce33a4fcf4cc4f9bc8c829857dbf1
java.io.IOException: Recovered.edits are found in Region:
{ENCODED => fcc017f900f94981ad490e291dd70dfe, NAME =>
'tsdb,...,fcc017f900f94981ad490e291dd70dfe.', STARTKEY => '...', ENDKEY =>
'...'}
, abort split/merge to prevent data loss
at
org.apache.hadoop.hbase.master.assignment.AssignmentManagerUtil.checkClosedRegion(AssignmentManagerUtil.java:307)
at
org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.checkClosedRegions(SplitTableRegionProcedure.java:282)
at
org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.executeFromState(SplitTableRegionProcedure.java:313)
at
org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.executeFromState(SplitTableRegionProcedure.java:107)
at
org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:189)
at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:962)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1660)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1417)
at
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1976)
Logs before the issue
{code:java}
2026-04-01 16:18:23,833 INFO [PEWorker-46] procedure2.ProcedureExecutor:
Initialized subprocedures=[{pid=14060511, ppid=14060510, state=RUNNABLE;
org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure,
table=tsdb, region=fcc017f900f94981ad490e291dd70dfe, UNASSIGN}]
2026-04-01 16:18:23,837 DEBUG [PEWorker-46] procedure2.ProcedureExecutor:
Acquired lock for pid=14060510,
state=RUNNABLE:SPLIT_TABLE_REGION_GET_SPLITTING_TABLE_REGIONS,
locked=true; SplitTableRegionProcedure table=tsdb,
parent=fcc017f900f94981ad490e291dd70dfe,
daughterA=a7439e4c913b08c90c2ca6be66d46683,
daughterB=f67ce33a4fcf4cc4f9bc8c829857dbf1
2026-04-01 16:18:23,847 INFO [PEWorker-46]
assignment.TransitRegionStateProcedure: Starting pid=14060511, ppid=14060510,
state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE;
TransitRegionStateProcedure table=tsdb,
region=fcc017f900f94981ad490e291dd70dfe, UNASSIGN
2026-04-01 16:18:23,857 INFO [PEWorker-23] assignment.CloseRegionProcedure:
pid=14060512, ppid=14060511, state=RUNNABLE; CloseRegionProcedure table=tsdb,
region=fcc017f900f94981ad490e291dd70dfe, server=<rs-host>,16020,<startcode>
2026-04-01 16:18:24,412 INFO [PEWorker-23] procedure2.ProcedureExecutor:
Finished pid=14060512, ppid=14060511, state=SUCCESS; CloseRegionProcedure
table=tsdb,
region=fcc017f900f94981ad490e291dd70dfe, server=<rs-host>,16020,<startcode>
in 545 msec
2026-04-01 16:18:24,421 INFO [PEWorker-46] procedure2.ProcedureExecutor:
Finished pid=14060511, ppid=14060510, state=SUCCESS;
TransitRegionStateProcedure table=tsdb,
region=fcc017f900f94981ad490e291dd70dfe, UNASSIGN in 565 msec
2026-04-01 16:18:24,436 DEBUG [PEWorker-46] procedure2.ProcedureExecutor:
Child procedures of pid=14060510 finished; pid=14060511 SUCCESS, pid=14060512
SUCCESS
{code}
> ProcedureExecutor orphans FAILED procedures with holdLock=true when
> setRollback() races with child release()
> --------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-30143
> URL: https://issues.apache.org/jira/browse/HBASE-30143
> Project: HBase
> Issue Type: Bug
> Components: proc-v2, Region Assignment
> Affects Versions: 2.6.5, 2.5.14
> Environment: Any HBase deployment running splits/merges or other
> StateMachineProcedures with holdLock()==true under concurrent worker load
> Reporter: Kiran Kumar Maturi
> Assignee: Kiran Kumar Maturi
> Priority: Minor
>
> h3. Summary
> {\{ProcedureExecutor.executeProcedure()}} can leave a
> \{{StateMachineProcedure}} with {\{holdLock()==true}} in an orphaned
> state: \{{ProcedureState.FAILED}}, exclusive lock held, and not present on
> any scheduler queue. No event ever re-awakens it; the only recovery is master
> failover (via \{{loadProcedures() ->
> failedList.forEach(scheduler::addBack)}}).
>
> In production we observed this as an HBase region stuck CLOSED for 5h 37m
> after a {\{SplitTableRegionProcedure}} hit "Recovered.edits are
> found" during
> {\{SPLIT_TABLE_REGIONS_CHECK_CLOSED_REGIONS}}. The region was completely
> unavailable to clients for the entire duration. Master failover released the
> lock and rollback finally ran.
> Race between two workers when a parent procedure calls {{setFailure()}} while
> a sibling/child
> procedure has not yet returned from {{procStack.release()}}.
> Relevant code paths (numbers from branch-2.6):
> * {{ProcedureExecutor.executeProcedure()}} lines 1414-1489 — outer do-while
> loop.
> * {{RootProcedureState.setRollback()}} line 85 — guarded by {{running == 0
> && state == FAILED}}.
> * {{RootProcedureState.acquire()}} line 138 — increments {{running}};
> {{release()}} at 150 decrements.
>
> * {{ProcedureExecutor.releaseLock()}} line 1502-1509 — skips release when
>
>
> {{proc.holdLock(env)==true && !proc.isFinished()}}. {{isFinished()}} is
> only true for
>
> SUCCESS/ROLLEDBACK, NOT for FAILED.
> Timeline of the race:
>
>
>
>
>
> || T || Worker-A (child) || Worker-B (parent) || running || state ||
>
>
> | 0 | acquire(child) | — | 1 | RUNNING |
>
> | 1 | child execute returns SUCCESS | — | 1 | RUNNING |
>
>
> | 2 | countDownChildren → scheduler.addFront(parent) | — | 1 | RUNNING |
>
>
> | 3 | — | picks up parent | 1 | RUNNING |
>
>
> | 4 | — | acquire(parent) | 2 | RUNNING |
>
>
> | 5 | — | executeFromState throws, setFailure() | 2 | FAILED |
>
>
> | 6 | — | execProcedure returns | 2 | FAILED |
>
>
> | 7 | — | do-while re-enters, acquire() returns false | 2 | FAILED |
>
>
> | 8 | — | setRollback() returns false (running != 0) | 2 | FAILED |
>
>
> | 9 | — | else-branch, wasExecuted()==true, break; | 2 | FAILED |
>
>
> | 10 | release(child) | Worker-B returns | 1 | FAILED |
> From T+10: procedure is FAILED, {{holdLock=true}} prevented
> {{releaseLock()}} at T+6 from
>
> releasing the xlock, and nothing re-enqueues the root. The child's
>
>
> {{countDownChildren}} wake-up was consumed at T+3 and there is no further
> event generator.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)