[
https://issues.apache.org/jira/browse/HBASE-30143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18078604#comment-18078604
]
Duo Zhang commented on HBASE-30143:
-----------------------------------
So the SplitRegionProcedure is failed? Which step? Is it expected?
The rollback process for most procedures are not well designed and reviewed,
usually only at some very early stages we can successfully rollback a
procedure. Most procedures can not be rollbacked, especially region assignment
related procedures.
> ProcedureExecutor orphans FAILED procedures with holdLock=true when
> setRollback() races with child release()
> --------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-30143
> URL: https://issues.apache.org/jira/browse/HBASE-30143
> Project: HBase
> Issue Type: Bug
> Components: proc-v2, Region Assignment
> Affects Versions: 2.6.5, 2.5.14
> Environment: Any HBase deployment running splits/merges or other
> StateMachineProcedures with holdLock()==true under concurrent worker load
> Reporter: Kiran Kumar Maturi
> Assignee: Kiran Kumar Maturi
> Priority: Minor
>
> h3. Summary
> {\{ProcedureExecutor.executeProcedure()}} can leave a
> \{{StateMachineProcedure}} with {\{holdLock()==true}} in an orphaned
> state: \{{ProcedureState.FAILED}}, exclusive lock held, and not present on
> any scheduler queue. No event ever re-awakens it; the only recovery is master
> failover (via \{{loadProcedures() ->
> failedList.forEach(scheduler::addBack)}}).
>
> In production we observed this as an HBase region stuck CLOSED for 5h 37m
> after a {\{SplitTableRegionProcedure}} hit "Recovered.edits are
> found" during
> {\{SPLIT_TABLE_REGIONS_CHECK_CLOSED_REGIONS}}. The region was completely
> unavailable to clients for the entire duration. Master failover released the
> lock and rollback finally ran.
> Race between two workers when a parent procedure calls {{setFailure()}} while
> a sibling/child
> procedure has not yet returned from {{procStack.release()}}.
> Relevant code paths (numbers from branch-2.6):
> * {{ProcedureExecutor.executeProcedure()}} lines 1414-1489 — outer do-while
> loop.
> * {{RootProcedureState.setRollback()}} line 85 — guarded by {{running == 0
> && state == FAILED}}.
> * {{RootProcedureState.acquire()}} line 138 — increments {{running}};
> {{release()}} at 150 decrements.
>
> * {{ProcedureExecutor.releaseLock()}} line 1502-1509 — skips release when
>
>
> {{proc.holdLock(env)==true && !proc.isFinished()}}. {{isFinished()}} is
> only true for
>
> SUCCESS/ROLLEDBACK, NOT for FAILED.
> Timeline of the race:
>
>
>
>
>
> || T || Worker-A (child) || Worker-B (parent) || running || state ||
>
>
> | 0 | acquire(child) | — | 1 | RUNNING |
>
> | 1 | child execute returns SUCCESS | — | 1 | RUNNING |
>
>
> | 2 | countDownChildren → scheduler.addFront(parent) | — | 1 | RUNNING |
>
>
> | 3 | — | picks up parent | 1 | RUNNING |
>
>
> | 4 | — | acquire(parent) | 2 | RUNNING |
>
>
> | 5 | — | executeFromState throws, setFailure() | 2 | FAILED |
>
>
> | 6 | — | execProcedure returns | 2 | FAILED |
>
>
> | 7 | — | do-while re-enters, acquire() returns false | 2 | FAILED |
>
>
> | 8 | — | setRollback() returns false (running != 0) | 2 | FAILED |
>
>
> | 9 | — | else-branch, wasExecuted()==true, break; | 2 | FAILED |
>
>
> | 10 | release(child) | Worker-B returns | 1 | FAILED |
> From T+10: procedure is FAILED, {{holdLock=true}} prevented
> {{releaseLock()}} at T+6 from
>
> releasing the xlock, and nothing re-enqueues the root. The child's
>
>
> {{countDownChildren}} wake-up was consumed at T+3 and there is no further
> event generator.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)