[
https://issues.apache.org/jira/browse/HBASE-30143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18078821#comment-18078821
]
Duo Zhang commented on HBASE-30143:
-----------------------------------
Region in CLOSED state should not have recovered.edits. Could you please check
earlier logs when opening the region? Did it successfully removed the recovered
edits after opening?
> ProcedureExecutor orphans FAILED procedures with holdLock=true when
> setRollback() races with child release()
> --------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-30143
> URL: https://issues.apache.org/jira/browse/HBASE-30143
> Project: HBase
> Issue Type: Bug
> Components: proc-v2, Region Assignment
> Affects Versions: 2.6.5, 2.5.14
> Environment: Any HBase deployment running splits/merges or other
> StateMachineProcedures with holdLock()==true under concurrent worker load
> Reporter: Kiran Kumar Maturi
> Assignee: Kiran Kumar Maturi
> Priority: Minor
>
> h3. Summary
> {\{ProcedureExecutor.executeProcedure()}} can leave a
> \{{StateMachineProcedure}} with {\{holdLock()==true}} in an orphaned
> state: \{{ProcedureState.FAILED}}, exclusive lock held, and not present on
> any scheduler queue. No event ever re-awakens it; the only recovery is master
> failover (via \{{loadProcedures() ->
> failedList.forEach(scheduler::addBack)}}).
>
> In production we observed this as an HBase region stuck CLOSED for 5h 37m
> after a {\{SplitTableRegionProcedure}} hit "Recovered.edits are
> found" during
> {\{SPLIT_TABLE_REGIONS_CHECK_CLOSED_REGIONS}}. The region was completely
> unavailable to clients for the entire duration. Master failover released the
> lock and rollback finally ran.
> Race between two workers when a parent procedure calls {{setFailure()}} while
> a sibling/child
> procedure has not yet returned from {{procStack.release()}}.
> Relevant code paths (numbers from branch-2.6):
> * {{ProcedureExecutor.executeProcedure()}} lines 1414-1489 — outer do-while
> loop.
> * {{RootProcedureState.setRollback()}} line 85 — guarded by {{running == 0
> && state == FAILED}}.
> * {{RootProcedureState.acquire()}} line 138 — increments {{running}};
> {{release()}} at 150 decrements.
>
> * {{ProcedureExecutor.releaseLock()}} line 1502-1509 — skips release when
>
>
> {{proc.holdLock(env)==true && !proc.isFinished()}}. {{isFinished()}} is
> only true for
>
> SUCCESS/ROLLEDBACK, NOT for FAILED.
> Timeline of the race:
>
>
>
>
>
> || T || Worker-A (child) || Worker-B (parent) || running || state ||
>
>
> | 0 | acquire(child) | — | 1 | RUNNING |
>
> | 1 | child execute returns SUCCESS | — | 1 | RUNNING |
>
>
> | 2 | countDownChildren → scheduler.addFront(parent) | — | 1 | RUNNING |
>
>
> | 3 | — | picks up parent | 1 | RUNNING |
>
>
> | 4 | — | acquire(parent) | 2 | RUNNING |
>
>
> | 5 | — | executeFromState throws, setFailure() | 2 | FAILED |
>
>
> | 6 | — | execProcedure returns | 2 | FAILED |
>
>
> | 7 | — | do-while re-enters, acquire() returns false | 2 | FAILED |
>
>
> | 8 | — | setRollback() returns false (running != 0) | 2 | FAILED |
>
>
> | 9 | — | else-branch, wasExecuted()==true, break; | 2 | FAILED |
>
>
> | 10 | release(child) | Worker-B returns | 1 | FAILED |
> From T+10: procedure is FAILED, {{holdLock=true}} prevented
> {{releaseLock()}} at T+6 from
>
> releasing the xlock, and nothing re-enqueues the root. The child's
>
>
> {{countDownChildren}} wake-up was consumed at T+3 and there is no further
> event generator.
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)