[
https://issues.apache.org/jira/browse/HBASE-29713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Viraj Jasani resolved HBASE-29713.
----------------------------------
Fix Version/s: 2.5.14
Hadoop Flags: Reviewed
Resolution: Fixed
> HMaster loosing track of SplitWalProcedure because race condition while
> suspending and waking up the procedure
> --------------------------------------------------------------------------------------------------------------
>
> Key: HBASE-29713
> URL: https://issues.apache.org/jira/browse/HBASE-29713
> Project: HBase
> Issue Type: Bug
> Components: proc-v2
> Affects Versions: 2.5.0, 2.5.13
> Reporter: Umesh Kumar Kumawat
> Assignee: Umesh Kumar Kumawat
> Priority: Major
> Labels: pull-request-available
> Fix For: 2.5.14
>
>
> When there are too many splitWalProecedure at a time and there are no worker
> available we suspend the procedure that try to acquire the worker and then
> wake the procedure again once some procedure get completed and it releases
> the worker.
> ([code|https://github.com/apache/hbase/blob/branch-2.5/hbase-server/src/main/java/org/apache/hadoop/hbase/master/SplitWALManager.java#L155-L164])
>
> {code:java}
> public ServerName acquireSplitWALWorker(Procedure<?> procedure)
> throws ProcedureSuspendedException {
> Optional<ServerName> worker = splitWorkerAssigner.acquire();
> if (worker.isPresent()) {
> LOG.debug("Acquired split WAL worker={}", worker.get());
> return worker.get();
> }
> splitWorkerAssigner.suspend(procedure);
> throw new ProcedureSuspendedException();
> }{code}
>
> [Definition|https://github.com/apache/hbase/blob/branch-2.5/hbase-server/src/main/java/org/apache/hadoop/hbase/master/SplitWALManager.java#L228C1-L231C6]
> of splitWorkerAssigner.suspend(procedure)
>
> {code:java}
> public void suspend(Procedure<?> proc) {
> event.suspend();
> event.suspendIfNotReady(proc);
> }
> {code}
>
> [Definition of both event.suspend() and
> event.suspendIfNotReady(proc)|https://github.com/apache/hbase/blob/branch-2.5/hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/ProcedureEvent.java#L47-L60]
> {code:java}
> /** Mark the event as not ready. */
> public synchronized void suspend() {
> ready = false;
> if (LOG.isTraceEnabled()) {
> LOG.trace("Suspend " + toString());
> }
> }
> /**
> * Returns true if event is not ready and adds procedure to suspended
> queue, else returns false.
> */
> public synchronized boolean suspendIfNotReady(Procedure proc) {
> if (!ready) {
> suspendedProcedures.addLast(proc);
> }
> return !ready;
> }
> {code}
>
> Notice In event.suspendIfNotReady(proc) we add the procedure to
> suspendedProcedure list only if read is false.
>
> *Flow of release the worker*
> Once a SplitWalProcedure got completed and a worker get freed, we wake this
> event and mark ready true.
> ([code|https://github.com/apache/hbase/blob/branch-2.5/hbase-server/src/main/java/org/apache/hadoop/hbase/master/SplitWALManager.java#L233-L237])
> {code:java}
> public void wake(MasterProcedureScheduler scheduler) {
> if (!event.isReady()) {
> event.wake(scheduler);
> }
> }{code}
>
> The suspend() and wake() methods in SplitWorkerAssigner are not synchronized,
> allowing this race condition.
> Race Condition Timeline:
> Thread-1 (Suspending Procedure): Thread-2 (Releasing Worker):
> 1. suspend() called
> 2. event.suspend() called
> → Sets ready = false
> 3. [THREAD PAUSES before next line] 4. wake() called
> 5.
> if (!event.isReady()) → TRUE
> 6.
> event.wake(scheduler) caleed
>
> → Sets ready = true
>
> → Wakes suspended queue (empty!)
> 7. event.suspendIfNotReady(proc) at line 232
> → Checks ready = true
> → Does NOT add procedure to suspended queue
> 8. ProcedureExecuter consdier this procedure marked as SUSPENDED. Now there
> is not way to add this proceudre back to schedulerQueue.
> → NEVER WOKEN UP!
>
> In 2.6 branch when we introduce WorkerAssigner.java as part of
> SnapshotProcedure, we changed these methods to be syschronized, that solved
> the problem.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)