[ 
https://issues.apache.org/jira/browse/HBASE-29713?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Viraj Jasani resolved HBASE-29713.
----------------------------------
    Fix Version/s: 2.5.14
     Hadoop Flags: Reviewed
       Resolution: Fixed

> HMaster loosing track of SplitWalProcedure because race condition while 
> suspending and waking up the procedure
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-29713
>                 URL: https://issues.apache.org/jira/browse/HBASE-29713
>             Project: HBase
>          Issue Type: Bug
>          Components: proc-v2
>    Affects Versions: 2.5.0, 2.5.13
>            Reporter: Umesh Kumar Kumawat
>            Assignee: Umesh Kumar Kumawat
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 2.5.14
>
>
> When there are too many splitWalProecedure at a time and there are no worker 
> available we suspend the procedure that try to acquire the worker and then 
> wake the procedure again once some procedure get completed and it releases 
> the worker. 
> ([code|https://github.com/apache/hbase/blob/branch-2.5/hbase-server/src/main/java/org/apache/hadoop/hbase/master/SplitWALManager.java#L155-L164])
>  
> {code:java}
> public ServerName acquireSplitWALWorker(Procedure<?> procedure)
>   throws ProcedureSuspendedException {
>   Optional<ServerName> worker = splitWorkerAssigner.acquire();
>   if (worker.isPresent()) {
>     LOG.debug("Acquired split WAL worker={}", worker.get());
>     return worker.get();
>   }
>   splitWorkerAssigner.suspend(procedure);
>   throw new ProcedureSuspendedException();
> }{code}
>  
> [Definition|https://github.com/apache/hbase/blob/branch-2.5/hbase-server/src/main/java/org/apache/hadoop/hbase/master/SplitWALManager.java#L228C1-L231C6]
>  of splitWorkerAssigner.suspend(procedure)
>  
> {code:java}
>     public void suspend(Procedure<?> proc) {
>       event.suspend();
>       event.suspendIfNotReady(proc);
>     }
> {code}
>  
> [Definition of both event.suspend() and 
> event.suspendIfNotReady(proc)|https://github.com/apache/hbase/blob/branch-2.5/hbase-procedure/src/main/java/org/apache/hadoop/hbase/procedure2/ProcedureEvent.java#L47-L60]
> {code:java}
> /** Mark the event as not ready. */
>   public synchronized void suspend() {
>     ready = false;
>     if (LOG.isTraceEnabled()) {
>       LOG.trace("Suspend " + toString());
>     }
>   }
>   /**
>    * Returns true if event is not ready and adds procedure to suspended 
> queue, else returns false.
>    */
>   public synchronized boolean suspendIfNotReady(Procedure proc) {
>     if (!ready) {
>       suspendedProcedures.addLast(proc);
>     }
>     return !ready;
>   }
> {code}
>  
> Notice In event.suspendIfNotReady(proc) we add the procedure to 
> suspendedProcedure list only if read is false.
>  
> *Flow of release the worker*
> Once a SplitWalProcedure got completed and a worker get freed, we wake this 
> event and mark ready true. 
> ([code|https://github.com/apache/hbase/blob/branch-2.5/hbase-server/src/main/java/org/apache/hadoop/hbase/master/SplitWALManager.java#L233-L237])
> {code:java}
>     public void wake(MasterProcedureScheduler scheduler) {
>       if (!event.isReady()) {
>         event.wake(scheduler);
>       }
>     }{code}
>  
> The suspend() and wake() methods in SplitWorkerAssigner are not synchronized, 
> allowing this race condition.
>   Race Condition Timeline:
>   Thread-1 (Suspending Procedure):           Thread-2 (Releasing Worker):
>   1. suspend() called 
>   2. event.suspend() called
>      → Sets ready = false
>   3. [THREAD PAUSES before next line]        4. wake() called 
>                                                                           5. 
> if (!event.isReady()) → TRUE
>                                                                           6. 
> event.wake(scheduler) caleed
>                                                                               
> → Sets ready = true
>                                                                               
> → Wakes suspended queue (empty!)
>   7. event.suspendIfNotReady(proc) at line 232
>      → Checks ready = true
>      → Does NOT add procedure to suspended queue
>   8. ProcedureExecuter consdier this procedure marked as SUSPENDED. Now there 
> is not way to add this proceudre back to schedulerQueue.
>      → NEVER WOKEN UP!
>  
> In 2.6 branch when we introduce WorkerAssigner.java as part of 
> SnapshotProcedure, we changed these methods to be syschronized, that solved 
> the problem.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to