[jira] [Commented] (HBASE-30143) ProcedureExecutor orphans FAILED procedures with holdLock=true when setRollback() races with child release()

Kiran Kumar Maturi (Jira) Wed, 06 May 2026 10:47:09 -0700


    [ 
https://issues.apache.org/jira/browse/HBASE-30143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18078745#comment-18078745
 ]


Kiran Kumar Maturi commented on HBASE-30143:
--------------------------------------------

SplitRegionProcedure failed at  SPLIT_TABLE_REGIONS_CHECK_CLOSED_REGIONS as 
there were recovered edits.

2026-04-01 16:18:24,423 ERROR [PEWorker-46] 
assignment.SplitTableRegionProcedure: Splitting 
fcc017f900f94981ad490e291dd70dfe, pid=14060510,
state=RUNNABLE:SPLIT_TABLE_REGIONS_CHECK_CLOSED_REGIONS, locked=true; 
SplitTableRegionProcedure table=tsdb, parent=fcc017f900f94981ad490e291dd70dfe, 
daughterA=a7439e4c913b08c90c2ca6be66d46683, 
daughterB=f67ce33a4fcf4cc4f9bc8c829857dbf1 
java.io.IOException: Recovered.edits are found in Region:

{ENCODED => fcc017f900f94981ad490e291dd70dfe, NAME => 
'tsdb,...,fcc017f900f94981ad490e291dd70dfe.', STARTKEY => '...', ENDKEY => 
'...'}

, abort split/merge to prevent data loss 
at 
org.apache.hadoop.hbase.master.assignment.AssignmentManagerUtil.checkClosedRegion(AssignmentManagerUtil.java:307)
at 
org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.checkClosedRegions(SplitTableRegionProcedure.java:282)
 
at 
org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.executeFromState(SplitTableRegionProcedure.java:313)
 
at 
org.apache.hadoop.hbase.master.assignment.SplitTableRegionProcedure.executeFromState(SplitTableRegionProcedure.java:107)
 
at 
org.apache.hadoop.hbase.procedure2.StateMachineProcedure.execute(StateMachineProcedure.java:189)
 
at org.apache.hadoop.hbase.procedure2.Procedure.doExecute(Procedure.java:962) 
at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.execProcedure(ProcedureExecutor.java:1660)
 
at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor.executeProcedure(ProcedureExecutor.java:1417)
 
at 
org.apache.hadoop.hbase.procedure2.ProcedureExecutor$WorkerThread.run(ProcedureExecutor.java:1976)

Logs before the issue
{code:java}
2026-04-01 16:18:23,833 INFO  [PEWorker-46] procedure2.ProcedureExecutor: 
Initialized subprocedures=[{pid=14060511, ppid=14060510, state=RUNNABLE;        
            
  org.apache.hadoop.hbase.master.assignment.TransitRegionStateProcedure, 
table=tsdb, region=fcc017f900f94981ad490e291dd70dfe, UNASSIGN}]                 
                 
  2026-04-01 16:18:23,837 DEBUG [PEWorker-46] procedure2.ProcedureExecutor: 
Acquired lock for pid=14060510, 
state=RUNNABLE:SPLIT_TABLE_REGION_GET_SPLITTING_TABLE_REGIONS,
   locked=true; SplitTableRegionProcedure table=tsdb, 
parent=fcc017f900f94981ad490e291dd70dfe, 
daughterA=a7439e4c913b08c90c2ca6be66d46683,                                
  daughterB=f67ce33a4fcf4cc4f9bc8c829857dbf1                                    
                                                                                
        
  2026-04-01 16:18:23,847 INFO  [PEWorker-46] 
assignment.TransitRegionStateProcedure: Starting pid=14060511, ppid=14060510, 
state=RUNNABLE:REGION_STATE_TRANSITION_CLOSE; 
  TransitRegionStateProcedure table=tsdb, 
region=fcc017f900f94981ad490e291dd70dfe, UNASSIGN                               
                                                
  2026-04-01 16:18:23,857 INFO  [PEWorker-23] assignment.CloseRegionProcedure: 
pid=14060512, ppid=14060511, state=RUNNABLE; CloseRegionProcedure table=tsdb,
  region=fcc017f900f94981ad490e291dd70dfe, server=<rs-host>,16020,<startcode>   
                                                                                
          
                                                                                
                                                                                
                                                                                
                                                                                
                                                                                
    
  2026-04-01 16:18:24,412 INFO  [PEWorker-23] procedure2.ProcedureExecutor: 
Finished pid=14060512, ppid=14060511, state=SUCCESS; CloseRegionProcedure 
table=tsdb,
  region=fcc017f900f94981ad490e291dd70dfe, server=<rs-host>,16020,<startcode> 
in 545 msec                                                                     
            
  2026-04-01 16:18:24,421 INFO  [PEWorker-46] procedure2.ProcedureExecutor: 
Finished pid=14060511, ppid=14060510, state=SUCCESS; 
TransitRegionStateProcedure table=tsdb,
  region=fcc017f900f94981ad490e291dd70dfe, UNASSIGN in 565 msec                 
                                                                                
          
  2026-04-01 16:18:24,436 DEBUG [PEWorker-46] procedure2.ProcedureExecutor: 
Child procedures of pid=14060510 finished; pid=14060511 SUCCESS, pid=14060512 
SUCCESS
{code}
 

>  ProcedureExecutor orphans FAILED procedures with holdLock=true when 
> setRollback() races with child release() 
> --------------------------------------------------------------------------------------------------------------
>
>                 Key: HBASE-30143
>                 URL: https://issues.apache.org/jira/browse/HBASE-30143
>             Project: HBase
>          Issue Type: Bug
>          Components: proc-v2, Region Assignment
>    Affects Versions: 2.6.5, 2.5.14
>         Environment: Any HBase deployment running splits/merges or other 
> StateMachineProcedures with holdLock()==true under concurrent worker load
>            Reporter: Kiran Kumar Maturi
>            Assignee: Kiran Kumar Maturi
>            Priority: Minor
>
> h3. Summary
> {\{ProcedureExecutor.executeProcedure()}} can leave a 
> \{{StateMachineProcedure}} with         {\{holdLock()==true}} in an orphaned 
> state: \{{ProcedureState.FAILED}}, exclusive lock held, and not present on 
> any scheduler queue. No event ever re-awakens it; the only recovery is master 
> failover (via \{{loadProcedures() -> 
> failedList.forEach(scheduler::addBack)}}).                                    
>                                                
> In production we observed this as an HBase region stuck CLOSED for 5h 37m 
> after a              {\{SplitTableRegionProcedure}} hit "Recovered.edits are 
> found" during                                        
> {\{SPLIT_TABLE_REGIONS_CHECK_CLOSED_REGIONS}}. The region was completely 
> unavailable to clients for the entire duration. Master failover released the 
> lock and rollback finally ran.
> Race between two workers when a parent procedure calls {{setFailure()}} while 
> a sibling/child             
>   procedure has not yet returned from {{procStack.release()}}.
>   Relevant code paths (numbers from branch-2.6):
>   * {{ProcedureExecutor.executeProcedure()}} lines 1414-1489 — outer do-while 
> loop.
>   * {{RootProcedureState.setRollback()}} line 85 — guarded by {{running == 0 
> && state == FAILED}}.
>   * {{RootProcedureState.acquire()}} line 138 — increments {{running}}; 
> {{release()}} at 150 decrements.                                              
>                     
>   * {{ProcedureExecutor.releaseLock()}} line 1502-1509 — skips release when   
>                                                                               
>               
>     {{proc.holdLock(env)==true && !proc.isFinished()}}. {{isFinished()}} is 
> only true for                                                                 
>                 
>     SUCCESS/ROLLEDBACK, NOT for FAILED.    
>  Timeline of the race:                                                        
>                                                                               
>              
>                                                                               
>                                                                               
>               
>   || T || Worker-A (child) || Worker-B (parent) || running || state ||        
>                                                                               
>               
>   |  0 | acquire(child) | — | 1 | RUNNING |                                   
>                               
>   |  1 | child execute returns SUCCESS | — | 1 | RUNNING |                    
>                                                                               
>               
>   |  2 | countDownChildren → scheduler.addFront(parent) | — | 1 | RUNNING |   
>                                                                               
>               
>   |  3 | — | picks up parent | 1 | RUNNING |                                  
>                                                                               
>               
>   |  4 | — | acquire(parent) | 2 | RUNNING |                                  
>                                                                               
>               
>   |  5 | — | executeFromState throws, setFailure() | 2 | FAILED |             
>                                                                               
>               
>   |  6 | — | execProcedure returns | 2 | FAILED |                             
>                                                                               
>               
>   |  7 | — | do-while re-enters, acquire() returns false | 2 | FAILED |       
>                                                                               
>               
>   |  8 | — | setRollback() returns false (running != 0) | 2 | FAILED |        
>                                                                               
>               
>   |  9 | — | else-branch, wasExecuted()==true, break; | 2 | FAILED |          
>                                                                               
>               
>   | 10 | release(child) | Worker-B returns | 1 | FAILED |      
>  From T+10: procedure is FAILED, {{holdLock=true}} prevented 
> {{releaseLock()}} at T+6 from                                                 
>                               
>   releasing the xlock, and nothing re-enqueues the root. The child's          
>                                                                               
>               
>   {{countDownChildren}} wake-up was consumed at T+3 and there is no further 
> event generator.   
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Commented] (HBASE-30143) ProcedureExecutor orphans FAILED procedures with holdLock=true when setRollback() races with child release()

Reply via email to