Re: [PR] HDDS-15066. Read-Write Lock race leave stale references to container creating orphan replicas [ozone]

via GitHub Tue, 28 Apr 2026 23:59:42 -0700


Gargi-jais11 commented on PR #10109:
URL: https://github.com/apache/ozone/pull/10109#issuecomment-4341547514


   for markContainerForUnhealthy with DiskBalancer parallely working . I 
believe this also needs to refetch the container after writeLock.
   Here it works for any container state to mark all — CLOSED, QUASI_CLOSED, 
OPEN, RECOVERING — all can become UNHEALTHY.
   
   ```
   Case 1: Container CLOSED/QUASI_CLOSED + DiskBalancer + Scanner in parallel
   DiskBalancer (DN1)                                                           
     Container Scanner (DN1)
   ──────────────────────────────                                               
     ──────────────────────────────────────────
   T1: selects C (CLOSED, Disk1)
       added to inProgressContainers
   
   T2: container.readLock() on OLD
   
   T3: copy Disk1 → Disk2                                                       
     scanner reads OLD files on Disk1
       (I/O in progress)                                                        
     finds checksum failure
                                                                                
     controller.markContainerUnhealthy(id, reason)
                                                                                
       containerSet.getContainer(id)
                                                                                
       → OLD container (Disk1)  ← stale ref
                                                                                
       handler.markContainerUnhealthy(OLD, reason)
                                                                                
       → writeLock() → BLOCKED (readLock held)
   
   T4: copy done, checksum verified
   T5: importContainer → NEW (CLOSED, Disk2)
   T6: containerSet.updateContainer(NEW)
       ← ContainerSet maps C → NEW (Disk2)
   
   T7: readUnlock() on OLD
   
                                                                                
       writeLock ACQUIRED on OLD
                                                                                
       state = CLOSED, not UNHEALTHY, volume not failed
                                                                                
       OLD: CLOSED → UNHEALTHY   ← wrong container
                                                                                
       writeUnlock
                                                                                
       sendICR: C on DN1 = UNHEALTHY  ← stale, wrong
   
   T8: markContainerForDelete(OLD) → DELETED
   
   Final state:
     OLD (Disk1): DELETED
     NEW (Disk2): CLOSED  ← healthy, valid
     SCM view of container C:   C on DN1 = UNHEALTHY  ← wrong
   
   ```
   
   **What SCM/RM does in response :**
   ```
   T9:  RM next cycle — 
ECUnderReplicationHandler.checkAndRemoveUnhealthyReplica()
          SCM replica record: DN1 has UNHEALTHY replica of C
          checks: is there a CLOSED replica for same index on another DN? 
            → if YES: "prefer deleting the UNHEALTHY over CLOSED" → 
sendThrottledDeleteCommand(DN1)
            → if NO CLOSED elsewhere: "delete any UNHEALTHY" → 
sendThrottledDeleteCommand(DN1)
   T10: DN1 receives DeleteContainerCommand for C
          containerSet.getContainer(id) → NEW container (Disk2, CLOSED)
          NEW container DELETED  ← healthy valid replica gone
          
        Now container C is genuinely under-replicated.
        RM tries to fix it by replicating — but the replica it just deleted was 
the source.
   ```
   
   **Outcome:**  A healthy replica on Disk2 gets deleted. Container C becomes 
genuinely under-replicated.
   The window between readUnlock (T7) and markContainerForDelete (T8) is the 
critical period. If the scanner's sendICR reaches SCM and RM processes it 
before FCR corrects the state, the delete command lands on DN1 and hits the 
healthy NEW container. This is why markContainerUnhealthy needs the re-fetch — 
the consequences of operating on the wrong container are irreversible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HDDS-15066. Read-Write Lock race leave stale references to container creating orphan replicas [ozone]

Reply via email to