Re: [PR] HDDS-15066. Read-Write Lock race leave stale references to container creating orphan replicas [ozone]

via GitHub Tue, 28 Apr 2026 21:25:06 -0700


Gargi-jais11 commented on PR #10109:
URL: https://github.com/apache/ozone/pull/10109#issuecomment-4340781753


   > This change looks good and I think it will solve the problem. However I 
think it reveals a problem with the code structure within the datanode. These 
services should not have to perform such complex locking to get a consistent 
view of a container and it would be very easy in the future for some other 
service to come along and not do things correctly, or indeed some other 
existing part of the DN code may also be doing things incorrectly already.
   > 
   > I think that all the container "logic" should be hidden behind an 
interface and all the services like disk balancer, or block deletion should 
call it, eg:
   > 
   > ```
   > containerManager.moveContainer(container_id, destination)
   > containerManager.deleteBlockList(container_id, blocksToDelete)
   > ...
   > etc
   > ```
   > 
   > Then all the locking etc happens behind the scenes in the same place where 
it can be controlled more closely.
   > 
   > Fixing this is a large exercise and not something we would want to take on 
in this PR.
   > 
   > There are other places in KeyValueHandler that take the container lock - 
are we sure they are OK or is any change needed in them too?
   
   I agree with u @sodonnel that all locking mechanism should be hidden behind 
an interface that it happens at the same place. 
   I was going through other write locks in the code base. I see that when **EC 
Under-replication + DiskBalancer**  there can be failure however it cannot be 
the problem but we can minimise the effort here.
   **Setup:** Container C (RS-6-3), shard index=2 exists only on DN1 
(under-replicated). DiskBalancer on DN1 simultaneously moves shard idx=2 from 
Disk1 → Disk2.
   
   ```
   The race:
   
   DiskBalancer (DN1)                    EC Reconstruction (DN5 coordinator)
   ─────────────────────────────────     
────────────────────────────────────────
   container.readLock()                  
     copy Disk1 → tmp/                                              pulls idx=0 
from DN3 ✓
     atomic move tmp → Disk2                                 pulls idx=1 from 
DN4 ✓
     containerSet.updateContainer()                   
containerSet.getContainer(C) → OLD obj (Disk1)
     readUnlock()                                                   tries 
writeLock → was blocked, now unblocked
     markContainerForDelete(OLD)                        OLD.state = DELETED
                                                                               
exportContainerData checks state → DELETED
                                                                              
throws IllegalStateException → FAILS
                                                                            ← 
entire ReconstructECContainersCommand fails
                                                                              
bandwidth for idx=0, idx=1 already wasted
   ```
   Net result: EC reconstruction fails just because the shard at DN1 has been 
marked as DELETED while it correctly pulled index from other DNs. Container C 
still has shard idx=2 missing. RM re-queues it next monitor cycle.
   
   Here we have two options:
   **Option 1 — Rely on RM to re-send (current behavior)**
   
   - RM re-queues the container after command failure/timeout
   - Next cycle: DN1's heartbeat has already reported the new path for idx=2 on 
Disk2
   - RM sends a fresh ReconstructECContainersCommand — this time export 
succeeds on the NEW container
   - Cost: 1 extra monitor cycle delay + bandwidth for idx=0, idx=1 already 
wasted in the failed attempt
   - Safe? Yes — no data loss, just delayed fix. But every failed attempt 
wastes bandwidth for all N-1 other shards too.
   
   **Option 2 — Re-fetch container from ContainerSet after getting the lock 
like others**
   In `exportContainerData`, after acquiring the lock, re-fetch the container 
from ContainerSet by ID, so even if the old container is marked DELETED the new 
container moved by diskbalancer will be fetched and will not fail 
ECReconstruction.
   
   Please let me know which option do @sodonnel @ChenSammiyou both prefer ? RM 
retry is simpler and already works — it just costs one extra cycle and wasted 
network I/O for every time the race window is hit. I just wanted to bring this 
in your knowledge and discuss.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] HDDS-15066. Read-Write Lock race leave stale references to container creating orphan replicas [ozone]

Reply via email to