Gargi-jais11 commented on PR #10109:
URL: https://github.com/apache/ozone/pull/10109#issuecomment-4427909615

   @sodonnel @ChenSammi 
   Yesterday night I tried implementing above diskbalancer flow but it seems to 
be complicated and bug prone.
   **(save old paths → readUnlock → writeLock → update 
volume/metadata/chunks/db paths → unlock → cleanup Disk1 from saved locals)**. 
_@sodonnel  direction is right in principle: one live Container object means no 
more updateContainer-style staleness for anything that always goes through 
getContainer(id) and the container lock._
   
   In practice I found it quite easy to get wrong, especially **Phase C 
(marking/deleting the old replica):** the live object must never be used for 
Disk1 cleanup, V3 DB/cache handling needs care, crash recovery between phases 
has to be spelled out, and the r**eadUnlock → writeLock** gap still needs a 
post-lock state check .
   
   Separately, @ChenSammi  already called out the bigger issue: many call sites 
read `KeyValueContainerData` paths without holding the container lock. In-place 
updates to several fields are atomic under the write lock, but not 
automatically safe for those **unsynchronized readers** unless we address 
visibility and torn reads (e.g. volatile/immutable snapshot) and review a lot 
of DN code — so the “fix in one place” story is not complete without that wider 
work.
   I’m leaning to keep the re-fetch/retry fix for this staleness issue.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to