Gargi-jais11 commented on PR #10109:
URL: https://github.com/apache/ozone/pull/10109#issuecomment-4340781753
> This change looks good and I think it will solve the problem. However I
think it reveals a problem with the code structure within the datanode. These
services should not have to perform such complex locking to get a consistent
view of a container and it would be very easy in the future for some other
service to come along and not do things correctly, or indeed some other
existing part of the DN code may also be doing things incorrectly already.
>
> I think that all the container "logic" should be hidden behind an
interface and all the services like disk balancer, or block deletion should
call it, eg:
>
> ```
> containerManager.moveContainer(container_id, destination)
> containerManager.deleteBlockList(container_id, blocksToDelete)
> ...
> etc
> ```
>
> Then all the locking etc happens behind the scenes in the same place where
it can be controlled more closely.
>
> Fixing this is a large exercise and not something we would want to take on
in this PR.
>
> There are other places in KeyValueHandler that take the container lock -
are we sure they are OK or is any change needed in them too?
I agree with u @sodonnel that all locking mechanism should be hidden behind
an interface that it happens at the same place.
I was going through other write locks in the code base. I see that when **EC
Under-replication + DiskBalancer** there can be failure however it cannot be
the problem but we can minimise the effort here.
**Setup:** Container C (RS-6-3), shard index=2 exists only on DN1
(under-replicated). DiskBalancer on DN1 simultaneously moves shard idx=2 from
Disk1 → Disk2.
```
The race:
DiskBalancer (DN1) EC Reconstruction (DN5 coordinator)
─────────────────────────────────
────────────────────────────────────────
container.readLock()
copy Disk1 → tmp/ pulls idx=0
from DN3 ✓
atomic move tmp → Disk2 pulls idx=1 from
DN4 ✓
containerSet.updateContainer()
containerSet.getContainer(C) → OLD obj (Disk1)
readUnlock() tries
writeLock → was blocked, now unblocked
markContainerForDelete(OLD) OLD.state = DELETED
exportContainerData checks state → DELETED
throws IllegalStateException → FAILS
←
entire ReconstructECContainersCommand fails
bandwidth for idx=0, idx=1 already wasted
```
Net result: EC reconstruction fails just because the shard at DN1 has been
marked as DELETED while it correctly pulled index from other DNs. Container C
still has shard idx=2 missing. RM re-queues it next monitor cycle.
Here we have two options:
**Option 1 — Rely on RM to re-send (current behavior)**
- RM re-queues the container after command failure/timeout
- Next cycle: DN1's heartbeat has already reported the new path for idx=2 on
Disk2
- RM sends a fresh ReconstructECContainersCommand — this time export
succeeds on the NEW container
- Cost: 1 extra monitor cycle delay + bandwidth for idx=0, idx=1 already
wasted in the failed attempt
- Safe? Yes — no data loss, just delayed fix. But every failed attempt
wastes bandwidth for all N-1 other shards too.
**Option 2 — Re-fetch container from ContainerSet after getting the lock
like others**
In `exportContainerData`, after acquiring the lock, re-fetch the container
from ContainerSet by ID, so even if the old container is marked DELETED the new
container moved by diskbalancer will be fetched and will not fail
ECReconstruction.
Please let me know which option do @sodonnel @ChenSammiyou both prefer ? RM
retry is simpler and already works — it just costs one extra cycle and wasted
network I/O for every time the race window is hit. I just wanted to bring this
in your knowledge and discuss.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]