Gargi Jaiswal created HDDS-15524:
------------------------------------
Summary: [DiskBalancer] Container parallel moves can overwrite
pending source replica deletions
Key: HDDS-15524
URL: https://issues.apache.org/jira/browse/HDDS-15524
Project: Apache Ozone
Issue Type: Sub-task
Affects Versions: 2.2.0
Reporter: Gargi Jaiswal
Assignee: Gargi Jaiswal
In {{DiskBalancerService.java}} *at lines 642-643 :*
{code:java}
pendingDeletionContainers.put(clock.millis() + replicaDeletionDelay,
container);{code}
{quote}After a successful container move, the old replica is queued for
deletion in {{{}pendingDeletionContainers{}}}, keyed by {{{}clock.millis() +
replicaDeletionDelay{}}}. That key has only millisecond precision, so if two
moves finish in the same millisecond, they get the same key.
Because the {*}map stores one container per key{*}, the second {{put()}}
overwrites the first. The overwritten container is never scheduled for
deletion, so its old replica stays on disk and wastes space. With
{{{}parallelThread > 1{}}}, this is realistic under normal load.{quote}
The key is: *{color:#172b4d}{{clock.millis() + replicaDeletionDelay}}{color}*
Both parts are the same for every thread finishing at the same millisecond: *
{{{}*clock.millis()*{}}}{*}—{*} wall clock, millisecond resolution. All JVM
threads share the same clock.
* {{{}*replicaDeletionDelay*{}}}{*}—{*} a single constant (default 5 minutes =
300,000 ms) shared by the whole service.
*Step-by-step analysis*
Assume {color:#00875a}{{replicaDeletionDelay = 300,000 ms}}{color}
and{color:#00875a} {{parallelThread = 5}}{color}.
Five container moves run in parallel. Moves for containers C-101 and C-202 both
finish at {color:#00875a}{{clock.millis() = 1,000,000}}{color}:
{code:java}
Thread-1 (moving C-101):
key = 1,000,000 + 300,000 = 1,300,000
pendingDeletionContainers.put(1_300_000, C-101_old_replica)
Map now: { 1_300_000 → C-101_old }
Thread-2 (moving C-202), same millisecond:
key = 1,000,000 + 300,000 = 1,300,000 <------ identical key!
pendingDeletionContainers.put(1_300_000, C-202_old_replica)
Map now: { 1_300_000 → C-202_old } <--------- C-101_old silently GONE{code}
C-101's old replica has been permanently lost from the pending-deletion queue.
It will never be scheduled for deletion. * C-202's old replica gets deleted
correctly.
* C-101's old replica is never visited. It sits on the source disk forever.
* The container is marked {{{}DELETED{}}}in metadata (line 605), so it won't
be served.
* But its data
directory(chunks,{{{}container.db{}}},{{{}.container{}}}descriptor) remains on
the source disk.
* {{{}decrementUsedSpace{}}}is only called inside{{{}deleteContainer(){}}}, so
the source volume'sused-space counter is never corrected.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]