Arun Sarin created HDDS-15688:
---------------------------------
Summary: DiskBalancer reverts to RUNNING when diskBalancer.info
persist fails in nodeStateUpdated() during decommission/maintenance
Key: HDDS-15688
URL: https://issues.apache.org/jira/browse/HDDS-15688
Project: Apache Ozone
Issue Type: Bug
Components: Ozone Datanode
Affects Versions: 2.3.0
Reporter: Arun Sarin
Attachments: repro-diskbalancer-decommission-persist-failure.sh,
repro_diskbalancer_decommission_persist_20260626_131028.txt,
repro_diskbalancer_decommission_persist_20260626_131028_datanode_filtered.log,
repro_diskbalancer_decommission_persist_20260626_131028_datanode_full.log,
repro_diskbalancer_decommission_persist_20260626_131028_scm_filtered.log,
repro_diskbalancer_decommission_persist_20260626_131028_scm_full.log
Hi [~gargijaiswal] / [~sammichen] ,
Flagging a safety issue in the persist-failure path added around
{{nodeStateUpdated()}} in {{DiskBalancerService}}
When a DN enters {{DECOMMISSIONING}} or {{{}ENTERING_MAINTENANCE{}}}, the
service correctly sets {{operationalState = PAUSED}} and calls
{{{}writeDiskBalancerInfoTo(){}}}. If that throws {{{}IOException{}}}, the
catch block reverts to {{originalServiceState}} — which is {{RUNNING}} when we
were trying to pause for decommission. {{getTasks()}} then keeps scheduling
moves on a node that is already leaving service.
The revert-to-last-persisted-state logic makes sense for {{PAUSED → RUNNING}}
(resume fails → stay paused). It is unsafe for {{RUNNING → PAUSED}} (pause
fails → must stay paused in memory).
[https://github.com/apache/ozone/blob/29b2ad5fcd40642c77f771128e7f0f7c4fe88e42/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/diskbalancer/DiskBalancerService.java#L853-L864]
In production , same can happen in any of the below cases:
1. Decommission during metadata volume pressure
User decommissions a DN while the metadata disk is nearly full (RocksDB
container DBs, SCM heartbeat state, {{{}diskBalancer.info{}}}, etc. on the same
volume). The atomic write creates {{diskBalancer.info.tmp}} but the rename to
{{diskBalancer.info}} fails with no space. DN is already
{{{}DECOMMISSIONING{}}}; DiskBalancer reverts to {{RUNNING}} and may still be
moving containers off that node while SCM is draining it.
2. Datanode in Maintenance window
Node put into {{ENTERING_MAINTENANCE}} during a disk/hardware incident on the
metadata mount. Same path as decommission - pause is attempted, persist fails,
revert to {{{}RUNNING{}}}. Maintenance is meant to stop user I/O and background
work; DiskBalancer keeps running.
3. Race with concurrent admin action
{{ozone admin datanode diskbalancer start}} (or stop) and decommission overlap.
Both paths persist via {{applyDiskBalancerInfo()}} /
{{{}nodeStateUpdated(){}}}. A failed write on the decommission pause path still
reverts to {{RUNNING}} regardless of what the admin command intended.
h3. Repro (confirmed on master)
Compose cluster, 5 DNs, RF=3, {{{}ozone-2.3.0-SNAPSHOT{}}}.
# Start DiskBalancer on IN_SERVICE DN → RPC {{RUNNING}}
# Break persist: {{rm -rf /data/metadata/diskBalancer.info && mkdir
/data/metadata/diskBalancer.info}}
# {{ozone admin datanode decommission <dn-ip>:19864}}
# DN reaches {{{}DECOMMISSIONING{}}}; DiskBalancer RPC still {{RUNNING}}
*Automated script[To repro]:*
*[^repro-diskbalancer-decommission-persist-failure.sh]*
h3. Expected vs actual
|| ||Expected||Actual||
|DN op state after decommission|{{DECOMMISSIONING}}|{{DECOMMISSIONING}} ✓|
|DiskBalancer in-memory / RPC|{{PAUSED}}|{{RUNNING}} ✗|
h3. Suggested fix
On persist failure in {{{}nodeStateUpdated(){}}}:
* If we were pausing ({{{}operationalState == PAUSED{}}} after transition):
keep PAUSED in memory; log retain despite failure
* If we were resuming ({{{}operationalState == RUNNING{}}} after transition):
revert to PAUSED (current revert behavior is already safe here)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]