[jira] [Created] (HDDS-15688) DiskBalancer reverts to RUNNING when diskBalancer.info persist fails in nodeStateUpdated() during decommission/maintenance

Arun Sarin (Jira) Fri, 26 Jun 2026 01:16:09 -0700

Arun Sarin created HDDS-15688:
---------------------------------

             Summary: DiskBalancer reverts to RUNNING when diskBalancer.info 
persist fails in nodeStateUpdated() during decommission/maintenance
                 Key: HDDS-15688
                 URL: https://issues.apache.org/jira/browse/HDDS-15688
             Project: Apache Ozone
          Issue Type: Bug
          Components: Ozone Datanode
    Affects Versions: 2.3.0
            Reporter: Arun Sarin
         Attachments: repro-diskbalancer-decommission-persist-failure.sh, 
repro_diskbalancer_decommission_persist_20260626_131028.txt, 
repro_diskbalancer_decommission_persist_20260626_131028_datanode_filtered.log, 
repro_diskbalancer_decommission_persist_20260626_131028_datanode_full.log, 
repro_diskbalancer_decommission_persist_20260626_131028_scm_filtered.log, 
repro_diskbalancer_decommission_persist_20260626_131028_scm_full.log


Hi [~gargijaiswal] / [~sammichen] ,

Flagging a safety issue in the persist-failure path added around 
{{nodeStateUpdated()}} in {{DiskBalancerService}} 

When a DN enters {{DECOMMISSIONING}} or {{{}ENTERING_MAINTENANCE{}}}, the 
service correctly sets {{operationalState = PAUSED}} and calls 
{{{}writeDiskBalancerInfoTo(){}}}. If that throws {{{}IOException{}}}, the 
catch block reverts to {{originalServiceState}} — which is {{RUNNING}} when we 
were trying to pause for decommission. {{getTasks()}} then keeps scheduling 
moves on a node that is already leaving service.

The revert-to-last-persisted-state logic makes sense for {{PAUSED → RUNNING}} 
(resume fails → stay paused). It is unsafe for {{RUNNING → PAUSED}} (pause 
fails → must stay paused in memory).

[https://github.com/apache/ozone/blob/29b2ad5fcd40642c77f771128e7f0f7c4fe88e42/hadoop-hdds/container-service/src/main/java/org/apache/hadoop/ozone/container/diskbalancer/DiskBalancerService.java#L853-L864]

In production , same can happen in any of the below cases: 

1. Decommission during metadata volume pressure

User decommissions a DN while the metadata disk is nearly full (RocksDB 
container DBs, SCM heartbeat state, {{{}diskBalancer.info{}}}, etc. on the same 
volume). The atomic write creates {{diskBalancer.info.tmp}} but the rename to 
{{diskBalancer.info}} fails with no space. DN is already 
{{{}DECOMMISSIONING{}}}; DiskBalancer reverts to {{RUNNING}} and may still be 
moving containers off that node while SCM is draining it.

2. Datanode in Maintenance window 

Node put into {{ENTERING_MAINTENANCE}} during a disk/hardware incident on the 
metadata mount. Same path as decommission - pause is attempted, persist fails, 
revert to {{{}RUNNING{}}}. Maintenance is meant to stop user I/O and background 
work; DiskBalancer keeps running.

3. Race with concurrent admin action

{{ozone admin datanode diskbalancer start}} (or stop) and decommission overlap. 
Both paths persist via {{applyDiskBalancerInfo()}} / 
{{{}nodeStateUpdated(){}}}. A failed write on the decommission pause path still 
reverts to {{RUNNING}} regardless of what the admin command intended.
h3. Repro (confirmed on master)

Compose cluster, 5 DNs, RF=3, {{{}ozone-2.3.0-SNAPSHOT{}}}.
 # Start DiskBalancer on IN_SERVICE DN → RPC {{RUNNING}}
 # Break persist: {{rm -rf /data/metadata/diskBalancer.info && mkdir 
/data/metadata/diskBalancer.info}}
 # {{ozone admin datanode decommission <dn-ip>:19864}}
 # DN reaches {{{}DECOMMISSIONING{}}}; DiskBalancer RPC still {{RUNNING}}

*Automated script[To repro]:*

*[^repro-diskbalancer-decommission-persist-failure.sh]*
h3. Expected vs actual
|| ||Expected||Actual||
|DN op state after decommission|{{DECOMMISSIONING}}|{{DECOMMISSIONING}} ✓|
|DiskBalancer in-memory / RPC|{{PAUSED}}|{{RUNNING}} ✗|
h3. Suggested fix

On persist failure in {{{}nodeStateUpdated(){}}}:
 * If we were pausing ({{{}operationalState == PAUSED{}}} after transition): 
keep PAUSED in memory; log retain despite failure
 * If we were resuming ({{{}operationalState == RUNNING{}}} after transition): 
revert to PAUSED (current revert behavior is already safe here)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (HDDS-15688) DiskBalancer reverts to RUNNING when diskBalancer.info persist fails in nodeStateUpdated() during decommission/maintenance

Reply via email to