[jira] [Updated] (HDDS-13694) Container Balancer Stop Command Fails with Error as Already Stopped

Siddhant Sangwan (Jira) Thu, 25 Sep 2025 05:00:05 -0700


     [ 
https://issues.apache.org/jira/browse/HDDS-13694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Siddhant Sangwan updated HDDS-13694:
------------------------------------
    Fix Version/s: 2.1.0
       Resolution: Fixed
           Status: Resolved  (was: Patch Available)

> Container Balancer Stop Command Fails with Error as Already Stopped
> -------------------------------------------------------------------
>
>                 Key: HDDS-13694
>                 URL: https://issues.apache.org/jira/browse/HDDS-13694
>             Project: Apache Ozone
>          Issue Type: Bug
>            Reporter: Sarveksha Yeshavantha Raju
>            Assignee: Sarveksha Yeshavantha Raju
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 2.1.0
>
>
> If {{stopBalancer}} is called when balancer is already in the STOPPED state, 
> it throws an exception instead of returning successfully.
> Proposed change:
> 1. Split the current {{validateState(boolean expectedRunning)}} into two
> methods:
>  * {{validateEligibility()}} – checks leader-ready and safe-mode only.
>  * {{validateState(expectedRunning)}} – delegates to 
> {{validateEligibility()}} and then performs the running / stopped assertions.
> 2. Change {{stopBalancer()}} to call {{validateEligibility()}} instead of 
> {{validateState(true)}}, persist {{shouldRun = false}} before looking at 
> {{taskStatus}}, and then interrupt a running task if present.
> Roughly how the changes look like in code:
> {code:java}
> private void validateEligibility() throws 
> IllegalContainerBalancerStateException {
>   if (!scmContext.isLeaderReady()) {
>     LOG.warn("SCM is not leader ready");
>     throw new IllegalContainerBalancerStateException("SCM is not leader " +
>           "ready");
>   }
>   if (scmContext.isInSafeMode()) {
>     LOG.warn("SCM is in safe mode");
>     throw new IllegalContainerBalancerStateException("SCM is in safe mode");
>   }
> }
> private void validateState(boolean expectedRunning) throws 
> IllegalContainerBalancerStateException {
>   validateEligibility();
>   if (!expectedRunning && !canBalancerStart()) {
>     ...
>   }
>   if (expectedRunning && !canBalancerStop()) {
>     ...
>   }
> }
> public void stopBalancer()
>     throws IOException, IllegalContainerBalancerStateException {
>   Thread balancingThread = null;
>   lock.lock();
>   try {
>     validateEligibility();               // only leadership / safemode
>     saveConfiguration(config, false, 0);
>     
>     if (isBalancerRunning()) {
>       LOG.info("Trying to stop ContainerBalancer service.");
>       task.stop();
>       balancingThread = currentBalancingThread;
>     }
>   } finally {
>     lock.unlock();
>   }
>   if (balancingThread != null) {
>     blockTillTaskStop(balancingThread);
>   }
> }
> {code}
> The main change here is the {{saveConfiguration()}} call being made 
> regardless of whether the in-memory state of balancer is running or stopped. 
> It persists the fact that balancer should not run to rocksDB and replicates 
> it via Ratis to the other SCMs. Then, if balancer is not running, we do 
> nothing else. This makes the API idempotent but also fixes a race condition:
> 1. When a leader steps down it stops the balancer thread locally but does not 
> flip the persisted flag ({{{}shouldRun{}}} stays true).
> Result: {{task != null}}, {{taskStatus == STOPPED}}.
> 2. When that SCM later regains leadership its {{notifyStatusChanged()}} 
> thread reads {{shouldRun = true}} and – because {{taskStatus == STOPPED}} – 
> starts a new balancer thread.
> 3. If, in the same time-window, an administrator issues {{stopBalancer}} from 
> the CLI, that method
>  * acquires the same lock first,
>  * calls {{validateState(true)}} which expects the balancer to be RUNNING,
>  * finds it STOPPED and throws an exception before persisting {{shouldRun = 
> false}}.
> The command silently fails and the balancer continues to run, when it should 
> have actually stopped. The changes proposed above fix this race condition as 
> well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDDS-13694) Container Balancer Stop Command Fails with Error as Already Stopped

Reply via email to