[jira] [Updated] (HDDS-13694) Container Balancer Stop Command Fails with Error as Already Stopped

Sarveksha Yeshavantha Raju (Jira) Sat, 27 Sep 2025 08:20:31 -0700


     [ 
https://issues.apache.org/jira/browse/HDDS-13694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Sarveksha Yeshavantha Raju updated HDDS-13694:
----------------------------------------------
    Description: 
The CLI attempted to stop the Container Balancer, but it was already in the 
STOPPED state, causing an exception instead of a success message.

Proposed change:

Split the current validateState(boolean expectedRunning) into two

methods:

validateEligibility() – checks leader-ready and safe-mode only.

validateState(expectedRunning) – delegates to validateEligibility() and then 
performs the running / stopped assertions.

Change stopBalancer() to call validateEligibility() instead of 
validateState(true), persist shouldRun = false before looking at taskStatus, 
and then interrupt a running task if present.

Roughly how the changes look like in code:



private void validateEligibility() throws 
IllegalContainerBalancerStateException {
  if (!scmContext.isLeaderReady()) {
    LOG.warn("SCM is not leader ready");
    throw new IllegalContainerBalancerStateException("SCM is not leader " +
          "ready");
  }
  if (scmContext.isInSafeMode()) {
    LOG.warn("SCM is in safe mode");
    throw new IllegalContainerBalancerStateException("SCM is in safe mode");
  }
}

private void validateState(boolean expectedRunning) throws 
IllegalContainerBalancerStateException {
  validateEligibility();
  if (!expectedRunning && !canBalancerStart()) {
    ...
  }
  if (expectedRunning && !canBalancerStop()) {
    ...
  }
}

public void stopBalancer()
    throws IOException, IllegalContainerBalancerStateException {

  Thread balancingThread = null;
  lock.lock();
  try {
    validateEligibility();               // only leadership / safemode

    saveConfiguration(config, false, 0);
    
    if (isBalancerRunning()) {
      LOG.info("Trying to stop ContainerBalancer service.");
      task.stop();
      balancingThread = currentBalancingThread;
    }
  } finally {
    lock.unlock();
  }

  if (balancingThread != null) {
    blockTillTaskStop(balancingThread);
  }
}
(@Sarveksha Yeshavantha Raju this is rough code, please do due diligence)

The main change here is the saveConfiguration() call being made regardless of 
whether the in-memory state of balancer is running or stopped. It persists the 
fact that balancer should not run to rocksDB and replicates it via Ratis to the 
other SCMs. Then, if balancer is not running, we do nothing else. This makes 
the API idempotent but also fixes a race condition:

When a leader steps down it stops the balancer thread locally but does not flip 
the persisted flag (shouldRun stays true).

Result: task != null, taskStatus == STOPPED.

When that SCM later regains leadership its notifyStatusChanged() thread reads 
shouldRun = true and – because taskStatus == STOPPED – starts a new balancer 
thread.

If, in the same time-window, an administrator issues stopBalancer from the CLI, 
that method

acquires the same lock first,

calls validateState(true) which expects the balancer to be RUNNING,

finds it STOPPED and throws an exception before persisting shouldRun = false.

The command silently fails and the balancer continues to run, when it should 
have actually stopped. The changes proposed above fix this race condition as 
well.

  was:The CLI attempted to stop the Container Balancer, but it was already in 
the STOPPED state, causing an exception instead of a success message.


> Container Balancer Stop Command Fails with Error as Already Stopped
> -------------------------------------------------------------------
>
>                 Key: HDDS-13694
>                 URL: https://issues.apache.org/jira/browse/HDDS-13694
>             Project: Apache Ozone
>          Issue Type: Bug
>            Reporter: Sarveksha Yeshavantha Raju
>            Assignee: Sarveksha Yeshavantha Raju
>            Priority: Major
>              Labels: pull-request-available
>
> The CLI attempted to stop the Container Balancer, but it was already in the 
> STOPPED state, causing an exception instead of a success message.
> Proposed change:
> Split the current validateState(boolean expectedRunning) into two
> methods:
> validateEligibility() – checks leader-ready and safe-mode only.
> validateState(expectedRunning) – delegates to validateEligibility() and then 
> performs the running / stopped assertions.
> Change stopBalancer() to call validateEligibility() instead of 
> validateState(true), persist shouldRun = false before looking at taskStatus, 
> and then interrupt a running task if present.
> Roughly how the changes look like in code:
> private void validateEligibility() throws 
> IllegalContainerBalancerStateException {
>   if (!scmContext.isLeaderReady()) {
>     LOG.warn("SCM is not leader ready");
>     throw new IllegalContainerBalancerStateException("SCM is not leader " +
>           "ready");
>   }
>   if (scmContext.isInSafeMode()) {
>     LOG.warn("SCM is in safe mode");
>     throw new IllegalContainerBalancerStateException("SCM is in safe mode");
>   }
> }
> private void validateState(boolean expectedRunning) throws 
> IllegalContainerBalancerStateException {
>   validateEligibility();
>   if (!expectedRunning && !canBalancerStart()) {
>     ...
>   }
>   if (expectedRunning && !canBalancerStop()) {
>     ...
>   }
> }
> public void stopBalancer()
>     throws IOException, IllegalContainerBalancerStateException {
>   Thread balancingThread = null;
>   lock.lock();
>   try {
>     validateEligibility();               // only leadership / safemode
>     saveConfiguration(config, false, 0);
>     
>     if (isBalancerRunning()) {
>       LOG.info("Trying to stop ContainerBalancer service.");
>       task.stop();
>       balancingThread = currentBalancingThread;
>     }
>   } finally {
>     lock.unlock();
>   }
>   if (balancingThread != null) {
>     blockTillTaskStop(balancingThread);
>   }
> }
> (@Sarveksha Yeshavantha Raju this is rough code, please do due diligence)
> The main change here is the saveConfiguration() call being made regardless of 
> whether the in-memory state of balancer is running or stopped. It persists 
> the fact that balancer should not run to rocksDB and replicates it via Ratis 
> to the other SCMs. Then, if balancer is not running, we do nothing else. This 
> makes the API idempotent but also fixes a race condition:
> When a leader steps down it stops the balancer thread locally but does not 
> flip the persisted flag (shouldRun stays true).
> Result: task != null, taskStatus == STOPPED.
> When that SCM later regains leadership its notifyStatusChanged() thread reads 
> shouldRun = true and – because taskStatus == STOPPED – starts a new balancer 
> thread.
> If, in the same time-window, an administrator issues stopBalancer from the 
> CLI, that method
> acquires the same lock first,
> calls validateState(true) which expects the balancer to be RUNNING,
> finds it STOPPED and throws an exception before persisting shouldRun = false.
> The command silently fails and the balancer continues to run, when it should 
> have actually stopped. The changes proposed above fix this race condition as 
> well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDDS-13694) Container Balancer Stop Command Fails with Error as Already Stopped

Reply via email to