Siddhant Sangwan created HDDS-13698:
---------------------------------------

             Summary: Race condition in Container Balancer start/stop HA flow
                 Key: HDDS-13698
                 URL: https://issues.apache.org/jira/browse/HDDS-13698
             Project: Apache Ozone
          Issue Type: Bug
          Components: SCM
    Affects Versions: 2.0.0
            Reporter: Siddhant Sangwan


1. When a leader steps down it stops the balancer thread locally but does not 
flip the persisted flag (shouldRun stays true).

Result: task != null, taskStatus == STOPPED.

2. When that SCM later regains leadership its notifyStatusChanged() thread 
reads shouldRun = true and – because taskStatus == STOPPED – starts a new 
balancer thread.

3. If, in the same time-window, an administrator issues stopBalancer from the 
CLI, that method
 - acquires the same lock first,

 - calls validateState(true) which expects the balancer to be RUNNING,

 - finds it STOPPED and throws an exception before persisting shouldRun = false.

4. The command silently fails and the balancer continues to run, when it should 
have actually stopped.
h2. Proposed fix:


Split the current validateState(boolean expectedRunning) into two methods:

1. validateEligibility() – checks leader-ready and safe-mode only.

2. validateState(expectedRunning) – delegates to validateEligibility() and then 
performs the running / stopped assertions.

3. Change stopBalancer() to call validateEligibility() instead of 
validateState(true), persist shouldRun = false before looking at taskStatus, 
and then interrupt a running task if present.

Roughly how the changes look like in code:
{code:java}
private void validateEligibility() throws 
IllegalContainerBalancerStateException {
  if (!scmContext.isLeaderReady()) {
    LOG.warn("SCM is not leader ready");
    throw new IllegalContainerBalancerStateException("SCM is not leader " +
          "ready");
  }
  if (scmContext.isInSafeMode()) {
    LOG.warn("SCM is in safe mode");
    throw new IllegalContainerBalancerStateException("SCM is in safe mode");
  }
}

private void validateState(boolean expectedRunning) throws 
IllegalContainerBalancerStateException {
  validateEligibility();
  if (!expectedRunning && !canBalancerStart()) {
    ...
  }
  if (expectedRunning && !canBalancerStop()) {
    ...
  }
}

public void stopBalancer()
    throws IOException, IllegalContainerBalancerStateException {

  Thread balancingThread = null;
  lock.lock();
  try {
    validateEligibility();               // only leadership / safemode

    saveConfiguration(config, false, 0);
    
    if (isBalancerRunning()) {
      LOG.info("Trying to stop ContainerBalancer service.");
      task.stop();
      balancingThread = currentBalancingThread;
    }
  } finally {
    lock.unlock();
  }

  if (balancingThread != null) {
    blockTillTaskStop(balancingThread);
  }
}
{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to