Siddhant Sangwan created HDDS-13698:
---------------------------------------
Summary: Race condition in Container Balancer start/stop HA flow
Key: HDDS-13698
URL: https://issues.apache.org/jira/browse/HDDS-13698
Project: Apache Ozone
Issue Type: Bug
Components: SCM
Affects Versions: 2.0.0
Reporter: Siddhant Sangwan
1. When a leader steps down it stops the balancer thread locally but does not
flip the persisted flag (shouldRun stays true).
Result: task != null, taskStatus == STOPPED.
2. When that SCM later regains leadership its notifyStatusChanged() thread
reads shouldRun = true and – because taskStatus == STOPPED – starts a new
balancer thread.
3. If, in the same time-window, an administrator issues stopBalancer from the
CLI, that method
- acquires the same lock first,
- calls validateState(true) which expects the balancer to be RUNNING,
- finds it STOPPED and throws an exception before persisting shouldRun = false.
4. The command silently fails and the balancer continues to run, when it should
have actually stopped.
h2. Proposed fix:
Split the current validateState(boolean expectedRunning) into two methods:
1. validateEligibility() – checks leader-ready and safe-mode only.
2. validateState(expectedRunning) – delegates to validateEligibility() and then
performs the running / stopped assertions.
3. Change stopBalancer() to call validateEligibility() instead of
validateState(true), persist shouldRun = false before looking at taskStatus,
and then interrupt a running task if present.
Roughly how the changes look like in code:
{code:java}
private void validateEligibility() throws
IllegalContainerBalancerStateException {
if (!scmContext.isLeaderReady()) {
LOG.warn("SCM is not leader ready");
throw new IllegalContainerBalancerStateException("SCM is not leader " +
"ready");
}
if (scmContext.isInSafeMode()) {
LOG.warn("SCM is in safe mode");
throw new IllegalContainerBalancerStateException("SCM is in safe mode");
}
}
private void validateState(boolean expectedRunning) throws
IllegalContainerBalancerStateException {
validateEligibility();
if (!expectedRunning && !canBalancerStart()) {
...
}
if (expectedRunning && !canBalancerStop()) {
...
}
}
public void stopBalancer()
throws IOException, IllegalContainerBalancerStateException {
Thread balancingThread = null;
lock.lock();
try {
validateEligibility(); // only leadership / safemode
saveConfiguration(config, false, 0);
if (isBalancerRunning()) {
LOG.info("Trying to stop ContainerBalancer service.");
task.stop();
balancingThread = currentBalancingThread;
}
} finally {
lock.unlock();
}
if (balancingThread != null) {
blockTillTaskStop(balancingThread);
}
}
{code}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]