[
https://issues.apache.org/jira/browse/HDDS-13694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Sarveksha Yeshavantha Raju updated HDDS-13694:
----------------------------------------------
Description:
The CLI attempted to stop the Container Balancer, but it was already in the
STOPPED state, causing an exception instead of a success message.
Proposed change:
Split the current validateState(boolean expectedRunning) into two
methods:
validateEligibility() – checks leader-ready and safe-mode only.
validateState(expectedRunning) – delegates to validateEligibility() and then
performs the running / stopped assertions.
Change stopBalancer() to call validateEligibility() instead of
validateState(true), persist shouldRun = false before looking at taskStatus,
and then interrupt a running task if present.
Roughly how the changes look like in code:
private void validateEligibility() throws
IllegalContainerBalancerStateException {
if (!scmContext.isLeaderReady()) {
LOG.warn("SCM is not leader ready");
throw new IllegalContainerBalancerStateException("SCM is not leader " +
"ready");
}
if (scmContext.isInSafeMode()) {
LOG.warn("SCM is in safe mode");
throw new IllegalContainerBalancerStateException("SCM is in safe mode");
}
}
private void validateState(boolean expectedRunning) throws
IllegalContainerBalancerStateException {
validateEligibility();
if (!expectedRunning && !canBalancerStart()) {
...
}
if (expectedRunning && !canBalancerStop()) {
...
}
}
public void stopBalancer()
throws IOException, IllegalContainerBalancerStateException {
Thread balancingThread = null;
lock.lock();
try {
validateEligibility(); // only leadership / safemode
saveConfiguration(config, false, 0);
if (isBalancerRunning()) {
LOG.info("Trying to stop ContainerBalancer service.");
task.stop();
balancingThread = currentBalancingThread;
}
} finally {
lock.unlock();
}
if (balancingThread != null) {
blockTillTaskStop(balancingThread);
}
}
(@Sarveksha Yeshavantha Raju this is rough code, please do due diligence)
The main change here is the saveConfiguration() call being made regardless of
whether the in-memory state of balancer is running or stopped. It persists the
fact that balancer should not run to rocksDB and replicates it via Ratis to the
other SCMs. Then, if balancer is not running, we do nothing else. This makes
the API idempotent but also fixes a race condition:
When a leader steps down it stops the balancer thread locally but does not flip
the persisted flag (shouldRun stays true).
Result: task != null, taskStatus == STOPPED.
When that SCM later regains leadership its notifyStatusChanged() thread reads
shouldRun = true and – because taskStatus == STOPPED – starts a new balancer
thread.
If, in the same time-window, an administrator issues stopBalancer from the CLI,
that method
acquires the same lock first,
calls validateState(true) which expects the balancer to be RUNNING,
finds it STOPPED and throws an exception before persisting shouldRun = false.
The command silently fails and the balancer continues to run, when it should
have actually stopped. The changes proposed above fix this race condition as
well.
was:The CLI attempted to stop the Container Balancer, but it was already in
the STOPPED state, causing an exception instead of a success message.
> Container Balancer Stop Command Fails with Error as Already Stopped
> -------------------------------------------------------------------
>
> Key: HDDS-13694
> URL: https://issues.apache.org/jira/browse/HDDS-13694
> Project: Apache Ozone
> Issue Type: Bug
> Reporter: Sarveksha Yeshavantha Raju
> Assignee: Sarveksha Yeshavantha Raju
> Priority: Major
> Labels: pull-request-available
>
> The CLI attempted to stop the Container Balancer, but it was already in the
> STOPPED state, causing an exception instead of a success message.
> Proposed change:
> Split the current validateState(boolean expectedRunning) into two
> methods:
> validateEligibility() – checks leader-ready and safe-mode only.
> validateState(expectedRunning) – delegates to validateEligibility() and then
> performs the running / stopped assertions.
> Change stopBalancer() to call validateEligibility() instead of
> validateState(true), persist shouldRun = false before looking at taskStatus,
> and then interrupt a running task if present.
> Roughly how the changes look like in code:
> private void validateEligibility() throws
> IllegalContainerBalancerStateException {
> if (!scmContext.isLeaderReady()) {
> LOG.warn("SCM is not leader ready");
> throw new IllegalContainerBalancerStateException("SCM is not leader " +
> "ready");
> }
> if (scmContext.isInSafeMode()) {
> LOG.warn("SCM is in safe mode");
> throw new IllegalContainerBalancerStateException("SCM is in safe mode");
> }
> }
> private void validateState(boolean expectedRunning) throws
> IllegalContainerBalancerStateException {
> validateEligibility();
> if (!expectedRunning && !canBalancerStart()) {
> ...
> }
> if (expectedRunning && !canBalancerStop()) {
> ...
> }
> }
> public void stopBalancer()
> throws IOException, IllegalContainerBalancerStateException {
> Thread balancingThread = null;
> lock.lock();
> try {
> validateEligibility(); // only leadership / safemode
> saveConfiguration(config, false, 0);
>
> if (isBalancerRunning()) {
> LOG.info("Trying to stop ContainerBalancer service.");
> task.stop();
> balancingThread = currentBalancingThread;
> }
> } finally {
> lock.unlock();
> }
> if (balancingThread != null) {
> blockTillTaskStop(balancingThread);
> }
> }
> (@Sarveksha Yeshavantha Raju this is rough code, please do due diligence)
> The main change here is the saveConfiguration() call being made regardless of
> whether the in-memory state of balancer is running or stopped. It persists
> the fact that balancer should not run to rocksDB and replicates it via Ratis
> to the other SCMs. Then, if balancer is not running, we do nothing else. This
> makes the API idempotent but also fixes a race condition:
> When a leader steps down it stops the balancer thread locally but does not
> flip the persisted flag (shouldRun stays true).
> Result: task != null, taskStatus == STOPPED.
> When that SCM later regains leadership its notifyStatusChanged() thread reads
> shouldRun = true and – because taskStatus == STOPPED – starts a new balancer
> thread.
> If, in the same time-window, an administrator issues stopBalancer from the
> CLI, that method
> acquires the same lock first,
> calls validateState(true) which expects the balancer to be RUNNING,
> finds it STOPPED and throws an exception before persisting shouldRun = false.
> The command silently fails and the balancer continues to run, when it should
> have actually stopped. The changes proposed above fix this race condition as
> well.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]