[
https://issues.apache.org/jira/browse/HDDS-13694?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Siddhant Sangwan updated HDDS-13694:
------------------------------------
Fix Version/s: 2.1.0
Resolution: Fixed
Status: Resolved (was: Patch Available)
> Container Balancer Stop Command Fails with Error as Already Stopped
> -------------------------------------------------------------------
>
> Key: HDDS-13694
> URL: https://issues.apache.org/jira/browse/HDDS-13694
> Project: Apache Ozone
> Issue Type: Bug
> Reporter: Sarveksha Yeshavantha Raju
> Assignee: Sarveksha Yeshavantha Raju
> Priority: Major
> Labels: pull-request-available
> Fix For: 2.1.0
>
>
> If {{stopBalancer}} is called when balancer is already in the STOPPED state,
> it throws an exception instead of returning successfully.
> Proposed change:
> 1. Split the current {{validateState(boolean expectedRunning)}} into two
> methods:
> * {{validateEligibility()}} – checks leader-ready and safe-mode only.
> * {{validateState(expectedRunning)}} – delegates to
> {{validateEligibility()}} and then performs the running / stopped assertions.
> 2. Change {{stopBalancer()}} to call {{validateEligibility()}} instead of
> {{validateState(true)}}, persist {{shouldRun = false}} before looking at
> {{taskStatus}}, and then interrupt a running task if present.
> Roughly how the changes look like in code:
> {code:java}
> private void validateEligibility() throws
> IllegalContainerBalancerStateException {
> if (!scmContext.isLeaderReady()) {
> LOG.warn("SCM is not leader ready");
> throw new IllegalContainerBalancerStateException("SCM is not leader " +
> "ready");
> }
> if (scmContext.isInSafeMode()) {
> LOG.warn("SCM is in safe mode");
> throw new IllegalContainerBalancerStateException("SCM is in safe mode");
> }
> }
> private void validateState(boolean expectedRunning) throws
> IllegalContainerBalancerStateException {
> validateEligibility();
> if (!expectedRunning && !canBalancerStart()) {
> ...
> }
> if (expectedRunning && !canBalancerStop()) {
> ...
> }
> }
> public void stopBalancer()
> throws IOException, IllegalContainerBalancerStateException {
> Thread balancingThread = null;
> lock.lock();
> try {
> validateEligibility(); // only leadership / safemode
> saveConfiguration(config, false, 0);
>
> if (isBalancerRunning()) {
> LOG.info("Trying to stop ContainerBalancer service.");
> task.stop();
> balancingThread = currentBalancingThread;
> }
> } finally {
> lock.unlock();
> }
> if (balancingThread != null) {
> blockTillTaskStop(balancingThread);
> }
> }
> {code}
> The main change here is the {{saveConfiguration()}} call being made
> regardless of whether the in-memory state of balancer is running or stopped.
> It persists the fact that balancer should not run to rocksDB and replicates
> it via Ratis to the other SCMs. Then, if balancer is not running, we do
> nothing else. This makes the API idempotent but also fixes a race condition:
> 1. When a leader steps down it stops the balancer thread locally but does not
> flip the persisted flag ({{{}shouldRun{}}} stays true).
> Result: {{task != null}}, {{taskStatus == STOPPED}}.
> 2. When that SCM later regains leadership its {{notifyStatusChanged()}}
> thread reads {{shouldRun = true}} and – because {{taskStatus == STOPPED}} –
> starts a new balancer thread.
> 3. If, in the same time-window, an administrator issues {{stopBalancer}} from
> the CLI, that method
> * acquires the same lock first,
> * calls {{validateState(true)}} which expects the balancer to be RUNNING,
> * finds it STOPPED and throws an exception before persisting {{shouldRun =
> false}}.
> The command silently fails and the balancer continues to run, when it should
> have actually stopped. The changes proposed above fix this race condition as
> well.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]