Ivan Andika created HDDS-14834:
----------------------------------
Summary: SCM NetworkTopology race condition
Key: HDDS-14834
URL: https://issues.apache.org/jira/browse/HDDS-14834
Project: Apache Ozone
Issue Type: Bug
Reporter: Ivan Andika
We found that there is a race condition on the cluster map betweenÂ
DeadNodeHandler and HealthyReadOnlyNodeHandler
* DeadNodeHandler: Removes the node from the topology
** Triggered by NodeStateManager#checkNodesHealth in NodeStateManager#run
health check that will run periodically (see scheduleNextHealthCheck)
* HealthyReadOnlyNodeHandler: Add the node from the topology
** Triggered by DN heartbeat from DN that was resurrected
If DeadNodeHandler and HealthyReadOnlyNodeHandler run at the same time, we
might have this interleaving
# DeadNodeHandler is invoked, but has not removed the network topology since
it is still working on other things like closing containers, destroying
pipelines, etc
# HealthyReadOnlyNodeHandler runs since the DN is detected to be alive and add
to the network topology
# DeadNodeHandler removed the network topology
The outcome is that the node does not exist in the topology although it is
healthy. This can cause issues with the placement policy.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]