[
https://issues.apache.org/jira/browse/HDDS-14834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ivan Andika updated HDDS-14834:
-------------------------------
Summary: SCM NetworkTopology race condition between DeadNodeHandler and
HealthyReadOnlyNodeHandler (was: SCM NetworkTopology race condition)
> SCM NetworkTopology race condition between DeadNodeHandler and
> HealthyReadOnlyNodeHandler
> -----------------------------------------------------------------------------------------
>
> Key: HDDS-14834
> URL: https://issues.apache.org/jira/browse/HDDS-14834
> Project: Apache Ozone
> Issue Type: Bug
> Reporter: Ivan Andika
> Priority: Major
>
> We found that there is a race condition on the cluster map between
> DeadNodeHandler and HealthyReadOnlyNodeHandler when doing a rolling DN
> restarts
> * DeadNodeHandler: Removes the node from the topology
> ** Triggered by NodeStateManager#checkNodesHealth in NodeStateManager#run
> health check that will run periodically (see scheduleNextHealthCheck)
> * HealthyReadOnlyNodeHandler: Add the node from the topology
> ** Triggered by DN heartbeat from DN that was resurrected
> If DeadNodeHandler and HealthyReadOnlyNodeHandler run at the same time, we
> might have this interleaving
> # DeadNodeHandler is invoked, but has not removed the network topology since
> it is still working on other things like closing containers, destroying
> pipelines, etc
> # HealthyReadOnlyNodeHandler runs since the DN is detected to be alive and
> add to the network topology
> # DeadNodeHandler removed the network topology
> The outcome is that the node does not exist in the topology although it is
> healthy. This can cause issues with the placement policy since the topology
> information of the DN does not exist.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]