[
https://issues.apache.org/jira/browse/HDDS-14834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ivan Andika updated HDDS-14834:
-------------------------------
Description:
We found that there is a race condition on the cluster map between
DeadNodeHandler and HealthyReadOnlyNodeHandler when doing a rolling DN restarts
* DeadNodeHandler: Removes the node from the topology
** Triggered by NodeStateManager#checkNodesHealth in NodeStateManager#run
health check that will run periodically (see scheduleNextHealthCheck)
* HealthyReadOnlyNodeHandler: Add the node from the topology
** Triggered by DN heartbeat from DN that was resurrected
If DeadNodeHandler and HealthyReadOnlyNodeHandler run at the same time, we
might have this interleaving
# DeadNodeHandler is invoked, but has not removed the network topology since
it is still working on other things like closing containers, destroying
pipelines, etc
# HealthyReadOnlyNodeHandler runs since the DN is detected to be alive and add
to the network topology
# DeadNodeHandler removed the network topology
The outcome is that the node does not exist in the topology although it is
healthy. This can cause issues with the placement policy.
was:
We found that there is a race condition on the cluster map betweenÂ
DeadNodeHandler and HealthyReadOnlyNodeHandler
* DeadNodeHandler: Removes the node from the topology
** Triggered by NodeStateManager#checkNodesHealth in NodeStateManager#run
health check that will run periodically (see scheduleNextHealthCheck)
* HealthyReadOnlyNodeHandler: Add the node from the topology
** Triggered by DN heartbeat from DN that was resurrected
If DeadNodeHandler and HealthyReadOnlyNodeHandler run at the same time, we
might have this interleaving
# DeadNodeHandler is invoked, but has not removed the network topology since
it is still working on other things like closing containers, destroying
pipelines, etc
# HealthyReadOnlyNodeHandler runs since the DN is detected to be alive and add
to the network topology
# DeadNodeHandler removed the network topology
The outcome is that the node does not exist in the topology although it is
healthy. This can cause issues with the placement policy.
> SCM NetworkTopology race condition
> ----------------------------------
>
> Key: HDDS-14834
> URL: https://issues.apache.org/jira/browse/HDDS-14834
> Project: Apache Ozone
> Issue Type: Bug
> Reporter: Ivan Andika
> Priority: Major
>
> We found that there is a race condition on the cluster map between
> DeadNodeHandler and HealthyReadOnlyNodeHandler when doing a rolling DN
> restarts
> * DeadNodeHandler: Removes the node from the topology
> ** Triggered by NodeStateManager#checkNodesHealth in NodeStateManager#run
> health check that will run periodically (see scheduleNextHealthCheck)
> * HealthyReadOnlyNodeHandler: Add the node from the topology
> ** Triggered by DN heartbeat from DN that was resurrected
> If DeadNodeHandler and HealthyReadOnlyNodeHandler run at the same time, we
> might have this interleaving
> # DeadNodeHandler is invoked, but has not removed the network topology since
> it is still working on other things like closing containers, destroying
> pipelines, etc
> # HealthyReadOnlyNodeHandler runs since the DN is detected to be alive and
> add to the network topology
> # DeadNodeHandler removed the network topology
> The outcome is that the node does not exist in the topology although it is
> healthy. This can cause issues with the placement policy.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]