[jira] [Updated] (HDDS-14834) SCM NetworkTopology race condition

Ivan Andika (Jira) Fri, 13 Mar 2026 23:54:11 -0700


     [ 
https://issues.apache.org/jira/browse/HDDS-14834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ivan Andika updated HDDS-14834:
-------------------------------
    Description: 
We found that there is a race condition on the cluster map between 
DeadNodeHandler and HealthyReadOnlyNodeHandler when doing a rolling DN restarts
 * DeadNodeHandler: Removes the node from the topology
 ** Triggered by NodeStateManager#checkNodesHealth in NodeStateManager#run 
health check that will run periodically (see scheduleNextHealthCheck)
 * HealthyReadOnlyNodeHandler: Add the node from the topology
 ** Triggered by DN heartbeat from DN that was resurrected

If DeadNodeHandler and HealthyReadOnlyNodeHandler run at the same time, we 
might have this interleaving
 # DeadNodeHandler is invoked, but has not removed the network topology since 
it is still working on other things like closing containers, destroying 
pipelines, etc
 # HealthyReadOnlyNodeHandler runs since the DN is detected to be alive and add 
to the network topology
 # DeadNodeHandler removed the network topology

The outcome is that the node does not exist in the topology although it is 
healthy. This can cause issues with the placement policy since the topology 
information of the DN does not exist.

  was:
We found that there is a race condition on the cluster map between 
DeadNodeHandler and HealthyReadOnlyNodeHandler when doing a rolling DN restarts
 * DeadNodeHandler: Removes the node from the topology
 ** Triggered by NodeStateManager#checkNodesHealth in NodeStateManager#run 
health check that will run periodically (see scheduleNextHealthCheck)
 * HealthyReadOnlyNodeHandler: Add the node from the topology
 ** Triggered by DN heartbeat from DN that was resurrected

If DeadNodeHandler and HealthyReadOnlyNodeHandler run at the same time, we 
might have this interleaving
 # DeadNodeHandler is invoked, but has not removed the network topology since 
it is still working on other things like closing containers, destroying 
pipelines, etc
 # HealthyReadOnlyNodeHandler runs since the DN is detected to be alive and add 
to the network topology
 # DeadNodeHandler removed the network topology

The outcome is that the node does not exist in the topology although it is 
healthy. This can cause issues with the placement policy.


> SCM NetworkTopology race condition
> ----------------------------------
>
>                 Key: HDDS-14834
>                 URL: https://issues.apache.org/jira/browse/HDDS-14834
>             Project: Apache Ozone
>          Issue Type: Bug
>            Reporter: Ivan Andika
>            Priority: Major
>
> We found that there is a race condition on the cluster map between 
> DeadNodeHandler and HealthyReadOnlyNodeHandler when doing a rolling DN 
> restarts
>  * DeadNodeHandler: Removes the node from the topology
>  ** Triggered by NodeStateManager#checkNodesHealth in NodeStateManager#run 
> health check that will run periodically (see scheduleNextHealthCheck)
>  * HealthyReadOnlyNodeHandler: Add the node from the topology
>  ** Triggered by DN heartbeat from DN that was resurrected
> If DeadNodeHandler and HealthyReadOnlyNodeHandler run at the same time, we 
> might have this interleaving
>  # DeadNodeHandler is invoked, but has not removed the network topology since 
> it is still working on other things like closing containers, destroying 
> pipelines, etc
>  # HealthyReadOnlyNodeHandler runs since the DN is detected to be alive and 
> add to the network topology
>  # DeadNodeHandler removed the network topology
> The outcome is that the node does not exist in the topology although it is 
> healthy. This can cause issues with the placement policy since the topology 
> information of the DN does not exist.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDDS-14834) SCM NetworkTopology race condition

Reply via email to