[jira] [Updated] (HDDS-14834) SCM NetworkTopology race condition

Ivan Andika (Jira) Fri, 13 Mar 2026 01:07:58 -0700


     [ 
https://issues.apache.org/jira/browse/HDDS-14834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ivan Andika updated HDDS-14834:
-------------------------------
    Description: 
We found that there is a race condition on the cluster map between 
DeadNodeHandler and HealthyReadOnlyNodeHandler when doing a rolling DN restarts
 * DeadNodeHandler: Removes the node from the topology
 ** Triggered by NodeStateManager#checkNodesHealth in NodeStateManager#run 
health check that will run periodically (see scheduleNextHealthCheck)
 * HealthyReadOnlyNodeHandler: Add the node from the topology
 ** Triggered by DN heartbeat from DN that was resurrected

If DeadNodeHandler and HealthyReadOnlyNodeHandler run at the same time, we 
might have this interleaving
 # DeadNodeHandler is invoked, but has not removed the network topology since 
it is still working on other things like closing containers, destroying 
pipelines, etc
 # HealthyReadOnlyNodeHandler runs since the DN is detected to be alive and add 
to the network topology
 # DeadNodeHandler removed the network topology

The outcome is that the node does not exist in the topology although it is 
healthy. This can cause issues with the placement policy.

  was:
We found that there is a race condition on the cluster map between 
DeadNodeHandler and HealthyReadOnlyNodeHandler
 * DeadNodeHandler: Removes the node from the topology
 ** Triggered by NodeStateManager#checkNodesHealth in NodeStateManager#run 
health check that will run periodically (see scheduleNextHealthCheck)
 * HealthyReadOnlyNodeHandler: Add the node from the topology
 ** Triggered by DN heartbeat from DN that was resurrected

If DeadNodeHandler and HealthyReadOnlyNodeHandler run at the same time, we 
might have this interleaving
 # DeadNodeHandler is invoked, but has not removed the network topology since 
it is still working on other things like closing containers, destroying 
pipelines, etc
 # HealthyReadOnlyNodeHandler runs since the DN is detected to be alive and add 
to the network topology
 # DeadNodeHandler removed the network topology

The outcome is that the node does not exist in the topology although it is 
healthy. This can cause issues with the placement policy.


> SCM NetworkTopology race condition
> ----------------------------------
>
>                 Key: HDDS-14834
>                 URL: https://issues.apache.org/jira/browse/HDDS-14834
>             Project: Apache Ozone
>          Issue Type: Bug
>            Reporter: Ivan Andika
>            Priority: Major
>
> We found that there is a race condition on the cluster map between 
> DeadNodeHandler and HealthyReadOnlyNodeHandler when doing a rolling DN 
> restarts
>  * DeadNodeHandler: Removes the node from the topology
>  ** Triggered by NodeStateManager#checkNodesHealth in NodeStateManager#run 
> health check that will run periodically (see scheduleNextHealthCheck)
>  * HealthyReadOnlyNodeHandler: Add the node from the topology
>  ** Triggered by DN heartbeat from DN that was resurrected
> If DeadNodeHandler and HealthyReadOnlyNodeHandler run at the same time, we 
> might have this interleaving
>  # DeadNodeHandler is invoked, but has not removed the network topology since 
> it is still working on other things like closing containers, destroying 
> pipelines, etc
>  # HealthyReadOnlyNodeHandler runs since the DN is detected to be alive and 
> add to the network topology
>  # DeadNodeHandler removed the network topology
> The outcome is that the node does not exist in the topology although it is 
> healthy. This can cause issues with the placement policy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDDS-14834) SCM NetworkTopology race condition

Reply via email to