[jira] [Updated] (HDDS-14834) SCM NetworkTopology race condition between DeadNodeHandler and HealthyReadOnlyNodeHandler

Ivan Andika (Jira) Sat, 14 Mar 2026 19:37:21 -0700


     [ 
https://issues.apache.org/jira/browse/HDDS-14834?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Ivan Andika updated HDDS-14834:
-------------------------------
    Summary: SCM NetworkTopology race condition between DeadNodeHandler and 
HealthyReadOnlyNodeHandler  (was: SCM NetworkTopology race condition)

> SCM NetworkTopology race condition between DeadNodeHandler and 
> HealthyReadOnlyNodeHandler
> -----------------------------------------------------------------------------------------
>
>                 Key: HDDS-14834
>                 URL: https://issues.apache.org/jira/browse/HDDS-14834
>             Project: Apache Ozone
>          Issue Type: Bug
>            Reporter: Ivan Andika
>            Priority: Major
>
> We found that there is a race condition on the cluster map between 
> DeadNodeHandler and HealthyReadOnlyNodeHandler when doing a rolling DN 
> restarts
>  * DeadNodeHandler: Removes the node from the topology
>  ** Triggered by NodeStateManager#checkNodesHealth in NodeStateManager#run 
> health check that will run periodically (see scheduleNextHealthCheck)
>  * HealthyReadOnlyNodeHandler: Add the node from the topology
>  ** Triggered by DN heartbeat from DN that was resurrected
> If DeadNodeHandler and HealthyReadOnlyNodeHandler run at the same time, we 
> might have this interleaving
>  # DeadNodeHandler is invoked, but has not removed the network topology since 
> it is still working on other things like closing containers, destroying 
> pipelines, etc
>  # HealthyReadOnlyNodeHandler runs since the DN is detected to be alive and 
> add to the network topology
>  # DeadNodeHandler removed the network topology
> The outcome is that the node does not exist in the topology although it is 
> healthy. This can cause issues with the placement policy since the topology 
> information of the DN does not exist.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Updated] (HDDS-14834) SCM NetworkTopology race condition between DeadNodeHandler and HealthyReadOnlyNodeHandler

Reply via email to