[
https://issues.apache.org/jira/browse/HDDS-13544?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Siddhant Sangwan updated HDDS-13544:
------------------------------------
Summary: DN Decommission Fails When Other Datanodes Are Offline Due to
Invalid Affinity Node in Ratis Replication (was: DN Decommission Fails When
Other Datanodes Are Offline Due to Invalid Affinity Node in EC Replication)
> DN Decommission Fails When Other Datanodes Are Offline Due to Invalid
> Affinity Node in Ratis Replication
> --------------------------------------------------------------------------------------------------------
>
> Key: HDDS-13544
> URL: https://issues.apache.org/jira/browse/HDDS-13544
> Project: Apache Ozone
> Issue Type: Bug
> Components: Ozone Datanode
> Reporter: Yashaswini G A
> Assignee: Siddhant Sangwan
> Priority: Major
>
> When testing Erasure Coded (EC) container behavior with partial node
> unavailability, Datanode decommissioning fails if other Datanodes holding EC
> blocks are already offline. This results in a placement error due to the SCM
> attempting to use an affinity node that has already been removed from the
> network topology.
> Steps
> 1. Take 1 or 2 datanodes offline
> 2. Decomission a datanode
> {noformat}
> 2025-07-23 16:36:14,245 INFO
> [node1-EventQueue-DeadNodeForDeadNodeHandler]-org.apache.hadoop.hdds.scm.node.DeadNodeHandler:
> A dead datanode is detected.
> 51a8bc95-909a-4c09-ac62-326a3f11640f(ccycloud-5.quasar-xxorwd.root.comops.site/10.140.185.70)
> 2025-07-23 16:36:14,246 INFO
> [node1-EventQueue-DeadNodeForDeadNodeHandler]-org.apache.hadoop.hdds.scm.node.DeadNodeHandler:
> Clearing command queue of size 1 for DN
> 51a8bc95-909a-4c09-ac62-326a3f11640f(ccycloud-5.quasar-xxorwd.root.comops.site/10.140.185.70)
> 2025-07-23 16:36:14,247 INFO
> [node1-EventQueue-DeadNodeForDeadNodeHandler]-org.apache.hadoop.hdds.scm.net.NetworkTopologyImpl:
> Removed a node: /default-rack/51a8bc95-909a-4c09-ac62-326a3f11640f{noformat}
> Decommissioning fails, and the following error is logged repeatedly in the
> SCM:
> {noformat}
> 2025-07-23 16:37:59,263 INFO
> [node1-DatanodeAdminManager-0]-org.apache.hadoop.hdds.scm.node.DatanodeAdminMonitorImpl:
> There are 2 nodes tracked for decommission and maintenance. 0 pending nodes.
> 2025-07-23 16:37:59,280 ERROR
> [node1-UnderReplicatedProcessor]-org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor:
> Error processing Health result of class: class
> org.apache.hadoop.hdds.scm.container.replication.ContainerHealthResult$UnderReplicatedHealthResult
> for container ContainerInfo{id=#10003, state=CLOSED,
> stateEnterTime=2025-07-23T16:28:31.441334Z,
> pipelineID=PipelineID=be36404e-5477-4894-8e8f-5c176c0b72e1,
> owner=ozone1753253441}
> java.lang.IllegalArgumentException: Affinity node
> /default-rack/51a8bc95-909a-4c09-ac62-326a3f11640f is not a member of topology
> at
> org.apache.hadoop.hdds.scm.net.NetworkTopologyImpl.checkAffinityNode(NetworkTopologyImpl.java:931)
> at
> org.apache.hadoop.hdds.scm.net.NetworkTopologyImpl.chooseRandom(NetworkTopologyImpl.java:510)
> at
> org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware.chooseNode(SCMContainerPlacementRackAware.java:471)
> aAffinity node /default-rack/51a8bc95-909a-4c09-ac62-326a3f11640f is
> not a member of topologyt
> org.apache.hadoop.hdds.scm.container.placement.algorithms.SCMContainerPlacementRackAware.chooseDatanodesInternal(SCMContainerPlacementRackAware.java:244)
> at
> org.apache.hadoop.hdds.scm.SCMCommonPlacementPolicy.chooseDatanodes(SCMCommonPlacementPolicy.java:209)
> at
> org.apache.hadoop.hdds.scm.container.replication.ReplicationManagerUtil.getTargetDatanodes(ReplicationManagerUtil.java:91)
> at
> org.apache.hadoop.hdds.scm.container.replication.RatisUnderReplicationHandler.getTargets(RatisUnderReplicationHandler.java:457)
> at
> org.apache.hadoop.hdds.scm.container.replication.RatisUnderReplicationHandler.processAndSendCommands(RatisUnderReplicationHandler.java:130)
> at
> org.apache.hadoop.hdds.scm.container.replication.ReplicationManager.processUnderReplicatedContainer(ReplicationManager.java:786)
> at
> org.apache.hadoop.hdds.scm.container.replication.UnderReplicatedProcessor.sendDatanodeCommands(UnderReplicatedProcessor.java:60)
> at
> org.apache.hadoop.hdds.scm.container.replication.UnderReplicatedProcessor.sendDatanodeCommands(UnderReplicatedProcessor.java:29)
> at
> org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.processContainer(UnhealthyReplicationProcessor.java:156)
> at
> org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.processAll(UnhealthyReplicationProcessor.java:116)
> at
> org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor.run(UnhealthyReplicationProcessor.java:165)
> at java.base/java.lang.Thread.run(Thread.java:829)
> 2025-07-23 16:37:59,281 INFO
> [node1-UnderReplicatedProcessor]-org.apache.hadoop.hdds.scm.container.replication.UnhealthyReplicationProcessor:
> Processed 0 containers with health state counts {}, failed processing 1,
> deferred due to load 0{noformat}
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]