[jira] [Comment Edited] (HDFS-17722) DataNode stuck in decommissioning on standby NameNode

Zbigniew Kostrzewa (Jira) Tue, 09 Dec 2025 12:20:06 -0800


    [ 
https://issues.apache.org/jira/browse/HDFS-17722?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18043955#comment-18043955
 ]


Zbigniew Kostrzewa edited comment on HDFS-17722 at 12/9/25 8:19 PM:
--------------------------------------------------------------------

I was able to reproduce the same behavior in non-HA setup after a series of 
decommission and recommission of some of the datanodes. At some re-run the 
decommission process hanged with one of the blocks reported as:
{code:java}
[user@host1 ~]$ hdfs fsck -blockId blk_1073742218
Connecting to namenode via 
http://host1:30070/fsck?ugi=user&blockId=blk_1073742218+&path=%2F
FSCK started by user/host1@REALM (auth:KERBEROS_SSL) from /host1 at Tue Dec 09 
20:13:54 GMT 2025Block Id: blk_1073742218
Block belongs to: /round5/data4/file_92
No. of Expected Replica: 1
No. of live Replica: 0
No. of excess Replica: 1
No. of stale Replica: 0
No. of decommissioned Replica: 0
No. of decommissioning Replica: 1
No. of corrupted Replica: 0
Block replica on datanode/rack: host2/default-rack is DECOMMISSIONING
Block replica on datanode/rack: host1/default-rack is HEALTHY{code}
If logs are needed I can provide them from any of the services in DEBUG level.


was (Author: kostrzewa):
I was able to reproduce the same behavior in non-HA setup after a series of 
decommission and recommission of some of the datanodes. At some re-run the 
decommission process hanged with one of the blocks reported as:
{code:java}
[user@host1 ~]$ hdfs fsck -blockId blk_1073742218
Connecting to namenode via 
http://host1:30070/fsck?ugi=user&blockId=blk_1073742218+&path=%2F
FSCK started by user/host1@REALM (auth:KERBEROS_SSL) from /host1 at Tue Dec 09 
20:13:54 GMT 2025Block Id: blk_1073742218
Block belongs to: /round5/data4/file_92
No. of Expected Replica: 1
No. of live Replica: 0
No. of excess Replica: 1
No. of stale Replica: 0
No. of decommissioned Replica: 0
No. of decommissioning Replica: 1
No. of corrupted Replica: 0
Block replica on datanode/rack: host2/default-rack is DECOMMISSIONING
Block replica on datanode/rack: host1/default-rack is HEALTHY{code}
 

> DataNode stuck in decommissioning on standby NameNode
> -----------------------------------------------------
>
>                 Key: HDFS-17722
>                 URL: https://issues.apache.org/jira/browse/HDFS-17722
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: namenode
>    Affects Versions: 3.3.6
>            Reporter: Benoit Sigoure
>            Priority: Minor
>
> When decommissioning a DataNode in our cluster, we observed a situation where 
> the active NameNode had marked the DataNode as decommissioned but the standby 
> had it stuck in decommissioning state indefinitely (we waited 8h) due to a 
> block being allegedly under replicated (note: for this path the target 
> replication factor is 2x).  The standby NameNode kept logging this in a loop:
> {{2025-01-31 12:02:35,963 INFO BlockStateChange: Block: 
> blk_1486338012_426727507, Expected Replicas: 2, live replicas: 1, corrupt 
> replicas: 0, decommissioned replicas: 0, decommissioning replicas: 1, 
> maintenance replicas: 0, live entering maintenance replicas: 0, replicas on 
> stale nodes: 0, readonly replicas: 0, excess replicas: 1, Is Open File: 
> false, Datanodes having this block: 10.128.89.32:9866 10.128.118.216:9866 
> 10.128.49.6:9866 , Current Datanode: 10.128.118.216:9866, Is current datanode 
> decommissioning: true, Is current datanode entering maintenance: false}}
> Looking at the fsck report for this block, the active NameNode was reporting 
> the following:
> {code:java}
> Block Id: blk_1486338012
> Block belongs to: /path/to/file
> No. of Expected Replica: 2
> No. of live Replica: 2
> No. of excess Replica: 0
> No. of stale Replica: 0
> No. of decommissioned Replica: 1
> No. of decommissioning Replica: 0
> No. of corrupted Replica: 0
> Block replica on datanode/rack: datanode-v3-25-hadoop.hadoop/default-rack is 
> HEALTHY
> Block replica on datanode/rack: datanode-v3-39-hadoop.hadoop/default-rack is 
> DECOMMISSIONED
> Block replica on datanode/rack: datanode-v3-26-hadoop.hadoop/default-rack is 
> HEALTHY
> {code}
> Whereas on the standby it says:
> {code:java}
> Block Id: blk_1486338012
> Block belongs to: /path/to/file
> No. of Expected Replica: 2
> No. of live Replica: 1
> No. of excess Replica: 1
> No. of stale Replica: 0
> No. of decommissioned Replica: 0
> No. of decommissioning Replica: 1
> No. of corrupted Replica: 0
> Block replica on datanode/rack: datanode-v3-25-hadoop.hadoop/default-rack is 
> HEALTHY
> Block replica on datanode/rack: datanode-v3-39-hadoop.hadoop/default-rack is 
> DECOMMISSIONING
> Block replica on datanode/rack: datanode-v3-26-hadoop.hadoop/default-rack is 
> HEALTHY
> {code}
> {code:java}
> hadoop@namenode-0:/$ hdfs dfs -ls /path/to/file
> -rw-r--r-- 2 hbase supergroup 32453388896 2025-01-02 16:15 /path/to/file
> {code}
> After restarting the standby NameNode, the problem disappeared, the datanode 
> in question transitioned to decommissioned state as expected.
> Credits for the bug report go to Tomas Baltrunas at Arista.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (HDFS-17722) DataNode stuck in decommissioning on standby NameNode

Reply via email to