Kevin Wikant created HDFS-16303:
-----------------------------------

             Summary: Losing over 100 datanodes in state decommissioning 
results in full blockage of all datanode decommissioning
                 Key: HDFS-16303
                 URL: https://issues.apache.org/jira/browse/HDFS-16303
             Project: Hadoop HDFS
          Issue Type: Bug
    Affects Versions: 3.3.1, 2.10.1
            Reporter: Kevin Wikant


## Problem Description

The HDFS Namenode class "DatanodeAdminManager" is responsible for 
decommissioning datanodes.

As per this "hdfs-site" configuration:
{quote}Config = dfs.namenode.decommission.max.concurrent.tracked.nodes 
Default Value = 100

The maximum number of decommission-in-progress datanodes nodes that will be 
tracked at one time by the namenode. Tracking a decommission-in-progress 
datanode consumes additional NN memory proportional to the number of blocks on 
the datnode. Having a conservative limit reduces the potential impact of 
decomissioning a large number of nodes at once. A value of 0 means no limit 
will be enforced.
{quote}
The Namenode will only actively track up to 100 datanodes for decommissioning 
at any given time, as to avoid Namenode memory pressure.

Looking into the "DatanodeAdminManager" code:
 * a new datanode is only removed from the "tracked.nodes" set when it finishes 
decommissioning
 * a new datanode is only added to the "tracked.nodes" set if there is fewer 
than 100 datanodes being tracked

So in the event that there are more than 100 datanodes being decommissioned at 
a given time, some of those datanodes will not be in the "tracked.nodes" set 
until 1 or more datanodes in the "tracked.nodes" finishes decommissioning. This 
is generally not a problem because the datanodes in "tracked.nodes" will 
eventually finish decommissioning, but there is an edge case where this logic 
prevents the namenode from making any forward progress towards decommissioning.

If all 100 datanodes in the "tracked.nodes" are unable to finish 
decommissioning, then other datanodes (which may be able to be decommissioned) 
will never get added to "tracked.nodes" and therefore will never get the 
opportunity to be decommissioned.

This can occur due the following issue:
{quote}2021-10-21 12:39:24,048 WARN 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager 
(DatanodeAdminMonitor-0): Node W.X.Y.Z:50010 is dead while in Decommission In 
Progress. Cannot be safely decommissioned or be in maintenance since there is 
risk of reduced data durability or data loss. Either restart the failed node or 
force decommissioning or maintenance by removing, calling refreshNodes, then 
re-adding to the excludes or host config files.
{quote}
If a Datanode is lost while decommissioning (for example if the underlying 
hardware fails or is lost), then it will remain in state decommissioning 
forever.

If 100 or more Datanodes are lost while decommissioning over the Hadoop cluster 
lifetime, then this is enough to completely fill up the "tracked.nodes" set. 
With the entire "tracked.nodes" set filled with datanodes that can never finish 
decommissioning, any datanodes added after this point will never be able to be 
decommissioned because they will never be added to the "tracked.nodes" set.

In this scenario:
 * the "tracked.nodes" set is filled with datanodes which are lost & cannot be 
recovered (and can never finish decommissioning so they will never be removed 
from the set)
 * the actual live datanodes being decommissioned are enqueued waiting to enter 
the "tracked.nodes" set (and are stuck waiting indefinitely)

This means that no progress towards decommissioning the live datanodes will be 
made unless the user takes the following action:
{quote}Either restart the failed node or force decommissioning or maintenance 
by removing, calling refreshNodes, then re-adding to the excludes or host 
config files.
{quote}
Ideally, the Namenode should be able to gracefully handle scenarios where the 
datanodes in the "tracked.nodes" set are not making forward progress towards 
decommissioning while the enqueued datanodes may be able to make forward 
progress.

## Reproduction Steps
 * create a Hadoop cluster
 * lose (i.e. terminate the host/process forever) over 100 datanodes while the 
datanodes are in state decommissioning
 * add additional datanodes to the cluster
 * attempt to decommission those new datanodes & observe that they are stuck in 
state decommissioning forever

Note that in this example each datanode, over the full history of the cluster, 
has a unique IP address



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-dev-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-dev-h...@hadoop.apache.org

Reply via email to