[
https://issues.apache.org/jira/browse/HDFS-16303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kevin Wikant resolved HDFS-16303.
---------------------------------
Resolution: Fixed
Resolved by this commit:
https://github.com/apache/hadoop/commit/d20b598f97e76c67d6103a950ea9e89644be2c41
> Losing over 100 datanodes in state decommissioning results in full blockage
> of all datanode decommissioning
> -----------------------------------------------------------------------------------------------------------
>
> Key: HDFS-16303
> URL: https://issues.apache.org/jira/browse/HDFS-16303
> Project: Hadoop HDFS
> Issue Type: Bug
> Affects Versions: 2.10.1, 3.3.1
> Reporter: Kevin Wikant
> Assignee: Kevin Wikant
> Priority: Major
> Labels: pull-request-available
> Time Spent: 12h 10m
> Remaining Estimate: 0h
>
> h2. Impact
> HDFS datanode decommissioning does not make any forward progress. For
> example, the user adds X datanodes to the "dfs.hosts.exclude" file and all X
> of those datanodes remain in state decommissioning forever without making any
> forward progress towards being decommissioned.
> h2. Root Cause
> The HDFS Namenode class "DatanodeAdminManager" is responsible for
> decommissioning datanodes.
> As per this "hdfs-site" configuration:
> {quote}Config = dfs.namenode.decommission.max.concurrent.tracked.nodes
> Default Value = 100
> The maximum number of decommission-in-progress datanodes nodes that will be
> tracked at one time by the namenode. Tracking a decommission-in-progress
> datanode consumes additional NN memory proportional to the number of blocks
> on the datnode. Having a conservative limit reduces the potential impact of
> decomissioning a large number of nodes at once. A value of 0 means no limit
> will be enforced.
> {quote}
> The Namenode will only actively track up to 100 datanodes for decommissioning
> at any given time, as to avoid Namenode memory pressure.
> Looking into the "DatanodeAdminManager" code:
> * a new datanode is only removed from the "tracked.nodes" set when it
> finishes decommissioning
> * a new datanode is only added to the "tracked.nodes" set if there is fewer
> than 100 datanodes being tracked
> So in the event that there are more than 100 datanodes being decommissioned
> at a given time, some of those datanodes will not be in the "tracked.nodes"
> set until 1 or more datanodes in the "tracked.nodes" finishes
> decommissioning. This is generally not a problem because the datanodes in
> "tracked.nodes" will eventually finish decommissioning, but there is an edge
> case where this logic prevents the namenode from making any forward progress
> towards decommissioning.
> If all 100 datanodes in the "tracked.nodes" are unable to finish
> decommissioning, then other datanodes (which may be able to be
> decommissioned) will never get added to "tracked.nodes" and therefore will
> never get the opportunity to be decommissioned.
> This can occur due the following issue:
> {quote}2021-10-21 12:39:24,048 WARN
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager
> (DatanodeAdminMonitor-0): Node W.X.Y.Z:50010 is dead while in Decommission In
> Progress. Cannot be safely decommissioned or be in maintenance since there is
> risk of reduced data durability or data loss. Either restart the failed node
> or force decommissioning or maintenance by removing, calling refreshNodes,
> then re-adding to the excludes or host config files.
> {quote}
> If a Datanode is lost while decommissioning (for example if the underlying
> hardware fails or is lost), then it will remain in state decommissioning
> forever.
> If 100 or more Datanodes are lost while decommissioning over the Hadoop
> cluster lifetime, then this is enough to completely fill up the
> "tracked.nodes" set. With the entire "tracked.nodes" set filled with
> datanodes that can never finish decommissioning, any datanodes added after
> this point will never be able to be decommissioned because they will never be
> added to the "tracked.nodes" set.
> In this scenario:
> * the "tracked.nodes" set is filled with datanodes which are lost & cannot
> be recovered (and can never finish decommissioning so they will never be
> removed from the set)
> * the actual live datanodes being decommissioned are enqueued waiting to
> enter the "tracked.nodes" set (and are stuck waiting indefinitely)
> This means that no progress towards decommissioning the live datanodes will
> be made unless the user takes the following action:
> {quote}Either restart the failed node or force decommissioning or maintenance
> by removing, calling refreshNodes, then re-adding to the excludes or host
> config files.
> {quote}
> Ideally, the Namenode should be able to gracefully handle scenarios where the
> datanodes in the "tracked.nodes" set are not making forward progress towards
> decommissioning while the enqueued datanodes may be able to make forward
> progress.
> h2. Reproduce Steps
> * create a Hadoop cluster
> * lose (i.e. terminate the host/process forever) over 100 datanodes while
> the datanodes are in state decommissioning
> * add additional datanodes to the cluster
> * attempt to decommission those new datanodes & observe that they are stuck
> in state decommissioning forever
> Note that in this example each datanode, over the full history of the
> cluster, has a unique IP address
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]