[ https://issues.apache.org/jira/browse/HDFS-16303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Akira Ajisaka updated HDFS-16303: --------------------------------- Fix Version/s: 3.4.0 > Losing over 100 datanodes in state decommissioning results in full blockage > of all datanode decommissioning > ----------------------------------------------------------------------------------------------------------- > > Key: HDFS-16303 > URL: https://issues.apache.org/jira/browse/HDFS-16303 > Project: Hadoop HDFS > Issue Type: Bug > Affects Versions: 2.10.1, 3.3.1 > Reporter: Kevin Wikant > Assignee: Kevin Wikant > Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 12h 10m > Remaining Estimate: 0h > > h2. Impact > HDFS datanode decommissioning does not make any forward progress. For > example, the user adds X datanodes to the "dfs.hosts.exclude" file and all X > of those datanodes remain in state decommissioning forever without making any > forward progress towards being decommissioned. > h2. Root Cause > The HDFS Namenode class "DatanodeAdminManager" is responsible for > decommissioning datanodes. > As per this "hdfs-site" configuration: > {quote}Config = dfs.namenode.decommission.max.concurrent.tracked.nodes > Default Value = 100 > The maximum number of decommission-in-progress datanodes nodes that will be > tracked at one time by the namenode. Tracking a decommission-in-progress > datanode consumes additional NN memory proportional to the number of blocks > on the datnode. Having a conservative limit reduces the potential impact of > decomissioning a large number of nodes at once. A value of 0 means no limit > will be enforced. > {quote} > The Namenode will only actively track up to 100 datanodes for decommissioning > at any given time, as to avoid Namenode memory pressure. > Looking into the "DatanodeAdminManager" code: > * a new datanode is only removed from the "tracked.nodes" set when it > finishes decommissioning > * a new datanode is only added to the "tracked.nodes" set if there is fewer > than 100 datanodes being tracked > So in the event that there are more than 100 datanodes being decommissioned > at a given time, some of those datanodes will not be in the "tracked.nodes" > set until 1 or more datanodes in the "tracked.nodes" finishes > decommissioning. This is generally not a problem because the datanodes in > "tracked.nodes" will eventually finish decommissioning, but there is an edge > case where this logic prevents the namenode from making any forward progress > towards decommissioning. > If all 100 datanodes in the "tracked.nodes" are unable to finish > decommissioning, then other datanodes (which may be able to be > decommissioned) will never get added to "tracked.nodes" and therefore will > never get the opportunity to be decommissioned. > This can occur due the following issue: > {quote}2021-10-21 12:39:24,048 WARN > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager > (DatanodeAdminMonitor-0): Node W.X.Y.Z:50010 is dead while in Decommission In > Progress. Cannot be safely decommissioned or be in maintenance since there is > risk of reduced data durability or data loss. Either restart the failed node > or force decommissioning or maintenance by removing, calling refreshNodes, > then re-adding to the excludes or host config files. > {quote} > If a Datanode is lost while decommissioning (for example if the underlying > hardware fails or is lost), then it will remain in state decommissioning > forever. > If 100 or more Datanodes are lost while decommissioning over the Hadoop > cluster lifetime, then this is enough to completely fill up the > "tracked.nodes" set. With the entire "tracked.nodes" set filled with > datanodes that can never finish decommissioning, any datanodes added after > this point will never be able to be decommissioned because they will never be > added to the "tracked.nodes" set. > In this scenario: > * the "tracked.nodes" set is filled with datanodes which are lost & cannot > be recovered (and can never finish decommissioning so they will never be > removed from the set) > * the actual live datanodes being decommissioned are enqueued waiting to > enter the "tracked.nodes" set (and are stuck waiting indefinitely) > This means that no progress towards decommissioning the live datanodes will > be made unless the user takes the following action: > {quote}Either restart the failed node or force decommissioning or maintenance > by removing, calling refreshNodes, then re-adding to the excludes or host > config files. > {quote} > Ideally, the Namenode should be able to gracefully handle scenarios where the > datanodes in the "tracked.nodes" set are not making forward progress towards > decommissioning while the enqueued datanodes may be able to make forward > progress. > h2. Reproduce Steps > * create a Hadoop cluster > * lose (i.e. terminate the host/process forever) over 100 datanodes while > the datanodes are in state decommissioning > * add additional datanodes to the cluster > * attempt to decommission those new datanodes & observe that they are stuck > in state decommissioning forever > Note that in this example each datanode, over the full history of the > cluster, has a unique IP address -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org