[ https://issues.apache.org/jira/browse/HDFS-16303?focusedWorklogId=688303&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-688303 ]
ASF GitHub Bot logged work on HDFS-16303: ----------------------------------------- Author: ASF GitHub Bot Created on: 30/Nov/21 21:26 Start Date: 30/Nov/21 21:26 Worklog Time Spent: 10m Work Description: KevinWikant commented on pull request #3675: URL: https://github.com/apache/hadoop/pull/3675#issuecomment-983034760 > DECOMMISSION_IN_PROGRESS + DEAD is an error state that means decommission has effectively failed. There is a case where it can complete, but what does that really mean - if the node is dead, it has not been gracefully stopped. The case which I have described where dead node decommissioning completes can occur when: - a decommissioning node goes dead, but all of its blocks still have block replicas on other live nodes - the namenode is eventually able to satisfy the minimum replication of all blocks (by replicating the under-replicated blocks from the live nodes) - the dead decommissioning node is transitioned to decommissioned In this case, the node did go dead while decommissioning, but there was no data loss thanks to redundant block replicas. From the user perspective, the loss of the decommissioning node did not impact the outcome of the decommissioning process. Had the node not gone dead while decommissioning, the eventual outcome is the same in that the node is decommissioned, there is no data loss, & all blocks have sufficient replicas. If there is data loss then a dead datanode will remain decommissioning, because if the dead node were to come alive again then it may be able to recover the lost data. But if there is no data loss then the when the node comes alive again it will be immediately transition to decommissioned anyway, so why not make it decommissioned while its still dead (and there is no data loss)? Also, I don't think the priority queue is adding much complexity, it's just putting healthy nodes (with more recent heartbeat times) ahead of unhealthy nodes (with older heartbeat times) such that healthy nodes are decommissioned first ---- I also want to call out another caveat with the approach of removing the node from the DatanodeAdminManager which I uncovered while unit testing If we leave the node in DECOMMISSION_IN_PROGRESS & remove the node from DatanodeAdminManager, then the following callstack should re-add the datanode to the DatanodeAdminManager when it comes alive again: - [DatanodeManager.registerDatanode](https://github.com/apache/hadoop/blob/db89a9411ebee11372314e82d7ea0606c348d014/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java#L1223) - [DatanodeManager.startAdminOperationIfNecessary](https://github.com/apache/hadoop/blob/db89a9411ebee11372314e82d7ea0606c348d014/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java#L1109) - [DatanodeAdminManager.startDecommission](https://github.com/apache/hadoop/blob/62c86eaa0e539a4307ca794e0fcd502a77ebceb8/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java#L187) - [DatanodeAdminMonitorBase.startTrackingNode](https://github.com/apache/hadoop/blob/03cfc852791c14fad39db4e5b14104a276c08e59/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminMonitorBase.java#L136) The problem is this condition "!node.isDecommissionInProgress()": https://github.com/apache/hadoop/blob/62c86eaa0e539a4307ca794e0fcd502a77ebceb8/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java#L177 Because the dead datanode is left in DECOMMISSION_INPROGRESS, "startTrackingNode" is not invoked because of the "!node.isDecommissionInProgress()" condition Simply removing the condition "!node.isDecommissionInProgress()" will not function well because "startTrackingNode" is not idempotent: - [startDecommission is invoked periodically when refreshDatanodes is called](https://github.com/apache/hadoop/blob/db89a9411ebee11372314e82d7ea0606c348d014/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java#L1339) - [pendingNodes is an ArrayDequeue which does not deduplicate the DatanodeDescriptor](https://github.com/apache/hadoop/blob/03cfc852791c14fad39db4e5b14104a276c08e59/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminMonitorBase.java#L43) - therefore, removing the "!node.isDecommissionInProgress()" check will cause a large number of duplicate DatanodeDescriptor to be added to DatanodeAdminManager I can think of 2 obvious ways to handle this: A) make calls to "startTrackingNode" idempotent (meaning that if the DatanodeDescriptor is already tracked, it does not get added to the DatanodeAdminManager) B) modify startDecommission such that its aware of if the invocation is for a datanode which was just restarted after being dead such that it can still invoke "startTrackingNode" even though "node.isDecommissionInProgress()" For "A)", the challenge is that we need to ensure the DatanodeDescriptor is not in "pendingReplication" or "outOfServiceBlocks" which could be a fairly costly call to execute repeatedly. Also, I am not even sure such a check is thread-safe given there is no locking used as part of "startDecommission" or "startTrackingNode" For "B)", the awareness of if a registerDatanode call is related to a [restarted datanode is available here](https://github.com/apache/hadoop/blob/db89a9411ebee11372314e82d7ea0606c348d014/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java#L1177). So this information would need to be passed down the callstack to a method "startDecommission(DatanodeDescriptor node, boolean isNodeRestart)". Because of the modified method signature, all the other invocations of "startDecommission" will need to specify isNodeRestart=false Given this additional hurdle in the approach of removing a dead datanode from the DatanodeAdminManager, are we sure it will be less complex/impactful than the proposed changed? ---- In short: - I don't think there is any downside in moving a dead datanode to decommissioned when there are no LowRedundancy blocks because this would happen immediately anyway were the node to come back alive (and get re-added to DatanodeAdminManager) - the approach of removing a dead datanode from the DatanodeAdminManager will not work properly without some significant refactoring of the "startDecommission" method & related code @sodonnel @virajjasani @aajisaka let me know your thoughts, I am still more in favor of tracking dead datanodes in DatanodeAdminManager (when there are LowRedundancy blocks), but if the community thinks its better to remove the dead datanodes from DatanodeAdminManager I can implement proposal "B)" -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org Issue Time Tracking ------------------- Worklog Id: (was: 688303) Time Spent: 4h 40m (was: 4.5h) > Losing over 100 datanodes in state decommissioning results in full blockage > of all datanode decommissioning > ----------------------------------------------------------------------------------------------------------- > > Key: HDFS-16303 > URL: https://issues.apache.org/jira/browse/HDFS-16303 > Project: Hadoop HDFS > Issue Type: Bug > Affects Versions: 2.10.1, 3.3.1 > Reporter: Kevin Wikant > Priority: Major > Labels: pull-request-available > Time Spent: 4h 40m > Remaining Estimate: 0h > > h2. Impact > HDFS datanode decommissioning does not make any forward progress. For > example, the user adds X datanodes to the "dfs.hosts.exclude" file and all X > of those datanodes remain in state decommissioning forever without making any > forward progress towards being decommissioned. > h2. Root Cause > The HDFS Namenode class "DatanodeAdminManager" is responsible for > decommissioning datanodes. > As per this "hdfs-site" configuration: > {quote}Config = dfs.namenode.decommission.max.concurrent.tracked.nodes > Default Value = 100 > The maximum number of decommission-in-progress datanodes nodes that will be > tracked at one time by the namenode. Tracking a decommission-in-progress > datanode consumes additional NN memory proportional to the number of blocks > on the datnode. Having a conservative limit reduces the potential impact of > decomissioning a large number of nodes at once. A value of 0 means no limit > will be enforced. > {quote} > The Namenode will only actively track up to 100 datanodes for decommissioning > at any given time, as to avoid Namenode memory pressure. > Looking into the "DatanodeAdminManager" code: > * a new datanode is only removed from the "tracked.nodes" set when it > finishes decommissioning > * a new datanode is only added to the "tracked.nodes" set if there is fewer > than 100 datanodes being tracked > So in the event that there are more than 100 datanodes being decommissioned > at a given time, some of those datanodes will not be in the "tracked.nodes" > set until 1 or more datanodes in the "tracked.nodes" finishes > decommissioning. This is generally not a problem because the datanodes in > "tracked.nodes" will eventually finish decommissioning, but there is an edge > case where this logic prevents the namenode from making any forward progress > towards decommissioning. > If all 100 datanodes in the "tracked.nodes" are unable to finish > decommissioning, then other datanodes (which may be able to be > decommissioned) will never get added to "tracked.nodes" and therefore will > never get the opportunity to be decommissioned. > This can occur due the following issue: > {quote}2021-10-21 12:39:24,048 WARN > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager > (DatanodeAdminMonitor-0): Node W.X.Y.Z:50010 is dead while in Decommission In > Progress. Cannot be safely decommissioned or be in maintenance since there is > risk of reduced data durability or data loss. Either restart the failed node > or force decommissioning or maintenance by removing, calling refreshNodes, > then re-adding to the excludes or host config files. > {quote} > If a Datanode is lost while decommissioning (for example if the underlying > hardware fails or is lost), then it will remain in state decommissioning > forever. > If 100 or more Datanodes are lost while decommissioning over the Hadoop > cluster lifetime, then this is enough to completely fill up the > "tracked.nodes" set. With the entire "tracked.nodes" set filled with > datanodes that can never finish decommissioning, any datanodes added after > this point will never be able to be decommissioned because they will never be > added to the "tracked.nodes" set. > In this scenario: > * the "tracked.nodes" set is filled with datanodes which are lost & cannot > be recovered (and can never finish decommissioning so they will never be > removed from the set) > * the actual live datanodes being decommissioned are enqueued waiting to > enter the "tracked.nodes" set (and are stuck waiting indefinitely) > This means that no progress towards decommissioning the live datanodes will > be made unless the user takes the following action: > {quote}Either restart the failed node or force decommissioning or maintenance > by removing, calling refreshNodes, then re-adding to the excludes or host > config files. > {quote} > Ideally, the Namenode should be able to gracefully handle scenarios where the > datanodes in the "tracked.nodes" set are not making forward progress towards > decommissioning while the enqueued datanodes may be able to make forward > progress. > h2. Reproduce Steps > * create a Hadoop cluster > * lose (i.e. terminate the host/process forever) over 100 datanodes while > the datanodes are in state decommissioning > * add additional datanodes to the cluster > * attempt to decommission those new datanodes & observe that they are stuck > in state decommissioning forever > Note that in this example each datanode, over the full history of the > cluster, has a unique IP address -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org