[jira] [Work logged] (HDFS-16303) Losing over 100 datanodes in state decommissioning results in full blockage of all datanode decommissioning

ASF GitHub Bot (Jira) Tue, 30 Nov 2021 13:27:05 -0800


     [ 
https://issues.apache.org/jira/browse/HDFS-16303?focusedWorklogId=688303&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-688303
 ]


ASF GitHub Bot logged work on HDFS-16303:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 30/Nov/21 21:26
            Start Date: 30/Nov/21 21:26
    Worklog Time Spent: 10m 
      Work Description: KevinWikant commented on pull request #3675:
URL: https://github.com/apache/hadoop/pull/3675#issuecomment-983034760


   > DECOMMISSION_IN_PROGRESS + DEAD is an error state that means decommission 
has effectively failed. There is a case where it can complete, but what does 
that really mean - if the node is dead, it has not been gracefully stopped.
   
   The case which I have described where dead node decommissioning completes 
can occur when:
   - a decommissioning node goes dead, but all of its blocks still have block 
replicas on other live nodes
   - the namenode is eventually able to satisfy the minimum replication of all 
blocks (by replicating the under-replicated blocks from the live nodes)
   - the dead decommissioning node is transitioned to decommissioned
   
   In this case, the node did go dead while decommissioning, but there was no 
data loss thanks to redundant block replicas. From the user perspective, the 
loss of the decommissioning node did not impact the outcome of the 
decommissioning process. Had the node not gone dead while decommissioning, the 
eventual outcome is the same in that the node is decommissioned, there is no 
data loss, & all blocks have sufficient replicas.
   
   If there is data loss then a dead datanode will remain decommissioning, 
because if the dead node were to come alive again then it may be able to 
recover the lost data. But if there is no data loss then the when the node 
comes alive again it will be immediately transition to decommissioned anyway, 
so why not make it decommissioned while its still dead (and there is no data 
loss)?
   
   Also, I don't think the priority queue is adding much complexity, it's just 
putting healthy nodes (with more recent heartbeat times) ahead of unhealthy 
nodes (with older heartbeat times) such that healthy nodes are decommissioned 
first
   
   ----
   
   I also want to call out another caveat with the approach of removing the 
node from the DatanodeAdminManager which I  uncovered while unit testing
   
   If we leave the node in DECOMMISSION_IN_PROGRESS & remove the node from 
DatanodeAdminManager, then the following callstack should re-add the datanode 
to the DatanodeAdminManager when it comes alive again:
   - 
[DatanodeManager.registerDatanode](https://github.com/apache/hadoop/blob/db89a9411ebee11372314e82d7ea0606c348d014/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java#L1223)
   - 
[DatanodeManager.startAdminOperationIfNecessary](https://github.com/apache/hadoop/blob/db89a9411ebee11372314e82d7ea0606c348d014/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java#L1109)
   - 
[DatanodeAdminManager.startDecommission](https://github.com/apache/hadoop/blob/62c86eaa0e539a4307ca794e0fcd502a77ebceb8/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java#L187)
   - 
[DatanodeAdminMonitorBase.startTrackingNode](https://github.com/apache/hadoop/blob/03cfc852791c14fad39db4e5b14104a276c08e59/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminMonitorBase.java#L136)
   
   The problem is this condition "!node.isDecommissionInProgress()": 
https://github.com/apache/hadoop/blob/62c86eaa0e539a4307ca794e0fcd502a77ebceb8/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminManager.java#L177
   
   Because the dead datanode is left in DECOMMISSION_INPROGRESS, 
"startTrackingNode" is not invoked because of the 
"!node.isDecommissionInProgress()" condition
   
   Simply removing the condition "!node.isDecommissionInProgress()" will not 
function well because "startTrackingNode" is not idempotent:
   - [startDecommission is invoked periodically when refreshDatanodes is 
called](https://github.com/apache/hadoop/blob/db89a9411ebee11372314e82d7ea0606c348d014/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java#L1339)
   - [pendingNodes is an ArrayDequeue which does not deduplicate the 
DatanodeDescriptor](https://github.com/apache/hadoop/blob/03cfc852791c14fad39db4e5b14104a276c08e59/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminMonitorBase.java#L43)
   - therefore, removing the "!node.isDecommissionInProgress()" check will 
cause a large number of duplicate DatanodeDescriptor to be added to 
DatanodeAdminManager
   
   I can think of 2 obvious ways to handle this:
   A) make calls to "startTrackingNode" idempotent (meaning that if the 
DatanodeDescriptor is already tracked, it does not get added to the 
DatanodeAdminManager)
   B) modify startDecommission such that its aware of if the invocation is for 
a datanode which was just restarted after being dead such that it can still 
invoke "startTrackingNode" even though "node.isDecommissionInProgress()"
   
   For "A)", the challenge is that we need to ensure the DatanodeDescriptor is 
not in "pendingReplication" or "outOfServiceBlocks" which could be a fairly 
costly call to execute repeatedly. Also, I am not even sure such a check is 
thread-safe given there is no locking used as part of "startDecommission" or 
"startTrackingNode"
   
   For "B)", the awareness of if a registerDatanode call is related to a 
[restarted datanode is available 
here](https://github.com/apache/hadoop/blob/db89a9411ebee11372314e82d7ea0606c348d014/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeManager.java#L1177).
 So this information would need to be passed down the callstack to a method 
"startDecommission(DatanodeDescriptor node, boolean isNodeRestart)". Because of 
the modified method signature, all the other invocations of "startDecommission" 
will need to specify isNodeRestart=false
   
   Given this additional hurdle in the approach of removing a dead datanode 
from the DatanodeAdminManager, are we sure it will be less complex/impactful 
than the proposed changed?
   
   ----
   
   In short:
   - I don't think there is any downside in moving a dead datanode to 
decommissioned when there are no LowRedundancy blocks because this would happen 
immediately anyway were the node to come back alive (and get re-added to 
DatanodeAdminManager)
   - the approach of removing a dead datanode from the DatanodeAdminManager 
will not work properly without some significant refactoring of the 
"startDecommission" method & related code
   
   @sodonnel @virajjasani @aajisaka let me know your thoughts, I am still more 
in favor of tracking dead datanodes in DatanodeAdminManager (when there are 
LowRedundancy blocks), but if the community thinks its better to remove the 
dead datanodes from DatanodeAdminManager I can implement proposal "B)"


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


Issue Time Tracking
-------------------

    Worklog Id:     (was: 688303)
    Time Spent: 4h 40m  (was: 4.5h)

> Losing over 100 datanodes in state decommissioning results in full blockage 
> of all datanode decommissioning
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: HDFS-16303
>                 URL: https://issues.apache.org/jira/browse/HDFS-16303
>             Project: Hadoop HDFS
>          Issue Type: Bug
>    Affects Versions: 2.10.1, 3.3.1
>            Reporter: Kevin Wikant
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 4h 40m
>  Remaining Estimate: 0h
>
> h2. Impact
> HDFS datanode decommissioning does not make any forward progress. For 
> example, the user adds X datanodes to the "dfs.hosts.exclude" file and all X 
> of those datanodes remain in state decommissioning forever without making any 
> forward progress towards being decommissioned.
> h2. Root Cause
> The HDFS Namenode class "DatanodeAdminManager" is responsible for 
> decommissioning datanodes.
> As per this "hdfs-site" configuration:
> {quote}Config = dfs.namenode.decommission.max.concurrent.tracked.nodes 
>  Default Value = 100
> The maximum number of decommission-in-progress datanodes nodes that will be 
> tracked at one time by the namenode. Tracking a decommission-in-progress 
> datanode consumes additional NN memory proportional to the number of blocks 
> on the datnode. Having a conservative limit reduces the potential impact of 
> decomissioning a large number of nodes at once. A value of 0 means no limit 
> will be enforced.
> {quote}
> The Namenode will only actively track up to 100 datanodes for decommissioning 
> at any given time, as to avoid Namenode memory pressure.
> Looking into the "DatanodeAdminManager" code:
>  * a new datanode is only removed from the "tracked.nodes" set when it 
> finishes decommissioning
>  * a new datanode is only added to the "tracked.nodes" set if there is fewer 
> than 100 datanodes being tracked
> So in the event that there are more than 100 datanodes being decommissioned 
> at a given time, some of those datanodes will not be in the "tracked.nodes" 
> set until 1 or more datanodes in the "tracked.nodes" finishes 
> decommissioning. This is generally not a problem because the datanodes in 
> "tracked.nodes" will eventually finish decommissioning, but there is an edge 
> case where this logic prevents the namenode from making any forward progress 
> towards decommissioning.
> If all 100 datanodes in the "tracked.nodes" are unable to finish 
> decommissioning, then other datanodes (which may be able to be 
> decommissioned) will never get added to "tracked.nodes" and therefore will 
> never get the opportunity to be decommissioned.
> This can occur due the following issue:
> {quote}2021-10-21 12:39:24,048 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager 
> (DatanodeAdminMonitor-0): Node W.X.Y.Z:50010 is dead while in Decommission In 
> Progress. Cannot be safely decommissioned or be in maintenance since there is 
> risk of reduced data durability or data loss. Either restart the failed node 
> or force decommissioning or maintenance by removing, calling refreshNodes, 
> then re-adding to the excludes or host config files.
> {quote}
> If a Datanode is lost while decommissioning (for example if the underlying 
> hardware fails or is lost), then it will remain in state decommissioning 
> forever.
> If 100 or more Datanodes are lost while decommissioning over the Hadoop 
> cluster lifetime, then this is enough to completely fill up the 
> "tracked.nodes" set. With the entire "tracked.nodes" set filled with 
> datanodes that can never finish decommissioning, any datanodes added after 
> this point will never be able to be decommissioned because they will never be 
> added to the "tracked.nodes" set.
> In this scenario:
>  * the "tracked.nodes" set is filled with datanodes which are lost & cannot 
> be recovered (and can never finish decommissioning so they will never be 
> removed from the set)
>  * the actual live datanodes being decommissioned are enqueued waiting to 
> enter the "tracked.nodes" set (and are stuck waiting indefinitely)
> This means that no progress towards decommissioning the live datanodes will 
> be made unless the user takes the following action:
> {quote}Either restart the failed node or force decommissioning or maintenance 
> by removing, calling refreshNodes, then re-adding to the excludes or host 
> config files.
> {quote}
> Ideally, the Namenode should be able to gracefully handle scenarios where the 
> datanodes in the "tracked.nodes" set are not making forward progress towards 
> decommissioning while the enqueued datanodes may be able to make forward 
> progress.
> h2. Reproduce Steps
>  * create a Hadoop cluster
>  * lose (i.e. terminate the host/process forever) over 100 datanodes while 
> the datanodes are in state decommissioning
>  * add additional datanodes to the cluster
>  * attempt to decommission those new datanodes & observe that they are stuck 
> in state decommissioning forever
> Note that in this example each datanode, over the full history of the 
> cluster, has a unique IP address



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDFS-16303) Losing over 100 datanodes in state decommissioning results in full blockage of all datanode decommissioning

Reply via email to