[jira] [Work logged] (HDFS-16064) HDFS-721 causes DataNode decommissioning to get stuck indefinitely

ASF GitHub Bot (Jira) Mon, 13 Jun 2022 19:04:09 -0700


     [ 
https://issues.apache.org/jira/browse/HDFS-16064?focusedWorklogId=780965&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-780965
 ]


ASF GitHub Bot logged work on HDFS-16064:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 14/Jun/22 02:02
            Start Date: 14/Jun/22 02:02
    Worklog Time Spent: 10m 
      Work Description: KevinWikant commented on PR #4410:
URL: https://github.com/apache/hadoop/pull/4410#issuecomment-1154627715

   Hi @ashutoshcipher thank you for reviewing the PR! I have tried to address 
your comments below:
   
   ## I am thinking could there be a case where the block can show up in both 
liveReplicas and decommissioning, which could lead to any unnecessarily call to 
invalidateCorruptReplicas()
   
   I am not sure its possible for a block replica to be reported as both a 
liveReplica & a decommissioningReplica at the same time. Its my understanding 
that a block replica is on decommissioning datanode is counted as a 
decommissioningReplica & a non-corrupt replica on a live datanode is counted as 
a liveReplica. So a block replica on a decommissioning node will only be 
counted towards decommissioningReplica count & not liveReplica count. I have 
never seen the behavior you are mentioning in my experience, let me know if you 
have a reference JIRA.
   
   If the case you described is possible then in theory numUsableReplica would 
be greater than it should be. In the typical case where 
"dfs.namenode.replication.min = 1" this makes no difference because even if 
there is only 1 non-corrupt block replica on 1 decommissioning node then the 
corrupt blocks are invalidated regardless of wether or not numUsableReplica=1 
or numUsableReplica=2 (due to double counting of replica as liveReplica & 
decommissioningReplica). In the case where "dfs.namenode.replication.min > 1" 
there could arguably be a difference because the corrupt replicas would not be 
invalidated if numUsableReplica=1 but they will be invalidated if 
numUsableReplica=2 (due to double counting of replica as liveReplica & 
decommissioningReplica).
   
   I think if this scenario is possible the correct fix would be to ensure that 
each block replica is only counted once towards liveReplica or 
decommissioningReplica. Let me know if there is a JIRA for this issue & I can 
potentially look into the bug fix separately.
   
   ## Edge case coming to my mind is when the we are considering the block on 
decommissioning node as useable but the very next moment, the node is 
decommissioned.
   
   Fair point, I had considered this & mentioned this edge case in the PR 
description:
   
   ```
   The only perceived risk here would be that the corrupt blocks are 
invalidated at around the same time the decommissioning and entering 
maintenance nodes are decommissioned. This could in theory bring the overall 
number of replicas below the "dfs.namenode.replication.min" (i.e. to 0 replicas 
in the worst case). This is however not an actual risk because the 
decommissioning and entering maintenance nodes will not finish decommissioning 
until their is a sufficient number of liveReplicas; so there is no possibility 
that the decommissioning and entering maintenance nodes will be decommissioned 
prematurely.
   ```
   
   Any replicas on decommissioning nodes will not be decommissioned until there 
are more liveReplicas than the default replication factor for the HDFS file. So 
its only possible for decommissioningReplicas to be decommissioned at the same 
time corruptReplicas are invalidated if there are sufficient liveReplicas to 
satisfy the replication factor; because of the live replicas its safe to 
eliminate both the decommissioningReplicas & the corruptReplicas. If there is 
not a sufficient number of liveReplicas then the decommissioningReplicas will 
not be decommissioned but the corruptReplicas will be invalidated; the block 
will not be lost because the decommissioningReplicas will still exist.
   
   One case you could argue is that if:
   - corruptReplicas > 1
   - decommissioningReplicas = 1
   
   By invalidating the corruptReplicas we are exposing the block to a risk of 
the loss should the decommissioningReplica be unexpectedly terminated/failed. 
This is true, but this same risk already exists where:
   - corruptReplicas > 1
   - liveReplicas = 1
   By invalidating the corruptReplicas we are exposing the block to a risk of 
the loss should the liveReplica be unexpectedly terminated/failed.
   
   So I don't think this change introduces any new risk of data loss, in fact 
it helps improve block redundancy in scenarios where a block cannot be 
sufficiently replicated because the corruptReplicas cannot be invalidated.




Issue Time Tracking
-------------------

    Worklog Id:     (was: 780965)
    Time Spent: 50m  (was: 40m)

> HDFS-721 causes DataNode decommissioning to get stuck indefinitely
> ------------------------------------------------------------------
>
>                 Key: HDFS-16064
>                 URL: https://issues.apache.org/jira/browse/HDFS-16064
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: datanode, namenode
>    Affects Versions: 3.2.1
>            Reporter: Kevin Wikant
>            Priority: Major
>              Labels: pull-request-available
>          Time Spent: 50m
>  Remaining Estimate: 0h
>
> Seems that https://issues.apache.org/jira/browse/HDFS-721 was resolved as a 
> non-issue under the assumption that if the namenode & a datanode get into an 
> inconsistent state for a given block pipeline, there should be another 
> datanode available to replicate the block to
> While testing datanode decommissioning using "dfs.exclude.hosts", I have 
> encountered a scenario where the decommissioning gets stuck indefinitely
> Below is the progression of events:
>  * there are initially 4 datanodes DN1, DN2, DN3, DN4
>  * scale-down is started by adding DN1 & DN2 to "dfs.exclude.hosts"
>  * HDFS block pipelines on DN1 & DN2 must now be replicated to DN3 & DN4 in 
> order to satisfy their minimum replication factor of 2
>  * during this replication process 
> https://issues.apache.org/jira/browse/HDFS-721 is encountered which causes 
> the following inconsistent state:
>  ** DN3 thinks it has the block pipeline in FINALIZED state
>  ** the namenode does not think DN3 has the block pipeline
> {code:java}
> 2021-06-06 10:38:23,604 INFO org.apache.hadoop.hdfs.server.datanode.DataNode 
> (DataXceiver for client  at /DN2:45654 [Receiving block BP-YYY:blk_XXX]): 
> DN3:9866:DataXceiver error processing WRITE_BLOCK operation  src: /DN2:45654 
> dst: /DN3:9866; 
> org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block 
> BP-YYY:blk_XXX already exists in state FINALIZED and thus cannot be created.
> {code}
>  * the replication is attempted again, but:
>  ** DN4 has the block
>  ** DN1 and/or DN2 have the block, but don't count towards the minimum 
> replication factor because they are being decommissioned
>  ** DN3 does not have the block & cannot have the block replicated to it 
> because of HDFS-721
>  * the namenode repeatedly tries to replicate the block to DN3 & repeatedly 
> fails, this continues indefinitely
>  * therefore DN4 is the only live datanode with the block & the minimum 
> replication factor of 2 cannot be satisfied
>  * because the minimum replication factor cannot be satisfied for the 
> block(s) being moved off DN1 & DN2, the datanode decommissioning can never be 
> completed 
> {code:java}
> 2021-06-06 10:39:10,106 INFO BlockStateChange (DatanodeAdminMonitor-0): 
> Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, 
> decommissioned replicas: 0, decommissioning replicas: 2, maintenance 
> replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is 
> Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , 
> Current Datanode: DN1:9866, Is current datanode decommissioning: true, Is 
> current datanode entering maintenance: false
> ...
> 2021-06-06 10:57:10,105 INFO BlockStateChange (DatanodeAdminMonitor-0): 
> Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, 
> decommissioned replicas: 0, decommissioning replicas: 2, maintenance 
> replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is 
> Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , 
> Current Datanode: DN2:9866, Is current datanode decommissioning: true, Is 
> current datanode entering maintenance: false
> {code}
> Being stuck in decommissioning state forever is not an intended behavior of 
> DataNode decommissioning
> A few potential solutions:
>  * Address the root cause of the problem which is an inconsistent state 
> between namenode & datanode: https://issues.apache.org/jira/browse/HDFS-721
>  * Detect when datanode decommissioning is stuck due to lack of available 
> datanodes for satisfying the minimum replication factor, then recover by 
> re-enabling the datanodes being decommissioned
>  



--
This message was sent by Atlassian Jira
(v8.20.7#820007)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

[jira] [Work logged] (HDFS-16064) HDFS-721 causes DataNode decommissioning to get stuck indefinitely

Reply via email to