[jira] [Commented] (HDFS-16064) Determine when to invalidate corrupt replicas based on number of usable replicas
[ https://issues.apache.org/jira/browse/HDFS-16064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17805302#comment-17805302 ] Rushabh Shah commented on HDFS-16064: - [~KevinWikant] [~aajisaka] Any reason why we haven't backported this fix to branch-2.10? > Determine when to invalidate corrupt replicas based on number of usable > replicas > > > Key: HDFS-16064 > URL: https://issues.apache.org/jira/browse/HDFS-16064 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Affects Versions: 3.2.1 >Reporter: Kevin Wikant >Assignee: Kevin Wikant >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0, 3.2.4, 3.3.5 > > Time Spent: 2h > Remaining Estimate: 0h > > Seems that https://issues.apache.org/jira/browse/HDFS-721 was resolved as a > non-issue under the assumption that if the namenode & a datanode get into an > inconsistent state for a given block pipeline, there should be another > datanode available to replicate the block to > While testing datanode decommissioning using "dfs.exclude.hosts", I have > encountered a scenario where the decommissioning gets stuck indefinitely > Below is the progression of events: > * there are initially 4 datanodes DN1, DN2, DN3, DN4 > * scale-down is started by adding DN1 & DN2 to "dfs.exclude.hosts" > * HDFS block pipelines on DN1 & DN2 must now be replicated to DN3 & DN4 in > order to satisfy their minimum replication factor of 2 > * during this replication process > https://issues.apache.org/jira/browse/HDFS-721 is encountered which causes > the following inconsistent state: > ** DN3 thinks it has the block pipeline in FINALIZED state > ** the namenode does not think DN3 has the block pipeline > {code:java} > 2021-06-06 10:38:23,604 INFO org.apache.hadoop.hdfs.server.datanode.DataNode > (DataXceiver for client at /DN2:45654 [Receiving block BP-YYY:blk_XXX]): > DN3:9866:DataXceiver error processing WRITE_BLOCK operation src: /DN2:45654 > dst: /DN3:9866; > org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block > BP-YYY:blk_XXX already exists in state FINALIZED and thus cannot be created. > {code} > * the replication is attempted again, but: > ** DN4 has the block > ** DN1 and/or DN2 have the block, but don't count towards the minimum > replication factor because they are being decommissioned > ** DN3 does not have the block & cannot have the block replicated to it > because of HDFS-721 > * the namenode repeatedly tries to replicate the block to DN3 & repeatedly > fails, this continues indefinitely > * therefore DN4 is the only live datanode with the block & the minimum > replication factor of 2 cannot be satisfied > * because the minimum replication factor cannot be satisfied for the > block(s) being moved off DN1 & DN2, the datanode decommissioning can never be > completed > {code:java} > 2021-06-06 10:39:10,106 INFO BlockStateChange (DatanodeAdminMonitor-0): > Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, > decommissioned replicas: 0, decommissioning replicas: 2, maintenance > replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is > Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , > Current Datanode: DN1:9866, Is current datanode decommissioning: true, Is > current datanode entering maintenance: false > ... > 2021-06-06 10:57:10,105 INFO BlockStateChange (DatanodeAdminMonitor-0): > Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, > decommissioned replicas: 0, decommissioning replicas: 2, maintenance > replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is > Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , > Current Datanode: DN2:9866, Is current datanode decommissioning: true, Is > current datanode entering maintenance: false > {code} > Being stuck in decommissioning state forever is not an intended behavior of > DataNode decommissioning > A few potential solutions: > * Address the root cause of the problem which is an inconsistent state > between namenode & datanode: https://issues.apache.org/jira/browse/HDFS-721 > * Detect when datanode decommissioning is stuck due to lack of available > datanodes for satisfying the minimum replication factor, then recover by > re-enabling the datanodes being decommissioned > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16064) Determine when to invalidate corrupt replicas based on number of usable replicas
[ https://issues.apache.org/jira/browse/HDFS-16064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17805315#comment-17805315 ] ASF GitHub Bot commented on HDFS-16064: --- zz12341 opened a new pull request, #6437: URL: https://github.com/apache/hadoop/pull/6437 … ### Description of PR [HDFS-16064](https://github.com/apache/hadoop/pull/4410) fixed an issue where decommissioning replicas were not counted as usable replicas, which caused decom to stuck forever in the case of small clusters. We are seeing the same issue on 2.10, and thus backporting the changes. ### How was this patch tested? ### For code changes: - [ x] Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')? - [ ] Object storage: have the integration tests been executed and the endpoint declared according to the connector-specific documentation? - [ ] If adding new dependencies to the code, are these dependencies licensed in a way that is compatible for inclusion under [ASF 2.0](http://www.apache.org/legal/resolved.html#category-a)? - [ ] If applicable, have you updated the `LICENSE`, `LICENSE-binary`, `NOTICE-binary` files? > Determine when to invalidate corrupt replicas based on number of usable > replicas > > > Key: HDFS-16064 > URL: https://issues.apache.org/jira/browse/HDFS-16064 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Affects Versions: 3.2.1 >Reporter: Kevin Wikant >Assignee: Kevin Wikant >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0, 3.2.4, 3.3.5 > > Time Spent: 2h > Remaining Estimate: 0h > > Seems that https://issues.apache.org/jira/browse/HDFS-721 was resolved as a > non-issue under the assumption that if the namenode & a datanode get into an > inconsistent state for a given block pipeline, there should be another > datanode available to replicate the block to > While testing datanode decommissioning using "dfs.exclude.hosts", I have > encountered a scenario where the decommissioning gets stuck indefinitely > Below is the progression of events: > * there are initially 4 datanodes DN1, DN2, DN3, DN4 > * scale-down is started by adding DN1 & DN2 to "dfs.exclude.hosts" > * HDFS block pipelines on DN1 & DN2 must now be replicated to DN3 & DN4 in > order to satisfy their minimum replication factor of 2 > * during this replication process > https://issues.apache.org/jira/browse/HDFS-721 is encountered which causes > the following inconsistent state: > ** DN3 thinks it has the block pipeline in FINALIZED state > ** the namenode does not think DN3 has the block pipeline > {code:java} > 2021-06-06 10:38:23,604 INFO org.apache.hadoop.hdfs.server.datanode.DataNode > (DataXceiver for client at /DN2:45654 [Receiving block BP-YYY:blk_XXX]): > DN3:9866:DataXceiver error processing WRITE_BLOCK operation src: /DN2:45654 > dst: /DN3:9866; > org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block > BP-YYY:blk_XXX already exists in state FINALIZED and thus cannot be created. > {code} > * the replication is attempted again, but: > ** DN4 has the block > ** DN1 and/or DN2 have the block, but don't count towards the minimum > replication factor because they are being decommissioned > ** DN3 does not have the block & cannot have the block replicated to it > because of HDFS-721 > * the namenode repeatedly tries to replicate the block to DN3 & repeatedly > fails, this continues indefinitely > * therefore DN4 is the only live datanode with the block & the minimum > replication factor of 2 cannot be satisfied > * because the minimum replication factor cannot be satisfied for the > block(s) being moved off DN1 & DN2, the datanode decommissioning can never be > completed > {code:java} > 2021-06-06 10:39:10,106 INFO BlockStateChange (DatanodeAdminMonitor-0): > Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, > decommissioned replicas: 0, decommissioning replicas: 2, maintenance > replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is > Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , > Current Datanode: DN1:9866, Is current datanode decommissioning: true, Is > current datanode entering maintenance: false > ... > 2021-06-06 10:57:10,105 INFO BlockStateChange (DatanodeAdminMonitor-0): > Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, > decommissioned replicas: 0, decommissioning replicas: 2, maintenance > replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is > Open File: false, Datanodes havi
[jira] [Commented] (HDFS-16064) Determine when to invalidate corrupt replicas based on number of usable replicas
[ https://issues.apache.org/jira/browse/HDFS-16064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17805317#comment-17805317 ] ASF GitHub Bot commented on HDFS-16064: --- shahrs87 commented on code in PR #6437: URL: https://github.com/apache/hadoop/pull/6437#discussion_r1448086058 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java: ## @@ -791,15 +791,33 @@ public short getMinReplication() { return minReplication; } + public short getMinStorageNum(BlockInfo block) { Review Comment: @zz12341 Why we want these changes? I don't see it in the original patch [here](https://github.com/apache/hadoop/pull/4410/files). > Determine when to invalidate corrupt replicas based on number of usable > replicas > > > Key: HDFS-16064 > URL: https://issues.apache.org/jira/browse/HDFS-16064 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Affects Versions: 3.2.1 >Reporter: Kevin Wikant >Assignee: Kevin Wikant >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0, 3.2.4, 3.3.5 > > Time Spent: 2h > Remaining Estimate: 0h > > Seems that https://issues.apache.org/jira/browse/HDFS-721 was resolved as a > non-issue under the assumption that if the namenode & a datanode get into an > inconsistent state for a given block pipeline, there should be another > datanode available to replicate the block to > While testing datanode decommissioning using "dfs.exclude.hosts", I have > encountered a scenario where the decommissioning gets stuck indefinitely > Below is the progression of events: > * there are initially 4 datanodes DN1, DN2, DN3, DN4 > * scale-down is started by adding DN1 & DN2 to "dfs.exclude.hosts" > * HDFS block pipelines on DN1 & DN2 must now be replicated to DN3 & DN4 in > order to satisfy their minimum replication factor of 2 > * during this replication process > https://issues.apache.org/jira/browse/HDFS-721 is encountered which causes > the following inconsistent state: > ** DN3 thinks it has the block pipeline in FINALIZED state > ** the namenode does not think DN3 has the block pipeline > {code:java} > 2021-06-06 10:38:23,604 INFO org.apache.hadoop.hdfs.server.datanode.DataNode > (DataXceiver for client at /DN2:45654 [Receiving block BP-YYY:blk_XXX]): > DN3:9866:DataXceiver error processing WRITE_BLOCK operation src: /DN2:45654 > dst: /DN3:9866; > org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block > BP-YYY:blk_XXX already exists in state FINALIZED and thus cannot be created. > {code} > * the replication is attempted again, but: > ** DN4 has the block > ** DN1 and/or DN2 have the block, but don't count towards the minimum > replication factor because they are being decommissioned > ** DN3 does not have the block & cannot have the block replicated to it > because of HDFS-721 > * the namenode repeatedly tries to replicate the block to DN3 & repeatedly > fails, this continues indefinitely > * therefore DN4 is the only live datanode with the block & the minimum > replication factor of 2 cannot be satisfied > * because the minimum replication factor cannot be satisfied for the > block(s) being moved off DN1 & DN2, the datanode decommissioning can never be > completed > {code:java} > 2021-06-06 10:39:10,106 INFO BlockStateChange (DatanodeAdminMonitor-0): > Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, > decommissioned replicas: 0, decommissioning replicas: 2, maintenance > replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is > Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , > Current Datanode: DN1:9866, Is current datanode decommissioning: true, Is > current datanode entering maintenance: false > ... > 2021-06-06 10:57:10,105 INFO BlockStateChange (DatanodeAdminMonitor-0): > Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, > decommissioned replicas: 0, decommissioning replicas: 2, maintenance > replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is > Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , > Current Datanode: DN2:9866, Is current datanode decommissioning: true, Is > current datanode entering maintenance: false > {code} > Being stuck in decommissioning state forever is not an intended behavior of > DataNode decommissioning > A few potential solutions: > * Address the root cause of the problem which is an inconsistent state > between namenode & datanode: https://issues.apache.org/jira/browse/HDFS-721 > * Detect when datanode decommissioning is stuck due t
[jira] [Commented] (HDFS-16064) Determine when to invalidate corrupt replicas based on number of usable replicas
[ https://issues.apache.org/jira/browse/HDFS-16064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17805324#comment-17805324 ] ASF GitHub Bot commented on HDFS-16064: --- zz12341 commented on code in PR #6437: URL: https://github.com/apache/hadoop/pull/6437#discussion_r1448119281 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java: ## @@ -791,15 +791,33 @@ public short getMinReplication() { return minReplication; } + public short getMinStorageNum(BlockInfo block) { Review Comment: I was originally trying to make it in sync with what trunk branch is doing: https://github.com/apache/hadoop/pull/4410/files#diff-305ecf45a0f0708849b5e3c0d21a56c681db3a1497e52a19ef24939278dc99feL1922-R1926 Let me revert this change > Determine when to invalidate corrupt replicas based on number of usable > replicas > > > Key: HDFS-16064 > URL: https://issues.apache.org/jira/browse/HDFS-16064 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Affects Versions: 3.2.1 >Reporter: Kevin Wikant >Assignee: Kevin Wikant >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0, 3.2.4, 3.3.5 > > Time Spent: 2h > Remaining Estimate: 0h > > Seems that https://issues.apache.org/jira/browse/HDFS-721 was resolved as a > non-issue under the assumption that if the namenode & a datanode get into an > inconsistent state for a given block pipeline, there should be another > datanode available to replicate the block to > While testing datanode decommissioning using "dfs.exclude.hosts", I have > encountered a scenario where the decommissioning gets stuck indefinitely > Below is the progression of events: > * there are initially 4 datanodes DN1, DN2, DN3, DN4 > * scale-down is started by adding DN1 & DN2 to "dfs.exclude.hosts" > * HDFS block pipelines on DN1 & DN2 must now be replicated to DN3 & DN4 in > order to satisfy their minimum replication factor of 2 > * during this replication process > https://issues.apache.org/jira/browse/HDFS-721 is encountered which causes > the following inconsistent state: > ** DN3 thinks it has the block pipeline in FINALIZED state > ** the namenode does not think DN3 has the block pipeline > {code:java} > 2021-06-06 10:38:23,604 INFO org.apache.hadoop.hdfs.server.datanode.DataNode > (DataXceiver for client at /DN2:45654 [Receiving block BP-YYY:blk_XXX]): > DN3:9866:DataXceiver error processing WRITE_BLOCK operation src: /DN2:45654 > dst: /DN3:9866; > org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block > BP-YYY:blk_XXX already exists in state FINALIZED and thus cannot be created. > {code} > * the replication is attempted again, but: > ** DN4 has the block > ** DN1 and/or DN2 have the block, but don't count towards the minimum > replication factor because they are being decommissioned > ** DN3 does not have the block & cannot have the block replicated to it > because of HDFS-721 > * the namenode repeatedly tries to replicate the block to DN3 & repeatedly > fails, this continues indefinitely > * therefore DN4 is the only live datanode with the block & the minimum > replication factor of 2 cannot be satisfied > * because the minimum replication factor cannot be satisfied for the > block(s) being moved off DN1 & DN2, the datanode decommissioning can never be > completed > {code:java} > 2021-06-06 10:39:10,106 INFO BlockStateChange (DatanodeAdminMonitor-0): > Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, > decommissioned replicas: 0, decommissioning replicas: 2, maintenance > replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is > Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , > Current Datanode: DN1:9866, Is current datanode decommissioning: true, Is > current datanode entering maintenance: false > ... > 2021-06-06 10:57:10,105 INFO BlockStateChange (DatanodeAdminMonitor-0): > Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, > decommissioned replicas: 0, decommissioning replicas: 2, maintenance > replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is > Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , > Current Datanode: DN2:9866, Is current datanode decommissioning: true, Is > current datanode entering maintenance: false > {code} > Being stuck in decommissioning state forever is not an intended behavior of > DataNode decommissioning > A few potential solutions: > * Address the root cause of the problem which is an inconsistent state > between namenode & datanod
[jira] [Commented] (HDFS-16064) Determine when to invalidate corrupt replicas based on number of usable replicas
[ https://issues.apache.org/jira/browse/HDFS-16064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17805579#comment-17805579 ] Kevin Wikant commented on HDFS-16064: - {quote}Any reason why we haven't backported this fix to branch-2.10? {quote} Back in 2022, I did try to backport this change to 2.10.1 branch & encountered unit test failure due to inconsistent behavior when compared to Hadoop 3.x {quote}> mvn test -Dtest=TestDecommission ... [ERROR] Tests run: 27, Failures: 0, Errors: 1, Skipped: 1, Time elapsed: 263.603 s <<< FAILURE! - in org.apache.hadoop.hdfs.TestDecommission [ERROR] testDeleteCorruptReplicaForUnderReplicatedBlock(org.apache.hadoop.hdfs.TestDecommission) Time elapsed: 60.462 s <<< ERROR! java.lang.Exception: test timed out after 6 milliseconds at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.test.GenericTestUtils.waitFor(GenericTestUtils.java:366) at org.apache.hadoop.hdfs.TestDecommission.testDeleteCorruptReplicaForUnderReplicatedBlock(TestDecommission.java:1918) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74) {quote} I do not remember all the root cause details, but from my notes: * "The inconsistent behavior has to do with when Datanodes in the MiniDFSCluster are sending full block reports vs incremental block reports and how that gets handled by the Namenode. Also, the triggerBlockReport method does not work in a MiniDFSCluster (i.e. no block report is sent) and there is no way to control sending of incremental vs full block reports." These Hadoop 2.x behavior differences in Namenode/Datanode/MiniDFSCluster were not fully root caused & addressed, so this bug fix was only backported to Hadoop 3.x which was sufficient for our needs. > Determine when to invalidate corrupt replicas based on number of usable > replicas > > > Key: HDFS-16064 > URL: https://issues.apache.org/jira/browse/HDFS-16064 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Affects Versions: 3.2.1 >Reporter: Kevin Wikant >Assignee: Kevin Wikant >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0, 3.2.4, 3.3.5 > > Time Spent: 2h > Remaining Estimate: 0h > > Seems that https://issues.apache.org/jira/browse/HDFS-721 was resolved as a > non-issue under the assumption that if the namenode & a datanode get into an > inconsistent state for a given block pipeline, there should be another > datanode available to replicate the block to > While testing datanode decommissioning using "dfs.exclude.hosts", I have > encountered a scenario where the decommissioning gets stuck indefinitely > Below is the progression of events: > * there are initially 4 datanodes DN1, DN2, DN3, DN4 > * scale-down is started by adding DN1 & DN2 to "dfs.exclude.hosts" > * HDFS block pipelines on DN1 & DN2 must now be replicated to DN3 & DN4 in > order to satisfy their minimum replication factor of 2 > * during this replication process > https://issues.apache.org/jira/browse/HDFS-721 is encountered which causes > the following inconsistent state: > ** DN3 thinks it has the block pipeline in FINALIZED state > ** the namenode does not think DN3 has the block pipeline > {code:java} > 2021-06-06 10:38:23,604 INFO org.apache.hadoop.hdfs.server.datanode.DataNode > (DataXceiver for client at /DN2:45654 [Receiving block BP-YYY:blk_XXX]): > DN3:9866:DataXceiver error processing WRITE_BLOCK operation src: /DN2:45654 > dst: /DN3:9866; > org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block > BP-YYY:blk_XXX already exists in state FINALIZED and thus cannot be created. > {code} > * the replication is attempted again, but: > ** DN4 has the block > ** DN1 and/or DN2 have the block, but don't count towards the minimum > replication factor because they are being decommissioned > ** DN3 does not have the block & cannot have the block replicated to it > because of HDFS-721 > * the na
[jira] [Commented] (HDFS-16064) Determine when to invalidate corrupt replicas based on number of usable replicas
[ https://issues.apache.org/jira/browse/HDFS-16064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17805699#comment-17805699 ] ASF GitHub Bot commented on HDFS-16064: --- shahrs87 commented on code in PR #6437: URL: https://github.com/apache/hadoop/pull/6437#discussion_r1449133212 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java: ## @@ -1463,21 +1463,28 @@ private void markBlockAsCorrupt(BlockToMarkCorrupt b, b.getReason(), b.getReasonCode()); NumberReplicas numberOfReplicas = countNodes(b.getStored()); -boolean hasEnoughLiveReplicas = numberOfReplicas.liveReplicas() >= -expectedReplicas; +final int numUsableReplicas = numberOfReplicas.liveReplicas() + +numberOfReplicas.decommissioning() + +numberOfReplicas.liveEnteringMaintenanceReplicas(); +boolean hasEnoughLiveReplicas = numUsableReplicas >= +expectedRedundancies; + boolean minReplicationSatisfied = numberOfReplicas.liveReplicas() >= minReplication; Review Comment: ```suggestion numUsableReplicas >= minReplication; ``` > Determine when to invalidate corrupt replicas based on number of usable > replicas > > > Key: HDFS-16064 > URL: https://issues.apache.org/jira/browse/HDFS-16064 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Affects Versions: 3.2.1 >Reporter: Kevin Wikant >Assignee: Kevin Wikant >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0, 3.2.4, 3.3.5 > > Time Spent: 2h > Remaining Estimate: 0h > > Seems that https://issues.apache.org/jira/browse/HDFS-721 was resolved as a > non-issue under the assumption that if the namenode & a datanode get into an > inconsistent state for a given block pipeline, there should be another > datanode available to replicate the block to > While testing datanode decommissioning using "dfs.exclude.hosts", I have > encountered a scenario where the decommissioning gets stuck indefinitely > Below is the progression of events: > * there are initially 4 datanodes DN1, DN2, DN3, DN4 > * scale-down is started by adding DN1 & DN2 to "dfs.exclude.hosts" > * HDFS block pipelines on DN1 & DN2 must now be replicated to DN3 & DN4 in > order to satisfy their minimum replication factor of 2 > * during this replication process > https://issues.apache.org/jira/browse/HDFS-721 is encountered which causes > the following inconsistent state: > ** DN3 thinks it has the block pipeline in FINALIZED state > ** the namenode does not think DN3 has the block pipeline > {code:java} > 2021-06-06 10:38:23,604 INFO org.apache.hadoop.hdfs.server.datanode.DataNode > (DataXceiver for client at /DN2:45654 [Receiving block BP-YYY:blk_XXX]): > DN3:9866:DataXceiver error processing WRITE_BLOCK operation src: /DN2:45654 > dst: /DN3:9866; > org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block > BP-YYY:blk_XXX already exists in state FINALIZED and thus cannot be created. > {code} > * the replication is attempted again, but: > ** DN4 has the block > ** DN1 and/or DN2 have the block, but don't count towards the minimum > replication factor because they are being decommissioned > ** DN3 does not have the block & cannot have the block replicated to it > because of HDFS-721 > * the namenode repeatedly tries to replicate the block to DN3 & repeatedly > fails, this continues indefinitely > * therefore DN4 is the only live datanode with the block & the minimum > replication factor of 2 cannot be satisfied > * because the minimum replication factor cannot be satisfied for the > block(s) being moved off DN1 & DN2, the datanode decommissioning can never be > completed > {code:java} > 2021-06-06 10:39:10,106 INFO BlockStateChange (DatanodeAdminMonitor-0): > Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, > decommissioned replicas: 0, decommissioning replicas: 2, maintenance > replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is > Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , > Current Datanode: DN1:9866, Is current datanode decommissioning: true, Is > current datanode entering maintenance: false > ... > 2021-06-06 10:57:10,105 INFO BlockStateChange (DatanodeAdminMonitor-0): > Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, > decommissioned replicas: 0, decommissioning replicas: 2, maintenance > replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is > Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , > Current Datanode: DN2:9866, I
[jira] [Commented] (HDFS-16064) Determine when to invalidate corrupt replicas based on number of usable replicas
[ https://issues.apache.org/jira/browse/HDFS-16064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17805704#comment-17805704 ] Rushabh Shah commented on HDFS-16064: - Thank you [~KevinWikant] for your reply. We also hit this bug recently in our production cluster. One of my colleague is trying to backport this patch to branch-2.10. Let us see if we have any luck backporting this change. > Determine when to invalidate corrupt replicas based on number of usable > replicas > > > Key: HDFS-16064 > URL: https://issues.apache.org/jira/browse/HDFS-16064 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Affects Versions: 3.2.1 >Reporter: Kevin Wikant >Assignee: Kevin Wikant >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0, 3.2.4, 3.3.5 > > Time Spent: 2h > Remaining Estimate: 0h > > Seems that https://issues.apache.org/jira/browse/HDFS-721 was resolved as a > non-issue under the assumption that if the namenode & a datanode get into an > inconsistent state for a given block pipeline, there should be another > datanode available to replicate the block to > While testing datanode decommissioning using "dfs.exclude.hosts", I have > encountered a scenario where the decommissioning gets stuck indefinitely > Below is the progression of events: > * there are initially 4 datanodes DN1, DN2, DN3, DN4 > * scale-down is started by adding DN1 & DN2 to "dfs.exclude.hosts" > * HDFS block pipelines on DN1 & DN2 must now be replicated to DN3 & DN4 in > order to satisfy their minimum replication factor of 2 > * during this replication process > https://issues.apache.org/jira/browse/HDFS-721 is encountered which causes > the following inconsistent state: > ** DN3 thinks it has the block pipeline in FINALIZED state > ** the namenode does not think DN3 has the block pipeline > {code:java} > 2021-06-06 10:38:23,604 INFO org.apache.hadoop.hdfs.server.datanode.DataNode > (DataXceiver for client at /DN2:45654 [Receiving block BP-YYY:blk_XXX]): > DN3:9866:DataXceiver error processing WRITE_BLOCK operation src: /DN2:45654 > dst: /DN3:9866; > org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block > BP-YYY:blk_XXX already exists in state FINALIZED and thus cannot be created. > {code} > * the replication is attempted again, but: > ** DN4 has the block > ** DN1 and/or DN2 have the block, but don't count towards the minimum > replication factor because they are being decommissioned > ** DN3 does not have the block & cannot have the block replicated to it > because of HDFS-721 > * the namenode repeatedly tries to replicate the block to DN3 & repeatedly > fails, this continues indefinitely > * therefore DN4 is the only live datanode with the block & the minimum > replication factor of 2 cannot be satisfied > * because the minimum replication factor cannot be satisfied for the > block(s) being moved off DN1 & DN2, the datanode decommissioning can never be > completed > {code:java} > 2021-06-06 10:39:10,106 INFO BlockStateChange (DatanodeAdminMonitor-0): > Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, > decommissioned replicas: 0, decommissioning replicas: 2, maintenance > replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is > Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , > Current Datanode: DN1:9866, Is current datanode decommissioning: true, Is > current datanode entering maintenance: false > ... > 2021-06-06 10:57:10,105 INFO BlockStateChange (DatanodeAdminMonitor-0): > Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, > decommissioned replicas: 0, decommissioning replicas: 2, maintenance > replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is > Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , > Current Datanode: DN2:9866, Is current datanode decommissioning: true, Is > current datanode entering maintenance: false > {code} > Being stuck in decommissioning state forever is not an intended behavior of > DataNode decommissioning > A few potential solutions: > * Address the root cause of the problem which is an inconsistent state > between namenode & datanode: https://issues.apache.org/jira/browse/HDFS-721 > * Detect when datanode decommissioning is stuck due to lack of available > datanodes for satisfying the minimum replication factor, then recover by > re-enabling the datanodes being decommissioned > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsu
[jira] [Commented] (HDFS-16064) Determine when to invalidate corrupt replicas based on number of usable replicas
[ https://issues.apache.org/jira/browse/HDFS-16064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17805723#comment-17805723 ] Rushabh Shah commented on HDFS-16064: - [~KevinWikant] By any chance you have draft PR for branch-2.10 saved? > Determine when to invalidate corrupt replicas based on number of usable > replicas > > > Key: HDFS-16064 > URL: https://issues.apache.org/jira/browse/HDFS-16064 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Affects Versions: 3.2.1 >Reporter: Kevin Wikant >Assignee: Kevin Wikant >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0, 3.2.4, 3.3.5 > > Time Spent: 2h > Remaining Estimate: 0h > > Seems that https://issues.apache.org/jira/browse/HDFS-721 was resolved as a > non-issue under the assumption that if the namenode & a datanode get into an > inconsistent state for a given block pipeline, there should be another > datanode available to replicate the block to > While testing datanode decommissioning using "dfs.exclude.hosts", I have > encountered a scenario where the decommissioning gets stuck indefinitely > Below is the progression of events: > * there are initially 4 datanodes DN1, DN2, DN3, DN4 > * scale-down is started by adding DN1 & DN2 to "dfs.exclude.hosts" > * HDFS block pipelines on DN1 & DN2 must now be replicated to DN3 & DN4 in > order to satisfy their minimum replication factor of 2 > * during this replication process > https://issues.apache.org/jira/browse/HDFS-721 is encountered which causes > the following inconsistent state: > ** DN3 thinks it has the block pipeline in FINALIZED state > ** the namenode does not think DN3 has the block pipeline > {code:java} > 2021-06-06 10:38:23,604 INFO org.apache.hadoop.hdfs.server.datanode.DataNode > (DataXceiver for client at /DN2:45654 [Receiving block BP-YYY:blk_XXX]): > DN3:9866:DataXceiver error processing WRITE_BLOCK operation src: /DN2:45654 > dst: /DN3:9866; > org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block > BP-YYY:blk_XXX already exists in state FINALIZED and thus cannot be created. > {code} > * the replication is attempted again, but: > ** DN4 has the block > ** DN1 and/or DN2 have the block, but don't count towards the minimum > replication factor because they are being decommissioned > ** DN3 does not have the block & cannot have the block replicated to it > because of HDFS-721 > * the namenode repeatedly tries to replicate the block to DN3 & repeatedly > fails, this continues indefinitely > * therefore DN4 is the only live datanode with the block & the minimum > replication factor of 2 cannot be satisfied > * because the minimum replication factor cannot be satisfied for the > block(s) being moved off DN1 & DN2, the datanode decommissioning can never be > completed > {code:java} > 2021-06-06 10:39:10,106 INFO BlockStateChange (DatanodeAdminMonitor-0): > Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, > decommissioned replicas: 0, decommissioning replicas: 2, maintenance > replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is > Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , > Current Datanode: DN1:9866, Is current datanode decommissioning: true, Is > current datanode entering maintenance: false > ... > 2021-06-06 10:57:10,105 INFO BlockStateChange (DatanodeAdminMonitor-0): > Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, > decommissioned replicas: 0, decommissioning replicas: 2, maintenance > replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is > Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , > Current Datanode: DN2:9866, Is current datanode decommissioning: true, Is > current datanode entering maintenance: false > {code} > Being stuck in decommissioning state forever is not an intended behavior of > DataNode decommissioning > A few potential solutions: > * Address the root cause of the problem which is an inconsistent state > between namenode & datanode: https://issues.apache.org/jira/browse/HDFS-721 > * Detect when datanode decommissioning is stuck due to lack of available > datanodes for satisfying the minimum replication factor, then recover by > re-enabling the datanodes being decommissioned > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16064) Determine when to invalidate corrupt replicas based on number of usable replicas
[ https://issues.apache.org/jira/browse/HDFS-16064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17805728#comment-17805728 ] ASF GitHub Bot commented on HDFS-16064: --- shahrs87 commented on PR #6437: URL: https://github.com/apache/hadoop/pull/6437#issuecomment-1887620199 @zz12341 [branch-2.10](https://github.com/apache/hadoop/tree/branch-2.10) is the active branch for 2.10 (not [branch-2.10.0](https://github.com/apache/hadoop/tree/branch-2.10.0)). You can see all the active branches [here](https://github.com/apache/hadoop/branches). > Determine when to invalidate corrupt replicas based on number of usable > replicas > > > Key: HDFS-16064 > URL: https://issues.apache.org/jira/browse/HDFS-16064 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Affects Versions: 3.2.1 >Reporter: Kevin Wikant >Assignee: Kevin Wikant >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0, 3.2.4, 3.3.5 > > Time Spent: 2h > Remaining Estimate: 0h > > Seems that https://issues.apache.org/jira/browse/HDFS-721 was resolved as a > non-issue under the assumption that if the namenode & a datanode get into an > inconsistent state for a given block pipeline, there should be another > datanode available to replicate the block to > While testing datanode decommissioning using "dfs.exclude.hosts", I have > encountered a scenario where the decommissioning gets stuck indefinitely > Below is the progression of events: > * there are initially 4 datanodes DN1, DN2, DN3, DN4 > * scale-down is started by adding DN1 & DN2 to "dfs.exclude.hosts" > * HDFS block pipelines on DN1 & DN2 must now be replicated to DN3 & DN4 in > order to satisfy their minimum replication factor of 2 > * during this replication process > https://issues.apache.org/jira/browse/HDFS-721 is encountered which causes > the following inconsistent state: > ** DN3 thinks it has the block pipeline in FINALIZED state > ** the namenode does not think DN3 has the block pipeline > {code:java} > 2021-06-06 10:38:23,604 INFO org.apache.hadoop.hdfs.server.datanode.DataNode > (DataXceiver for client at /DN2:45654 [Receiving block BP-YYY:blk_XXX]): > DN3:9866:DataXceiver error processing WRITE_BLOCK operation src: /DN2:45654 > dst: /DN3:9866; > org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block > BP-YYY:blk_XXX already exists in state FINALIZED and thus cannot be created. > {code} > * the replication is attempted again, but: > ** DN4 has the block > ** DN1 and/or DN2 have the block, but don't count towards the minimum > replication factor because they are being decommissioned > ** DN3 does not have the block & cannot have the block replicated to it > because of HDFS-721 > * the namenode repeatedly tries to replicate the block to DN3 & repeatedly > fails, this continues indefinitely > * therefore DN4 is the only live datanode with the block & the minimum > replication factor of 2 cannot be satisfied > * because the minimum replication factor cannot be satisfied for the > block(s) being moved off DN1 & DN2, the datanode decommissioning can never be > completed > {code:java} > 2021-06-06 10:39:10,106 INFO BlockStateChange (DatanodeAdminMonitor-0): > Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, > decommissioned replicas: 0, decommissioning replicas: 2, maintenance > replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is > Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , > Current Datanode: DN1:9866, Is current datanode decommissioning: true, Is > current datanode entering maintenance: false > ... > 2021-06-06 10:57:10,105 INFO BlockStateChange (DatanodeAdminMonitor-0): > Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, > decommissioned replicas: 0, decommissioning replicas: 2, maintenance > replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is > Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , > Current Datanode: DN2:9866, Is current datanode decommissioning: true, Is > current datanode entering maintenance: false > {code} > Being stuck in decommissioning state forever is not an intended behavior of > DataNode decommissioning > A few potential solutions: > * Address the root cause of the problem which is an inconsistent state > between namenode & datanode: https://issues.apache.org/jira/browse/HDFS-721 > * Detect when datanode decommissioning is stuck due to lack of available > datanodes for satisfying the minimum replication factor, then recover by > re-enabling the datanodes being decommissioned > -- This mess
[jira] [Commented] (HDFS-16064) Determine when to invalidate corrupt replicas based on number of usable replicas
[ https://issues.apache.org/jira/browse/HDFS-16064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17805742#comment-17805742 ] ASF GitHub Bot commented on HDFS-16064: --- shahrs87 commented on PR #6437: URL: https://github.com/apache/hadoop/pull/6437#issuecomment-1887669346 ![Screenshot 2024-01-11 at 9 45 49 AM](https://github.com/apache/hadoop/assets/5529085/8b1fbfaf-17bf-49d2-87bc-b413ae94d2e6) @zz12341 This change in `BlockManager#addStoredBlock` is missing from the original patch? https://github.com/apache/hadoop/pull/4410 > Determine when to invalidate corrupt replicas based on number of usable > replicas > > > Key: HDFS-16064 > URL: https://issues.apache.org/jira/browse/HDFS-16064 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Affects Versions: 3.2.1 >Reporter: Kevin Wikant >Assignee: Kevin Wikant >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0, 3.2.4, 3.3.5 > > Time Spent: 2h > Remaining Estimate: 0h > > Seems that https://issues.apache.org/jira/browse/HDFS-721 was resolved as a > non-issue under the assumption that if the namenode & a datanode get into an > inconsistent state for a given block pipeline, there should be another > datanode available to replicate the block to > While testing datanode decommissioning using "dfs.exclude.hosts", I have > encountered a scenario where the decommissioning gets stuck indefinitely > Below is the progression of events: > * there are initially 4 datanodes DN1, DN2, DN3, DN4 > * scale-down is started by adding DN1 & DN2 to "dfs.exclude.hosts" > * HDFS block pipelines on DN1 & DN2 must now be replicated to DN3 & DN4 in > order to satisfy their minimum replication factor of 2 > * during this replication process > https://issues.apache.org/jira/browse/HDFS-721 is encountered which causes > the following inconsistent state: > ** DN3 thinks it has the block pipeline in FINALIZED state > ** the namenode does not think DN3 has the block pipeline > {code:java} > 2021-06-06 10:38:23,604 INFO org.apache.hadoop.hdfs.server.datanode.DataNode > (DataXceiver for client at /DN2:45654 [Receiving block BP-YYY:blk_XXX]): > DN3:9866:DataXceiver error processing WRITE_BLOCK operation src: /DN2:45654 > dst: /DN3:9866; > org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block > BP-YYY:blk_XXX already exists in state FINALIZED and thus cannot be created. > {code} > * the replication is attempted again, but: > ** DN4 has the block > ** DN1 and/or DN2 have the block, but don't count towards the minimum > replication factor because they are being decommissioned > ** DN3 does not have the block & cannot have the block replicated to it > because of HDFS-721 > * the namenode repeatedly tries to replicate the block to DN3 & repeatedly > fails, this continues indefinitely > * therefore DN4 is the only live datanode with the block & the minimum > replication factor of 2 cannot be satisfied > * because the minimum replication factor cannot be satisfied for the > block(s) being moved off DN1 & DN2, the datanode decommissioning can never be > completed > {code:java} > 2021-06-06 10:39:10,106 INFO BlockStateChange (DatanodeAdminMonitor-0): > Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, > decommissioned replicas: 0, decommissioning replicas: 2, maintenance > replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is > Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , > Current Datanode: DN1:9866, Is current datanode decommissioning: true, Is > current datanode entering maintenance: false > ... > 2021-06-06 10:57:10,105 INFO BlockStateChange (DatanodeAdminMonitor-0): > Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, > decommissioned replicas: 0, decommissioning replicas: 2, maintenance > replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is > Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , > Current Datanode: DN2:9866, Is current datanode decommissioning: true, Is > current datanode entering maintenance: false > {code} > Being stuck in decommissioning state forever is not an intended behavior of > DataNode decommissioning > A few potential solutions: > * Address the root cause of the problem which is an inconsistent state > between namenode & datanode: https://issues.apache.org/jira/browse/HDFS-721 > * Detect when datanode decommissioning is stuck due to lack of available > datanodes for satisfying the minimum replication factor, then recover by > re-enabling the datanodes being decommissioned > -- This mess
[jira] [Commented] (HDFS-16064) Determine when to invalidate corrupt replicas based on number of usable replicas
[ https://issues.apache.org/jira/browse/HDFS-16064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17805765#comment-17805765 ] ASF GitHub Bot commented on HDFS-16064: --- shahrs87 commented on PR #6437: URL: https://github.com/apache/hadoop/pull/6437#issuecomment-1887735325 > [HDFS-16064] backporting HDFS-16064. Determine when to invalidate corrupt replicas based on number of usable replicas (#4410) #6437 @zz12341 Keep the title and the commit message same as https://github.com/apache/hadoop/pull/4410. We shouldn't add backporting keyword in the title and commit message. > Determine when to invalidate corrupt replicas based on number of usable > replicas > > > Key: HDFS-16064 > URL: https://issues.apache.org/jira/browse/HDFS-16064 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Affects Versions: 3.2.1 >Reporter: Kevin Wikant >Assignee: Kevin Wikant >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0, 3.2.4, 3.3.5 > > Time Spent: 2h > Remaining Estimate: 0h > > Seems that https://issues.apache.org/jira/browse/HDFS-721 was resolved as a > non-issue under the assumption that if the namenode & a datanode get into an > inconsistent state for a given block pipeline, there should be another > datanode available to replicate the block to > While testing datanode decommissioning using "dfs.exclude.hosts", I have > encountered a scenario where the decommissioning gets stuck indefinitely > Below is the progression of events: > * there are initially 4 datanodes DN1, DN2, DN3, DN4 > * scale-down is started by adding DN1 & DN2 to "dfs.exclude.hosts" > * HDFS block pipelines on DN1 & DN2 must now be replicated to DN3 & DN4 in > order to satisfy their minimum replication factor of 2 > * during this replication process > https://issues.apache.org/jira/browse/HDFS-721 is encountered which causes > the following inconsistent state: > ** DN3 thinks it has the block pipeline in FINALIZED state > ** the namenode does not think DN3 has the block pipeline > {code:java} > 2021-06-06 10:38:23,604 INFO org.apache.hadoop.hdfs.server.datanode.DataNode > (DataXceiver for client at /DN2:45654 [Receiving block BP-YYY:blk_XXX]): > DN3:9866:DataXceiver error processing WRITE_BLOCK operation src: /DN2:45654 > dst: /DN3:9866; > org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block > BP-YYY:blk_XXX already exists in state FINALIZED and thus cannot be created. > {code} > * the replication is attempted again, but: > ** DN4 has the block > ** DN1 and/or DN2 have the block, but don't count towards the minimum > replication factor because they are being decommissioned > ** DN3 does not have the block & cannot have the block replicated to it > because of HDFS-721 > * the namenode repeatedly tries to replicate the block to DN3 & repeatedly > fails, this continues indefinitely > * therefore DN4 is the only live datanode with the block & the minimum > replication factor of 2 cannot be satisfied > * because the minimum replication factor cannot be satisfied for the > block(s) being moved off DN1 & DN2, the datanode decommissioning can never be > completed > {code:java} > 2021-06-06 10:39:10,106 INFO BlockStateChange (DatanodeAdminMonitor-0): > Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, > decommissioned replicas: 0, decommissioning replicas: 2, maintenance > replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is > Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , > Current Datanode: DN1:9866, Is current datanode decommissioning: true, Is > current datanode entering maintenance: false > ... > 2021-06-06 10:57:10,105 INFO BlockStateChange (DatanodeAdminMonitor-0): > Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, > decommissioned replicas: 0, decommissioning replicas: 2, maintenance > replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is > Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , > Current Datanode: DN2:9866, Is current datanode decommissioning: true, Is > current datanode entering maintenance: false > {code} > Being stuck in decommissioning state forever is not an intended behavior of > DataNode decommissioning > A few potential solutions: > * Address the root cause of the problem which is an inconsistent state > between namenode & datanode: https://issues.apache.org/jira/browse/HDFS-721 > * Detect when datanode decommissioning is stuck due to lack of available > datanodes for satisfying the minimum replication factor, then recover by > re-enabling the datanode