[jira] [Commented] (HDFS-16064) Determine when to invalidate corrupt replicas based on number of usable replicas
[ https://issues.apache.org/jira/browse/HDFS-16064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17805579#comment-17805579 ] Kevin Wikant commented on HDFS-16064: - {quote}Any reason why we haven't backported this fix to branch-2.10? {quote} Back in 2022, I did try to backport this change to 2.10.1 branch & encountered unit test failure due to inconsistent behavior when compared to Hadoop 3.x {quote}> mvn test -Dtest=TestDecommission ... [ERROR] Tests run: 27, Failures: 0, Errors: 1, Skipped: 1, Time elapsed: 263.603 s <<< FAILURE! - in org.apache.hadoop.hdfs.TestDecommission [ERROR] testDeleteCorruptReplicaForUnderReplicatedBlock(org.apache.hadoop.hdfs.TestDecommission) Time elapsed: 60.462 s <<< ERROR! java.lang.Exception: test timed out after 6 milliseconds at java.lang.Thread.sleep(Native Method) at org.apache.hadoop.test.GenericTestUtils.waitFor(GenericTestUtils.java:366) at org.apache.hadoop.hdfs.TestDecommission.testDeleteCorruptReplicaForUnderReplicatedBlock(TestDecommission.java:1918) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47) at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44) at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) at org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74) {quote} I do not remember all the root cause details, but from my notes: * "The inconsistent behavior has to do with when Datanodes in the MiniDFSCluster are sending full block reports vs incremental block reports and how that gets handled by the Namenode. Also, the triggerBlockReport method does not work in a MiniDFSCluster (i.e. no block report is sent) and there is no way to control sending of incremental vs full block reports." These Hadoop 2.x behavior differences in Namenode/Datanode/MiniDFSCluster were not fully root caused & addressed, so this bug fix was only backported to Hadoop 3.x which was sufficient for our needs. > Determine when to invalidate corrupt replicas based on number of usable > replicas > > > Key: HDFS-16064 > URL: https://issues.apache.org/jira/browse/HDFS-16064 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Affects Versions: 3.2.1 >Reporter: Kevin Wikant >Assignee: Kevin Wikant >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0, 3.2.4, 3.3.5 > > Time Spent: 2h > Remaining Estimate: 0h > > Seems that https://issues.apache.org/jira/browse/HDFS-721 was resolved as a > non-issue under the assumption that if the namenode & a datanode get into an > inconsistent state for a given block pipeline, there should be another > datanode available to replicate the block to > While testing datanode decommissioning using "dfs.exclude.hosts", I have > encountered a scenario where the decommissioning gets stuck indefinitely > Below is the progression of events: > * there are initially 4 datanodes DN1, DN2, DN3, DN4 > * scale-down is started by adding DN1 & DN2 to "dfs.exclude.hosts" > * HDFS block pipelines on DN1 & DN2 must now be replicated to DN3 & DN4 in > order to satisfy their minimum replication factor of 2 > * during this replication process > https://issues.apache.org/jira/browse/HDFS-721 is encountered which causes > the following inconsistent state: > ** DN3 thinks it has the block pipeline in FINALIZED state > ** the namenode does not think DN3 has the block pipeline > {code:java} > 2021-06-06 10:38:23,604 INFO org.apache.hadoop.hdfs.server.datanode.DataNode > (DataXceiver for client at /DN2:45654 [Receiving block BP-YYY:blk_XXX]): > DN3:9866:DataXceiver error processing WRITE_BLOCK operation src: /DN2:45654 > dst: /DN3:9866; > org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block > BP-YYY:blk_XXX already exists in state FINALIZED and thus cannot be created. > {code} > * the replication is attempted again, but: > ** DN4 has the block > ** DN1 and/or DN2 have the block, but don't count towards the minimum > replication factor because they are being decommissioned > ** DN3 does not have the block & cannot have the block replicated to it > because of HDFS-721 > * the namenode
[jira] [Updated] (HDFS-16664) Use correct GenerationStamp when invalidating corrupt block replicas
[ https://issues.apache.org/jira/browse/HDFS-16664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Wikant updated HDFS-16664: Description: While trying to backport HDFS-16064 to an older Hadoop version, the new unit test "testDeleteCorruptReplicaForUnderReplicatedBlock" started failing unexpectedly. Upon deep diving this unit test failure, I identified a bug in HDFS corrupt replica invalidation which results in the following datanode exception: {quote}2022-07-16 08:07:52,041 [BP-958471676-X-1657973243350 heartbeating to localhost/127.0.0.1:61365] WARN datanode.DataNode (BPServiceActor.java:processCommand(887)) - Error processing datanode Command java.io.IOException: Failed to delete 1 (out of 1) replica(s): 0) Failed to delete replica blk_1073741825_1005: GenerationStamp not matched, existing replica is blk_1073741825_1001 at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2139) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2034) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:735) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:680) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:883) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:678) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:849) at java.lang.Thread.run(Thread.java:750) {quote} The issue is that the Namenode is sending wrong generationStamp to the datanode. By adding some additional logs, I was able to determine the root cause for this: * the generationStamp sent in the DNA_INVALIDATE is based on the [generationStamp of the block sent in the block report|https://github.com/apache/hadoop/blob/8774f178686487007dcf8c418c989b785a529000/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L3733] * the problem is that the datanode with the corrupt block replica (that receives the DNA_INVALIDATE) is not necissarily the same datanode that sent the block report * this can cause the above exception when the corrupt block replica on the datanode receiving the DNA_INVALIDATE & the block replica on the datanode that sent the block report have different generationStamps The solution is to store the corrupt replicas generationStamp in the CorruptReplicasMap, then to extract this correct generationStamp value when sending the DNA_INVALIDATE to the datanode h2. Failed Test - Before the fix {quote}> mvn test -Dtest=TestDecommission#testDeleteCorruptReplicaForUnderReplicatedBlock [INFO] Results: [INFO] [ERROR] Failures: [ERROR] TestDecommission.testDeleteCorruptReplicaForUnderReplicatedBlock:2035 Node 127.0.0.1:61366 failed to complete decommissioning. numTrackedNodes=1 , numPendingNodes=0 , adminState=Decommission In Progress , nodesWithReplica=[127.0.0.1:61366, 127.0.0.1:61419] {quote} Logs: {quote}> cat target/surefire-reports/org.apache.hadoop.hdfs.TestDecommission-output.txt | grep 'Expected Replicas:|XXX|FINALIZED|Block now|Failed to delete' 2022-07-16 08:07:45,891 [Listener at localhost/61378] INFO hdfs.TestDecommission (TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(1942)) - Block now has 2 corrupt replicas on [127.0.0.1:61370 , 127.0.0.1:61375] and 1 live replica on 127.0.0.1:61366 2022-07-16 08:07:45,913 [Listener at localhost/61378] INFO hdfs.TestDecommission (TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(1974)) - Block now has 2 corrupt replicas on [127.0.0.1:61370 , 127.0.0.1:61375] and 1 decommissioning replica on 127.0.0.1:61366 XXX invalidateBlock dn=127.0.0.1:61415 , blk=1073741825_1001 XXX postponeBlock dn=127.0.0.1:61415 , blk=1073741825_1001 XXX invalidateBlock dn=127.0.0.1:61419 , blk=1073741825_1003 XXX addToInvalidates dn=127.0.0.1:61419 , blk=1073741825_1003 XXX addBlocksToBeInvalidated dn=127.0.0.1:61419 , blk=1073741825_1003 XXX rescanPostponedMisreplicatedBlocks blk=1073741825_1005 XXX DNA_INVALIDATE dn=/127.0.0.1:61419 , blk=1073741825_1003 XXX invalidate(on DN) dn=/127.0.0.1:61419 , invalidBlk=blk_1073741825_1003 , blkByIdAndGenStamp = FinalizedReplica, blk_1073741825_1003, FINALIZED 2022-07-16 08:07:49,084 [BP-958471676-X-1657973243350 heartbeating to localhost/127.0.0.1:61365] INFO impl.FsDatasetAsyncDiskService (FsDatasetAsyncDiskService.java:deleteAsync(226)) - Scheduling blk_1073741825_1003 replica FinalizedReplica, blk_1073741825_1003, FINALIZED XXX addBlock dn=127.0.0.1:61419 , blk=1073741825_1005 *<<< block report is coming from 127.0.0.1:61419 which has genStamp=1005* XXX invalidateCorruptReplicas
[jira] [Comment Edited] (HDFS-16664) Use correct GenerationStamp when invalidating corrupt block replicas
[ https://issues.apache.org/jira/browse/HDFS-16664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17567522#comment-17567522 ] Kevin Wikant edited comment on HDFS-16664 at 7/16/22 5:14 PM: -- The test "testDeleteCorruptReplicaForUnderReplicatedBlock" is failing when backporting to: [https://github.com/apache/hadoop/tree/branch-3.2.1] See section "Why does unit test failure not reproduce in Hadoop trunk?" for additional details was (Author: kevinwikant): The test "testDeleteCorruptReplicaForUnderReplicatedBlock" is failing when backporting to: [https://github.com/apache/hadoop/tree/branch-3.2.1] > Use correct GenerationStamp when invalidating corrupt block replicas > > > Key: HDFS-16664 > URL: https://issues.apache.org/jira/browse/HDFS-16664 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Kevin Wikant >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > While trying to backport HDFS-16064 to an older Hadoop version, the new unit > test "testDeleteCorruptReplicaForUnderReplicatedBlock" started failing > unexpectedly. > Upon deep diving this unit test failure, I identified a bug in HDFS corrupt > replica invalidation which results in the following datanode exception: > {quote}2022-07-16 08:07:52,041 [BP-958471676-X-1657973243350 heartbeating to > localhost/127.0.0.1:61365] WARN datanode.DataNode > (BPServiceActor.java:processCommand(887)) - Error processing datanode Command > java.io.IOException: Failed to delete 1 (out of 1) replica(s): > 0) Failed to delete replica blk_1073741825_1005: GenerationStamp not matched, > existing replica is blk_1073741825_1001 > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2139) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2034) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:735) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:680) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:883) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:678) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:849) > at java.lang.Thread.run(Thread.java:750) > {quote} > The issue is that the Namenode is sending wrong generationStamp to the > datanode. By adding some additional logs, I was able to determine the root > cause for this: > * the generationStamp sent in the DNA_INVALIDATE is based on the > [generationStamp of the block sent in the block > report|https://github.com/apache/hadoop/blob/8774f178686487007dcf8c418c989b785a529000/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L3733] > * the problem is that the datanode with the corrupt block replica (that > receives the DNA_INVALIDATE) is not necissarily the same datanode that sent > the block report > * this can cause the above exception when the corrupt block replica on the > datanode receiving the DNA_INVALIDATE & the block replica on the datanode > that sent the block report have different generationStamps > The solution is to store the corrupt replicas generationStamp in the > CorruptReplicasMap, then to extract this correct generationStamp value when > sending the DNA_INVALIDATE to the datanode > > h2. Failed Test - Before the fix > {quote}> mvn test > -Dtest=TestDecommission#testDeleteCorruptReplicaForUnderReplicatedBlock > > [INFO] Results: > [INFO] > [ERROR] Failures: > [ERROR] > TestDecommission.testDeleteCorruptReplicaForUnderReplicatedBlock:2035 Node > 127.0.0.1:61366 failed to complete decommissioning. numTrackedNodes=1 , > numPendingNodes=0 , adminState=Decommission In Progress , > nodesWithReplica=[127.0.0.1:61366, 127.0.0.1:61419] > {quote} > Logs: > {quote}> cat > target/surefire-reports/org.apache.hadoop.hdfs.TestDecommission-output.txt | > grep 'Expected Replicas:|XXX|FINALIZED|Block now|Failed to delete' > 2022-07-16 08:07:45,891 [Listener at localhost/61378] INFO > hdfs.TestDecommission > (TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(1942)) > - Block now has 2 corrupt replicas on [127.0.0.1:61370 , 127.0.0.1:61375] and > 1 live replica on 127.0.0.1:61366 > 2022-07-16 08:07:45,913 [Listener at localhost/61378] INFO > hdfs.TestDecommission > (TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(1974)) > - Block now has 2 corrupt replicas on
[jira] [Commented] (HDFS-16664) Use correct GenerationStamp when invalidating corrupt block replicas
[ https://issues.apache.org/jira/browse/HDFS-16664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17567522#comment-17567522 ] Kevin Wikant commented on HDFS-16664: - The issue is occurring when backporting to: https://github.com/apache/hadoop/tree/branch-3.2.1 > Use correct GenerationStamp when invalidating corrupt block replicas > > > Key: HDFS-16664 > URL: https://issues.apache.org/jira/browse/HDFS-16664 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Kevin Wikant >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > While trying to backport HDFS-16064 to an older Hadoop version, the new unit > test "testDeleteCorruptReplicaForUnderReplicatedBlock" started failing > unexpectedly. > Upon deep diving this unit test failure, I identified a bug in HDFS corrupt > replica invalidation which results in the following datanode exception: > {quote}2022-07-16 08:07:52,041 [BP-958471676-X-1657973243350 heartbeating to > localhost/127.0.0.1:61365] WARN datanode.DataNode > (BPServiceActor.java:processCommand(887)) - Error processing datanode Command > java.io.IOException: Failed to delete 1 (out of 1) replica(s): > 0) Failed to delete replica blk_1073741825_1005: GenerationStamp not matched, > existing replica is blk_1073741825_1001 > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2139) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2034) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:735) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:680) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:883) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:678) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:849) > at java.lang.Thread.run(Thread.java:750) > {quote} > The issue is that the Namenode is sending wrong generationStamp to the > datanode. By adding some additional logs, I was able to determine the root > cause for this: > * the generationStamp sent in the DNA_INVALIDATE is based on the > [generationStamp of the block sent in the block > report|https://github.com/apache/hadoop/blob/8774f178686487007dcf8c418c989b785a529000/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L3733] > * the problem is that the datanode with the corrupt block replica (that > receives the DNA_INVALIDATE) is not necissarily the same datanode that sent > the block report > * this can cause the above exception when the corrupt block replica on the > datanode receiving the DNA_INVALIDATE & the block replica on the datanode > that sent the block report have different generationStamps > The solution is to store the corrupt replicas generationStamp in the > CorruptReplicasMap, then to extract this correct generationStamp value when > sending the DNA_INVALIDATE to the datanode > > h2. Failed Test - Before the fix > {quote}> mvn test > -Dtest=TestDecommission#testDeleteCorruptReplicaForUnderReplicatedBlock > > [INFO] Results: > [INFO] > [ERROR] Failures: > [ERROR] > TestDecommission.testDeleteCorruptReplicaForUnderReplicatedBlock:2035 Node > 127.0.0.1:61366 failed to complete decommissioning. numTrackedNodes=1 , > numPendingNodes=0 , adminState=Decommission In Progress , > nodesWithReplica=[127.0.0.1:61366, 127.0.0.1:61419] > {quote} > Logs: > {quote}> cat > target/surefire-reports/org.apache.hadoop.hdfs.TestDecommission-output.txt | > grep 'Expected Replicas:|XXX|FINALIZED|Block now|Failed to delete' > 2022-07-16 08:07:45,891 [Listener at localhost/61378] INFO > hdfs.TestDecommission > (TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(1942)) > - Block now has 2 corrupt replicas on [127.0.0.1:61370 , 127.0.0.1:61375] and > 1 live replica on 127.0.0.1:61366 > 2022-07-16 08:07:45,913 [Listener at localhost/61378] INFO > hdfs.TestDecommission > (TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(1974)) > - Block now has 2 corrupt replicas on [127.0.0.1:61370 , 127.0.0.1:61375] and > 1 decommissioning replica on 127.0.0.1:61366 > XXX invalidateBlock dn=127.0.0.1:61415 , blk=1073741825_1001 > XXX postponeBlock dn=127.0.0.1:61415 , blk=1073741825_1001 > XXX invalidateBlock dn=127.0.0.1:61419 , blk=1073741825_1003 > XXX addToInvalidates dn=127.0.0.1:61419 , blk=1073741825_1003 > XXX addBlocksToBeInvalidated
[jira] [Comment Edited] (HDFS-16664) Use correct GenerationStamp when invalidating corrupt block replicas
[ https://issues.apache.org/jira/browse/HDFS-16664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17567522#comment-17567522 ] Kevin Wikant edited comment on HDFS-16664 at 7/16/22 5:13 PM: -- The test "testDeleteCorruptReplicaForUnderReplicatedBlock" is failing when backporting to: [https://github.com/apache/hadoop/tree/branch-3.2.1] was (Author: kevinwikant): The issue is occurring when backporting to: https://github.com/apache/hadoop/tree/branch-3.2.1 > Use correct GenerationStamp when invalidating corrupt block replicas > > > Key: HDFS-16664 > URL: https://issues.apache.org/jira/browse/HDFS-16664 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Kevin Wikant >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > While trying to backport HDFS-16064 to an older Hadoop version, the new unit > test "testDeleteCorruptReplicaForUnderReplicatedBlock" started failing > unexpectedly. > Upon deep diving this unit test failure, I identified a bug in HDFS corrupt > replica invalidation which results in the following datanode exception: > {quote}2022-07-16 08:07:52,041 [BP-958471676-X-1657973243350 heartbeating to > localhost/127.0.0.1:61365] WARN datanode.DataNode > (BPServiceActor.java:processCommand(887)) - Error processing datanode Command > java.io.IOException: Failed to delete 1 (out of 1) replica(s): > 0) Failed to delete replica blk_1073741825_1005: GenerationStamp not matched, > existing replica is blk_1073741825_1001 > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2139) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2034) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:735) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:680) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:883) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:678) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:849) > at java.lang.Thread.run(Thread.java:750) > {quote} > The issue is that the Namenode is sending wrong generationStamp to the > datanode. By adding some additional logs, I was able to determine the root > cause for this: > * the generationStamp sent in the DNA_INVALIDATE is based on the > [generationStamp of the block sent in the block > report|https://github.com/apache/hadoop/blob/8774f178686487007dcf8c418c989b785a529000/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L3733] > * the problem is that the datanode with the corrupt block replica (that > receives the DNA_INVALIDATE) is not necissarily the same datanode that sent > the block report > * this can cause the above exception when the corrupt block replica on the > datanode receiving the DNA_INVALIDATE & the block replica on the datanode > that sent the block report have different generationStamps > The solution is to store the corrupt replicas generationStamp in the > CorruptReplicasMap, then to extract this correct generationStamp value when > sending the DNA_INVALIDATE to the datanode > > h2. Failed Test - Before the fix > {quote}> mvn test > -Dtest=TestDecommission#testDeleteCorruptReplicaForUnderReplicatedBlock > > [INFO] Results: > [INFO] > [ERROR] Failures: > [ERROR] > TestDecommission.testDeleteCorruptReplicaForUnderReplicatedBlock:2035 Node > 127.0.0.1:61366 failed to complete decommissioning. numTrackedNodes=1 , > numPendingNodes=0 , adminState=Decommission In Progress , > nodesWithReplica=[127.0.0.1:61366, 127.0.0.1:61419] > {quote} > Logs: > {quote}> cat > target/surefire-reports/org.apache.hadoop.hdfs.TestDecommission-output.txt | > grep 'Expected Replicas:|XXX|FINALIZED|Block now|Failed to delete' > 2022-07-16 08:07:45,891 [Listener at localhost/61378] INFO > hdfs.TestDecommission > (TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(1942)) > - Block now has 2 corrupt replicas on [127.0.0.1:61370 , 127.0.0.1:61375] and > 1 live replica on 127.0.0.1:61366 > 2022-07-16 08:07:45,913 [Listener at localhost/61378] INFO > hdfs.TestDecommission > (TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(1974)) > - Block now has 2 corrupt replicas on [127.0.0.1:61370 , 127.0.0.1:61375] and > 1 decommissioning replica on 127.0.0.1:61366 > XXX invalidateBlock dn=127.0.0.1:61415 , blk=1073741825_1001 >
[jira] [Updated] (HDFS-16664) Use correct GenerationStamp when invalidating corrupt block replicas
[ https://issues.apache.org/jira/browse/HDFS-16664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Wikant updated HDFS-16664: Description: While trying to backport HDFS-16064 to an older Hadoop version, the new unit test "testDeleteCorruptReplicaForUnderReplicatedBlock" started failing unexpectedly. Upon deep diving this unit test failure, I identified a bug in HDFS corrupt replica invalidation which results in the following datanode exception: {quote}2022-07-16 08:07:52,041 [BP-958471676-X-1657973243350 heartbeating to localhost/127.0.0.1:61365] WARN datanode.DataNode (BPServiceActor.java:processCommand(887)) - Error processing datanode Command java.io.IOException: Failed to delete 1 (out of 1) replica(s): 0) Failed to delete replica blk_1073741825_1005: GenerationStamp not matched, existing replica is blk_1073741825_1001 at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2139) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2034) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:735) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:680) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:883) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:678) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:849) at java.lang.Thread.run(Thread.java:750) {quote} The issue is that the Namenode is sending wrong generationStamp to the datanode. By adding some additional logs, I was able to determine the root cause for this: * the generationStamp sent in the DNA_INVALIDATE is based on the [generationStamp of the block sent in the block report|https://github.com/apache/hadoop/blob/8774f178686487007dcf8c418c989b785a529000/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L3733] * the problem is that the datanode with the corrupt block replica (that receives the DNA_INVALIDATE) is not necissarily the same datanode that sent the block report * this can cause the above exception when the corrupt block replica on the datanode receiving the DNA_INVALIDATE & the block replica on the datanode that sent the block report have different generationStamps The solution is to store the corrupt replicas generationStamp in the CorruptReplicasMap, then to extract this correct generationStamp value when sending the DNA_INVALIDATE to the datanode h2. Failed Test - Before the fix {quote}> mvn test -Dtest=TestDecommission#testDeleteCorruptReplicaForUnderReplicatedBlock [INFO] Results: [INFO] [ERROR] Failures: [ERROR] TestDecommission.testDeleteCorruptReplicaForUnderReplicatedBlock:2035 Node 127.0.0.1:61366 failed to complete decommissioning. numTrackedNodes=1 , numPendingNodes=0 , adminState=Decommission In Progress , nodesWithReplica=[127.0.0.1:61366, 127.0.0.1:61419] {quote} Logs: {quote}> cat target/surefire-reports/org.apache.hadoop.hdfs.TestDecommission-output.txt | grep 'Expected Replicas:|XXX|FINALIZED|Block now|Failed to delete' 2022-07-16 08:07:45,891 [Listener at localhost/61378] INFO hdfs.TestDecommission (TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(1942)) - Block now has 2 corrupt replicas on [127.0.0.1:61370 , 127.0.0.1:61375] and 1 live replica on 127.0.0.1:61366 2022-07-16 08:07:45,913 [Listener at localhost/61378] INFO hdfs.TestDecommission (TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(1974)) - Block now has 2 corrupt replicas on [127.0.0.1:61370 , 127.0.0.1:61375] and 1 decommissioning replica on 127.0.0.1:61366 XXX invalidateBlock dn=127.0.0.1:61415 , blk=1073741825_1001 XXX postponeBlock dn=127.0.0.1:61415 , blk=1073741825_1001 XXX invalidateBlock dn=127.0.0.1:61419 , blk=1073741825_1003 XXX addToInvalidates dn=127.0.0.1:61419 , blk=1073741825_1003 XXX addBlocksToBeInvalidated dn=127.0.0.1:61419 , blk=1073741825_1003 XXX rescanPostponedMisreplicatedBlocks blk=1073741825_1005 XXX DNA_INVALIDATE dn=/127.0.0.1:61419 , blk=1073741825_1003 XXX invalidate(on DN) dn=/127.0.0.1:61419 , invalidBlk=blk_1073741825_1003 , blkByIdAndGenStamp = FinalizedReplica, blk_1073741825_1003, FINALIZED 2022-07-16 08:07:49,084 [BP-958471676-X-1657973243350 heartbeating to localhost/127.0.0.1:61365] INFO impl.FsDatasetAsyncDiskService (FsDatasetAsyncDiskService.java:deleteAsync(226)) - Scheduling blk_1073741825_1003 replica FinalizedReplica, blk_1073741825_1003, FINALIZED XXX addBlock dn=127.0.0.1:61419 , blk=1073741825_1005 *<<< block report is coming from 127.0.0.1:61419 which has genStamp=1005* XXX invalidateCorruptReplicas
[jira] [Updated] (HDFS-16664) Use correct GenerationStamp when invalidating corrupt block replicas
[ https://issues.apache.org/jira/browse/HDFS-16664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Wikant updated HDFS-16664: Description: While trying to backport HDFS-16064 to an older Hadoop version, the new unit test "testDeleteCorruptReplicaForUnderReplicatedBlock" started failing unexpectedly. Upon deep diving this unit test failure, I identified a bug in HDFS corrupt replica invalidation which results in the following datanode exception: {quote}2022-07-16 08:07:52,041 [BP-958471676-X-1657973243350 heartbeating to localhost/127.0.0.1:61365] WARN datanode.DataNode (BPServiceActor.java:processCommand(887)) - Error processing datanode Command java.io.IOException: Failed to delete 1 (out of 1) replica(s): 0) Failed to delete replica blk_1073741825_1005: GenerationStamp not matched, existing replica is blk_1073741825_1001 at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2139) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2034) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:735) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:680) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:883) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:678) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:849) at java.lang.Thread.run(Thread.java:750) {quote} The issue is that the Namenode is sending wrong generationStamp to the datanode. By adding some additional logs, I was able to determine the root cause for this: * the generationStamp sent in the DNA_INVALIDATE is based on the [generationStamp of the block sent in the block report|https://github.com/apache/hadoop/blob/8774f178686487007dcf8c418c989b785a529000/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L3733] * the problem is that the datanode with the corrupt block replica (that receives the DNA_INVALIDATE) is not necissarily the same datanode that sent the block report * this can cause the above exception when the corrupt block replica on the datanode receiving the DNA_INVALIDATE & the block replica on the datanode that sent the block report have different generationStamps The solution is to store the corrupt replicas generationStamp in the CorruptReplicasMap, then to extract this correct generationStamp value when sending the DNA_INVALIDATE to the datanode h2. Failed Test - Before the fix {quote}> mvn test -Dtest=TestDecommission#testDeleteCorruptReplicaForUnderReplicatedBlock [INFO] Results: [INFO] [ERROR] Failures: [ERROR] TestDecommission.testDeleteCorruptReplicaForUnderReplicatedBlock:2035 Node 127.0.0.1:61366 failed to complete decommissioning. numTrackedNodes=1 , numPendingNodes=0 , adminState=Decommission In Progress , nodesWithReplica=[127.0.0.1:61366, 127.0.0.1:61419] {quote} Logs: {quote}> cat target/surefire-reports/org.apache.hadoop.hdfs.TestDecommission-output.txt | grep 'Expected Replicas:|XXX|FINALIZED|Block now|Failed to delete' 2022-07-16 08:07:45,891 [Listener at localhost/61378] INFO hdfs.TestDecommission (TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(1942)) - Block now has 2 corrupt replicas on [127.0.0.1:61370 , 127.0.0.1:61375] and 1 live replica on 127.0.0.1:61366 2022-07-16 08:07:45,913 [Listener at localhost/61378] INFO hdfs.TestDecommission (TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(1974)) - Block now has 2 corrupt replicas on [127.0.0.1:61370 , 127.0.0.1:61375] and 1 decommissioning replica on 127.0.0.1:61366 XXX invalidateBlock dn=127.0.0.1:61415 , blk=1073741825_1001 XXX postponeBlock dn=127.0.0.1:61415 , blk=1073741825_1001 XXX invalidateBlock dn=127.0.0.1:61419 , blk=1073741825_1003 XXX addToInvalidates dn=127.0.0.1:61419 , blk=1073741825_1003 XXX addBlocksToBeInvalidated dn=127.0.0.1:61419 , blk=1073741825_1003 XXX rescanPostponedMisreplicatedBlocks blk=1073741825_1005 XXX DNA_INVALIDATE dn=/127.0.0.1:61419 , blk=1073741825_1003 XXX invalidate(on DN) dn=/127.0.0.1:61419 , invalidBlk=blk_1073741825_1003 , blkByIdAndGenStamp = FinalizedReplica, blk_1073741825_1003, FINALIZED 2022-07-16 08:07:49,084 [BP-958471676-X-1657973243350 heartbeating to localhost/127.0.0.1:61365] INFO impl.FsDatasetAsyncDiskService (FsDatasetAsyncDiskService.java:deleteAsync(226)) - Scheduling blk_1073741825_1003 replica FinalizedReplica, blk_1073741825_1003, FINALIZED XXX addBlock dn=127.0.0.1:61419 , blk=1073741825_1005 *<<< block report is coming from 127.0.0.1:61419 which has genStamp=1005* XXX invalidateCorruptReplicas
[jira] [Updated] (HDFS-16664) Use correct GenerationStamp when invalidating corrupt block replicas
[ https://issues.apache.org/jira/browse/HDFS-16664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Wikant updated HDFS-16664: Summary: Use correct GenerationStamp when invalidating corrupt block replicas (was: Use correct GenerationStamp when invalidating corrupt block replica) > Use correct GenerationStamp when invalidating corrupt block replicas > > > Key: HDFS-16664 > URL: https://issues.apache.org/jira/browse/HDFS-16664 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Kevin Wikant >Priority: Major > > While trying to backport HDFS-16064 to an older Hadoop version, the new unit > test "testDeleteCorruptReplicaForUnderReplicatedBlock" started failing > unexpectedly. > Upon deep diving this unit test failure, I identified a bug in HDFS corrupt > replica invalidation which results in the following datanode exception: > {quote}2022-07-16 08:07:52,041 [BP-958471676-X-1657973243350 heartbeating to > localhost/127.0.0.1:61365] WARN datanode.DataNode > (BPServiceActor.java:processCommand(887)) - Error processing datanode Command > java.io.IOException: Failed to delete 1 (out of 1) replica(s): > 0) Failed to delete replica blk_1073741825_1005: GenerationStamp not matched, > existing replica is blk_1073741825_1001 > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2139) > at > org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2034) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:735) > at > org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:680) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:883) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:678) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:849) > at java.lang.Thread.run(Thread.java:750) > {quote} > The issue is that the Namenode is sending wrong generationStamp to the > datanode. By adding some additional logs, I was able to determine the root > cause for this: > * the generationStamp sent in the DNA_INVALIDATE is based on the > [generationStamp of the block sent in the block > report|https://github.com/apache/hadoop/blob/8774f178686487007dcf8c418c989b785a529000/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L3733] > * the problem is that the datanode with the corrupt block replica (that > receives the DNA_INVALIDATE) is not necissarily the same datanode that sent > the block report > * this can cause the above exception when the corrupt block replica on the > datanode receiving the DNA_INVALIDATE & the block replica on the datanode > that sent the block report have different generationStamps > The solution is to store the corrupt replicas generationStamp in the > CorruptReplicasMap, then to extract this correct generationStamp value when > sending the DNA_INVALIDATE to the datanode > > h2. Failed Test - Before the fix > {quote}> mvn test > -Dtest=TestDecommission#testDeleteCorruptReplicaForUnderReplicatedBlock > > [INFO] Results: > [INFO] > [ERROR] Failures: > [ERROR] > TestDecommission.testDeleteCorruptReplicaForUnderReplicatedBlock:2035 Node > 127.0.0.1:61366 failed to complete decommissioning. numTrackedNodes=1 , > numPendingNodes=0 , adminState=Decommission In Progress , > nodesWithReplica=[127.0.0.1:61366, 127.0.0.1:61419] > {quote} > Logs: > {quote}> cat > target/surefire-reports/org.apache.hadoop.hdfs.TestDecommission-output.txt | > grep 'Expected Replicas:|XXX|FINALIZED|Block now|Failed to delete' > 2022-07-16 08:07:45,891 [Listener at localhost/61378] INFO > hdfs.TestDecommission > (TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(1942)) > - Block now has 2 corrupt replicas on [127.0.0.1:61370 , 127.0.0.1:61375] and > 1 live replica on 127.0.0.1:61366 > 2022-07-16 08:07:45,913 [Listener at localhost/61378] INFO > hdfs.TestDecommission > (TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(1974)) > - Block now has 2 corrupt replicas on [127.0.0.1:61370 , 127.0.0.1:61375] and > 1 decommissioning replica on 127.0.0.1:61366 > XXX invalidateBlock dn=127.0.0.1:61415 , blk=1073741825_1001 > XXX postponeBlock dn=127.0.0.1:61415 , blk=1073741825_1001 > XXX invalidateBlock dn=127.0.0.1:61419 , blk=1073741825_1003 > XXX addToInvalidates dn=127.0.0.1:61419 , blk=1073741825_1003 > XXX addBlocksToBeInvalidated dn=127.0.0.1:61419 , blk=1073741825_1003 > XXX rescanPostponedMisreplicatedBlocks
[jira] [Updated] (HDFS-16664) Use correct GenerationStamp when invalidating corrupt block replica
[ https://issues.apache.org/jira/browse/HDFS-16664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Wikant updated HDFS-16664: Description: While trying to backport HDFS-16064 to an older Hadoop version, the new unit test "testDeleteCorruptReplicaForUnderReplicatedBlock" started failing unexpectedly. Upon deep diving this unit test failure, I identified a bug in HDFS corrupt replica invalidation which results in the following datanode exception: {quote}2022-07-16 08:07:52,041 [BP-958471676-X-1657973243350 heartbeating to localhost/127.0.0.1:61365] WARN datanode.DataNode (BPServiceActor.java:processCommand(887)) - Error processing datanode Command java.io.IOException: Failed to delete 1 (out of 1) replica(s): 0) Failed to delete replica blk_1073741825_1005: GenerationStamp not matched, existing replica is blk_1073741825_1001 at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2139) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2034) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:735) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:680) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:883) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:678) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:849) at java.lang.Thread.run(Thread.java:750) {quote} The issue is that the Namenode is sending wrong generationStamp to the datanode. By adding some additional logs, I was able to determine the root cause for this: * the generationStamp sent in the DNA_INVALIDATE is based on the [generationStamp of the block sent in the block report|https://github.com/apache/hadoop/blob/8774f178686487007dcf8c418c989b785a529000/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L3733] * the problem is that the datanode with the corrupt block replica (that receives the DNA_INVALIDATE) is not necissarily the same datanode that sent the block report * this can cause the above exception when the corrupt block replica on the datanode receiving the DNA_INVALIDATE & the block replica on the datanode that sent the block report have different generationStamps The solution is to store the corrupt replicas generationStamp in the CorruptReplicasMap, then to extract this correct generationStamp value when sending the DNA_INVALIDATE to the datanode h2. Failed Test - Before the fix {quote}> mvn test -Dtest=TestDecommission#testDeleteCorruptReplicaForUnderReplicatedBlock [INFO] Results: [INFO] [ERROR] Failures: [ERROR] TestDecommission.testDeleteCorruptReplicaForUnderReplicatedBlock:2035 Node 127.0.0.1:61366 failed to complete decommissioning. numTrackedNodes=1 , numPendingNodes=0 , adminState=Decommission In Progress , nodesWithReplica=[127.0.0.1:61366, 127.0.0.1:61419] {quote} Logs: {quote}> cat target/surefire-reports/org.apache.hadoop.hdfs.TestDecommission-output.txt | grep 'Expected Replicas:|XXX|FINALIZED|Block now|Failed to delete' 2022-07-16 08:07:45,891 [Listener at localhost/61378] INFO hdfs.TestDecommission (TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(1942)) - Block now has 2 corrupt replicas on [127.0.0.1:61370 , 127.0.0.1:61375] and 1 live replica on 127.0.0.1:61366 2022-07-16 08:07:45,913 [Listener at localhost/61378] INFO hdfs.TestDecommission (TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(1974)) - Block now has 2 corrupt replicas on [127.0.0.1:61370 , 127.0.0.1:61375] and 1 decommissioning replica on 127.0.0.1:61366 XXX invalidateBlock dn=127.0.0.1:61415 , blk=1073741825_1001 XXX postponeBlock dn=127.0.0.1:61415 , blk=1073741825_1001 XXX invalidateBlock dn=127.0.0.1:61419 , blk=1073741825_1003 XXX addToInvalidates dn=127.0.0.1:61419 , blk=1073741825_1003 XXX addBlocksToBeInvalidated dn=127.0.0.1:61419 , blk=1073741825_1003 XXX rescanPostponedMisreplicatedBlocks blk=1073741825_1005 XXX DNA_INVALIDATE dn=/127.0.0.1:61419 , blk=1073741825_1003 XXX invalidate(on DN) dn=/127.0.0.1:61419 , invalidBlk=blk_1073741825_1003 , blkByIdAndGenStamp = FinalizedReplica, blk_1073741825_1003, FINALIZED 2022-07-16 08:07:49,084 [BP-958471676-X-1657973243350 heartbeating to localhost/127.0.0.1:61365] INFO impl.FsDatasetAsyncDiskService (FsDatasetAsyncDiskService.java:deleteAsync(226)) - Scheduling blk_1073741825_1003 replica FinalizedReplica, blk_1073741825_1003, FINALIZED XXX addBlock dn=127.0.0.1:61419 , blk=1073741825_1005 *<<< block report is coming from 127.0.0.1:61419 which has genStamp=1005* XXX invalidateCorruptReplicas
[jira] [Updated] (HDFS-16664) Use correct GenerationStamp when invalidating corrupt block replica
[ https://issues.apache.org/jira/browse/HDFS-16664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Wikant updated HDFS-16664: Description: While trying to backport HDFS-16064 to an older Hadoop version, the new unit test "testDeleteCorruptReplicaForUnderReplicatedBlock" started failing unexpectedly. Upon deep diving this unit test failure, I identified a bug in HDFS corrupt replica invalidation which results in the following datanode exception: {quote}2022-07-16 08:07:52,041 [BP-958471676-X-1657973243350 heartbeating to localhost/127.0.0.1:61365] WARN datanode.DataNode (BPServiceActor.java:processCommand(887)) - Error processing datanode Command java.io.IOException: Failed to delete 1 (out of 1) replica(s): 0) Failed to delete replica blk_1073741825_1005: GenerationStamp not matched, existing replica is blk_1073741825_1001 at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2139) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2034) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:735) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:680) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:883) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:678) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:849) at java.lang.Thread.run(Thread.java:750) {quote} The issue is that the Namenode is sending wrong generationStamp to the datanode. By adding some additional logs, I was able to determine the root cause for this: * the generationStamp sent in the DNA_INVALIDATE is based on the [generationStamp of the block sent in the block report|https://github.com/apache/hadoop/blob/8774f178686487007dcf8c418c989b785a529000/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L3733] * the problem is that the datanode with the corrupt block replica (that receives the DNA_INVALIDATE) is not necissarily the same datanode that sent the block report * this can cause the above exception when the corrupt block replica on the datanode receiving the DNA_INVALIDATE & the block replica on the datanode that sent the block report have different generationStamps The solution is to store the corrupt replicas generationStamp in the CorruptReplicasMap, then to extract this correct generationStamp value when sending the DNA_INVALIDATE to the datanode h2. Failed Test - Before the fix {quote}> mvn test -Dtest=TestDecommission#testDeleteCorruptReplicaForUnderReplicatedBlock [INFO] Results: [INFO] [ERROR] Failures: [ERROR] TestDecommission.testDeleteCorruptReplicaForUnderReplicatedBlock:2035 Node 127.0.0.1:61366 failed to complete decommissioning. numTrackedNodes=1 , numPendingNodes=0 , adminState=Decommission In Progress , nodesWithReplica=[127.0.0.1:61366, 127.0.0.1:61419] {quote} Logs: {quote}> cat target/surefire-reports/org.apache.hadoop.hdfs.TestDecommission-output.txt | grep 'Expected Replicas:|XXX|FINALIZED|Block now|Failed to delete' 2022-07-16 08:07:45,891 [Listener at localhost/61378] INFO hdfs.TestDecommission (TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(1942)) - Block now has 2 corrupt replicas on [127.0.0.1:61370 , 127.0.0.1:61375] and 1 live replica on 127.0.0.1:61366 2022-07-16 08:07:45,913 [Listener at localhost/61378] INFO hdfs.TestDecommission (TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(1974)) - Block now has 2 corrupt replicas on [127.0.0.1:61370 , 127.0.0.1:61375] and 1 decommissioning replica on 127.0.0.1:61366 XXX invalidateBlock dn=127.0.0.1:61415 , blk=1073741825_1001 XXX postponeBlock dn=127.0.0.1:61415 , blk=1073741825_1001 XXX invalidateBlock dn=127.0.0.1:61419 , blk=1073741825_1003 XXX addToInvalidates dn=127.0.0.1:61419 , blk=1073741825_1003 XXX addBlocksToBeInvalidated dn=127.0.0.1:61419 , blk=1073741825_1003 XXX rescanPostponedMisreplicatedBlocks blk=1073741825_1005 XXX DNA_INVALIDATE dn=/127.0.0.1:61419 , blk=1073741825_1003 XXX invalidate(on DN) dn=/127.0.0.1:61419 , invalidBlk=blk_1073741825_1003 , blkByIdAndGenStamp = FinalizedReplica, blk_1073741825_1003, FINALIZED 2022-07-16 08:07:49,084 [BP-958471676-X-1657973243350 heartbeating to localhost/127.0.0.1:61365] INFO impl.FsDatasetAsyncDiskService (FsDatasetAsyncDiskService.java:deleteAsync(226)) - Scheduling blk_1073741825_1003 replica FinalizedReplica, blk_1073741825_1003, FINALIZED XXX addBlock dn=127.0.0.1:61419 , blk=1073741825_1005 <<< block report is coming from 127.0.0.1:61419 which has genStamp=1005 XXX invalidateCorruptReplicas
[jira] [Created] (HDFS-16664) Use correct GenerationStamp when invalidating corrupt block replica
Kevin Wikant created HDFS-16664: --- Summary: Use correct GenerationStamp when invalidating corrupt block replica Key: HDFS-16664 URL: https://issues.apache.org/jira/browse/HDFS-16664 Project: Hadoop HDFS Issue Type: Bug Reporter: Kevin Wikant While trying to backport HDFS-16064 to an older Hadoop version, the new unit test "testDeleteCorruptReplicaForUnderReplicatedBlock" started failing unexpectedly. Upon deep diving this unit test failure, I identified a bug in HDFS corrupt replica invalidation which results in the following datanode exception: {quote}2022-07-16 08:07:52,041 [BP-958471676-X-1657973243350 heartbeating to localhost/127.0.0.1:61365] WARN datanode.DataNode (BPServiceActor.java:processCommand(887)) - Error processing datanode Command java.io.IOException: Failed to delete 1 (out of 1) replica(s): 0) Failed to delete replica blk_1073741825_1005: GenerationStamp not matched, existing replica is blk_1073741825_1001 at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2139) at org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2034) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:735) at org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:680) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:883) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:678) at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:849) at java.lang.Thread.run(Thread.java:750) {quote} * The issue is that the Namenode is sending wrong generationStamp to the datanode. By adding some additional logs, I was able to determine the root cause for this: the generationStamp sent in the DNA_INVALIDATE is based on the [generationStamp of the block sent in the block report|https://github.com/apache/hadoop/blob/8774f178686487007dcf8c418c989b785a529000/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L3733] * the problem is that the datanode with the corrupt block replica (that receives the DNA_INVALIDATE) is not necissarily the same datanode that sent the block report * this can cause the above exception when the corrupt block replica on the datanode receiving the DNA_INVALIDATE & the block replica on the datanode that sent the block report have different generationStamps The solution is to store the corrupt replicas generationStamp in the CorruptReplicasMap, then to extract this correct generationStamp value when sending the DNA_INVALIDATE to the datanode h2. Failed Test - Before the fix {quote}> mvn test -Dtest=TestDecommission#testDeleteCorruptReplicaForUnderReplicatedBlock [INFO] Results: [INFO] [ERROR] Failures: [ERROR] TestDecommission.testDeleteCorruptReplicaForUnderReplicatedBlock:2035 Node 127.0.0.1:61366 failed to complete decommissioning. numTrackedNodes=1 , numPendingNodes=0 , adminState=Decommission In Progress , nodesWithReplica=[127.0.0.1:61366, 127.0.0.1:61419] {quote} Logs: {quote}> cat target/surefire-reports/org.apache.hadoop.hdfs.TestDecommission-output.txt | grep 'Expected Replicas:\|XXX\|FINALIZED\|Block now\|Failed to delete' 2022-07-16 08:07:45,891 [Listener at localhost/61378] INFO hdfs.TestDecommission (TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(1942)) - Block now has 2 corrupt replicas on [127.0.0.1:61370 , 127.0.0.1:61375] and 1 live replica on 127.0.0.1:61366 2022-07-16 08:07:45,913 [Listener at localhost/61378] INFO hdfs.TestDecommission (TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(1974)) - Block now has 2 corrupt replicas on [127.0.0.1:61370 , 127.0.0.1:61375] and 1 decommissioning replica on 127.0.0.1:61366 XXX invalidateBlock dn=127.0.0.1:61415 , blk=1073741825_1001 XXX postponeBlock dn=127.0.0.1:61415 , blk=1073741825_1001 XXX invalidateBlock dn=127.0.0.1:61419 , blk=1073741825_1003 XXX addToInvalidates dn=127.0.0.1:61419 , blk=1073741825_1003 XXX addBlocksToBeInvalidated dn=127.0.0.1:61419 , blk=1073741825_1003 XXX rescanPostponedMisreplicatedBlocks blk=1073741825_1005 XXX DNA_INVALIDATE dn=/127.0.0.1:61419 , blk=1073741825_1003 XXX invalidate(on DN) dn=/127.0.0.1:61419 , invalidBlk=blk_1073741825_1003 , blkByIdAndGenStamp = FinalizedReplica, blk_1073741825_1003, FINALIZED 2022-07-16 08:07:49,084 [BP-958471676-X-1657973243350 heartbeating to localhost/127.0.0.1:61365] INFO impl.FsDatasetAsyncDiskService (FsDatasetAsyncDiskService.java:deleteAsync(226)) - Scheduling blk_1073741825_1003 replica FinalizedReplica, blk_1073741825_1003, FINALIZED XXX
[jira] [Comment Edited] (HDFS-16064) HDFS-721 causes DataNode decommissioning to get stuck indefinitely
[ https://issues.apache.org/jira/browse/HDFS-16064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17550619#comment-17550619 ] Kevin Wikant edited comment on HDFS-16064 at 6/7/22 12:29 PM: -- Thanks [~it_singer] , you are correct in that my initial root cause was incomplete In the past few months I have seen this issue re-occur multiple times, I decided to do a deeper dive & I identified the bug described here: [https://github.com/apache/hadoop/pull/4410] I think the issue described in this ticket is occurring because the corrupt replica on DN3 will not be invalidated until DN3 either: * restarts & sends a block report * sends its next periodic block report (default interval is 6 hours) So in the worst case the decommissioning in the aforementioned scenario will take up to 6 hours to complete because DN3 may take up to 6 hours to send its next block report & have the corrupt replica invalidated. I have not targeted fixing this decommissioning blocker scenario because it is arguably expected behavior & will resolve in at most "dfs.blockreport.intervalMsec". Instead the fix [[https://github.com/apache/hadoop/pull/4410]] is targeting a more severe bug where decommissioning gets blocked indefinitely was (Author: kevinwikant): Thanks [~it_singer] , you are correct in that my initial root cause was very much incorrect In the past few months I have seen this issue re-occur multiple times, I decided to do a deeper dive & I identified the bug described here: [https://github.com/apache/hadoop/pull/4410] I think the issue described in this ticket is occurring because the corrupt replica on DN3 will not be invalidated until DN3 either: * restarts & sends a block report * sends its next periodic block report (default interval is 6 hours) So in the worst case the decommissioning in the aforementioned scenario will take up to 6 hours to complete because DN3 may take up to 6 hours to send its next block report & have the corrupt replica invalidated. I have not targeted fixing this decommissioning blocker scenario because it is arguably expected behavior & will resolve in at most "dfs.blockreport.intervalMsec". Instead the fix [[https://github.com/apache/hadoop/pull/4410]] is targeting a more severe bug where decommissioning gets blocked indefinitely > HDFS-721 causes DataNode decommissioning to get stuck indefinitely > -- > > Key: HDFS-16064 > URL: https://issues.apache.org/jira/browse/HDFS-16064 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Affects Versions: 3.2.1 >Reporter: Kevin Wikant >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > Seems that https://issues.apache.org/jira/browse/HDFS-721 was resolved as a > non-issue under the assumption that if the namenode & a datanode get into an > inconsistent state for a given block pipeline, there should be another > datanode available to replicate the block to > While testing datanode decommissioning using "dfs.exclude.hosts", I have > encountered a scenario where the decommissioning gets stuck indefinitely > Below is the progression of events: > * there are initially 4 datanodes DN1, DN2, DN3, DN4 > * scale-down is started by adding DN1 & DN2 to "dfs.exclude.hosts" > * HDFS block pipelines on DN1 & DN2 must now be replicated to DN3 & DN4 in > order to satisfy their minimum replication factor of 2 > * during this replication process > https://issues.apache.org/jira/browse/HDFS-721 is encountered which causes > the following inconsistent state: > ** DN3 thinks it has the block pipeline in FINALIZED state > ** the namenode does not think DN3 has the block pipeline > {code:java} > 2021-06-06 10:38:23,604 INFO org.apache.hadoop.hdfs.server.datanode.DataNode > (DataXceiver for client at /DN2:45654 [Receiving block BP-YYY:blk_XXX]): > DN3:9866:DataXceiver error processing WRITE_BLOCK operation src: /DN2:45654 > dst: /DN3:9866; > org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block > BP-YYY:blk_XXX already exists in state FINALIZED and thus cannot be created. > {code} > * the replication is attempted again, but: > ** DN4 has the block > ** DN1 and/or DN2 have the block, but don't count towards the minimum > replication factor because they are being decommissioned > ** DN3 does not have the block & cannot have the block replicated to it > because of HDFS-721 > * the namenode repeatedly tries to replicate the block to DN3 & repeatedly > fails, this continues indefinitely > * therefore DN4 is the only live datanode with the block & the minimum > replication factor of 2 cannot be satisfied > * because the minimum replication factor cannot be satisfied for the
[jira] [Commented] (HDFS-16064) HDFS-721 causes DataNode decommissioning to get stuck indefinitely
[ https://issues.apache.org/jira/browse/HDFS-16064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17550619#comment-17550619 ] Kevin Wikant commented on HDFS-16064: - Thanks [~it_singer] , you are correct in that my initial root cause was very much incorrect In the past few months I have seen this issue re-occur multiple times, I decided to do a deeper dive & I identified the bug described here: [https://github.com/apache/hadoop/pull/4410] I think the issue described in this ticket is occurring because the corrupt replica on DN3 will not be invalidated until DN3 either: * restarts & sends a block report * sends its next periodic block report (default interval is 6 hours) So in the worst case the decommissioning in the aforementioned scenario will take up to 6 hours to complete because DN3 may take up to 6 hours to send its next block report & have the corrupt replica invalidated. I have not targeted fixing this decommissioning blocker scenario because it is arguably expected behavior & will resolve in at most "dfs.blockreport.intervalMsec". Instead the fix [[https://github.com/apache/hadoop/pull/4410]] is targeting a more severe bug where decommissioning gets blocked indefinitely > HDFS-721 causes DataNode decommissioning to get stuck indefinitely > -- > > Key: HDFS-16064 > URL: https://issues.apache.org/jira/browse/HDFS-16064 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode, namenode >Affects Versions: 3.2.1 >Reporter: Kevin Wikant >Priority: Major > Labels: pull-request-available > Time Spent: 10m > Remaining Estimate: 0h > > Seems that https://issues.apache.org/jira/browse/HDFS-721 was resolved as a > non-issue under the assumption that if the namenode & a datanode get into an > inconsistent state for a given block pipeline, there should be another > datanode available to replicate the block to > While testing datanode decommissioning using "dfs.exclude.hosts", I have > encountered a scenario where the decommissioning gets stuck indefinitely > Below is the progression of events: > * there are initially 4 datanodes DN1, DN2, DN3, DN4 > * scale-down is started by adding DN1 & DN2 to "dfs.exclude.hosts" > * HDFS block pipelines on DN1 & DN2 must now be replicated to DN3 & DN4 in > order to satisfy their minimum replication factor of 2 > * during this replication process > https://issues.apache.org/jira/browse/HDFS-721 is encountered which causes > the following inconsistent state: > ** DN3 thinks it has the block pipeline in FINALIZED state > ** the namenode does not think DN3 has the block pipeline > {code:java} > 2021-06-06 10:38:23,604 INFO org.apache.hadoop.hdfs.server.datanode.DataNode > (DataXceiver for client at /DN2:45654 [Receiving block BP-YYY:blk_XXX]): > DN3:9866:DataXceiver error processing WRITE_BLOCK operation src: /DN2:45654 > dst: /DN3:9866; > org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block > BP-YYY:blk_XXX already exists in state FINALIZED and thus cannot be created. > {code} > * the replication is attempted again, but: > ** DN4 has the block > ** DN1 and/or DN2 have the block, but don't count towards the minimum > replication factor because they are being decommissioned > ** DN3 does not have the block & cannot have the block replicated to it > because of HDFS-721 > * the namenode repeatedly tries to replicate the block to DN3 & repeatedly > fails, this continues indefinitely > * therefore DN4 is the only live datanode with the block & the minimum > replication factor of 2 cannot be satisfied > * because the minimum replication factor cannot be satisfied for the > block(s) being moved off DN1 & DN2, the datanode decommissioning can never be > completed > {code:java} > 2021-06-06 10:39:10,106 INFO BlockStateChange (DatanodeAdminMonitor-0): > Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, > decommissioned replicas: 0, decommissioning replicas: 2, maintenance > replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is > Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , > Current Datanode: DN1:9866, Is current datanode decommissioning: true, Is > current datanode entering maintenance: false > ... > 2021-06-06 10:57:10,105 INFO BlockStateChange (DatanodeAdminMonitor-0): > Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, > decommissioned replicas: 0, decommissioning replicas: 2, maintenance > replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is > Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , > Current Datanode: DN2:9866, Is current datanode decommissioning: true, Is > current datanode entering
[jira] [Created] (HDFS-16443) Fix edge case where DatanodeAdminDefaultMonitor doubly enqueues a DatanodeDescriptor on exception
Kevin Wikant created HDFS-16443: --- Summary: Fix edge case where DatanodeAdminDefaultMonitor doubly enqueues a DatanodeDescriptor on exception Key: HDFS-16443 URL: https://issues.apache.org/jira/browse/HDFS-16443 Project: Hadoop HDFS Issue Type: Bug Components: hdfs Reporter: Kevin Wikant As part of the fix merged in: https://issues.apache.org/jira/browse/HDFS-16303 There was a rare edge case noticed in DatanodeAdminDefaultMonitor which causes a DatanodeDescriptor to be added twice to the pendingNodes queue. * a [datanode is unhealthy so it gets added to "unhealthyDns"]([https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminDefaultMonitor.java#L227)] * an exception is thrown which causes [this catch block](https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminDefaultMonitor.java#L271) to execute * the [datanode is added to "pendingNodes"]([https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminDefaultMonitor.java#L276)] * under certain conditions the [datanode can be added again from "unhealthyDns" to "pendingNodes" here]([https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminDefaultMonitor.java#L296)] This Jira is to track the 1 line fix for this bug -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-16442) TestBlockTokenWithShortCircuitRead.testShortCircuitReadWithInvalidToken fails
Kevin Wikant created HDFS-16442: --- Summary: TestBlockTokenWithShortCircuitRead.testShortCircuitReadWithInvalidToken fails Key: HDFS-16442 URL: https://issues.apache.org/jira/browse/HDFS-16442 Project: Hadoop HDFS Issue Type: Sub-task Components: hdfs Reporter: Kevin Wikant [https://ci-hadoop.apache.org/blue/organizations/jenkins/hadoop-multibranch/detail/PR-3920/2/pipeline] [https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3920/2/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt] ``` [ERROR] Failures: [ERROR] TestBlockTokenWithShortCircuitRead.testShortCircuitReadWithInvalidToken:153->checkSlotsAfterSSRWithTokenExpiration:178->checkShmAndSlots:184 expected:<1> but was:<2> [ERROR] TestDirectoryScanner.testThrottling:727 Throttle is too permissive [INFO] [ERROR] Tests run: 6208, Failures: 2, Errors: 0, Skipped: 22 ``` -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16303) Losing over 100 datanodes in state decommissioning results in full blockage of all datanode decommissioning
[ https://issues.apache.org/jira/browse/HDFS-16303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17481354#comment-17481354 ] Kevin Wikant commented on HDFS-16303: - My apologies for the delay, something high priority came up at work. * backport to Hadoop 2.x : https://github.com/apache/hadoop/pull/3920 * backport to Hadoop 3.x : https://github.com/apache/hadoop/pull/3921 I have also made 1 small change to the DatanodeAdminDefaultMonitor based on a rare edge case I identified: https://github.com/apache/hadoop/pull/3923 > Losing over 100 datanodes in state decommissioning results in full blockage > of all datanode decommissioning > --- > > Key: HDFS-16303 > URL: https://issues.apache.org/jira/browse/HDFS-16303 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.10.1, 3.3.1 >Reporter: Kevin Wikant >Assignee: Kevin Wikant >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 12h 40m > Remaining Estimate: 0h > > h2. Impact > HDFS datanode decommissioning does not make any forward progress. For > example, the user adds X datanodes to the "dfs.hosts.exclude" file and all X > of those datanodes remain in state decommissioning forever without making any > forward progress towards being decommissioned. > h2. Root Cause > The HDFS Namenode class "DatanodeAdminManager" is responsible for > decommissioning datanodes. > As per this "hdfs-site" configuration: > {quote}Config = dfs.namenode.decommission.max.concurrent.tracked.nodes > Default Value = 100 > The maximum number of decommission-in-progress datanodes nodes that will be > tracked at one time by the namenode. Tracking a decommission-in-progress > datanode consumes additional NN memory proportional to the number of blocks > on the datnode. Having a conservative limit reduces the potential impact of > decomissioning a large number of nodes at once. A value of 0 means no limit > will be enforced. > {quote} > The Namenode will only actively track up to 100 datanodes for decommissioning > at any given time, as to avoid Namenode memory pressure. > Looking into the "DatanodeAdminManager" code: > * a new datanode is only removed from the "tracked.nodes" set when it > finishes decommissioning > * a new datanode is only added to the "tracked.nodes" set if there is fewer > than 100 datanodes being tracked > So in the event that there are more than 100 datanodes being decommissioned > at a given time, some of those datanodes will not be in the "tracked.nodes" > set until 1 or more datanodes in the "tracked.nodes" finishes > decommissioning. This is generally not a problem because the datanodes in > "tracked.nodes" will eventually finish decommissioning, but there is an edge > case where this logic prevents the namenode from making any forward progress > towards decommissioning. > If all 100 datanodes in the "tracked.nodes" are unable to finish > decommissioning, then other datanodes (which may be able to be > decommissioned) will never get added to "tracked.nodes" and therefore will > never get the opportunity to be decommissioned. > This can occur due the following issue: > {quote}2021-10-21 12:39:24,048 WARN > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager > (DatanodeAdminMonitor-0): Node W.X.Y.Z:50010 is dead while in Decommission In > Progress. Cannot be safely decommissioned or be in maintenance since there is > risk of reduced data durability or data loss. Either restart the failed node > or force decommissioning or maintenance by removing, calling refreshNodes, > then re-adding to the excludes or host config files. > {quote} > If a Datanode is lost while decommissioning (for example if the underlying > hardware fails or is lost), then it will remain in state decommissioning > forever. > If 100 or more Datanodes are lost while decommissioning over the Hadoop > cluster lifetime, then this is enough to completely fill up the > "tracked.nodes" set. With the entire "tracked.nodes" set filled with > datanodes that can never finish decommissioning, any datanodes added after > this point will never be able to be decommissioned because they will never be > added to the "tracked.nodes" set. > In this scenario: > * the "tracked.nodes" set is filled with datanodes which are lost & cannot > be recovered (and can never finish decommissioning so they will never be > removed from the set) > * the actual live datanodes being decommissioned are enqueued waiting to > enter the "tracked.nodes" set (and are stuck waiting indefinitely) > This means that no progress towards decommissioning the live datanodes will > be made unless the user takes the following action: > {quote}Either restart the
[jira] [Reopened] (HDFS-16303) Losing over 100 datanodes in state decommissioning results in full blockage of all datanode decommissioning
[ https://issues.apache.org/jira/browse/HDFS-16303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Wikant reopened HDFS-16303: - Re-opening the Jira to track work for cherry picking the change > Losing over 100 datanodes in state decommissioning results in full blockage > of all datanode decommissioning > --- > > Key: HDFS-16303 > URL: https://issues.apache.org/jira/browse/HDFS-16303 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.10.1, 3.3.1 >Reporter: Kevin Wikant >Assignee: Kevin Wikant >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > Time Spent: 12h 10m > Remaining Estimate: 0h > > h2. Impact > HDFS datanode decommissioning does not make any forward progress. For > example, the user adds X datanodes to the "dfs.hosts.exclude" file and all X > of those datanodes remain in state decommissioning forever without making any > forward progress towards being decommissioned. > h2. Root Cause > The HDFS Namenode class "DatanodeAdminManager" is responsible for > decommissioning datanodes. > As per this "hdfs-site" configuration: > {quote}Config = dfs.namenode.decommission.max.concurrent.tracked.nodes > Default Value = 100 > The maximum number of decommission-in-progress datanodes nodes that will be > tracked at one time by the namenode. Tracking a decommission-in-progress > datanode consumes additional NN memory proportional to the number of blocks > on the datnode. Having a conservative limit reduces the potential impact of > decomissioning a large number of nodes at once. A value of 0 means no limit > will be enforced. > {quote} > The Namenode will only actively track up to 100 datanodes for decommissioning > at any given time, as to avoid Namenode memory pressure. > Looking into the "DatanodeAdminManager" code: > * a new datanode is only removed from the "tracked.nodes" set when it > finishes decommissioning > * a new datanode is only added to the "tracked.nodes" set if there is fewer > than 100 datanodes being tracked > So in the event that there are more than 100 datanodes being decommissioned > at a given time, some of those datanodes will not be in the "tracked.nodes" > set until 1 or more datanodes in the "tracked.nodes" finishes > decommissioning. This is generally not a problem because the datanodes in > "tracked.nodes" will eventually finish decommissioning, but there is an edge > case where this logic prevents the namenode from making any forward progress > towards decommissioning. > If all 100 datanodes in the "tracked.nodes" are unable to finish > decommissioning, then other datanodes (which may be able to be > decommissioned) will never get added to "tracked.nodes" and therefore will > never get the opportunity to be decommissioned. > This can occur due the following issue: > {quote}2021-10-21 12:39:24,048 WARN > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager > (DatanodeAdminMonitor-0): Node W.X.Y.Z:50010 is dead while in Decommission In > Progress. Cannot be safely decommissioned or be in maintenance since there is > risk of reduced data durability or data loss. Either restart the failed node > or force decommissioning or maintenance by removing, calling refreshNodes, > then re-adding to the excludes or host config files. > {quote} > If a Datanode is lost while decommissioning (for example if the underlying > hardware fails or is lost), then it will remain in state decommissioning > forever. > If 100 or more Datanodes are lost while decommissioning over the Hadoop > cluster lifetime, then this is enough to completely fill up the > "tracked.nodes" set. With the entire "tracked.nodes" set filled with > datanodes that can never finish decommissioning, any datanodes added after > this point will never be able to be decommissioned because they will never be > added to the "tracked.nodes" set. > In this scenario: > * the "tracked.nodes" set is filled with datanodes which are lost & cannot > be recovered (and can never finish decommissioning so they will never be > removed from the set) > * the actual live datanodes being decommissioned are enqueued waiting to > enter the "tracked.nodes" set (and are stuck waiting indefinitely) > This means that no progress towards decommissioning the live datanodes will > be made unless the user takes the following action: > {quote}Either restart the failed node or force decommissioning or maintenance > by removing, calling refreshNodes, then re-adding to the excludes or host > config files. > {quote} > Ideally, the Namenode should be able to gracefully handle scenarios where the > datanodes in the "tracked.nodes" set are not making forward progress towards > decommissioning
[jira] [Resolved] (HDFS-16303) Losing over 100 datanodes in state decommissioning results in full blockage of all datanode decommissioning
[ https://issues.apache.org/jira/browse/HDFS-16303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Wikant resolved HDFS-16303. - Resolution: Fixed Resolved by this commit: https://github.com/apache/hadoop/commit/d20b598f97e76c67d6103a950ea9e89644be2c41 > Losing over 100 datanodes in state decommissioning results in full blockage > of all datanode decommissioning > --- > > Key: HDFS-16303 > URL: https://issues.apache.org/jira/browse/HDFS-16303 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 2.10.1, 3.3.1 >Reporter: Kevin Wikant >Assignee: Kevin Wikant >Priority: Major > Labels: pull-request-available > Time Spent: 12h 10m > Remaining Estimate: 0h > > h2. Impact > HDFS datanode decommissioning does not make any forward progress. For > example, the user adds X datanodes to the "dfs.hosts.exclude" file and all X > of those datanodes remain in state decommissioning forever without making any > forward progress towards being decommissioned. > h2. Root Cause > The HDFS Namenode class "DatanodeAdminManager" is responsible for > decommissioning datanodes. > As per this "hdfs-site" configuration: > {quote}Config = dfs.namenode.decommission.max.concurrent.tracked.nodes > Default Value = 100 > The maximum number of decommission-in-progress datanodes nodes that will be > tracked at one time by the namenode. Tracking a decommission-in-progress > datanode consumes additional NN memory proportional to the number of blocks > on the datnode. Having a conservative limit reduces the potential impact of > decomissioning a large number of nodes at once. A value of 0 means no limit > will be enforced. > {quote} > The Namenode will only actively track up to 100 datanodes for decommissioning > at any given time, as to avoid Namenode memory pressure. > Looking into the "DatanodeAdminManager" code: > * a new datanode is only removed from the "tracked.nodes" set when it > finishes decommissioning > * a new datanode is only added to the "tracked.nodes" set if there is fewer > than 100 datanodes being tracked > So in the event that there are more than 100 datanodes being decommissioned > at a given time, some of those datanodes will not be in the "tracked.nodes" > set until 1 or more datanodes in the "tracked.nodes" finishes > decommissioning. This is generally not a problem because the datanodes in > "tracked.nodes" will eventually finish decommissioning, but there is an edge > case where this logic prevents the namenode from making any forward progress > towards decommissioning. > If all 100 datanodes in the "tracked.nodes" are unable to finish > decommissioning, then other datanodes (which may be able to be > decommissioned) will never get added to "tracked.nodes" and therefore will > never get the opportunity to be decommissioned. > This can occur due the following issue: > {quote}2021-10-21 12:39:24,048 WARN > org.apache.hadoop.hdfs.server.blockmanagement.BlockManager > (DatanodeAdminMonitor-0): Node W.X.Y.Z:50010 is dead while in Decommission In > Progress. Cannot be safely decommissioned or be in maintenance since there is > risk of reduced data durability or data loss. Either restart the failed node > or force decommissioning or maintenance by removing, calling refreshNodes, > then re-adding to the excludes or host config files. > {quote} > If a Datanode is lost while decommissioning (for example if the underlying > hardware fails or is lost), then it will remain in state decommissioning > forever. > If 100 or more Datanodes are lost while decommissioning over the Hadoop > cluster lifetime, then this is enough to completely fill up the > "tracked.nodes" set. With the entire "tracked.nodes" set filled with > datanodes that can never finish decommissioning, any datanodes added after > this point will never be able to be decommissioned because they will never be > added to the "tracked.nodes" set. > In this scenario: > * the "tracked.nodes" set is filled with datanodes which are lost & cannot > be recovered (and can never finish decommissioning so they will never be > removed from the set) > * the actual live datanodes being decommissioned are enqueued waiting to > enter the "tracked.nodes" set (and are stuck waiting indefinitely) > This means that no progress towards decommissioning the live datanodes will > be made unless the user takes the following action: > {quote}Either restart the failed node or force decommissioning or maintenance > by removing, calling refreshNodes, then re-adding to the excludes or host > config files. > {quote} > Ideally, the Namenode should be able to gracefully handle scenarios where the > datanodes in the "tracked.nodes" set are not making forward
[jira] [Created] (HDFS-16336) TestRollingUpgrade.testRollback fails
Kevin Wikant created HDFS-16336: --- Summary: TestRollingUpgrade.testRollback fails Key: HDFS-16336 URL: https://issues.apache.org/jira/browse/HDFS-16336 Project: Hadoop HDFS Issue Type: Sub-task Components: hdfs Affects Versions: 3.4.0 Reporter: Kevin Wikant This pull request: [https://github.com/apache/hadoop/pull/3675] Failed Jenkins pre-commit job due to an unrelated unit test failure: [https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3675/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt] {code:java} [ERROR] Failures: [ERROR] org.apache.hadoop.hdfs.TestRollingUpgrade.testRollback(org.apache.hadoop.hdfs.TestRollingUpgrade) [ERROR] Run 1: TestRollingUpgrade.testRollback:328->checkMxBeanIsNull:299 expected null, but was: [ERROR] Run 2: TestRollingUpgrade.testRollback:328->checkMxBeanIsNull:299 expected null, but was: [ERROR] Run 3: TestRollingUpgrade.testRollback:328->checkMxBeanIsNull:299 expected null, but was: {code} Seems that perhaps "TestRollingUpgrade.testRollback" is a flaky unit test -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16303) Losing over 100 datanodes in state decommissioning results in full blockage of all datanode decommissioning
[ https://issues.apache.org/jira/browse/HDFS-16303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Wikant updated HDFS-16303: Description: h2. Impact HDFS datanode decommissioning does not make any forward progress. For example, the user adds X datanodes to the "dfs.hosts.exclude" file and all X of those datanodes remain in state decommissioning forever without making any forward progress towards being decommissioned. h2. Root Cause The HDFS Namenode class "DatanodeAdminManager" is responsible for decommissioning datanodes. As per this "hdfs-site" configuration: {quote}Config = dfs.namenode.decommission.max.concurrent.tracked.nodes Default Value = 100 The maximum number of decommission-in-progress datanodes nodes that will be tracked at one time by the namenode. Tracking a decommission-in-progress datanode consumes additional NN memory proportional to the number of blocks on the datnode. Having a conservative limit reduces the potential impact of decomissioning a large number of nodes at once. A value of 0 means no limit will be enforced. {quote} The Namenode will only actively track up to 100 datanodes for decommissioning at any given time, as to avoid Namenode memory pressure. Looking into the "DatanodeAdminManager" code: * a new datanode is only removed from the "tracked.nodes" set when it finishes decommissioning * a new datanode is only added to the "tracked.nodes" set if there is fewer than 100 datanodes being tracked So in the event that there are more than 100 datanodes being decommissioned at a given time, some of those datanodes will not be in the "tracked.nodes" set until 1 or more datanodes in the "tracked.nodes" finishes decommissioning. This is generally not a problem because the datanodes in "tracked.nodes" will eventually finish decommissioning, but there is an edge case where this logic prevents the namenode from making any forward progress towards decommissioning. If all 100 datanodes in the "tracked.nodes" are unable to finish decommissioning, then other datanodes (which may be able to be decommissioned) will never get added to "tracked.nodes" and therefore will never get the opportunity to be decommissioned. This can occur due the following issue: {quote}2021-10-21 12:39:24,048 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockManager (DatanodeAdminMonitor-0): Node W.X.Y.Z:50010 is dead while in Decommission In Progress. Cannot be safely decommissioned or be in maintenance since there is risk of reduced data durability or data loss. Either restart the failed node or force decommissioning or maintenance by removing, calling refreshNodes, then re-adding to the excludes or host config files. {quote} If a Datanode is lost while decommissioning (for example if the underlying hardware fails or is lost), then it will remain in state decommissioning forever. If 100 or more Datanodes are lost while decommissioning over the Hadoop cluster lifetime, then this is enough to completely fill up the "tracked.nodes" set. With the entire "tracked.nodes" set filled with datanodes that can never finish decommissioning, any datanodes added after this point will never be able to be decommissioned because they will never be added to the "tracked.nodes" set. In this scenario: * the "tracked.nodes" set is filled with datanodes which are lost & cannot be recovered (and can never finish decommissioning so they will never be removed from the set) * the actual live datanodes being decommissioned are enqueued waiting to enter the "tracked.nodes" set (and are stuck waiting indefinitely) This means that no progress towards decommissioning the live datanodes will be made unless the user takes the following action: {quote}Either restart the failed node or force decommissioning or maintenance by removing, calling refreshNodes, then re-adding to the excludes or host config files. {quote} Ideally, the Namenode should be able to gracefully handle scenarios where the datanodes in the "tracked.nodes" set are not making forward progress towards decommissioning while the enqueued datanodes may be able to make forward progress. h2. Reproduce Steps * create a Hadoop cluster * lose (i.e. terminate the host/process forever) over 100 datanodes while the datanodes are in state decommissioning * add additional datanodes to the cluster * attempt to decommission those new datanodes & observe that they are stuck in state decommissioning forever Note that in this example each datanode, over the full history of the cluster, has a unique IP address was: h2. Impact HDFS datanode decommissioning does not make any forward progress. For example, the user adds X datanodes to the "dfs.hosts.exclude" file and all of those datanodes remain in state decommissioning forever without making any forward progress towards decommissioning. h2. Root Cause The HDFS Namenode class "DatanodeAdminManager" is
[jira] [Updated] (HDFS-16303) Losing over 100 datanodes in state decommissioning results in full blockage of all datanode decommissioning
[ https://issues.apache.org/jira/browse/HDFS-16303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Wikant updated HDFS-16303: Description: h2. Impact HDFS datanode decommissioning does not make any forward progress. For example, the user adds X datanodes to the "dfs.hosts.exclude" file and all of those datanodes remain in state decommissioning forever without making any forward progress towards decommissioning. h2. Root Cause The HDFS Namenode class "DatanodeAdminManager" is responsible for decommissioning datanodes. As per this "hdfs-site" configuration: {quote}Config = dfs.namenode.decommission.max.concurrent.tracked.nodes Default Value = 100 The maximum number of decommission-in-progress datanodes nodes that will be tracked at one time by the namenode. Tracking a decommission-in-progress datanode consumes additional NN memory proportional to the number of blocks on the datnode. Having a conservative limit reduces the potential impact of decomissioning a large number of nodes at once. A value of 0 means no limit will be enforced. {quote} The Namenode will only actively track up to 100 datanodes for decommissioning at any given time, as to avoid Namenode memory pressure. Looking into the "DatanodeAdminManager" code: * a new datanode is only removed from the "tracked.nodes" set when it finishes decommissioning * a new datanode is only added to the "tracked.nodes" set if there is fewer than 100 datanodes being tracked So in the event that there are more than 100 datanodes being decommissioned at a given time, some of those datanodes will not be in the "tracked.nodes" set until 1 or more datanodes in the "tracked.nodes" finishes decommissioning. This is generally not a problem because the datanodes in "tracked.nodes" will eventually finish decommissioning, but there is an edge case where this logic prevents the namenode from making any forward progress towards decommissioning. If all 100 datanodes in the "tracked.nodes" are unable to finish decommissioning, then other datanodes (which may be able to be decommissioned) will never get added to "tracked.nodes" and therefore will never get the opportunity to be decommissioned. This can occur due the following issue: {quote}2021-10-21 12:39:24,048 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockManager (DatanodeAdminMonitor-0): Node W.X.Y.Z:50010 is dead while in Decommission In Progress. Cannot be safely decommissioned or be in maintenance since there is risk of reduced data durability or data loss. Either restart the failed node or force decommissioning or maintenance by removing, calling refreshNodes, then re-adding to the excludes or host config files. {quote} If a Datanode is lost while decommissioning (for example if the underlying hardware fails or is lost), then it will remain in state decommissioning forever. If 100 or more Datanodes are lost while decommissioning over the Hadoop cluster lifetime, then this is enough to completely fill up the "tracked.nodes" set. With the entire "tracked.nodes" set filled with datanodes that can never finish decommissioning, any datanodes added after this point will never be able to be decommissioned because they will never be added to the "tracked.nodes" set. In this scenario: * the "tracked.nodes" set is filled with datanodes which are lost & cannot be recovered (and can never finish decommissioning so they will never be removed from the set) * the actual live datanodes being decommissioned are enqueued waiting to enter the "tracked.nodes" set (and are stuck waiting indefinitely) This means that no progress towards decommissioning the live datanodes will be made unless the user takes the following action: {quote}Either restart the failed node or force decommissioning or maintenance by removing, calling refreshNodes, then re-adding to the excludes or host config files. {quote} Ideally, the Namenode should be able to gracefully handle scenarios where the datanodes in the "tracked.nodes" set are not making forward progress towards decommissioning while the enqueued datanodes may be able to make forward progress. h2. Reproduce Steps * create a Hadoop cluster * lose (i.e. terminate the host/process forever) over 100 datanodes while the datanodes are in state decommissioning * add additional datanodes to the cluster * attempt to decommission those new datanodes & observe that they are stuck in state decommissioning forever Note that in this example each datanode, over the full history of the cluster, has a unique IP address was: h2. Problem Description The HDFS Namenode class "DatanodeAdminManager" is responsible for decommissioning datanodes. As per this "hdfs-site" configuration: {quote}Config = dfs.namenode.decommission.max.concurrent.tracked.nodes Default Value = 100 The maximum number of decommission-in-progress datanodes nodes that will be tracked at one time by
[jira] [Updated] (HDFS-16303) Losing over 100 datanodes in state decommissioning results in full blockage of all datanode decommissioning
[ https://issues.apache.org/jira/browse/HDFS-16303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Wikant updated HDFS-16303: Description: h2. Problem Description The HDFS Namenode class "DatanodeAdminManager" is responsible for decommissioning datanodes. As per this "hdfs-site" configuration: {quote}Config = dfs.namenode.decommission.max.concurrent.tracked.nodes Default Value = 100 The maximum number of decommission-in-progress datanodes nodes that will be tracked at one time by the namenode. Tracking a decommission-in-progress datanode consumes additional NN memory proportional to the number of blocks on the datnode. Having a conservative limit reduces the potential impact of decomissioning a large number of nodes at once. A value of 0 means no limit will be enforced. {quote} The Namenode will only actively track up to 100 datanodes for decommissioning at any given time, as to avoid Namenode memory pressure. Looking into the "DatanodeAdminManager" code: * a new datanode is only removed from the "tracked.nodes" set when it finishes decommissioning * a new datanode is only added to the "tracked.nodes" set if there is fewer than 100 datanodes being tracked So in the event that there are more than 100 datanodes being decommissioned at a given time, some of those datanodes will not be in the "tracked.nodes" set until 1 or more datanodes in the "tracked.nodes" finishes decommissioning. This is generally not a problem because the datanodes in "tracked.nodes" will eventually finish decommissioning, but there is an edge case where this logic prevents the namenode from making any forward progress towards decommissioning. If all 100 datanodes in the "tracked.nodes" are unable to finish decommissioning, then other datanodes (which may be able to be decommissioned) will never get added to "tracked.nodes" and therefore will never get the opportunity to be decommissioned. This can occur due the following issue: {quote}2021-10-21 12:39:24,048 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockManager (DatanodeAdminMonitor-0): Node W.X.Y.Z:50010 is dead while in Decommission In Progress. Cannot be safely decommissioned or be in maintenance since there is risk of reduced data durability or data loss. Either restart the failed node or force decommissioning or maintenance by removing, calling refreshNodes, then re-adding to the excludes or host config files. {quote} If a Datanode is lost while decommissioning (for example if the underlying hardware fails or is lost), then it will remain in state decommissioning forever. If 100 or more Datanodes are lost while decommissioning over the Hadoop cluster lifetime, then this is enough to completely fill up the "tracked.nodes" set. With the entire "tracked.nodes" set filled with datanodes that can never finish decommissioning, any datanodes added after this point will never be able to be decommissioned because they will never be added to the "tracked.nodes" set. In this scenario: * the "tracked.nodes" set is filled with datanodes which are lost & cannot be recovered (and can never finish decommissioning so they will never be removed from the set) * the actual live datanodes being decommissioned are enqueued waiting to enter the "tracked.nodes" set (and are stuck waiting indefinitely) This means that no progress towards decommissioning the live datanodes will be made unless the user takes the following action: {quote}Either restart the failed node or force decommissioning or maintenance by removing, calling refreshNodes, then re-adding to the excludes or host config files. {quote} Ideally, the Namenode should be able to gracefully handle scenarios where the datanodes in the "tracked.nodes" set are not making forward progress towards decommissioning while the enqueued datanodes may be able to make forward progress. h2. Reproduce Steps * create a Hadoop cluster * lose (i.e. terminate the host/process forever) over 100 datanodes while the datanodes are in state decommissioning * add additional datanodes to the cluster * attempt to decommission those new datanodes & observe that they are stuck in state decommissioning forever Note that in this example each datanode, over the full history of the cluster, has a unique IP address was: ## Problem Description The HDFS Namenode class "DatanodeAdminManager" is responsible for decommissioning datanodes. As per this "hdfs-site" configuration: {quote}Config = dfs.namenode.decommission.max.concurrent.tracked.nodes Default Value = 100 The maximum number of decommission-in-progress datanodes nodes that will be tracked at one time by the namenode. Tracking a decommission-in-progress datanode consumes additional NN memory proportional to the number of blocks on the datnode. Having a conservative limit reduces the potential impact of decomissioning a large number of nodes at once. A value of 0 means
[jira] [Created] (HDFS-16303) Losing over 100 datanodes in state decommissioning results in full blockage of all datanode decommissioning
Kevin Wikant created HDFS-16303: --- Summary: Losing over 100 datanodes in state decommissioning results in full blockage of all datanode decommissioning Key: HDFS-16303 URL: https://issues.apache.org/jira/browse/HDFS-16303 Project: Hadoop HDFS Issue Type: Bug Affects Versions: 3.3.1, 2.10.1 Reporter: Kevin Wikant ## Problem Description The HDFS Namenode class "DatanodeAdminManager" is responsible for decommissioning datanodes. As per this "hdfs-site" configuration: {quote}Config = dfs.namenode.decommission.max.concurrent.tracked.nodes Default Value = 100 The maximum number of decommission-in-progress datanodes nodes that will be tracked at one time by the namenode. Tracking a decommission-in-progress datanode consumes additional NN memory proportional to the number of blocks on the datnode. Having a conservative limit reduces the potential impact of decomissioning a large number of nodes at once. A value of 0 means no limit will be enforced. {quote} The Namenode will only actively track up to 100 datanodes for decommissioning at any given time, as to avoid Namenode memory pressure. Looking into the "DatanodeAdminManager" code: * a new datanode is only removed from the "tracked.nodes" set when it finishes decommissioning * a new datanode is only added to the "tracked.nodes" set if there is fewer than 100 datanodes being tracked So in the event that there are more than 100 datanodes being decommissioned at a given time, some of those datanodes will not be in the "tracked.nodes" set until 1 or more datanodes in the "tracked.nodes" finishes decommissioning. This is generally not a problem because the datanodes in "tracked.nodes" will eventually finish decommissioning, but there is an edge case where this logic prevents the namenode from making any forward progress towards decommissioning. If all 100 datanodes in the "tracked.nodes" are unable to finish decommissioning, then other datanodes (which may be able to be decommissioned) will never get added to "tracked.nodes" and therefore will never get the opportunity to be decommissioned. This can occur due the following issue: {quote}2021-10-21 12:39:24,048 WARN org.apache.hadoop.hdfs.server.blockmanagement.BlockManager (DatanodeAdminMonitor-0): Node W.X.Y.Z:50010 is dead while in Decommission In Progress. Cannot be safely decommissioned or be in maintenance since there is risk of reduced data durability or data loss. Either restart the failed node or force decommissioning or maintenance by removing, calling refreshNodes, then re-adding to the excludes or host config files. {quote} If a Datanode is lost while decommissioning (for example if the underlying hardware fails or is lost), then it will remain in state decommissioning forever. If 100 or more Datanodes are lost while decommissioning over the Hadoop cluster lifetime, then this is enough to completely fill up the "tracked.nodes" set. With the entire "tracked.nodes" set filled with datanodes that can never finish decommissioning, any datanodes added after this point will never be able to be decommissioned because they will never be added to the "tracked.nodes" set. In this scenario: * the "tracked.nodes" set is filled with datanodes which are lost & cannot be recovered (and can never finish decommissioning so they will never be removed from the set) * the actual live datanodes being decommissioned are enqueued waiting to enter the "tracked.nodes" set (and are stuck waiting indefinitely) This means that no progress towards decommissioning the live datanodes will be made unless the user takes the following action: {quote}Either restart the failed node or force decommissioning or maintenance by removing, calling refreshNodes, then re-adding to the excludes or host config files. {quote} Ideally, the Namenode should be able to gracefully handle scenarios where the datanodes in the "tracked.nodes" set are not making forward progress towards decommissioning while the enqueued datanodes may be able to make forward progress. ## Reproduction Steps * create a Hadoop cluster * lose (i.e. terminate the host/process forever) over 100 datanodes while the datanodes are in state decommissioning * add additional datanodes to the cluster * attempt to decommission those new datanodes & observe that they are stuck in state decommissioning forever Note that in this example each datanode, over the full history of the cluster, has a unique IP address -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16064) HDFS-721 causes DataNode decommissioning to get stuck indefinitely
[ https://issues.apache.org/jira/browse/HDFS-16064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Wikant updated HDFS-16064: Description: Seems that https://issues.apache.org/jira/browse/HDFS-721 was resolved as a non-issue under the assumption that if the namenode & a datanode get into an inconsistent state for a given block pipeline, there should be another datanode available to replicate the block to While testing datanode decommissioning using "dfs.exclude.hosts", I have encountered a scenario where the decommissioning gets stuck indefinitely Below is the progression of events: * there are initially 4 datanodes DN1, DN2, DN3, DN4 * scale-down is started by adding DN1 & DN2 to "dfs.exclude.hosts" * HDFS block pipelines on DN1 & DN2 must now be replicated to DN3 & DN4 in order to satisfy their minimum replication factor of 2 * during this replication process https://issues.apache.org/jira/browse/HDFS-721 is encountered which causes the following inconsistent state: ** DN3 thinks it has the block pipeline in FINALIZED state ** the namenode does not think DN3 has the block pipeline {code:java} 2021-06-06 10:38:23,604 INFO org.apache.hadoop.hdfs.server.datanode.DataNode (DataXceiver for client at /DN2:45654 [Receiving block BP-YYY:blk_XXX]): DN3:9866:DataXceiver error processing WRITE_BLOCK operation src: /DN2:45654 dst: /DN3:9866; org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-YYY:blk_XXX already exists in state FINALIZED and thus cannot be created. {code} * the replication is attempted again, but: ** DN4 has the block ** DN1 and/or DN2 have the block, but don't count towards the minimum replication factor because they are being decommissioned ** DN3 does not have the block & cannot have the block replicated to it because of HDFS-721 * the namenode repeatedly tries to replicate the block to DN3 & repeatedly fails, this continues indefinitely * therefore DN4 is the only live datanode with the block & the minimum replication factor of 2 cannot be satisfied * because the minimum replication factor cannot be satisfied for the block(s) being moved off DN1 & DN2, the datanode decommissioning can never be completed {code:java} 2021-06-06 10:39:10,106 INFO BlockStateChange (DatanodeAdminMonitor-0): Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 2, maintenance replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , Current Datanode: DN1:9866, Is current datanode decommissioning: true, Is current datanode entering maintenance: false ... 2021-06-06 10:57:10,105 INFO BlockStateChange (DatanodeAdminMonitor-0): Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 2, maintenance replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , Current Datanode: DN2:9866, Is current datanode decommissioning: true, Is current datanode entering maintenance: false {code} Being stuck in decommissioning state forever is not an intended behavior of DataNode decommissioning A few potential solutions: * Address the root cause of the problem which is an inconsistent state between namenode & datanode: https://issues.apache.org/jira/browse/HDFS-721 * Detect when datanode decommissioning is stuck due to lack of available datanodes for satisfying the minimum replication factor, then recover by re-enabling the datanodes being decommissioned was: Seems that https://issues.apache.org/jira/browse/HDFS-721 was resolved as a non-issue under the assumption that if the namenode & a datanode get into an inconsistent state for a given block pipeline, there should be another datanode available to replicate the block to While testing datanode decommissioning using "dfs.exclude.hosts", I have encountered a scenario where the decommissioning gets stuck indefinitely Below is the progression of events: * there are initially 4 datanodes DN1, DN2, DN3, DN4 * scale-down is started by adding DN1 & DN2 to "dfs.exclude.hosts" * HDFS block pipelines on DN1 & DN2 must now be replicated to DN3 & DN4 in order to satisfy their minimum replication factor of 2 * during this replication process https://issues.apache.org/jira/browse/HDFS-721 is encountered which causes the following inconsistent state: ** DN3 thinks it has the block pipeline in FINALIZED state ** the namenode does not think DN3 has the block pipeline {code:java} 2021-06-06 10:38:23,604 INFO org.apache.hadoop.hdfs.server.datanode.DataNode (DataXceiver for client at /DN2:45654 [Receiving block BP-YYY:blk_XXX]): DN3:9866:DataXceiver error processing WRITE_BLOCK operation src:
[jira] [Updated] (HDFS-16064) HDFS-721 causes DataNode decommissioning to get stuck indefinitely
[ https://issues.apache.org/jira/browse/HDFS-16064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Kevin Wikant updated HDFS-16064: Description: Seems that https://issues.apache.org/jira/browse/HDFS-721 was resolved as a non-issue under the assumption that if the namenode & a datanode get into an inconsistent state for a given block pipeline, there should be another datanode available to replicate the block to While testing datanode decommissioning using "dfs.exclude.hosts", I have encountered a scenario where the decommissioning gets stuck indefinitely Below is the progression of events: * there are initially 4 datanodes DN1, DN2, DN3, DN4 * scale-down is started by adding DN1 & DN2 to "dfs.exclude.hosts" * HDFS block pipelines on DN1 & DN2 must now be replicated to DN3 & DN4 in order to satisfy their minimum replication factor of 2 * during this replication process https://issues.apache.org/jira/browse/HDFS-721 is encountered which causes the following inconsistent state: ** DN3 thinks it has the block pipeline in FINALIZED state ** the namenode does not think DN3 has the block pipeline {code:java} 2021-06-06 10:38:23,604 INFO org.apache.hadoop.hdfs.server.datanode.DataNode (DataXceiver for client at /DN2:45654 [Receiving block BP-YYY:blk_XXX]): DN3:9866:DataXceiver error processing WRITE_BLOCK operation src: /DN2:45654 dst: /DN3:9866; org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-YYY:blk_XXX already exists in state FINALIZED and thus cannot be created. {code} * the replication is attempted again, but: ** DN4 has the block ** DN1 and/or DN2 have the block, but don't count towards the minimum replication factor because they are being decommissioned ** DN3 does not have the block & cannot have the block replicated to it because of HDFS-721 * the namenode repeatedly tries to replicate the block to DN3 & repeatedly fails, this continues indefinitely * therefore DN4 is the only live datanode with the block & the minimum replication factor of 2 cannot be satisfied * because the minimum replication factor cannot be satisfied for the block(s) being moved off DN1 & DN2, the datanode decommissioning can never be completed {code:java} 2021-06-06 10:39:10,106 INFO BlockStateChange (DatanodeAdminMonitor-0): Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 2, maintenance replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , Current Datanode: DN1:9866, Is current datanode decommissioning: true, Is current datanode entering maintenance: false ... 2021-06-06 10:57:10,105 INFO BlockStateChange (DatanodeAdminMonitor-0): Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 2, maintenance replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , Current Datanode: DN2:9866, Is current datanode decommissioning: true, Is current datanode entering maintenance: false {code} Being stuck in decommissioning state forever is not an intended behavior of DataNode decommissioning A few potential solutions: * Address the root cause of the problem which is an inconsistent state between namenode & datanode: https://issues.apache.org/jira/browse/HDFS-721 * Detect when datanode decommissioning is stuck due to lack of available datanodes for satisfying the minimum replication factor, then recover by re-enabling the datanodes being decommissioned was: Seems that https://issues.apache.org/jira/browse/HDFS-721 was resolved as a non-issue under the assumption that if the namenode & a datanode get into an inconsistent state for a given block pipeline, there should be another datanode available to replicate the block to While testing datanode decommissioning using "dfs.exclude.hosts", I have encountered a scenario where the decommissioning gets stuck indefinitely Below is the progression of events: * there are initially 4 datanodes DN1, DN2, DN3, DN4 * scale-down is started by adding DN1 & DN2 to "dfs.exclude.hosts" * HDFS block(s) on DN1 & DN2 must now be replicated to DN3 & DN4 in order to satisfy their minimum replication factor of 2 * during this replication process https://issues.apache.org/jira/browse/HDFS-721 is encountered which causes the following inconsistent state: ** DN3 thinks it has the block pipeline in FINALIZED state ** the namenode does not think DN3 has the block pipeline {code:java} 2021-06-06 10:38:23,604 INFO org.apache.hadoop.hdfs.server.datanode.DataNode (DataXceiver for client at /DN2:45654 [Receiving block BP-YYY:blk_XXX]): DN3:9866:DataXceiver error processing WRITE_BLOCK operation src:
[jira] [Created] (HDFS-16064) HDFS-721 causes DataNode decommissioning to get stuck indefinitely
Kevin Wikant created HDFS-16064: --- Summary: HDFS-721 causes DataNode decommissioning to get stuck indefinitely Key: HDFS-16064 URL: https://issues.apache.org/jira/browse/HDFS-16064 Project: Hadoop HDFS Issue Type: Bug Components: datanode, namenode Affects Versions: 3.2.1 Reporter: Kevin Wikant Seems that https://issues.apache.org/jira/browse/HDFS-721 was resolved as a non-issue under the assumption that if the namenode & a datanode get into an inconsistent state for a given block pipeline, there should be another datanode available to replicate the block to While testing datanode decommissioning using "dfs.exclude.hosts", I have encountered a scenario where the decommissioning gets stuck indefinitely Below is the progression of events: * there are initially 4 datanodes DN1, DN2, DN3, DN4 * scale-down is started by adding DN1 & DN2 to "dfs.exclude.hosts" * HDFS block(s) on DN1 & DN2 must now be replicated to DN3 & DN4 in order to satisfy their minimum replication factor of 2 * during this replication process https://issues.apache.org/jira/browse/HDFS-721 is encountered which causes the following inconsistent state: ** DN3 thinks it has the block pipeline in FINALIZED state ** the namenode does not think DN3 has the block pipeline {code:java} 2021-06-06 10:38:23,604 INFO org.apache.hadoop.hdfs.server.datanode.DataNode (DataXceiver for client at /DN2:45654 [Receiving block BP-YYY:blk_XXX]): DN3:9866:DataXceiver error processing WRITE_BLOCK operation src: /DN2:45654 dst: /DN3:9866; org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block BP-YYY:blk_XXX already exists in state FINALIZED and thus cannot be created. {code} * the replication is attempted again, but: ** DN4 has the block ** DN1 and/or DN2 have the block, but don't count towards the minimum replication factor because they are being decommissioned ** DN3 does not have the block & cannot have the block replicated to it because of HDFS-721 * the namenode repeatedly tries to replicate the block to DN3 & repeatedly fails, this continues indefinitely * therefore DN4 is the only live datanode with the block & the minimum replication factor of 2 cannot be satisfied * because the minimum replication factor cannot be satisfied for the block(s) being moved off DN1 & DN2, the datanode decommissioning can never be completed {code:java} 2021-06-06 10:39:10,106 INFO BlockStateChange (DatanodeAdminMonitor-0): Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 2, maintenance replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , Current Datanode: DN1:9866, Is current datanode decommissioning: true, Is current datanode entering maintenance: false ... 2021-06-06 10:57:10,105 INFO BlockStateChange (DatanodeAdminMonitor-0): Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, decommissioned replicas: 0, decommissioning replicas: 2, maintenance replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , Current Datanode: DN2:9866, Is current datanode decommissioning: true, Is current datanode entering maintenance: false {code} Being stuck in decommissioning state forever is not an intended behavior of DataNode decommissioning A few potential solutions: * Address the root cause of the problem which is an inconsistent state between namenode & datanode: https://issues.apache.org/jira/browse/HDFS-721 * Detect when datanode decommissioning is stuck due to lack of available datanodes for satisfying the minimum replication factor, then recover by re-enabling the datanodes being decommissioned -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org