[jira] [Commented] (HDFS-16064) Determine when to invalidate corrupt replicas based on number of usable replicas

2024-01-11 Thread Kevin Wikant (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17805579#comment-17805579
 ] 

Kevin Wikant commented on HDFS-16064:
-

{quote}Any reason why we haven't backported this fix to branch-2.10? 
{quote}
Back in 2022, I did try to backport this change to 2.10.1 branch & encountered 
unit test failure due to inconsistent behavior when compared to Hadoop 3.x
{quote}> mvn test -Dtest=TestDecommission
...

[ERROR] Tests run: 27, Failures: 0, Errors: 1, Skipped: 1, Time elapsed: 
263.603 s <<< FAILURE! - in org.apache.hadoop.hdfs.TestDecommission
[ERROR] 
testDeleteCorruptReplicaForUnderReplicatedBlock(org.apache.hadoop.hdfs.TestDecommission)
  Time elapsed: 60.462 s  <<< ERROR!
java.lang.Exception: test timed out after 6 milliseconds
        at java.lang.Thread.sleep(Native Method)
        at 
org.apache.hadoop.test.GenericTestUtils.waitFor(GenericTestUtils.java:366)
        at 
org.apache.hadoop.hdfs.TestDecommission.testDeleteCorruptReplicaForUnderReplicatedBlock(TestDecommission.java:1918)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
        at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:47)
        at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
        at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:44)
        at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
        at 
org.junit.internal.runners.statements.FailOnTimeout$StatementThread.run(FailOnTimeout.java:74)
{quote}
I do not remember all the root cause details, but from my notes:
 * "The inconsistent behavior has to do with when Datanodes in the 
MiniDFSCluster are sending full block reports vs incremental block reports and 
how that gets handled by the Namenode. Also, the triggerBlockReport method does 
not work in a MiniDFSCluster (i.e. no block report is sent) and there is no way 
to control sending of incremental vs full block reports."

These Hadoop 2.x behavior differences in Namenode/Datanode/MiniDFSCluster were 
not fully root caused & addressed, so this bug fix was only backported to 
Hadoop 3.x which was sufficient for our needs.

> Determine when to invalidate corrupt replicas based on number of usable 
> replicas
> 
>
> Key: HDFS-16064
> URL: https://issues.apache.org/jira/browse/HDFS-16064
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, namenode
>Affects Versions: 3.2.1
>Reporter: Kevin Wikant
>Assignee: Kevin Wikant
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.2.4, 3.3.5
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Seems that https://issues.apache.org/jira/browse/HDFS-721 was resolved as a 
> non-issue under the assumption that if the namenode & a datanode get into an 
> inconsistent state for a given block pipeline, there should be another 
> datanode available to replicate the block to
> While testing datanode decommissioning using "dfs.exclude.hosts", I have 
> encountered a scenario where the decommissioning gets stuck indefinitely
> Below is the progression of events:
>  * there are initially 4 datanodes DN1, DN2, DN3, DN4
>  * scale-down is started by adding DN1 & DN2 to "dfs.exclude.hosts"
>  * HDFS block pipelines on DN1 & DN2 must now be replicated to DN3 & DN4 in 
> order to satisfy their minimum replication factor of 2
>  * during this replication process 
> https://issues.apache.org/jira/browse/HDFS-721 is encountered which causes 
> the following inconsistent state:
>  ** DN3 thinks it has the block pipeline in FINALIZED state
>  ** the namenode does not think DN3 has the block pipeline
> {code:java}
> 2021-06-06 10:38:23,604 INFO org.apache.hadoop.hdfs.server.datanode.DataNode 
> (DataXceiver for client  at /DN2:45654 [Receiving block BP-YYY:blk_XXX]): 
> DN3:9866:DataXceiver error processing WRITE_BLOCK operation  src: /DN2:45654 
> dst: /DN3:9866; 
> org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block 
> BP-YYY:blk_XXX already exists in state FINALIZED and thus cannot be created.
> {code}
>  * the replication is attempted again, but:
>  ** DN4 has the block
>  ** DN1 and/or DN2 have the block, but don't count towards the minimum 
> replication factor because they are being decommissioned
>  ** DN3 does not have the block & cannot have the block replicated to it 
> because of HDFS-721
>  * the namenode 

[jira] [Updated] (HDFS-16664) Use correct GenerationStamp when invalidating corrupt block replicas

2022-07-16 Thread Kevin Wikant (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Wikant updated HDFS-16664:

Description: 
While trying to backport HDFS-16064 to an older Hadoop version, the new unit 
test "testDeleteCorruptReplicaForUnderReplicatedBlock" started failing 
unexpectedly.

Upon deep diving this unit test failure, I identified a bug in HDFS corrupt 
replica invalidation which results in the following datanode exception:
{quote}2022-07-16 08:07:52,041 [BP-958471676-X-1657973243350 heartbeating to 
localhost/127.0.0.1:61365] WARN  datanode.DataNode 
(BPServiceActor.java:processCommand(887)) - Error processing datanode Command
java.io.IOException: Failed to delete 1 (out of 1) replica(s):
0) Failed to delete replica blk_1073741825_1005: GenerationStamp not matched, 
existing replica is blk_1073741825_1001
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2139)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2034)
        at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:735)
        at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:680)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:883)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:678)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:849)
        at java.lang.Thread.run(Thread.java:750)
{quote}
The issue is that the Namenode is sending wrong generationStamp to the 
datanode. By adding some additional logs, I was able to determine the root 
cause for this:
 * the generationStamp sent in the DNA_INVALIDATE is based on the 
[generationStamp of the block sent in the block 
report|https://github.com/apache/hadoop/blob/8774f178686487007dcf8c418c989b785a529000/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L3733]
 * the problem is that the datanode with the corrupt block replica (that 
receives the DNA_INVALIDATE) is not necissarily the same datanode that sent the 
block report
 * this can cause the above exception when the corrupt block replica on the 
datanode receiving the DNA_INVALIDATE & the block replica on the datanode that 
sent the block report have different generationStamps

The solution is to store the corrupt replicas generationStamp in the 
CorruptReplicasMap, then to extract this correct generationStamp value when 
sending the DNA_INVALIDATE to the datanode

 
h2. Failed Test - Before the fix
{quote}> mvn test 
-Dtest=TestDecommission#testDeleteCorruptReplicaForUnderReplicatedBlock

 

[INFO] Results:
[INFO] 
[ERROR] Failures: 
[ERROR]   TestDecommission.testDeleteCorruptReplicaForUnderReplicatedBlock:2035 
Node 127.0.0.1:61366 failed to complete decommissioning. numTrackedNodes=1 , 
numPendingNodes=0 , adminState=Decommission In Progress , 
nodesWithReplica=[127.0.0.1:61366, 127.0.0.1:61419]
{quote}
Logs:
{quote}> cat 
target/surefire-reports/org.apache.hadoop.hdfs.TestDecommission-output.txt | 
grep 'Expected Replicas:|XXX|FINALIZED|Block now|Failed to delete'

2022-07-16 08:07:45,891 [Listener at localhost/61378] INFO  
hdfs.TestDecommission 
(TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(1942)) - 
Block now has 2 corrupt replicas on [127.0.0.1:61370 , 127.0.0.1:61375] and 1 
live replica on 127.0.0.1:61366
2022-07-16 08:07:45,913 [Listener at localhost/61378] INFO  
hdfs.TestDecommission 
(TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(1974)) - 
Block now has 2 corrupt replicas on [127.0.0.1:61370 , 127.0.0.1:61375] and 1 
decommissioning replica on 127.0.0.1:61366
XXX invalidateBlock dn=127.0.0.1:61415 , blk=1073741825_1001
XXX postponeBlock dn=127.0.0.1:61415 , blk=1073741825_1001
XXX invalidateBlock dn=127.0.0.1:61419 , blk=1073741825_1003
XXX addToInvalidates dn=127.0.0.1:61419 , blk=1073741825_1003
XXX addBlocksToBeInvalidated dn=127.0.0.1:61419 , blk=1073741825_1003
XXX rescanPostponedMisreplicatedBlocks blk=1073741825_1005
XXX DNA_INVALIDATE dn=/127.0.0.1:61419 , blk=1073741825_1003
XXX invalidate(on DN) dn=/127.0.0.1:61419 , invalidBlk=blk_1073741825_1003 , 
blkByIdAndGenStamp = FinalizedReplica, blk_1073741825_1003, FINALIZED
2022-07-16 08:07:49,084 [BP-958471676-X-1657973243350 heartbeating to 
localhost/127.0.0.1:61365] INFO  impl.FsDatasetAsyncDiskService 
(FsDatasetAsyncDiskService.java:deleteAsync(226)) - Scheduling 
blk_1073741825_1003 replica FinalizedReplica, blk_1073741825_1003, FINALIZED
XXX addBlock dn=127.0.0.1:61419 , blk=1073741825_1005   *<<<  block report is 
coming from 127.0.0.1:61419 which has genStamp=1005*
XXX invalidateCorruptReplicas 

[jira] [Comment Edited] (HDFS-16664) Use correct GenerationStamp when invalidating corrupt block replicas

2022-07-16 Thread Kevin Wikant (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17567522#comment-17567522
 ] 

Kevin Wikant edited comment on HDFS-16664 at 7/16/22 5:14 PM:
--

The test "testDeleteCorruptReplicaForUnderReplicatedBlock" is failing when 
backporting to: [https://github.com/apache/hadoop/tree/branch-3.2.1]

See section "Why does unit test failure not reproduce in Hadoop trunk?" for 
additional details


was (Author: kevinwikant):
The test "testDeleteCorruptReplicaForUnderReplicatedBlock" is failing when 
backporting to: [https://github.com/apache/hadoop/tree/branch-3.2.1]

> Use correct GenerationStamp when invalidating corrupt block replicas
> 
>
> Key: HDFS-16664
> URL: https://issues.apache.org/jira/browse/HDFS-16664
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kevin Wikant
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> While trying to backport HDFS-16064 to an older Hadoop version, the new unit 
> test "testDeleteCorruptReplicaForUnderReplicatedBlock" started failing 
> unexpectedly.
> Upon deep diving this unit test failure, I identified a bug in HDFS corrupt 
> replica invalidation which results in the following datanode exception:
> {quote}2022-07-16 08:07:52,041 [BP-958471676-X-1657973243350 heartbeating to 
> localhost/127.0.0.1:61365] WARN  datanode.DataNode 
> (BPServiceActor.java:processCommand(887)) - Error processing datanode Command
> java.io.IOException: Failed to delete 1 (out of 1) replica(s):
> 0) Failed to delete replica blk_1073741825_1005: GenerationStamp not matched, 
> existing replica is blk_1073741825_1001
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2139)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2034)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:735)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:680)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:883)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:678)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:849)
>         at java.lang.Thread.run(Thread.java:750)
> {quote}
> The issue is that the Namenode is sending wrong generationStamp to the 
> datanode. By adding some additional logs, I was able to determine the root 
> cause for this:
>  * the generationStamp sent in the DNA_INVALIDATE is based on the 
> [generationStamp of the block sent in the block 
> report|https://github.com/apache/hadoop/blob/8774f178686487007dcf8c418c989b785a529000/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L3733]
>  * the problem is that the datanode with the corrupt block replica (that 
> receives the DNA_INVALIDATE) is not necissarily the same datanode that sent 
> the block report
>  * this can cause the above exception when the corrupt block replica on the 
> datanode receiving the DNA_INVALIDATE & the block replica on the datanode 
> that sent the block report have different generationStamps
> The solution is to store the corrupt replicas generationStamp in the 
> CorruptReplicasMap, then to extract this correct generationStamp value when 
> sending the DNA_INVALIDATE to the datanode
>  
> h2. Failed Test - Before the fix
> {quote}> mvn test 
> -Dtest=TestDecommission#testDeleteCorruptReplicaForUnderReplicatedBlock
>  
> [INFO] Results:
> [INFO] 
> [ERROR] Failures: 
> [ERROR]   
> TestDecommission.testDeleteCorruptReplicaForUnderReplicatedBlock:2035 Node 
> 127.0.0.1:61366 failed to complete decommissioning. numTrackedNodes=1 , 
> numPendingNodes=0 , adminState=Decommission In Progress , 
> nodesWithReplica=[127.0.0.1:61366, 127.0.0.1:61419]
> {quote}
> Logs:
> {quote}> cat 
> target/surefire-reports/org.apache.hadoop.hdfs.TestDecommission-output.txt | 
> grep 'Expected Replicas:|XXX|FINALIZED|Block now|Failed to delete'
> 2022-07-16 08:07:45,891 [Listener at localhost/61378] INFO  
> hdfs.TestDecommission 
> (TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(1942)) 
> - Block now has 2 corrupt replicas on [127.0.0.1:61370 , 127.0.0.1:61375] and 
> 1 live replica on 127.0.0.1:61366
> 2022-07-16 08:07:45,913 [Listener at localhost/61378] INFO  
> hdfs.TestDecommission 
> (TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(1974)) 
> - Block now has 2 corrupt replicas on 

[jira] [Commented] (HDFS-16664) Use correct GenerationStamp when invalidating corrupt block replicas

2022-07-16 Thread Kevin Wikant (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17567522#comment-17567522
 ] 

Kevin Wikant commented on HDFS-16664:
-

The issue is occurring when backporting to: 
https://github.com/apache/hadoop/tree/branch-3.2.1

> Use correct GenerationStamp when invalidating corrupt block replicas
> 
>
> Key: HDFS-16664
> URL: https://issues.apache.org/jira/browse/HDFS-16664
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kevin Wikant
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> While trying to backport HDFS-16064 to an older Hadoop version, the new unit 
> test "testDeleteCorruptReplicaForUnderReplicatedBlock" started failing 
> unexpectedly.
> Upon deep diving this unit test failure, I identified a bug in HDFS corrupt 
> replica invalidation which results in the following datanode exception:
> {quote}2022-07-16 08:07:52,041 [BP-958471676-X-1657973243350 heartbeating to 
> localhost/127.0.0.1:61365] WARN  datanode.DataNode 
> (BPServiceActor.java:processCommand(887)) - Error processing datanode Command
> java.io.IOException: Failed to delete 1 (out of 1) replica(s):
> 0) Failed to delete replica blk_1073741825_1005: GenerationStamp not matched, 
> existing replica is blk_1073741825_1001
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2139)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2034)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:735)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:680)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:883)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:678)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:849)
>         at java.lang.Thread.run(Thread.java:750)
> {quote}
> The issue is that the Namenode is sending wrong generationStamp to the 
> datanode. By adding some additional logs, I was able to determine the root 
> cause for this:
>  * the generationStamp sent in the DNA_INVALIDATE is based on the 
> [generationStamp of the block sent in the block 
> report|https://github.com/apache/hadoop/blob/8774f178686487007dcf8c418c989b785a529000/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L3733]
>  * the problem is that the datanode with the corrupt block replica (that 
> receives the DNA_INVALIDATE) is not necissarily the same datanode that sent 
> the block report
>  * this can cause the above exception when the corrupt block replica on the 
> datanode receiving the DNA_INVALIDATE & the block replica on the datanode 
> that sent the block report have different generationStamps
> The solution is to store the corrupt replicas generationStamp in the 
> CorruptReplicasMap, then to extract this correct generationStamp value when 
> sending the DNA_INVALIDATE to the datanode
>  
> h2. Failed Test - Before the fix
> {quote}> mvn test 
> -Dtest=TestDecommission#testDeleteCorruptReplicaForUnderReplicatedBlock
>  
> [INFO] Results:
> [INFO] 
> [ERROR] Failures: 
> [ERROR]   
> TestDecommission.testDeleteCorruptReplicaForUnderReplicatedBlock:2035 Node 
> 127.0.0.1:61366 failed to complete decommissioning. numTrackedNodes=1 , 
> numPendingNodes=0 , adminState=Decommission In Progress , 
> nodesWithReplica=[127.0.0.1:61366, 127.0.0.1:61419]
> {quote}
> Logs:
> {quote}> cat 
> target/surefire-reports/org.apache.hadoop.hdfs.TestDecommission-output.txt | 
> grep 'Expected Replicas:|XXX|FINALIZED|Block now|Failed to delete'
> 2022-07-16 08:07:45,891 [Listener at localhost/61378] INFO  
> hdfs.TestDecommission 
> (TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(1942)) 
> - Block now has 2 corrupt replicas on [127.0.0.1:61370 , 127.0.0.1:61375] and 
> 1 live replica on 127.0.0.1:61366
> 2022-07-16 08:07:45,913 [Listener at localhost/61378] INFO  
> hdfs.TestDecommission 
> (TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(1974)) 
> - Block now has 2 corrupt replicas on [127.0.0.1:61370 , 127.0.0.1:61375] and 
> 1 decommissioning replica on 127.0.0.1:61366
> XXX invalidateBlock dn=127.0.0.1:61415 , blk=1073741825_1001
> XXX postponeBlock dn=127.0.0.1:61415 , blk=1073741825_1001
> XXX invalidateBlock dn=127.0.0.1:61419 , blk=1073741825_1003
> XXX addToInvalidates dn=127.0.0.1:61419 , blk=1073741825_1003
> XXX addBlocksToBeInvalidated 

[jira] [Comment Edited] (HDFS-16664) Use correct GenerationStamp when invalidating corrupt block replicas

2022-07-16 Thread Kevin Wikant (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16664?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17567522#comment-17567522
 ] 

Kevin Wikant edited comment on HDFS-16664 at 7/16/22 5:13 PM:
--

The test "testDeleteCorruptReplicaForUnderReplicatedBlock" is failing when 
backporting to: [https://github.com/apache/hadoop/tree/branch-3.2.1]


was (Author: kevinwikant):
The issue is occurring when backporting to: 
https://github.com/apache/hadoop/tree/branch-3.2.1

> Use correct GenerationStamp when invalidating corrupt block replicas
> 
>
> Key: HDFS-16664
> URL: https://issues.apache.org/jira/browse/HDFS-16664
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kevin Wikant
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> While trying to backport HDFS-16064 to an older Hadoop version, the new unit 
> test "testDeleteCorruptReplicaForUnderReplicatedBlock" started failing 
> unexpectedly.
> Upon deep diving this unit test failure, I identified a bug in HDFS corrupt 
> replica invalidation which results in the following datanode exception:
> {quote}2022-07-16 08:07:52,041 [BP-958471676-X-1657973243350 heartbeating to 
> localhost/127.0.0.1:61365] WARN  datanode.DataNode 
> (BPServiceActor.java:processCommand(887)) - Error processing datanode Command
> java.io.IOException: Failed to delete 1 (out of 1) replica(s):
> 0) Failed to delete replica blk_1073741825_1005: GenerationStamp not matched, 
> existing replica is blk_1073741825_1001
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2139)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2034)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:735)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:680)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:883)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:678)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:849)
>         at java.lang.Thread.run(Thread.java:750)
> {quote}
> The issue is that the Namenode is sending wrong generationStamp to the 
> datanode. By adding some additional logs, I was able to determine the root 
> cause for this:
>  * the generationStamp sent in the DNA_INVALIDATE is based on the 
> [generationStamp of the block sent in the block 
> report|https://github.com/apache/hadoop/blob/8774f178686487007dcf8c418c989b785a529000/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L3733]
>  * the problem is that the datanode with the corrupt block replica (that 
> receives the DNA_INVALIDATE) is not necissarily the same datanode that sent 
> the block report
>  * this can cause the above exception when the corrupt block replica on the 
> datanode receiving the DNA_INVALIDATE & the block replica on the datanode 
> that sent the block report have different generationStamps
> The solution is to store the corrupt replicas generationStamp in the 
> CorruptReplicasMap, then to extract this correct generationStamp value when 
> sending the DNA_INVALIDATE to the datanode
>  
> h2. Failed Test - Before the fix
> {quote}> mvn test 
> -Dtest=TestDecommission#testDeleteCorruptReplicaForUnderReplicatedBlock
>  
> [INFO] Results:
> [INFO] 
> [ERROR] Failures: 
> [ERROR]   
> TestDecommission.testDeleteCorruptReplicaForUnderReplicatedBlock:2035 Node 
> 127.0.0.1:61366 failed to complete decommissioning. numTrackedNodes=1 , 
> numPendingNodes=0 , adminState=Decommission In Progress , 
> nodesWithReplica=[127.0.0.1:61366, 127.0.0.1:61419]
> {quote}
> Logs:
> {quote}> cat 
> target/surefire-reports/org.apache.hadoop.hdfs.TestDecommission-output.txt | 
> grep 'Expected Replicas:|XXX|FINALIZED|Block now|Failed to delete'
> 2022-07-16 08:07:45,891 [Listener at localhost/61378] INFO  
> hdfs.TestDecommission 
> (TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(1942)) 
> - Block now has 2 corrupt replicas on [127.0.0.1:61370 , 127.0.0.1:61375] and 
> 1 live replica on 127.0.0.1:61366
> 2022-07-16 08:07:45,913 [Listener at localhost/61378] INFO  
> hdfs.TestDecommission 
> (TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(1974)) 
> - Block now has 2 corrupt replicas on [127.0.0.1:61370 , 127.0.0.1:61375] and 
> 1 decommissioning replica on 127.0.0.1:61366
> XXX invalidateBlock dn=127.0.0.1:61415 , blk=1073741825_1001
> 

[jira] [Updated] (HDFS-16664) Use correct GenerationStamp when invalidating corrupt block replicas

2022-07-16 Thread Kevin Wikant (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Wikant updated HDFS-16664:

Description: 
While trying to backport HDFS-16064 to an older Hadoop version, the new unit 
test "testDeleteCorruptReplicaForUnderReplicatedBlock" started failing 
unexpectedly.

Upon deep diving this unit test failure, I identified a bug in HDFS corrupt 
replica invalidation which results in the following datanode exception:
{quote}2022-07-16 08:07:52,041 [BP-958471676-X-1657973243350 heartbeating to 
localhost/127.0.0.1:61365] WARN  datanode.DataNode 
(BPServiceActor.java:processCommand(887)) - Error processing datanode Command
java.io.IOException: Failed to delete 1 (out of 1) replica(s):
0) Failed to delete replica blk_1073741825_1005: GenerationStamp not matched, 
existing replica is blk_1073741825_1001
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2139)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2034)
        at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:735)
        at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:680)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:883)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:678)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:849)
        at java.lang.Thread.run(Thread.java:750)
{quote}
The issue is that the Namenode is sending wrong generationStamp to the 
datanode. By adding some additional logs, I was able to determine the root 
cause for this:
 * the generationStamp sent in the DNA_INVALIDATE is based on the 
[generationStamp of the block sent in the block 
report|https://github.com/apache/hadoop/blob/8774f178686487007dcf8c418c989b785a529000/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L3733]
 * the problem is that the datanode with the corrupt block replica (that 
receives the DNA_INVALIDATE) is not necissarily the same datanode that sent the 
block report
 * this can cause the above exception when the corrupt block replica on the 
datanode receiving the DNA_INVALIDATE & the block replica on the datanode that 
sent the block report have different generationStamps

The solution is to store the corrupt replicas generationStamp in the 
CorruptReplicasMap, then to extract this correct generationStamp value when 
sending the DNA_INVALIDATE to the datanode

 
h2. Failed Test - Before the fix
{quote}> mvn test 
-Dtest=TestDecommission#testDeleteCorruptReplicaForUnderReplicatedBlock

 

[INFO] Results:
[INFO] 
[ERROR] Failures: 
[ERROR]   TestDecommission.testDeleteCorruptReplicaForUnderReplicatedBlock:2035 
Node 127.0.0.1:61366 failed to complete decommissioning. numTrackedNodes=1 , 
numPendingNodes=0 , adminState=Decommission In Progress , 
nodesWithReplica=[127.0.0.1:61366, 127.0.0.1:61419]
{quote}
Logs:
{quote}> cat 
target/surefire-reports/org.apache.hadoop.hdfs.TestDecommission-output.txt | 
grep 'Expected Replicas:|XXX|FINALIZED|Block now|Failed to delete'

2022-07-16 08:07:45,891 [Listener at localhost/61378] INFO  
hdfs.TestDecommission 
(TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(1942)) - 
Block now has 2 corrupt replicas on [127.0.0.1:61370 , 127.0.0.1:61375] and 1 
live replica on 127.0.0.1:61366
2022-07-16 08:07:45,913 [Listener at localhost/61378] INFO  
hdfs.TestDecommission 
(TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(1974)) - 
Block now has 2 corrupt replicas on [127.0.0.1:61370 , 127.0.0.1:61375] and 1 
decommissioning replica on 127.0.0.1:61366
XXX invalidateBlock dn=127.0.0.1:61415 , blk=1073741825_1001
XXX postponeBlock dn=127.0.0.1:61415 , blk=1073741825_1001
XXX invalidateBlock dn=127.0.0.1:61419 , blk=1073741825_1003
XXX addToInvalidates dn=127.0.0.1:61419 , blk=1073741825_1003
XXX addBlocksToBeInvalidated dn=127.0.0.1:61419 , blk=1073741825_1003
XXX rescanPostponedMisreplicatedBlocks blk=1073741825_1005
XXX DNA_INVALIDATE dn=/127.0.0.1:61419 , blk=1073741825_1003
XXX invalidate(on DN) dn=/127.0.0.1:61419 , invalidBlk=blk_1073741825_1003 , 
blkByIdAndGenStamp = FinalizedReplica, blk_1073741825_1003, FINALIZED
2022-07-16 08:07:49,084 [BP-958471676-X-1657973243350 heartbeating to 
localhost/127.0.0.1:61365] INFO  impl.FsDatasetAsyncDiskService 
(FsDatasetAsyncDiskService.java:deleteAsync(226)) - Scheduling 
blk_1073741825_1003 replica FinalizedReplica, blk_1073741825_1003, FINALIZED
XXX addBlock dn=127.0.0.1:61419 , blk=1073741825_1005   *<<<  block report is 
coming from 127.0.0.1:61419 which has genStamp=1005*
XXX invalidateCorruptReplicas 

[jira] [Updated] (HDFS-16664) Use correct GenerationStamp when invalidating corrupt block replicas

2022-07-16 Thread Kevin Wikant (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Wikant updated HDFS-16664:

Description: 
While trying to backport HDFS-16064 to an older Hadoop version, the new unit 
test "testDeleteCorruptReplicaForUnderReplicatedBlock" started failing 
unexpectedly.

Upon deep diving this unit test failure, I identified a bug in HDFS corrupt 
replica invalidation which results in the following datanode exception:
{quote}2022-07-16 08:07:52,041 [BP-958471676-X-1657973243350 heartbeating to 
localhost/127.0.0.1:61365] WARN  datanode.DataNode 
(BPServiceActor.java:processCommand(887)) - Error processing datanode Command
java.io.IOException: Failed to delete 1 (out of 1) replica(s):
0) Failed to delete replica blk_1073741825_1005: GenerationStamp not matched, 
existing replica is blk_1073741825_1001
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2139)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2034)
        at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:735)
        at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:680)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:883)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:678)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:849)
        at java.lang.Thread.run(Thread.java:750)
{quote}
The issue is that the Namenode is sending wrong generationStamp to the 
datanode. By adding some additional logs, I was able to determine the root 
cause for this:
 * the generationStamp sent in the DNA_INVALIDATE is based on the 
[generationStamp of the block sent in the block 
report|https://github.com/apache/hadoop/blob/8774f178686487007dcf8c418c989b785a529000/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L3733]
 * the problem is that the datanode with the corrupt block replica (that 
receives the DNA_INVALIDATE) is not necissarily the same datanode that sent the 
block report
 * this can cause the above exception when the corrupt block replica on the 
datanode receiving the DNA_INVALIDATE & the block replica on the datanode that 
sent the block report have different generationStamps

The solution is to store the corrupt replicas generationStamp in the 
CorruptReplicasMap, then to extract this correct generationStamp value when 
sending the DNA_INVALIDATE to the datanode

 
h2. Failed Test - Before the fix
{quote}> mvn test 
-Dtest=TestDecommission#testDeleteCorruptReplicaForUnderReplicatedBlock

 

[INFO] Results:
[INFO] 
[ERROR] Failures: 
[ERROR]   TestDecommission.testDeleteCorruptReplicaForUnderReplicatedBlock:2035 
Node 127.0.0.1:61366 failed to complete decommissioning. numTrackedNodes=1 , 
numPendingNodes=0 , adminState=Decommission In Progress , 
nodesWithReplica=[127.0.0.1:61366, 127.0.0.1:61419]
{quote}
Logs:
{quote}> cat 
target/surefire-reports/org.apache.hadoop.hdfs.TestDecommission-output.txt | 
grep 'Expected Replicas:|XXX|FINALIZED|Block now|Failed to delete'

2022-07-16 08:07:45,891 [Listener at localhost/61378] INFO  
hdfs.TestDecommission 
(TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(1942)) - 
Block now has 2 corrupt replicas on [127.0.0.1:61370 , 127.0.0.1:61375] and 1 
live replica on 127.0.0.1:61366
2022-07-16 08:07:45,913 [Listener at localhost/61378] INFO  
hdfs.TestDecommission 
(TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(1974)) - 
Block now has 2 corrupt replicas on [127.0.0.1:61370 , 127.0.0.1:61375] and 1 
decommissioning replica on 127.0.0.1:61366
XXX invalidateBlock dn=127.0.0.1:61415 , blk=1073741825_1001
XXX postponeBlock dn=127.0.0.1:61415 , blk=1073741825_1001
XXX invalidateBlock dn=127.0.0.1:61419 , blk=1073741825_1003
XXX addToInvalidates dn=127.0.0.1:61419 , blk=1073741825_1003
XXX addBlocksToBeInvalidated dn=127.0.0.1:61419 , blk=1073741825_1003
XXX rescanPostponedMisreplicatedBlocks blk=1073741825_1005
XXX DNA_INVALIDATE dn=/127.0.0.1:61419 , blk=1073741825_1003
XXX invalidate(on DN) dn=/127.0.0.1:61419 , invalidBlk=blk_1073741825_1003 , 
blkByIdAndGenStamp = FinalizedReplica, blk_1073741825_1003, FINALIZED
2022-07-16 08:07:49,084 [BP-958471676-X-1657973243350 heartbeating to 
localhost/127.0.0.1:61365] INFO  impl.FsDatasetAsyncDiskService 
(FsDatasetAsyncDiskService.java:deleteAsync(226)) - Scheduling 
blk_1073741825_1003 replica FinalizedReplica, blk_1073741825_1003, FINALIZED
XXX addBlock dn=127.0.0.1:61419 , blk=1073741825_1005   *<<<  block report is 
coming from 127.0.0.1:61419 which has genStamp=1005*
XXX invalidateCorruptReplicas 

[jira] [Updated] (HDFS-16664) Use correct GenerationStamp when invalidating corrupt block replicas

2022-07-16 Thread Kevin Wikant (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Wikant updated HDFS-16664:

Summary: Use correct GenerationStamp when invalidating corrupt block 
replicas  (was: Use correct GenerationStamp when invalidating corrupt block 
replica)

> Use correct GenerationStamp when invalidating corrupt block replicas
> 
>
> Key: HDFS-16664
> URL: https://issues.apache.org/jira/browse/HDFS-16664
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Kevin Wikant
>Priority: Major
>
> While trying to backport HDFS-16064 to an older Hadoop version, the new unit 
> test "testDeleteCorruptReplicaForUnderReplicatedBlock" started failing 
> unexpectedly.
> Upon deep diving this unit test failure, I identified a bug in HDFS corrupt 
> replica invalidation which results in the following datanode exception:
> {quote}2022-07-16 08:07:52,041 [BP-958471676-X-1657973243350 heartbeating to 
> localhost/127.0.0.1:61365] WARN  datanode.DataNode 
> (BPServiceActor.java:processCommand(887)) - Error processing datanode Command
> java.io.IOException: Failed to delete 1 (out of 1) replica(s):
> 0) Failed to delete replica blk_1073741825_1005: GenerationStamp not matched, 
> existing replica is blk_1073741825_1001
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2139)
>         at 
> org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2034)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:735)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:680)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:883)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:678)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:849)
>         at java.lang.Thread.run(Thread.java:750)
> {quote}
> The issue is that the Namenode is sending wrong generationStamp to the 
> datanode. By adding some additional logs, I was able to determine the root 
> cause for this:
>  * the generationStamp sent in the DNA_INVALIDATE is based on the 
> [generationStamp of the block sent in the block 
> report|https://github.com/apache/hadoop/blob/8774f178686487007dcf8c418c989b785a529000/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L3733]
>  * the problem is that the datanode with the corrupt block replica (that 
> receives the DNA_INVALIDATE) is not necissarily the same datanode that sent 
> the block report
>  * this can cause the above exception when the corrupt block replica on the 
> datanode receiving the DNA_INVALIDATE & the block replica on the datanode 
> that sent the block report have different generationStamps
> The solution is to store the corrupt replicas generationStamp in the 
> CorruptReplicasMap, then to extract this correct generationStamp value when 
> sending the DNA_INVALIDATE to the datanode
>  
> h2. Failed Test - Before the fix
> {quote}> mvn test 
> -Dtest=TestDecommission#testDeleteCorruptReplicaForUnderReplicatedBlock
>  
> [INFO] Results:
> [INFO] 
> [ERROR] Failures: 
> [ERROR]   
> TestDecommission.testDeleteCorruptReplicaForUnderReplicatedBlock:2035 Node 
> 127.0.0.1:61366 failed to complete decommissioning. numTrackedNodes=1 , 
> numPendingNodes=0 , adminState=Decommission In Progress , 
> nodesWithReplica=[127.0.0.1:61366, 127.0.0.1:61419]
> {quote}
> Logs:
> {quote}> cat 
> target/surefire-reports/org.apache.hadoop.hdfs.TestDecommission-output.txt | 
> grep 'Expected Replicas:|XXX|FINALIZED|Block now|Failed to delete'
> 2022-07-16 08:07:45,891 [Listener at localhost/61378] INFO  
> hdfs.TestDecommission 
> (TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(1942)) 
> - Block now has 2 corrupt replicas on [127.0.0.1:61370 , 127.0.0.1:61375] and 
> 1 live replica on 127.0.0.1:61366
> 2022-07-16 08:07:45,913 [Listener at localhost/61378] INFO  
> hdfs.TestDecommission 
> (TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(1974)) 
> - Block now has 2 corrupt replicas on [127.0.0.1:61370 , 127.0.0.1:61375] and 
> 1 decommissioning replica on 127.0.0.1:61366
> XXX invalidateBlock dn=127.0.0.1:61415 , blk=1073741825_1001
> XXX postponeBlock dn=127.0.0.1:61415 , blk=1073741825_1001
> XXX invalidateBlock dn=127.0.0.1:61419 , blk=1073741825_1003
> XXX addToInvalidates dn=127.0.0.1:61419 , blk=1073741825_1003
> XXX addBlocksToBeInvalidated dn=127.0.0.1:61419 , blk=1073741825_1003
> XXX rescanPostponedMisreplicatedBlocks 

[jira] [Updated] (HDFS-16664) Use correct GenerationStamp when invalidating corrupt block replica

2022-07-16 Thread Kevin Wikant (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Wikant updated HDFS-16664:

Description: 
While trying to backport HDFS-16064 to an older Hadoop version, the new unit 
test "testDeleteCorruptReplicaForUnderReplicatedBlock" started failing 
unexpectedly.

Upon deep diving this unit test failure, I identified a bug in HDFS corrupt 
replica invalidation which results in the following datanode exception:
{quote}2022-07-16 08:07:52,041 [BP-958471676-X-1657973243350 heartbeating to 
localhost/127.0.0.1:61365] WARN  datanode.DataNode 
(BPServiceActor.java:processCommand(887)) - Error processing datanode Command
java.io.IOException: Failed to delete 1 (out of 1) replica(s):
0) Failed to delete replica blk_1073741825_1005: GenerationStamp not matched, 
existing replica is blk_1073741825_1001
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2139)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2034)
        at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:735)
        at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:680)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:883)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:678)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:849)
        at java.lang.Thread.run(Thread.java:750)
{quote}
The issue is that the Namenode is sending wrong generationStamp to the 
datanode. By adding some additional logs, I was able to determine the root 
cause for this:
 * the generationStamp sent in the DNA_INVALIDATE is based on the 
[generationStamp of the block sent in the block 
report|https://github.com/apache/hadoop/blob/8774f178686487007dcf8c418c989b785a529000/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L3733]
 * the problem is that the datanode with the corrupt block replica (that 
receives the DNA_INVALIDATE) is not necissarily the same datanode that sent the 
block report
 * this can cause the above exception when the corrupt block replica on the 
datanode receiving the DNA_INVALIDATE & the block replica on the datanode that 
sent the block report have different generationStamps

The solution is to store the corrupt replicas generationStamp in the 
CorruptReplicasMap, then to extract this correct generationStamp value when 
sending the DNA_INVALIDATE to the datanode

 
h2. Failed Test - Before the fix
{quote}> mvn test 
-Dtest=TestDecommission#testDeleteCorruptReplicaForUnderReplicatedBlock

 

[INFO] Results:
[INFO] 
[ERROR] Failures: 
[ERROR]   TestDecommission.testDeleteCorruptReplicaForUnderReplicatedBlock:2035 
Node 127.0.0.1:61366 failed to complete decommissioning. numTrackedNodes=1 , 
numPendingNodes=0 , adminState=Decommission In Progress , 
nodesWithReplica=[127.0.0.1:61366, 127.0.0.1:61419]
{quote}
Logs:
{quote}> cat 
target/surefire-reports/org.apache.hadoop.hdfs.TestDecommission-output.txt | 
grep 'Expected Replicas:|XXX|FINALIZED|Block now|Failed to delete'

2022-07-16 08:07:45,891 [Listener at localhost/61378] INFO  
hdfs.TestDecommission 
(TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(1942)) - 
Block now has 2 corrupt replicas on [127.0.0.1:61370 , 127.0.0.1:61375] and 1 
live replica on 127.0.0.1:61366
2022-07-16 08:07:45,913 [Listener at localhost/61378] INFO  
hdfs.TestDecommission 
(TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(1974)) - 
Block now has 2 corrupt replicas on [127.0.0.1:61370 , 127.0.0.1:61375] and 1 
decommissioning replica on 127.0.0.1:61366
XXX invalidateBlock dn=127.0.0.1:61415 , blk=1073741825_1001
XXX postponeBlock dn=127.0.0.1:61415 , blk=1073741825_1001
XXX invalidateBlock dn=127.0.0.1:61419 , blk=1073741825_1003
XXX addToInvalidates dn=127.0.0.1:61419 , blk=1073741825_1003
XXX addBlocksToBeInvalidated dn=127.0.0.1:61419 , blk=1073741825_1003
XXX rescanPostponedMisreplicatedBlocks blk=1073741825_1005
XXX DNA_INVALIDATE dn=/127.0.0.1:61419 , blk=1073741825_1003
XXX invalidate(on DN) dn=/127.0.0.1:61419 , invalidBlk=blk_1073741825_1003 , 
blkByIdAndGenStamp = FinalizedReplica, blk_1073741825_1003, FINALIZED
2022-07-16 08:07:49,084 [BP-958471676-X-1657973243350 heartbeating to 
localhost/127.0.0.1:61365] INFO  impl.FsDatasetAsyncDiskService 
(FsDatasetAsyncDiskService.java:deleteAsync(226)) - Scheduling 
blk_1073741825_1003 replica FinalizedReplica, blk_1073741825_1003, FINALIZED
XXX addBlock dn=127.0.0.1:61419 , blk=1073741825_1005   *<<<  block report is 
coming from 127.0.0.1:61419 which has genStamp=1005*
XXX invalidateCorruptReplicas 

[jira] [Updated] (HDFS-16664) Use correct GenerationStamp when invalidating corrupt block replica

2022-07-16 Thread Kevin Wikant (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16664?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Wikant updated HDFS-16664:

Description: 
While trying to backport HDFS-16064 to an older Hadoop version, the new unit 
test "testDeleteCorruptReplicaForUnderReplicatedBlock" started failing 
unexpectedly.

Upon deep diving this unit test failure, I identified a bug in HDFS corrupt 
replica invalidation which results in the following datanode exception:
{quote}2022-07-16 08:07:52,041 [BP-958471676-X-1657973243350 heartbeating to 
localhost/127.0.0.1:61365] WARN  datanode.DataNode 
(BPServiceActor.java:processCommand(887)) - Error processing datanode Command
java.io.IOException: Failed to delete 1 (out of 1) replica(s):
0) Failed to delete replica blk_1073741825_1005: GenerationStamp not matched, 
existing replica is blk_1073741825_1001
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2139)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2034)
        at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:735)
        at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:680)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:883)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:678)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:849)
        at java.lang.Thread.run(Thread.java:750)
{quote}
The issue is that the Namenode is sending wrong generationStamp to the 
datanode. By adding some additional logs, I was able to determine the root 
cause for this:
 * the generationStamp sent in the DNA_INVALIDATE is based on the 
[generationStamp of the block sent in the block 
report|https://github.com/apache/hadoop/blob/8774f178686487007dcf8c418c989b785a529000/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L3733]
 * the problem is that the datanode with the corrupt block replica (that 
receives the DNA_INVALIDATE) is not necissarily the same datanode that sent the 
block report
 * this can cause the above exception when the corrupt block replica on the 
datanode receiving the DNA_INVALIDATE & the block replica on the datanode that 
sent the block report have different generationStamps

The solution is to store the corrupt replicas generationStamp in the 
CorruptReplicasMap, then to extract this correct generationStamp value when 
sending the DNA_INVALIDATE to the datanode

 
h2. Failed Test - Before the fix
{quote}> mvn test 
-Dtest=TestDecommission#testDeleteCorruptReplicaForUnderReplicatedBlock

 

[INFO] Results:
[INFO] 
[ERROR] Failures: 
[ERROR]   TestDecommission.testDeleteCorruptReplicaForUnderReplicatedBlock:2035 
Node 127.0.0.1:61366 failed to complete decommissioning. numTrackedNodes=1 , 
numPendingNodes=0 , adminState=Decommission In Progress , 
nodesWithReplica=[127.0.0.1:61366, 127.0.0.1:61419]
{quote}
Logs:
{quote}> cat 
target/surefire-reports/org.apache.hadoop.hdfs.TestDecommission-output.txt | 
grep 'Expected Replicas:|XXX|FINALIZED|Block now|Failed to delete'

2022-07-16 08:07:45,891 [Listener at localhost/61378] INFO  
hdfs.TestDecommission 
(TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(1942)) - 
Block now has 2 corrupt replicas on [127.0.0.1:61370 , 127.0.0.1:61375] and 1 
live replica on 127.0.0.1:61366
2022-07-16 08:07:45,913 [Listener at localhost/61378] INFO  
hdfs.TestDecommission 
(TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(1974)) - 
Block now has 2 corrupt replicas on [127.0.0.1:61370 , 127.0.0.1:61375] and 1 
decommissioning replica on 127.0.0.1:61366
XXX invalidateBlock dn=127.0.0.1:61415 , blk=1073741825_1001
XXX postponeBlock dn=127.0.0.1:61415 , blk=1073741825_1001
XXX invalidateBlock dn=127.0.0.1:61419 , blk=1073741825_1003
XXX addToInvalidates dn=127.0.0.1:61419 , blk=1073741825_1003
XXX addBlocksToBeInvalidated dn=127.0.0.1:61419 , blk=1073741825_1003
XXX rescanPostponedMisreplicatedBlocks blk=1073741825_1005
XXX DNA_INVALIDATE dn=/127.0.0.1:61419 , blk=1073741825_1003
XXX invalidate(on DN) dn=/127.0.0.1:61419 , invalidBlk=blk_1073741825_1003 , 
blkByIdAndGenStamp = FinalizedReplica, blk_1073741825_1003, FINALIZED
2022-07-16 08:07:49,084 [BP-958471676-X-1657973243350 heartbeating to 
localhost/127.0.0.1:61365] INFO  impl.FsDatasetAsyncDiskService 
(FsDatasetAsyncDiskService.java:deleteAsync(226)) - Scheduling 
blk_1073741825_1003 replica FinalizedReplica, blk_1073741825_1003, FINALIZED
XXX addBlock dn=127.0.0.1:61419 , blk=1073741825_1005   <<<  block report is 
coming from 127.0.0.1:61419 which has genStamp=1005
XXX invalidateCorruptReplicas 

[jira] [Created] (HDFS-16664) Use correct GenerationStamp when invalidating corrupt block replica

2022-07-16 Thread Kevin Wikant (Jira)
Kevin Wikant created HDFS-16664:
---

 Summary: Use correct GenerationStamp when invalidating corrupt 
block replica
 Key: HDFS-16664
 URL: https://issues.apache.org/jira/browse/HDFS-16664
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Kevin Wikant


While trying to backport HDFS-16064 to an older Hadoop version, the new unit 
test "testDeleteCorruptReplicaForUnderReplicatedBlock" started failing 
unexpectedly.

Upon deep diving this unit test failure, I identified a bug in HDFS corrupt 
replica invalidation which results in the following datanode exception:
{quote}2022-07-16 08:07:52,041 [BP-958471676-X-1657973243350 heartbeating to 
localhost/127.0.0.1:61365] WARN  datanode.DataNode 
(BPServiceActor.java:processCommand(887)) - Error processing datanode Command
java.io.IOException: Failed to delete 1 (out of 1) replica(s):
0) Failed to delete replica blk_1073741825_1005: GenerationStamp not matched, 
existing replica is blk_1073741825_1001
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2139)
        at 
org.apache.hadoop.hdfs.server.datanode.fsdataset.impl.FsDatasetImpl.invalidate(FsDatasetImpl.java:2034)
        at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActive(BPOfferService.java:735)
        at 
org.apache.hadoop.hdfs.server.datanode.BPOfferService.processCommandFromActor(BPOfferService.java:680)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.processCommand(BPServiceActor.java:883)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:678)
        at 
org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:849)
        at java.lang.Thread.run(Thread.java:750)
{quote} * The issue is that the Namenode is sending wrong generationStamp to 
the datanode. By adding some additional logs, I was able to determine the root 
cause for this:
the generationStamp sent in the DNA_INVALIDATE is based on the [generationStamp 
of the block sent in the block 
report|https://github.com/apache/hadoop/blob/8774f178686487007dcf8c418c989b785a529000/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L3733]
 * the problem is that the datanode with the corrupt block replica (that 
receives the DNA_INVALIDATE) is not necissarily the same datanode that sent the 
block report
 * this can cause the above exception when the corrupt block replica on the 
datanode receiving the DNA_INVALIDATE & the block replica on the datanode that 
sent the block report have different generationStamps

The solution is to store the corrupt replicas generationStamp in the 
CorruptReplicasMap, then to extract this correct generationStamp value when 
sending the DNA_INVALIDATE to the datanode

 
h2. Failed Test - Before the fix
{quote}> mvn test 
-Dtest=TestDecommission#testDeleteCorruptReplicaForUnderReplicatedBlock

 

[INFO] Results:
[INFO] 
[ERROR] Failures: 
[ERROR]   TestDecommission.testDeleteCorruptReplicaForUnderReplicatedBlock:2035 
Node 127.0.0.1:61366 failed to complete decommissioning. numTrackedNodes=1 , 
numPendingNodes=0 , adminState=Decommission In Progress , 
nodesWithReplica=[127.0.0.1:61366, 127.0.0.1:61419]
{quote}
Logs:
{quote}> cat 
target/surefire-reports/org.apache.hadoop.hdfs.TestDecommission-output.txt | 
grep 'Expected Replicas:\|XXX\|FINALIZED\|Block now\|Failed to delete'


2022-07-16 08:07:45,891 [Listener at localhost/61378] INFO  
hdfs.TestDecommission 
(TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(1942)) - 
Block now has 2 corrupt replicas on [127.0.0.1:61370 , 127.0.0.1:61375] and 1 
live replica on 127.0.0.1:61366
2022-07-16 08:07:45,913 [Listener at localhost/61378] INFO  
hdfs.TestDecommission 
(TestDecommission.java:testDeleteCorruptReplicaForUnderReplicatedBlock(1974)) - 
Block now has 2 corrupt replicas on [127.0.0.1:61370 , 127.0.0.1:61375] and 1 
decommissioning replica on 127.0.0.1:61366
XXX invalidateBlock dn=127.0.0.1:61415 , blk=1073741825_1001
XXX postponeBlock dn=127.0.0.1:61415 , blk=1073741825_1001
XXX invalidateBlock dn=127.0.0.1:61419 , blk=1073741825_1003
XXX addToInvalidates dn=127.0.0.1:61419 , blk=1073741825_1003
XXX addBlocksToBeInvalidated dn=127.0.0.1:61419 , blk=1073741825_1003
XXX rescanPostponedMisreplicatedBlocks blk=1073741825_1005
XXX DNA_INVALIDATE dn=/127.0.0.1:61419 , blk=1073741825_1003
XXX invalidate(on DN) dn=/127.0.0.1:61419 , invalidBlk=blk_1073741825_1003 , 
blkByIdAndGenStamp = FinalizedReplica, blk_1073741825_1003, FINALIZED
2022-07-16 08:07:49,084 [BP-958471676-X-1657973243350 heartbeating to 
localhost/127.0.0.1:61365] INFO  impl.FsDatasetAsyncDiskService 
(FsDatasetAsyncDiskService.java:deleteAsync(226)) - Scheduling 
blk_1073741825_1003 replica FinalizedReplica, blk_1073741825_1003, FINALIZED
XXX 

[jira] [Comment Edited] (HDFS-16064) HDFS-721 causes DataNode decommissioning to get stuck indefinitely

2022-06-07 Thread Kevin Wikant (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17550619#comment-17550619
 ] 

Kevin Wikant edited comment on HDFS-16064 at 6/7/22 12:29 PM:
--

Thanks [~it_singer] , you are correct in that my initial root cause was 
incomplete

In the past few months I have seen this issue re-occur multiple times, I 
decided to do a deeper dive & I identified the bug described here: 
[https://github.com/apache/hadoop/pull/4410]

I think the issue described in this ticket is occurring because the corrupt 
replica on DN3 will not be invalidated until DN3 either:
 * restarts & sends a block report
 * sends its next periodic block report (default interval is 6 hours)

So in the worst case the decommissioning in the aforementioned scenario will 
take up to 6 hours to complete because DN3 may take up to 6 hours to send its 
next block report & have the corrupt replica invalidated. I have not targeted 
fixing this decommissioning blocker scenario because it is arguably expected 
behavior & will resolve in at most "dfs.blockreport.intervalMsec". Instead the 
fix [[https://github.com/apache/hadoop/pull/4410]] is targeting a more severe 
bug where decommissioning gets blocked indefinitely


was (Author: kevinwikant):
Thanks [~it_singer] , you are correct in that my initial root cause was very 
much incorrect

In the past few months I have seen this issue re-occur multiple times, I 
decided to do a deeper dive & I identified the bug described here: 
[https://github.com/apache/hadoop/pull/4410]

I think the issue described in this ticket is occurring because the corrupt 
replica on DN3 will not be invalidated until DN3 either:
 * restarts & sends a block report
 * sends its next periodic block report (default interval is 6 hours)

So in the worst case the decommissioning in the aforementioned scenario will 
take up to 6 hours to complete because DN3 may take up to 6 hours to send its 
next block report & have the corrupt replica invalidated. I have not targeted 
fixing this decommissioning blocker scenario because it is arguably expected 
behavior & will resolve in at most "dfs.blockreport.intervalMsec". Instead the 
fix [[https://github.com/apache/hadoop/pull/4410]] is targeting a more severe 
bug where decommissioning gets blocked indefinitely

> HDFS-721 causes DataNode decommissioning to get stuck indefinitely
> --
>
> Key: HDFS-16064
> URL: https://issues.apache.org/jira/browse/HDFS-16064
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, namenode
>Affects Versions: 3.2.1
>Reporter: Kevin Wikant
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> Seems that https://issues.apache.org/jira/browse/HDFS-721 was resolved as a 
> non-issue under the assumption that if the namenode & a datanode get into an 
> inconsistent state for a given block pipeline, there should be another 
> datanode available to replicate the block to
> While testing datanode decommissioning using "dfs.exclude.hosts", I have 
> encountered a scenario where the decommissioning gets stuck indefinitely
> Below is the progression of events:
>  * there are initially 4 datanodes DN1, DN2, DN3, DN4
>  * scale-down is started by adding DN1 & DN2 to "dfs.exclude.hosts"
>  * HDFS block pipelines on DN1 & DN2 must now be replicated to DN3 & DN4 in 
> order to satisfy their minimum replication factor of 2
>  * during this replication process 
> https://issues.apache.org/jira/browse/HDFS-721 is encountered which causes 
> the following inconsistent state:
>  ** DN3 thinks it has the block pipeline in FINALIZED state
>  ** the namenode does not think DN3 has the block pipeline
> {code:java}
> 2021-06-06 10:38:23,604 INFO org.apache.hadoop.hdfs.server.datanode.DataNode 
> (DataXceiver for client  at /DN2:45654 [Receiving block BP-YYY:blk_XXX]): 
> DN3:9866:DataXceiver error processing WRITE_BLOCK operation  src: /DN2:45654 
> dst: /DN3:9866; 
> org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block 
> BP-YYY:blk_XXX already exists in state FINALIZED and thus cannot be created.
> {code}
>  * the replication is attempted again, but:
>  ** DN4 has the block
>  ** DN1 and/or DN2 have the block, but don't count towards the minimum 
> replication factor because they are being decommissioned
>  ** DN3 does not have the block & cannot have the block replicated to it 
> because of HDFS-721
>  * the namenode repeatedly tries to replicate the block to DN3 & repeatedly 
> fails, this continues indefinitely
>  * therefore DN4 is the only live datanode with the block & the minimum 
> replication factor of 2 cannot be satisfied
>  * because the minimum replication factor cannot be satisfied for the 

[jira] [Commented] (HDFS-16064) HDFS-721 causes DataNode decommissioning to get stuck indefinitely

2022-06-06 Thread Kevin Wikant (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17550619#comment-17550619
 ] 

Kevin Wikant commented on HDFS-16064:
-

Thanks [~it_singer] , you are correct in that my initial root cause was very 
much incorrect

In the past few months I have seen this issue re-occur multiple times, I 
decided to do a deeper dive & I identified the bug described here: 
[https://github.com/apache/hadoop/pull/4410]

I think the issue described in this ticket is occurring because the corrupt 
replica on DN3 will not be invalidated until DN3 either:
 * restarts & sends a block report
 * sends its next periodic block report (default interval is 6 hours)

So in the worst case the decommissioning in the aforementioned scenario will 
take up to 6 hours to complete because DN3 may take up to 6 hours to send its 
next block report & have the corrupt replica invalidated. I have not targeted 
fixing this decommissioning blocker scenario because it is arguably expected 
behavior & will resolve in at most "dfs.blockreport.intervalMsec". Instead the 
fix [[https://github.com/apache/hadoop/pull/4410]] is targeting a more severe 
bug where decommissioning gets blocked indefinitely

> HDFS-721 causes DataNode decommissioning to get stuck indefinitely
> --
>
> Key: HDFS-16064
> URL: https://issues.apache.org/jira/browse/HDFS-16064
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode, namenode
>Affects Versions: 3.2.1
>Reporter: Kevin Wikant
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Seems that https://issues.apache.org/jira/browse/HDFS-721 was resolved as a 
> non-issue under the assumption that if the namenode & a datanode get into an 
> inconsistent state for a given block pipeline, there should be another 
> datanode available to replicate the block to
> While testing datanode decommissioning using "dfs.exclude.hosts", I have 
> encountered a scenario where the decommissioning gets stuck indefinitely
> Below is the progression of events:
>  * there are initially 4 datanodes DN1, DN2, DN3, DN4
>  * scale-down is started by adding DN1 & DN2 to "dfs.exclude.hosts"
>  * HDFS block pipelines on DN1 & DN2 must now be replicated to DN3 & DN4 in 
> order to satisfy their minimum replication factor of 2
>  * during this replication process 
> https://issues.apache.org/jira/browse/HDFS-721 is encountered which causes 
> the following inconsistent state:
>  ** DN3 thinks it has the block pipeline in FINALIZED state
>  ** the namenode does not think DN3 has the block pipeline
> {code:java}
> 2021-06-06 10:38:23,604 INFO org.apache.hadoop.hdfs.server.datanode.DataNode 
> (DataXceiver for client  at /DN2:45654 [Receiving block BP-YYY:blk_XXX]): 
> DN3:9866:DataXceiver error processing WRITE_BLOCK operation  src: /DN2:45654 
> dst: /DN3:9866; 
> org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block 
> BP-YYY:blk_XXX already exists in state FINALIZED and thus cannot be created.
> {code}
>  * the replication is attempted again, but:
>  ** DN4 has the block
>  ** DN1 and/or DN2 have the block, but don't count towards the minimum 
> replication factor because they are being decommissioned
>  ** DN3 does not have the block & cannot have the block replicated to it 
> because of HDFS-721
>  * the namenode repeatedly tries to replicate the block to DN3 & repeatedly 
> fails, this continues indefinitely
>  * therefore DN4 is the only live datanode with the block & the minimum 
> replication factor of 2 cannot be satisfied
>  * because the minimum replication factor cannot be satisfied for the 
> block(s) being moved off DN1 & DN2, the datanode decommissioning can never be 
> completed 
> {code:java}
> 2021-06-06 10:39:10,106 INFO BlockStateChange (DatanodeAdminMonitor-0): 
> Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, 
> decommissioned replicas: 0, decommissioning replicas: 2, maintenance 
> replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is 
> Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , 
> Current Datanode: DN1:9866, Is current datanode decommissioning: true, Is 
> current datanode entering maintenance: false
> ...
> 2021-06-06 10:57:10,105 INFO BlockStateChange (DatanodeAdminMonitor-0): 
> Block: blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, 
> decommissioned replicas: 0, decommissioning replicas: 2, maintenance 
> replicas: 0, live entering maintenance replicas: 0, excess replicas: 0, Is 
> Open File: false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , 
> Current Datanode: DN2:9866, Is current datanode decommissioning: true, Is 
> current datanode entering 

[jira] [Created] (HDFS-16443) Fix edge case where DatanodeAdminDefaultMonitor doubly enqueues a DatanodeDescriptor on exception

2022-01-28 Thread Kevin Wikant (Jira)
Kevin Wikant created HDFS-16443:
---

 Summary: Fix edge case where DatanodeAdminDefaultMonitor doubly 
enqueues a DatanodeDescriptor on exception
 Key: HDFS-16443
 URL: https://issues.apache.org/jira/browse/HDFS-16443
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: hdfs
Reporter: Kevin Wikant


As part of the fix merged in: https://issues.apache.org/jira/browse/HDFS-16303

There was a rare edge case noticed in DatanodeAdminDefaultMonitor which causes 
a DatanodeDescriptor to be added twice to the pendingNodes queue. 
 * a [datanode is unhealthy so it gets added to 
"unhealthyDns"]([https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminDefaultMonitor.java#L227)]
 * an exception is thrown which causes [this catch 
block](https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminDefaultMonitor.java#L271)
 to execute
 * the [datanode is added to 
"pendingNodes"]([https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminDefaultMonitor.java#L276)]
 * under certain conditions the [datanode can be added again from 
"unhealthyDns" to "pendingNodes" 
here]([https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/DatanodeAdminDefaultMonitor.java#L296)]

This Jira is to track the 1 line fix for this bug



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-16442) TestBlockTokenWithShortCircuitRead.testShortCircuitReadWithInvalidToken fails

2022-01-27 Thread Kevin Wikant (Jira)
Kevin Wikant created HDFS-16442:
---

 Summary: 
TestBlockTokenWithShortCircuitRead.testShortCircuitReadWithInvalidToken fails
 Key: HDFS-16442
 URL: https://issues.apache.org/jira/browse/HDFS-16442
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs
Reporter: Kevin Wikant


[https://ci-hadoop.apache.org/blue/organizations/jenkins/hadoop-multibranch/detail/PR-3920/2/pipeline]

 

[https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3920/2/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt]

 

```
[ERROR] Failures: 
[ERROR]   
TestBlockTokenWithShortCircuitRead.testShortCircuitReadWithInvalidToken:153->checkSlotsAfterSSRWithTokenExpiration:178->checkShmAndSlots:184
 expected:<1> but was:<2>
[ERROR]   TestDirectoryScanner.testThrottling:727 Throttle is too permissive
[INFO] 
[ERROR] Tests run: 6208, Failures: 2, Errors: 0, Skipped: 22
```



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-16303) Losing over 100 datanodes in state decommissioning results in full blockage of all datanode decommissioning

2022-01-24 Thread Kevin Wikant (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-16303?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17481354#comment-17481354
 ] 

Kevin Wikant commented on HDFS-16303:
-

My apologies for the delay, something high priority came up at work.
 * backport to Hadoop 2.x : https://github.com/apache/hadoop/pull/3920
 * backport to Hadoop 3.x : https://github.com/apache/hadoop/pull/3921

I have also made 1 small change to the DatanodeAdminDefaultMonitor based on a 
rare edge case I identified: https://github.com/apache/hadoop/pull/3923

> Losing over 100 datanodes in state decommissioning results in full blockage 
> of all datanode decommissioning
> ---
>
> Key: HDFS-16303
> URL: https://issues.apache.org/jira/browse/HDFS-16303
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1, 3.3.1
>Reporter: Kevin Wikant
>Assignee: Kevin Wikant
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 12h 40m
>  Remaining Estimate: 0h
>
> h2. Impact
> HDFS datanode decommissioning does not make any forward progress. For 
> example, the user adds X datanodes to the "dfs.hosts.exclude" file and all X 
> of those datanodes remain in state decommissioning forever without making any 
> forward progress towards being decommissioned.
> h2. Root Cause
> The HDFS Namenode class "DatanodeAdminManager" is responsible for 
> decommissioning datanodes.
> As per this "hdfs-site" configuration:
> {quote}Config = dfs.namenode.decommission.max.concurrent.tracked.nodes 
>  Default Value = 100
> The maximum number of decommission-in-progress datanodes nodes that will be 
> tracked at one time by the namenode. Tracking a decommission-in-progress 
> datanode consumes additional NN memory proportional to the number of blocks 
> on the datnode. Having a conservative limit reduces the potential impact of 
> decomissioning a large number of nodes at once. A value of 0 means no limit 
> will be enforced.
> {quote}
> The Namenode will only actively track up to 100 datanodes for decommissioning 
> at any given time, as to avoid Namenode memory pressure.
> Looking into the "DatanodeAdminManager" code:
>  * a new datanode is only removed from the "tracked.nodes" set when it 
> finishes decommissioning
>  * a new datanode is only added to the "tracked.nodes" set if there is fewer 
> than 100 datanodes being tracked
> So in the event that there are more than 100 datanodes being decommissioned 
> at a given time, some of those datanodes will not be in the "tracked.nodes" 
> set until 1 or more datanodes in the "tracked.nodes" finishes 
> decommissioning. This is generally not a problem because the datanodes in 
> "tracked.nodes" will eventually finish decommissioning, but there is an edge 
> case where this logic prevents the namenode from making any forward progress 
> towards decommissioning.
> If all 100 datanodes in the "tracked.nodes" are unable to finish 
> decommissioning, then other datanodes (which may be able to be 
> decommissioned) will never get added to "tracked.nodes" and therefore will 
> never get the opportunity to be decommissioned.
> This can occur due the following issue:
> {quote}2021-10-21 12:39:24,048 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager 
> (DatanodeAdminMonitor-0): Node W.X.Y.Z:50010 is dead while in Decommission In 
> Progress. Cannot be safely decommissioned or be in maintenance since there is 
> risk of reduced data durability or data loss. Either restart the failed node 
> or force decommissioning or maintenance by removing, calling refreshNodes, 
> then re-adding to the excludes or host config files.
> {quote}
> If a Datanode is lost while decommissioning (for example if the underlying 
> hardware fails or is lost), then it will remain in state decommissioning 
> forever.
> If 100 or more Datanodes are lost while decommissioning over the Hadoop 
> cluster lifetime, then this is enough to completely fill up the 
> "tracked.nodes" set. With the entire "tracked.nodes" set filled with 
> datanodes that can never finish decommissioning, any datanodes added after 
> this point will never be able to be decommissioned because they will never be 
> added to the "tracked.nodes" set.
> In this scenario:
>  * the "tracked.nodes" set is filled with datanodes which are lost & cannot 
> be recovered (and can never finish decommissioning so they will never be 
> removed from the set)
>  * the actual live datanodes being decommissioned are enqueued waiting to 
> enter the "tracked.nodes" set (and are stuck waiting indefinitely)
> This means that no progress towards decommissioning the live datanodes will 
> be made unless the user takes the following action:
> {quote}Either restart the 

[jira] [Reopened] (HDFS-16303) Losing over 100 datanodes in state decommissioning results in full blockage of all datanode decommissioning

2021-12-23 Thread Kevin Wikant (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Wikant reopened HDFS-16303:
-

Re-opening the Jira to track work for cherry picking the change

> Losing over 100 datanodes in state decommissioning results in full blockage 
> of all datanode decommissioning
> ---
>
> Key: HDFS-16303
> URL: https://issues.apache.org/jira/browse/HDFS-16303
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1, 3.3.1
>Reporter: Kevin Wikant
>Assignee: Kevin Wikant
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>  Time Spent: 12h 10m
>  Remaining Estimate: 0h
>
> h2. Impact
> HDFS datanode decommissioning does not make any forward progress. For 
> example, the user adds X datanodes to the "dfs.hosts.exclude" file and all X 
> of those datanodes remain in state decommissioning forever without making any 
> forward progress towards being decommissioned.
> h2. Root Cause
> The HDFS Namenode class "DatanodeAdminManager" is responsible for 
> decommissioning datanodes.
> As per this "hdfs-site" configuration:
> {quote}Config = dfs.namenode.decommission.max.concurrent.tracked.nodes 
>  Default Value = 100
> The maximum number of decommission-in-progress datanodes nodes that will be 
> tracked at one time by the namenode. Tracking a decommission-in-progress 
> datanode consumes additional NN memory proportional to the number of blocks 
> on the datnode. Having a conservative limit reduces the potential impact of 
> decomissioning a large number of nodes at once. A value of 0 means no limit 
> will be enforced.
> {quote}
> The Namenode will only actively track up to 100 datanodes for decommissioning 
> at any given time, as to avoid Namenode memory pressure.
> Looking into the "DatanodeAdminManager" code:
>  * a new datanode is only removed from the "tracked.nodes" set when it 
> finishes decommissioning
>  * a new datanode is only added to the "tracked.nodes" set if there is fewer 
> than 100 datanodes being tracked
> So in the event that there are more than 100 datanodes being decommissioned 
> at a given time, some of those datanodes will not be in the "tracked.nodes" 
> set until 1 or more datanodes in the "tracked.nodes" finishes 
> decommissioning. This is generally not a problem because the datanodes in 
> "tracked.nodes" will eventually finish decommissioning, but there is an edge 
> case where this logic prevents the namenode from making any forward progress 
> towards decommissioning.
> If all 100 datanodes in the "tracked.nodes" are unable to finish 
> decommissioning, then other datanodes (which may be able to be 
> decommissioned) will never get added to "tracked.nodes" and therefore will 
> never get the opportunity to be decommissioned.
> This can occur due the following issue:
> {quote}2021-10-21 12:39:24,048 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager 
> (DatanodeAdminMonitor-0): Node W.X.Y.Z:50010 is dead while in Decommission In 
> Progress. Cannot be safely decommissioned or be in maintenance since there is 
> risk of reduced data durability or data loss. Either restart the failed node 
> or force decommissioning or maintenance by removing, calling refreshNodes, 
> then re-adding to the excludes or host config files.
> {quote}
> If a Datanode is lost while decommissioning (for example if the underlying 
> hardware fails or is lost), then it will remain in state decommissioning 
> forever.
> If 100 or more Datanodes are lost while decommissioning over the Hadoop 
> cluster lifetime, then this is enough to completely fill up the 
> "tracked.nodes" set. With the entire "tracked.nodes" set filled with 
> datanodes that can never finish decommissioning, any datanodes added after 
> this point will never be able to be decommissioned because they will never be 
> added to the "tracked.nodes" set.
> In this scenario:
>  * the "tracked.nodes" set is filled with datanodes which are lost & cannot 
> be recovered (and can never finish decommissioning so they will never be 
> removed from the set)
>  * the actual live datanodes being decommissioned are enqueued waiting to 
> enter the "tracked.nodes" set (and are stuck waiting indefinitely)
> This means that no progress towards decommissioning the live datanodes will 
> be made unless the user takes the following action:
> {quote}Either restart the failed node or force decommissioning or maintenance 
> by removing, calling refreshNodes, then re-adding to the excludes or host 
> config files.
> {quote}
> Ideally, the Namenode should be able to gracefully handle scenarios where the 
> datanodes in the "tracked.nodes" set are not making forward progress towards 
> decommissioning 

[jira] [Resolved] (HDFS-16303) Losing over 100 datanodes in state decommissioning results in full blockage of all datanode decommissioning

2021-12-23 Thread Kevin Wikant (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Wikant resolved HDFS-16303.
-
Resolution: Fixed

Resolved by this commit: 
https://github.com/apache/hadoop/commit/d20b598f97e76c67d6103a950ea9e89644be2c41

> Losing over 100 datanodes in state decommissioning results in full blockage 
> of all datanode decommissioning
> ---
>
> Key: HDFS-16303
> URL: https://issues.apache.org/jira/browse/HDFS-16303
> Project: Hadoop HDFS
>  Issue Type: Bug
>Affects Versions: 2.10.1, 3.3.1
>Reporter: Kevin Wikant
>Assignee: Kevin Wikant
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 12h 10m
>  Remaining Estimate: 0h
>
> h2. Impact
> HDFS datanode decommissioning does not make any forward progress. For 
> example, the user adds X datanodes to the "dfs.hosts.exclude" file and all X 
> of those datanodes remain in state decommissioning forever without making any 
> forward progress towards being decommissioned.
> h2. Root Cause
> The HDFS Namenode class "DatanodeAdminManager" is responsible for 
> decommissioning datanodes.
> As per this "hdfs-site" configuration:
> {quote}Config = dfs.namenode.decommission.max.concurrent.tracked.nodes 
>  Default Value = 100
> The maximum number of decommission-in-progress datanodes nodes that will be 
> tracked at one time by the namenode. Tracking a decommission-in-progress 
> datanode consumes additional NN memory proportional to the number of blocks 
> on the datnode. Having a conservative limit reduces the potential impact of 
> decomissioning a large number of nodes at once. A value of 0 means no limit 
> will be enforced.
> {quote}
> The Namenode will only actively track up to 100 datanodes for decommissioning 
> at any given time, as to avoid Namenode memory pressure.
> Looking into the "DatanodeAdminManager" code:
>  * a new datanode is only removed from the "tracked.nodes" set when it 
> finishes decommissioning
>  * a new datanode is only added to the "tracked.nodes" set if there is fewer 
> than 100 datanodes being tracked
> So in the event that there are more than 100 datanodes being decommissioned 
> at a given time, some of those datanodes will not be in the "tracked.nodes" 
> set until 1 or more datanodes in the "tracked.nodes" finishes 
> decommissioning. This is generally not a problem because the datanodes in 
> "tracked.nodes" will eventually finish decommissioning, but there is an edge 
> case where this logic prevents the namenode from making any forward progress 
> towards decommissioning.
> If all 100 datanodes in the "tracked.nodes" are unable to finish 
> decommissioning, then other datanodes (which may be able to be 
> decommissioned) will never get added to "tracked.nodes" and therefore will 
> never get the opportunity to be decommissioned.
> This can occur due the following issue:
> {quote}2021-10-21 12:39:24,048 WARN 
> org.apache.hadoop.hdfs.server.blockmanagement.BlockManager 
> (DatanodeAdminMonitor-0): Node W.X.Y.Z:50010 is dead while in Decommission In 
> Progress. Cannot be safely decommissioned or be in maintenance since there is 
> risk of reduced data durability or data loss. Either restart the failed node 
> or force decommissioning or maintenance by removing, calling refreshNodes, 
> then re-adding to the excludes or host config files.
> {quote}
> If a Datanode is lost while decommissioning (for example if the underlying 
> hardware fails or is lost), then it will remain in state decommissioning 
> forever.
> If 100 or more Datanodes are lost while decommissioning over the Hadoop 
> cluster lifetime, then this is enough to completely fill up the 
> "tracked.nodes" set. With the entire "tracked.nodes" set filled with 
> datanodes that can never finish decommissioning, any datanodes added after 
> this point will never be able to be decommissioned because they will never be 
> added to the "tracked.nodes" set.
> In this scenario:
>  * the "tracked.nodes" set is filled with datanodes which are lost & cannot 
> be recovered (and can never finish decommissioning so they will never be 
> removed from the set)
>  * the actual live datanodes being decommissioned are enqueued waiting to 
> enter the "tracked.nodes" set (and are stuck waiting indefinitely)
> This means that no progress towards decommissioning the live datanodes will 
> be made unless the user takes the following action:
> {quote}Either restart the failed node or force decommissioning or maintenance 
> by removing, calling refreshNodes, then re-adding to the excludes or host 
> config files.
> {quote}
> Ideally, the Namenode should be able to gracefully handle scenarios where the 
> datanodes in the "tracked.nodes" set are not making forward 

[jira] [Created] (HDFS-16336) TestRollingUpgrade.testRollback fails

2021-11-18 Thread Kevin Wikant (Jira)
Kevin Wikant created HDFS-16336:
---

 Summary: TestRollingUpgrade.testRollback fails
 Key: HDFS-16336
 URL: https://issues.apache.org/jira/browse/HDFS-16336
 Project: Hadoop HDFS
  Issue Type: Sub-task
  Components: hdfs
Affects Versions: 3.4.0
Reporter: Kevin Wikant


This pull request: [https://github.com/apache/hadoop/pull/3675]

Failed Jenkins pre-commit job due to an unrelated unit test failure: 
[https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-3675/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt]
{code:java}
[ERROR] Failures: 
[ERROR] 
org.apache.hadoop.hdfs.TestRollingUpgrade.testRollback(org.apache.hadoop.hdfs.TestRollingUpgrade)
[ERROR]   Run 1: TestRollingUpgrade.testRollback:328->checkMxBeanIsNull:299 
expected null, but 
was:
[ERROR]   Run 2: TestRollingUpgrade.testRollback:328->checkMxBeanIsNull:299 
expected null, but 
was:
[ERROR]   Run 3: TestRollingUpgrade.testRollback:328->checkMxBeanIsNull:299 
expected null, but 
was: {code}
Seems that perhaps "TestRollingUpgrade.testRollback" is a flaky unit test



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16303) Losing over 100 datanodes in state decommissioning results in full blockage of all datanode decommissioning

2021-11-05 Thread Kevin Wikant (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Wikant updated HDFS-16303:

Description: 
h2. Impact

HDFS datanode decommissioning does not make any forward progress. For example, 
the user adds X datanodes to the "dfs.hosts.exclude" file and all X of those 
datanodes remain in state decommissioning forever without making any forward 
progress towards being decommissioned.
h2. Root Cause

The HDFS Namenode class "DatanodeAdminManager" is responsible for 
decommissioning datanodes.

As per this "hdfs-site" configuration:
{quote}Config = dfs.namenode.decommission.max.concurrent.tracked.nodes 
 Default Value = 100

The maximum number of decommission-in-progress datanodes nodes that will be 
tracked at one time by the namenode. Tracking a decommission-in-progress 
datanode consumes additional NN memory proportional to the number of blocks on 
the datnode. Having a conservative limit reduces the potential impact of 
decomissioning a large number of nodes at once. A value of 0 means no limit 
will be enforced.
{quote}
The Namenode will only actively track up to 100 datanodes for decommissioning 
at any given time, as to avoid Namenode memory pressure.

Looking into the "DatanodeAdminManager" code:
 * a new datanode is only removed from the "tracked.nodes" set when it finishes 
decommissioning
 * a new datanode is only added to the "tracked.nodes" set if there is fewer 
than 100 datanodes being tracked

So in the event that there are more than 100 datanodes being decommissioned at 
a given time, some of those datanodes will not be in the "tracked.nodes" set 
until 1 or more datanodes in the "tracked.nodes" finishes decommissioning. This 
is generally not a problem because the datanodes in "tracked.nodes" will 
eventually finish decommissioning, but there is an edge case where this logic 
prevents the namenode from making any forward progress towards decommissioning.

If all 100 datanodes in the "tracked.nodes" are unable to finish 
decommissioning, then other datanodes (which may be able to be decommissioned) 
will never get added to "tracked.nodes" and therefore will never get the 
opportunity to be decommissioned.

This can occur due the following issue:
{quote}2021-10-21 12:39:24,048 WARN 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager 
(DatanodeAdminMonitor-0): Node W.X.Y.Z:50010 is dead while in Decommission In 
Progress. Cannot be safely decommissioned or be in maintenance since there is 
risk of reduced data durability or data loss. Either restart the failed node or 
force decommissioning or maintenance by removing, calling refreshNodes, then 
re-adding to the excludes or host config files.
{quote}
If a Datanode is lost while decommissioning (for example if the underlying 
hardware fails or is lost), then it will remain in state decommissioning 
forever.

If 100 or more Datanodes are lost while decommissioning over the Hadoop cluster 
lifetime, then this is enough to completely fill up the "tracked.nodes" set. 
With the entire "tracked.nodes" set filled with datanodes that can never finish 
decommissioning, any datanodes added after this point will never be able to be 
decommissioned because they will never be added to the "tracked.nodes" set.

In this scenario:
 * the "tracked.nodes" set is filled with datanodes which are lost & cannot be 
recovered (and can never finish decommissioning so they will never be removed 
from the set)
 * the actual live datanodes being decommissioned are enqueued waiting to enter 
the "tracked.nodes" set (and are stuck waiting indefinitely)

This means that no progress towards decommissioning the live datanodes will be 
made unless the user takes the following action:
{quote}Either restart the failed node or force decommissioning or maintenance 
by removing, calling refreshNodes, then re-adding to the excludes or host 
config files.
{quote}
Ideally, the Namenode should be able to gracefully handle scenarios where the 
datanodes in the "tracked.nodes" set are not making forward progress towards 
decommissioning while the enqueued datanodes may be able to make forward 
progress.
h2. Reproduce Steps
 * create a Hadoop cluster
 * lose (i.e. terminate the host/process forever) over 100 datanodes while the 
datanodes are in state decommissioning
 * add additional datanodes to the cluster
 * attempt to decommission those new datanodes & observe that they are stuck in 
state decommissioning forever

Note that in this example each datanode, over the full history of the cluster, 
has a unique IP address

  was:
h2. Impact

HDFS datanode decommissioning does not make any forward progress. For example, 
the user adds X datanodes to the "dfs.hosts.exclude" file and all of those 
datanodes remain in state decommissioning forever without making any forward 
progress towards decommissioning.
h2. Root Cause

The HDFS Namenode class "DatanodeAdminManager" is 

[jira] [Updated] (HDFS-16303) Losing over 100 datanodes in state decommissioning results in full blockage of all datanode decommissioning

2021-11-05 Thread Kevin Wikant (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Wikant updated HDFS-16303:

Description: 
h2. Impact

HDFS datanode decommissioning does not make any forward progress. For example, 
the user adds X datanodes to the "dfs.hosts.exclude" file and all of those 
datanodes remain in state decommissioning forever without making any forward 
progress towards decommissioning.
h2. Root Cause

The HDFS Namenode class "DatanodeAdminManager" is responsible for 
decommissioning datanodes.

As per this "hdfs-site" configuration:
{quote}Config = dfs.namenode.decommission.max.concurrent.tracked.nodes 
 Default Value = 100

The maximum number of decommission-in-progress datanodes nodes that will be 
tracked at one time by the namenode. Tracking a decommission-in-progress 
datanode consumes additional NN memory proportional to the number of blocks on 
the datnode. Having a conservative limit reduces the potential impact of 
decomissioning a large number of nodes at once. A value of 0 means no limit 
will be enforced.
{quote}
The Namenode will only actively track up to 100 datanodes for decommissioning 
at any given time, as to avoid Namenode memory pressure.

Looking into the "DatanodeAdminManager" code:
 * a new datanode is only removed from the "tracked.nodes" set when it finishes 
decommissioning
 * a new datanode is only added to the "tracked.nodes" set if there is fewer 
than 100 datanodes being tracked

So in the event that there are more than 100 datanodes being decommissioned at 
a given time, some of those datanodes will not be in the "tracked.nodes" set 
until 1 or more datanodes in the "tracked.nodes" finishes decommissioning. This 
is generally not a problem because the datanodes in "tracked.nodes" will 
eventually finish decommissioning, but there is an edge case where this logic 
prevents the namenode from making any forward progress towards decommissioning.

If all 100 datanodes in the "tracked.nodes" are unable to finish 
decommissioning, then other datanodes (which may be able to be decommissioned) 
will never get added to "tracked.nodes" and therefore will never get the 
opportunity to be decommissioned.

This can occur due the following issue:
{quote}2021-10-21 12:39:24,048 WARN 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager 
(DatanodeAdminMonitor-0): Node W.X.Y.Z:50010 is dead while in Decommission In 
Progress. Cannot be safely decommissioned or be in maintenance since there is 
risk of reduced data durability or data loss. Either restart the failed node or 
force decommissioning or maintenance by removing, calling refreshNodes, then 
re-adding to the excludes or host config files.
{quote}
If a Datanode is lost while decommissioning (for example if the underlying 
hardware fails or is lost), then it will remain in state decommissioning 
forever.

If 100 or more Datanodes are lost while decommissioning over the Hadoop cluster 
lifetime, then this is enough to completely fill up the "tracked.nodes" set. 
With the entire "tracked.nodes" set filled with datanodes that can never finish 
decommissioning, any datanodes added after this point will never be able to be 
decommissioned because they will never be added to the "tracked.nodes" set.

In this scenario:
 * the "tracked.nodes" set is filled with datanodes which are lost & cannot be 
recovered (and can never finish decommissioning so they will never be removed 
from the set)
 * the actual live datanodes being decommissioned are enqueued waiting to enter 
the "tracked.nodes" set (and are stuck waiting indefinitely)

This means that no progress towards decommissioning the live datanodes will be 
made unless the user takes the following action:
{quote}Either restart the failed node or force decommissioning or maintenance 
by removing, calling refreshNodes, then re-adding to the excludes or host 
config files.
{quote}
Ideally, the Namenode should be able to gracefully handle scenarios where the 
datanodes in the "tracked.nodes" set are not making forward progress towards 
decommissioning while the enqueued datanodes may be able to make forward 
progress.
h2. Reproduce Steps
 * create a Hadoop cluster
 * lose (i.e. terminate the host/process forever) over 100 datanodes while the 
datanodes are in state decommissioning
 * add additional datanodes to the cluster
 * attempt to decommission those new datanodes & observe that they are stuck in 
state decommissioning forever

Note that in this example each datanode, over the full history of the cluster, 
has a unique IP address

  was:
h2. Problem Description

The HDFS Namenode class "DatanodeAdminManager" is responsible for 
decommissioning datanodes.

As per this "hdfs-site" configuration:
{quote}Config = dfs.namenode.decommission.max.concurrent.tracked.nodes 
 Default Value = 100

The maximum number of decommission-in-progress datanodes nodes that will be 
tracked at one time by 

[jira] [Updated] (HDFS-16303) Losing over 100 datanodes in state decommissioning results in full blockage of all datanode decommissioning

2021-11-05 Thread Kevin Wikant (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16303?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Wikant updated HDFS-16303:

Description: 
h2. Problem Description

The HDFS Namenode class "DatanodeAdminManager" is responsible for 
decommissioning datanodes.

As per this "hdfs-site" configuration:
{quote}Config = dfs.namenode.decommission.max.concurrent.tracked.nodes 
 Default Value = 100

The maximum number of decommission-in-progress datanodes nodes that will be 
tracked at one time by the namenode. Tracking a decommission-in-progress 
datanode consumes additional NN memory proportional to the number of blocks on 
the datnode. Having a conservative limit reduces the potential impact of 
decomissioning a large number of nodes at once. A value of 0 means no limit 
will be enforced.
{quote}
The Namenode will only actively track up to 100 datanodes for decommissioning 
at any given time, as to avoid Namenode memory pressure.

Looking into the "DatanodeAdminManager" code:
 * a new datanode is only removed from the "tracked.nodes" set when it finishes 
decommissioning
 * a new datanode is only added to the "tracked.nodes" set if there is fewer 
than 100 datanodes being tracked

So in the event that there are more than 100 datanodes being decommissioned at 
a given time, some of those datanodes will not be in the "tracked.nodes" set 
until 1 or more datanodes in the "tracked.nodes" finishes decommissioning. This 
is generally not a problem because the datanodes in "tracked.nodes" will 
eventually finish decommissioning, but there is an edge case where this logic 
prevents the namenode from making any forward progress towards decommissioning.

If all 100 datanodes in the "tracked.nodes" are unable to finish 
decommissioning, then other datanodes (which may be able to be decommissioned) 
will never get added to "tracked.nodes" and therefore will never get the 
opportunity to be decommissioned.

This can occur due the following issue:
{quote}2021-10-21 12:39:24,048 WARN 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager 
(DatanodeAdminMonitor-0): Node W.X.Y.Z:50010 is dead while in Decommission In 
Progress. Cannot be safely decommissioned or be in maintenance since there is 
risk of reduced data durability or data loss. Either restart the failed node or 
force decommissioning or maintenance by removing, calling refreshNodes, then 
re-adding to the excludes or host config files.
{quote}
If a Datanode is lost while decommissioning (for example if the underlying 
hardware fails or is lost), then it will remain in state decommissioning 
forever.

If 100 or more Datanodes are lost while decommissioning over the Hadoop cluster 
lifetime, then this is enough to completely fill up the "tracked.nodes" set. 
With the entire "tracked.nodes" set filled with datanodes that can never finish 
decommissioning, any datanodes added after this point will never be able to be 
decommissioned because they will never be added to the "tracked.nodes" set.

In this scenario:
 * the "tracked.nodes" set is filled with datanodes which are lost & cannot be 
recovered (and can never finish decommissioning so they will never be removed 
from the set)
 * the actual live datanodes being decommissioned are enqueued waiting to enter 
the "tracked.nodes" set (and are stuck waiting indefinitely)

This means that no progress towards decommissioning the live datanodes will be 
made unless the user takes the following action:
{quote}Either restart the failed node or force decommissioning or maintenance 
by removing, calling refreshNodes, then re-adding to the excludes or host 
config files.
{quote}
Ideally, the Namenode should be able to gracefully handle scenarios where the 
datanodes in the "tracked.nodes" set are not making forward progress towards 
decommissioning while the enqueued datanodes may be able to make forward 
progress.
h2. Reproduce Steps
 * create a Hadoop cluster
 * lose (i.e. terminate the host/process forever) over 100 datanodes while the 
datanodes are in state decommissioning
 * add additional datanodes to the cluster
 * attempt to decommission those new datanodes & observe that they are stuck in 
state decommissioning forever

Note that in this example each datanode, over the full history of the cluster, 
has a unique IP address

  was:
## Problem Description

The HDFS Namenode class "DatanodeAdminManager" is responsible for 
decommissioning datanodes.

As per this "hdfs-site" configuration:
{quote}Config = dfs.namenode.decommission.max.concurrent.tracked.nodes 
Default Value = 100

The maximum number of decommission-in-progress datanodes nodes that will be 
tracked at one time by the namenode. Tracking a decommission-in-progress 
datanode consumes additional NN memory proportional to the number of blocks on 
the datnode. Having a conservative limit reduces the potential impact of 
decomissioning a large number of nodes at once. A value of 0 means 

[jira] [Created] (HDFS-16303) Losing over 100 datanodes in state decommissioning results in full blockage of all datanode decommissioning

2021-11-05 Thread Kevin Wikant (Jira)
Kevin Wikant created HDFS-16303:
---

 Summary: Losing over 100 datanodes in state decommissioning 
results in full blockage of all datanode decommissioning
 Key: HDFS-16303
 URL: https://issues.apache.org/jira/browse/HDFS-16303
 Project: Hadoop HDFS
  Issue Type: Bug
Affects Versions: 3.3.1, 2.10.1
Reporter: Kevin Wikant


## Problem Description

The HDFS Namenode class "DatanodeAdminManager" is responsible for 
decommissioning datanodes.

As per this "hdfs-site" configuration:
{quote}Config = dfs.namenode.decommission.max.concurrent.tracked.nodes 
Default Value = 100

The maximum number of decommission-in-progress datanodes nodes that will be 
tracked at one time by the namenode. Tracking a decommission-in-progress 
datanode consumes additional NN memory proportional to the number of blocks on 
the datnode. Having a conservative limit reduces the potential impact of 
decomissioning a large number of nodes at once. A value of 0 means no limit 
will be enforced.
{quote}
The Namenode will only actively track up to 100 datanodes for decommissioning 
at any given time, as to avoid Namenode memory pressure.

Looking into the "DatanodeAdminManager" code:
 * a new datanode is only removed from the "tracked.nodes" set when it finishes 
decommissioning
 * a new datanode is only added to the "tracked.nodes" set if there is fewer 
than 100 datanodes being tracked

So in the event that there are more than 100 datanodes being decommissioned at 
a given time, some of those datanodes will not be in the "tracked.nodes" set 
until 1 or more datanodes in the "tracked.nodes" finishes decommissioning. This 
is generally not a problem because the datanodes in "tracked.nodes" will 
eventually finish decommissioning, but there is an edge case where this logic 
prevents the namenode from making any forward progress towards decommissioning.

If all 100 datanodes in the "tracked.nodes" are unable to finish 
decommissioning, then other datanodes (which may be able to be decommissioned) 
will never get added to "tracked.nodes" and therefore will never get the 
opportunity to be decommissioned.

This can occur due the following issue:
{quote}2021-10-21 12:39:24,048 WARN 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager 
(DatanodeAdminMonitor-0): Node W.X.Y.Z:50010 is dead while in Decommission In 
Progress. Cannot be safely decommissioned or be in maintenance since there is 
risk of reduced data durability or data loss. Either restart the failed node or 
force decommissioning or maintenance by removing, calling refreshNodes, then 
re-adding to the excludes or host config files.
{quote}
If a Datanode is lost while decommissioning (for example if the underlying 
hardware fails or is lost), then it will remain in state decommissioning 
forever.

If 100 or more Datanodes are lost while decommissioning over the Hadoop cluster 
lifetime, then this is enough to completely fill up the "tracked.nodes" set. 
With the entire "tracked.nodes" set filled with datanodes that can never finish 
decommissioning, any datanodes added after this point will never be able to be 
decommissioned because they will never be added to the "tracked.nodes" set.

In this scenario:
 * the "tracked.nodes" set is filled with datanodes which are lost & cannot be 
recovered (and can never finish decommissioning so they will never be removed 
from the set)
 * the actual live datanodes being decommissioned are enqueued waiting to enter 
the "tracked.nodes" set (and are stuck waiting indefinitely)

This means that no progress towards decommissioning the live datanodes will be 
made unless the user takes the following action:
{quote}Either restart the failed node or force decommissioning or maintenance 
by removing, calling refreshNodes, then re-adding to the excludes or host 
config files.
{quote}
Ideally, the Namenode should be able to gracefully handle scenarios where the 
datanodes in the "tracked.nodes" set are not making forward progress towards 
decommissioning while the enqueued datanodes may be able to make forward 
progress.

## Reproduction Steps
 * create a Hadoop cluster
 * lose (i.e. terminate the host/process forever) over 100 datanodes while the 
datanodes are in state decommissioning
 * add additional datanodes to the cluster
 * attempt to decommission those new datanodes & observe that they are stuck in 
state decommissioning forever

Note that in this example each datanode, over the full history of the cluster, 
has a unique IP address



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-16064) HDFS-721 causes DataNode decommissioning to get stuck indefinitely

2021-06-10 Thread Kevin Wikant (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Wikant updated HDFS-16064:

Description: 
Seems that https://issues.apache.org/jira/browse/HDFS-721 was resolved as a 
non-issue under the assumption that if the namenode & a datanode get into an 
inconsistent state for a given block pipeline, there should be another datanode 
available to replicate the block to

While testing datanode decommissioning using "dfs.exclude.hosts", I have 
encountered a scenario where the decommissioning gets stuck indefinitely

Below is the progression of events:
 * there are initially 4 datanodes DN1, DN2, DN3, DN4
 * scale-down is started by adding DN1 & DN2 to "dfs.exclude.hosts"
 * HDFS block pipelines on DN1 & DN2 must now be replicated to DN3 & DN4 in 
order to satisfy their minimum replication factor of 2
 * during this replication process 
https://issues.apache.org/jira/browse/HDFS-721 is encountered which causes the 
following inconsistent state:
 ** DN3 thinks it has the block pipeline in FINALIZED state
 ** the namenode does not think DN3 has the block pipeline

{code:java}
2021-06-06 10:38:23,604 INFO org.apache.hadoop.hdfs.server.datanode.DataNode 
(DataXceiver for client  at /DN2:45654 [Receiving block BP-YYY:blk_XXX]): 
DN3:9866:DataXceiver error processing WRITE_BLOCK operation  src: /DN2:45654 
dst: /DN3:9866; 
org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block 
BP-YYY:blk_XXX already exists in state FINALIZED and thus cannot be created.
{code}
 * the replication is attempted again, but:
 ** DN4 has the block
 ** DN1 and/or DN2 have the block, but don't count towards the minimum 
replication factor because they are being decommissioned
 ** DN3 does not have the block & cannot have the block replicated to it 
because of HDFS-721
 * the namenode repeatedly tries to replicate the block to DN3 & repeatedly 
fails, this continues indefinitely
 * therefore DN4 is the only live datanode with the block & the minimum 
replication factor of 2 cannot be satisfied
 * because the minimum replication factor cannot be satisfied for the block(s) 
being moved off DN1 & DN2, the datanode decommissioning can never be completed 

{code:java}
2021-06-06 10:39:10,106 INFO BlockStateChange (DatanodeAdminMonitor-0): Block: 
blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, 
decommissioned replicas: 0, decommissioning replicas: 2, maintenance replicas: 
0, live entering maintenance replicas: 0, excess replicas: 0, Is Open File: 
false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , Current 
Datanode: DN1:9866, Is current datanode decommissioning: true, Is current 
datanode entering maintenance: false
...
2021-06-06 10:57:10,105 INFO BlockStateChange (DatanodeAdminMonitor-0): Block: 
blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, 
decommissioned replicas: 0, decommissioning replicas: 2, maintenance replicas: 
0, live entering maintenance replicas: 0, excess replicas: 0, Is Open File: 
false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , Current 
Datanode: DN2:9866, Is current datanode decommissioning: true, Is current 
datanode entering maintenance: false
{code}
Being stuck in decommissioning state forever is not an intended behavior of 
DataNode decommissioning

A few potential solutions:
 * Address the root cause of the problem which is an inconsistent state between 
namenode & datanode: https://issues.apache.org/jira/browse/HDFS-721
 * Detect when datanode decommissioning is stuck due to lack of available 
datanodes for satisfying the minimum replication factor, then recover by 
re-enabling the datanodes being decommissioned

 

  was:
Seems that https://issues.apache.org/jira/browse/HDFS-721 was resolved as a 
non-issue under the assumption that if the namenode & a datanode get into an 
inconsistent state for a given block pipeline, there should be another datanode 
available to replicate the block to

While testing datanode decommissioning using "dfs.exclude.hosts", I have 
encountered a scenario where the decommissioning gets stuck indefinitely

Below is the progression of events:
 * there are initially 4 datanodes DN1, DN2, DN3, DN4
 * scale-down is started by adding DN1 & DN2 to "dfs.exclude.hosts"
 * HDFS block pipelines on DN1 & DN2 must now be replicated to DN3 & DN4 in 
order to satisfy their minimum replication factor of 2
 * during this replication process 
https://issues.apache.org/jira/browse/HDFS-721 is encountered which causes the 
following inconsistent state:
 ** DN3 thinks it has the block pipeline in FINALIZED state
 ** the namenode does not think DN3 has the block pipeline

{code:java}
2021-06-06 10:38:23,604 INFO org.apache.hadoop.hdfs.server.datanode.DataNode 
(DataXceiver for client  at /DN2:45654 [Receiving block BP-YYY:blk_XXX]): 
DN3:9866:DataXceiver error processing WRITE_BLOCK operation  src: 

[jira] [Updated] (HDFS-16064) HDFS-721 causes DataNode decommissioning to get stuck indefinitely

2021-06-10 Thread Kevin Wikant (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-16064?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kevin Wikant updated HDFS-16064:

Description: 
Seems that https://issues.apache.org/jira/browse/HDFS-721 was resolved as a 
non-issue under the assumption that if the namenode & a datanode get into an 
inconsistent state for a given block pipeline, there should be another datanode 
available to replicate the block to

While testing datanode decommissioning using "dfs.exclude.hosts", I have 
encountered a scenario where the decommissioning gets stuck indefinitely

Below is the progression of events:
 * there are initially 4 datanodes DN1, DN2, DN3, DN4
 * scale-down is started by adding DN1 & DN2 to "dfs.exclude.hosts"
 * HDFS block pipelines on DN1 & DN2 must now be replicated to DN3 & DN4 in 
order to satisfy their minimum replication factor of 2
 * during this replication process 
https://issues.apache.org/jira/browse/HDFS-721 is encountered which causes the 
following inconsistent state:
 ** DN3 thinks it has the block pipeline in FINALIZED state
 ** the namenode does not think DN3 has the block pipeline

{code:java}
2021-06-06 10:38:23,604 INFO org.apache.hadoop.hdfs.server.datanode.DataNode 
(DataXceiver for client  at /DN2:45654 [Receiving block BP-YYY:blk_XXX]): 
DN3:9866:DataXceiver error processing WRITE_BLOCK operation  src: /DN2:45654 
dst: /DN3:9866; 
org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block 
BP-YYY:blk_XXX already exists in state FINALIZED and thus cannot be created.
{code}
 * the replication is attempted again, but:
 ** DN4 has the block
 ** DN1 and/or DN2 have the block, but don't count towards the minimum 
replication factor because they are being decommissioned
 ** DN3 does not have the block & cannot have the block replicated to it 
because of HDFS-721
 * the namenode repeatedly tries to replicate the block to DN3 & repeatedly 
fails, this continues indefinitely
 * therefore DN4 is the only live datanode with the block & the minimum 
replication factor of 2 cannot be satisfied
 * because the minimum replication factor cannot be satisfied for the block(s) 
being moved off DN1 & DN2, the datanode decommissioning can never be completed

 
{code:java}
2021-06-06 10:39:10,106 INFO BlockStateChange (DatanodeAdminMonitor-0): Block: 
blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, 
decommissioned replicas: 0, decommissioning replicas: 2, maintenance replicas: 
0, live entering maintenance replicas: 0, excess replicas: 0, Is Open File: 
false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , Current 
Datanode: DN1:9866, Is current datanode decommissioning: true, Is current 
datanode entering maintenance: false
...
2021-06-06 10:57:10,105 INFO BlockStateChange (DatanodeAdminMonitor-0): Block: 
blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, 
decommissioned replicas: 0, decommissioning replicas: 2, maintenance replicas: 
0, live entering maintenance replicas: 0, excess replicas: 0, Is Open File: 
false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , Current 
Datanode: DN2:9866, Is current datanode decommissioning: true, Is current 
datanode entering maintenance: false
{code}
Being stuck in decommissioning state forever is not an intended behavior of 
DataNode decommissioning

A few potential solutions:
 * Address the root cause of the problem which is an inconsistent state between 
namenode & datanode: https://issues.apache.org/jira/browse/HDFS-721
 * Detect when datanode decommissioning is stuck due to lack of available 
datanodes for satisfying the minimum replication factor, then recover by 
re-enabling the datanodes being decommissioned

 

  was:
Seems that https://issues.apache.org/jira/browse/HDFS-721 was resolved as a 
non-issue under the assumption that if the namenode & a datanode get into an 
inconsistent state for a given block pipeline, there should be another datanode 
available to replicate the block to

While testing datanode decommissioning using "dfs.exclude.hosts", I have 
encountered a scenario where the decommissioning gets stuck indefinitely

Below is the progression of events:
 * there are initially 4 datanodes DN1, DN2, DN3, DN4
 * scale-down is started by adding DN1 & DN2 to "dfs.exclude.hosts"
 * HDFS block(s) on DN1 & DN2 must now be replicated to DN3 & DN4 in order to 
satisfy their minimum replication factor of 2
 * during this replication process 
https://issues.apache.org/jira/browse/HDFS-721 is encountered which causes the 
following inconsistent state:
 ** DN3 thinks it has the block pipeline in FINALIZED state
 ** the namenode does not think DN3 has the block pipeline

{code:java}
2021-06-06 10:38:23,604 INFO org.apache.hadoop.hdfs.server.datanode.DataNode 
(DataXceiver for client  at /DN2:45654 [Receiving block BP-YYY:blk_XXX]): 
DN3:9866:DataXceiver error processing WRITE_BLOCK operation  src: 

[jira] [Created] (HDFS-16064) HDFS-721 causes DataNode decommissioning to get stuck indefinitely

2021-06-10 Thread Kevin Wikant (Jira)
Kevin Wikant created HDFS-16064:
---

 Summary: HDFS-721 causes DataNode decommissioning to get stuck 
indefinitely
 Key: HDFS-16064
 URL: https://issues.apache.org/jira/browse/HDFS-16064
 Project: Hadoop HDFS
  Issue Type: Bug
  Components: datanode, namenode
Affects Versions: 3.2.1
Reporter: Kevin Wikant


Seems that https://issues.apache.org/jira/browse/HDFS-721 was resolved as a 
non-issue under the assumption that if the namenode & a datanode get into an 
inconsistent state for a given block pipeline, there should be another datanode 
available to replicate the block to

While testing datanode decommissioning using "dfs.exclude.hosts", I have 
encountered a scenario where the decommissioning gets stuck indefinitely

Below is the progression of events:
 * there are initially 4 datanodes DN1, DN2, DN3, DN4
 * scale-down is started by adding DN1 & DN2 to "dfs.exclude.hosts"
 * HDFS block(s) on DN1 & DN2 must now be replicated to DN3 & DN4 in order to 
satisfy their minimum replication factor of 2
 * during this replication process 
https://issues.apache.org/jira/browse/HDFS-721 is encountered which causes the 
following inconsistent state:
 ** DN3 thinks it has the block pipeline in FINALIZED state
 ** the namenode does not think DN3 has the block pipeline

{code:java}
2021-06-06 10:38:23,604 INFO org.apache.hadoop.hdfs.server.datanode.DataNode 
(DataXceiver for client  at /DN2:45654 [Receiving block BP-YYY:blk_XXX]): 
DN3:9866:DataXceiver error processing WRITE_BLOCK operation  src: /DN2:45654 
dst: /DN3:9866; 
org.apache.hadoop.hdfs.server.datanode.ReplicaAlreadyExistsException: Block 
BP-YYY:blk_XXX already exists in state FINALIZED and thus cannot be created.
{code}
 * the replication is attempted again, but:
 ** DN4 has the block
 ** DN1 and/or DN2 have the block, but don't count towards the minimum 
replication factor because they are being decommissioned
 ** DN3 does not have the block & cannot have the block replicated to it 
because of HDFS-721
 * the namenode repeatedly tries to replicate the block to DN3 & repeatedly 
fails, this continues indefinitely
 * therefore DN4 is the only live datanode with the block & the minimum 
replication factor of 2 cannot be satisfied
 * because the minimum replication factor cannot be satisfied for the block(s) 
being moved off DN1 & DN2, the datanode decommissioning can never be completed

 
{code:java}
2021-06-06 10:39:10,106 INFO BlockStateChange (DatanodeAdminMonitor-0): Block: 
blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, 
decommissioned replicas: 0, decommissioning replicas: 2, maintenance replicas: 
0, live entering maintenance replicas: 0, excess replicas: 0, Is Open File: 
false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , Current 
Datanode: DN1:9866, Is current datanode decommissioning: true, Is current 
datanode entering maintenance: false
...
2021-06-06 10:57:10,105 INFO BlockStateChange (DatanodeAdminMonitor-0): Block: 
blk_XXX, Expected Replicas: 2, live replicas: 1, corrupt replicas: 0, 
decommissioned replicas: 0, decommissioning replicas: 2, maintenance replicas: 
0, live entering maintenance replicas: 0, excess replicas: 0, Is Open File: 
false, Datanodes having this block: DN1:9866 DN2:9866 DN4:9866 , Current 
Datanode: DN2:9866, Is current datanode decommissioning: true, Is current 
datanode entering maintenance: false
{code}
Being stuck in decommissioning state forever is not an intended behavior of 
DataNode decommissioning

A few potential solutions:
 * Address the root cause of the problem which is an inconsistent state between 
namenode & datanode: https://issues.apache.org/jira/browse/HDFS-721
 * Detect when datanode decommissioning is stuck due to lack of available 
datanodes for satisfying the minimum replication factor, then recover by 
re-enabling the datanodes being decommissioned

 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org