[ 
https://issues.apache.org/jira/browse/HDFS-17453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17834496#comment-17834496
 ] 

ASF GitHub Bot commented on HDFS-17453:
---------------------------------------

hadoop-yetus commented on PR #6708:
URL: https://github.com/apache/hadoop/pull/6708#issuecomment-2040998440

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |:----:|----------:|--------:|:--------:|:-------:|
   | +0 :ok: |  reexec  |   0m 31s |  |  Docker mode activated.  |
   |||| _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 2 new or modified test files.  |
   |||| _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  47m 20s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 22s |  |  trunk passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  compile  |   1m 16s |  |  trunk passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  checkstyle  |   1m 10s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 29s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 15s |  |  trunk passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javadoc  |   1m 42s |  |  trunk passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  spotbugs  |   3m 21s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  38m 30s |  |  branch has no errors 
when building and testing our client artifacts.  |
   |||| _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m 13s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 15s |  |  the patch passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javac  |   1m 15s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m  8s |  |  the patch passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  javac  |   1m  8s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 59s | 
[/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6708/3/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs-project/hadoop-hdfs: The patch generated 6 new + 86 unchanged - 
0 fixed = 92 total (was 86)  |
   | +1 :green_heart: |  mvnsite  |   1m 20s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 55s |  |  the patch passed with JDK 
Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1  |
   | +1 :green_heart: |  javadoc  |   1m 35s |  |  the patch passed with JDK 
Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06  |
   | +1 :green_heart: |  spotbugs  |   3m 28s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  39m  2s |  |  patch has no errors 
when building and testing our client artifacts.  |
   |||| _ Other Tests _ |
   | -1 :x: |  unit  | 237m  9s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6708/3/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 42s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 387m 32s |  |  |
   
   
   | Reason | Tests |
   |-------:|:------|
   | Failed junit tests | hadoop.hdfs.server.mover.TestMover |
   |   | hadoop.hdfs.server.blockmanagement.TestPendingDataNodeMessages |
   |   | 
hadoop.hdfs.server.common.blockaliasmap.impl.TestLevelDbMockAliasMapClient |
   |   | hadoop.hdfs.server.diskbalancer.command.TestDiskBalancerCommand |
   |   | hadoop.hdfs.server.diskbalancer.TestDiskBalancerWithMockMover |
   |   | hadoop.hdfs.server.namenode.ha.TestDNFencing |
   
   
   | Subsystem | Report/Notes |
   |----------:|:-------------|
   | Docker | ClientAPI=1.45 ServerAPI=1.45 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6708/3/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/6708 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 43af3362757d 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 
15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / d637a41658748bc01a7bf5f821b105449de82959 |
   | Default Java | Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6708/3/testReport/ |
   | Max. process+thread count | 4400 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-hdfs U: 
hadoop-hdfs-project/hadoop-hdfs |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6708/3/console |
   | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   




> IncrementalBlockReport can have race condition with Edit Log Tailer
> -------------------------------------------------------------------
>
>                 Key: HDFS-17453
>                 URL: https://issues.apache.org/jira/browse/HDFS-17453
>             Project: Hadoop HDFS
>          Issue Type: Bug
>          Components: auto-failover, ha, hdfs, namenode
>    Affects Versions: 3.3.0, 3.3.1, 2.10.2, 3.3.2, 3.3.5, 3.3.4, 3.3.6
>            Reporter: Danny Becker
>            Assignee: Danny Becker
>            Priority: Major
>              Labels: pull-request-available
>
> h2. Summary
> There is a race condition between IncrementalBlockReports (IBR) and 
> EditLogTailer in Standby NameNode (SNN) which can lead to leaked IBRs and 
> false corrupt blocks after HA Failover. The race condition occurs when the 
> SNN loads the edit logs before it receives the block reports from DataNode 
> (DN).
> h2. Example
> In the following example there is a block (b1) with 3 generation stamps (gs1, 
> gs2, gs3).
>  # SNN1 loads edit logs for b1gs1 and b1gs2.
>  # DN1 sends the IBR for b1gs1 to SNN1.
>  # SNN1 will determine that the reported block b1gs1 from DN1 is corrupt and 
> it will be queued for later. 
> [BlockManager.java|https://github.com/apache/hadoop/blob/6ed73896f6e8b4b7c720eff64193cb30b3e77fb2/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L3447C1-L3464C6]
> {code:java}
>     BlockToMarkCorrupt c = checkReplicaCorrupt(
>         block, reportedState, storedBlock, ucState, dn);
>     if (c != null) {
>       if (shouldPostponeBlocksFromFuture) {
>         // If the block is an out-of-date generation stamp or state,
>         // but we're the standby, we shouldn't treat it as corrupt,
>         // but instead just queue it for later processing.
>         // Storing the reported block for later processing, as that is what
>         // comes from the IBR / FBR and hence what we should use to compare
>         // against the memory state.
>         // See HDFS-6289 and HDFS-15422 for more context.
>         queueReportedBlock(storageInfo, block, reportedState,
>             QUEUE_REASON_CORRUPT_STATE);
>       } else {
>         toCorrupt.add(c);
>       }
>       return storedBlock;
>     } {code}
>  # DN1 sends IBR for b1gs2 and b1gs3 to SNN1.
>  # SNN1 processes b1sg2 and updates the blocks map.
>  # SNN1 queues b1gs3 for later because it determines that b1gs3 is a future 
> genstamp.
>  # SNN1 loads b1gs3 edit logs and processes the queued reports for b1.
>  # SNN1 processes b1gs1 first and puts it back in the queue.
>  # SNN1 processes b1gs3 next and updates the blocks map.
>  # Later, SNN1 becomes the Active NameNode (ANN) during an HA Failover.
>  # SNN1 will catch to the latest edit logs, then process all queued block 
> reports to become the ANN.
>  # ANN1 will process b1gs1 and mark it as corrupt.
> If the example above happens for every DN which stores b1, then when the HA 
> failover happens, b1 will be incorrectly marked as corrupt. This will be 
> fixed when the first DN sends a FullBlockReport or an IBR for b1.
> h2. Logs from Active Cluster
> I added the following logs to confirm this issue in an active cluster:
> {code:java}
> BlockToMarkCorrupt c = checkReplicaCorrupt(
>     block, reportedState, storedBlock, ucState, dn);
> if (c != null) {
>   DatanodeStorageInfo storedStorageInfo = storedBlock.findStorageInfo(dn);
>   LOG.info("Found corrupt block {} [{}, {}] from DN {}. Stored block {} from 
> DN {}",
>       block, reportedState.name(), ucState.name(), storageInfo, storedBlock, 
> storedStorageInfo);
>   if (storageInfo.equals(storedStorageInfo) &&
>         storedBlock.getGenerationStamp() > block.getGenerationStamp()) {
>     LOG.info("Stored Block {} from the same DN {} has a newer GenStamp." +
>         storedBlock, storedStorageInfo);
>   }
>   if (shouldPostponeBlocksFromFuture) {
>     // If the block is an out-of-date generation stamp or state,
>     // but we're the standby, we shouldn't treat it as corrupt,
>     // but instead just queue it for later processing.
>     // Storing the reported block for later processing, as that is what
>     // comes from the IBR / FBR and hence what we should use to compare
>     // against the memory state.
>     // See HDFS-6289 and HDFS-15422 for more context.
>     queueReportedBlock(storageInfo, block, reportedState,
>         QUEUE_REASON_CORRUPT_STATE);
>     LOG.info("Queueing the block {} for later processing", block);
>   } else {
>     toCorrupt.add(c);
>     LOG.info("Marking the block {} as corrupt", block);
>   }
>   return storedBlock;
> } {code}
>  
> Logs from nn1 (Active):
> {code:java}
> 2024-04-03T03:00:52.524-0700,INFO,[IPC Server handler 6 on default port 
> 443],org.apache.hadoop.hdfs.server.namenode.FSNamesystem,"updatePipeline(blk_66092666802_65700910634,
>  newGS=65700925027, newLength=10485760, newNodes=[[DN1]:10010, [DN2]:10010, 
> [DN3]:10010, client=client1)"
> 2024-04-03T03:00:52.539-0700,INFO,[IPC Server handler 6 on default port 
> 443],org.apache.hadoop.hdfs.server.namenode.FSNamesystem,"updatePipeline(blk_66092666802_65700910634
>  => blk_66092666802_65700925027) success"
> 2024-04-03T03:01:07.413-0700,INFO,[IPC Server handler 6 on default port 
> 443],org.apache.hadoop.hdfs.server.namenode.FSNamesystem,"updatePipeline(blk_66092666802_65700925027,
>  newGS=65700933553, newLength=20971520, newNodes=[[DN1]:10010, [DN2]:10010, 
> [DN3]:10010, client=client1)"
> 2024-04-03T03:01:07.413-0700,INFO,[IPC Server handler 6 on default port 
> 443],org.apache.hadoop.hdfs.server.namenode.FSNamesystem,"updatePipeline(blk_66092666802_65700925027
>  => blk_66092666802_65700933553) success" {code}
>  
> Logs from nn2 (Standby):
> {code:java}
> 2024-04-03T03:01:23.067-0700,INFO,[Block report 
> processor],org.apache.hadoop.hdfs.server.blockmanagement.BlockManager,"Found 
> corrupt block blk_66092666802_65700925027 [FINALIZED, COMPLETE] from DN 
> [DISK]DS-1:NORMAL:[DN1]:10010. Stored block blk_66092666802_65700933553 from 
> DN null"
> 2024-04-03T03:01:23.067-0700,INFO,[Block report 
> processor],org.apache.hadoop.hdfs.server.blockmanagement.BlockManager,"Queueing
>  the block blk_66092666802_65700925027 for later processing"
> 2024-04-03T03:01:24.159-0700,INFO,[Block report 
> processor],org.apache.hadoop.hdfs.server.blockmanagement.BlockManager,"Found 
> corrupt block blk_66092666802_65700925027 [FINALIZED, COMPLETE] from DN 
> [DISK]DS-3:NORMAL:[DN3]:10010. Stored block blk_66092666802_65700933553 from 
> DN null"
> 2024-04-03T03:01:24.159-0700,INFO,[Block report 
> processor],org.apache.hadoop.hdfs.server.blockmanagement.BlockManager,"Queueing
>  the block blk_66092666802_65700925027 for later processing"
> 2024-04-03T03:01:24.159-0700,INFO,[Block report 
> processor],org.apache.hadoop.hdfs.server.blockmanagement.BlockManager,"Found 
> corrupt block blk_66092666802_65700925027 [FINALIZED, COMPLETE] from DN 
> [DISK]DS-2:NORMAL:[DN2]:10010. Stored block blk_66092666802_65700933553 from 
> DN null"
> 2024-04-03T03:01:24.159-0700,INFO,[Block report 
> processor],org.apache.hadoop.hdfs.server.blockmanagement.BlockManager,"Queueing
>  the block blk_66092666802_65700925027 for later processing" {code}
>  
> Logs from nn2 when it transitions to Active:
> {code:java}
> 2024-04-03T15:39:09.050-0700,INFO,[IPC Server handler 8 on default port 
> 8020],org.apache.hadoop.hdfs.server.blockmanagement.BlockManager,"Found 
> corrupt block blk_66092666802_65700925027 [FINALIZED, COMPLETE] from DN 
> [DISK]DS-1:NORMAL:[DN1]:10010. Stored block blk_66092666802_65700933553 from 
> DN [DISK]DS-1:NORMAL:[DN1]:10010"
> 2024-04-03T15:39:09.050-0700,INFO,[IPC Server handler 8 on default port 
> 8020],org.apache.hadoop.hdfs.server.blockmanagement.BlockManager,"Stored 
> Block blk_66092666802_65700933553 from the same DN 
> [DISK]DS-1:NORMAL:[DN1]:10010 has a newer GenStamp."
> 2024-04-03T15:39:09.050-0700,INFO,[IPC Server handler 8 on default port 
> 8020],org.apache.hadoop.hdfs.server.blockmanagement.BlockManager,"Marking the 
> block blk_66092666802_65700925027 as corrupt"
> 2024-04-03T15:39:09.050-0700,INFO,[IPC Server handler 8 on default port 
> 8020],org.apache.hadoop.hdfs.server.blockmanagement.BlockManager,"Found 
> corrupt block blk_66092666802_65700925027 [FINALIZED, COMPLETE] from DN 
> [DISK]DS-2:NORMAL:[DN2]:10010. Stored block blk_66092666802_65700933553 from 
> DN [DISK]DS-2:NORMAL:[DN2]:10010"
> 2024-04-03T15:39:09.050-0700,INFO,[IPC Server handler 8 on default port 
> 8020],org.apache.hadoop.hdfs.server.blockmanagement.BlockManager,"Stored 
> Block blk_66092666802_65700933553 from the same DN 
> [DISK]DS-2:NORMAL:[DN2]:10010 has a newer GenStamp."
> 2024-04-03T15:39:09.050-0700,INFO,[IPC Server handler 8 on default port 
> 8020],org.apache.hadoop.hdfs.server.blockmanagement.BlockManager,"Marking the 
> block blk_66092666802_65700925027 as corrupt"
> 2024-04-03T15:39:09.050-0700,INFO,[IPC Server handler 8 on default port 
> 8020],org.apache.hadoop.hdfs.server.blockmanagement.BlockManager,"Found 
> corrupt block blk_66092666802_65700925027 [FINALIZED, COMPLETE] from DN 
> [DISK]DS-3:NORMAL:[DN3]:10010. Stored block blk_66092666802_65700933553 from 
> DN [DISK]DS-3:NORMAL:[DN3]:10010"
> 2024-04-03T15:39:09.050-0700,INFO,[IPC Server handler 8 on default port 
> 8020],org.apache.hadoop.hdfs.server.blockmanagement.BlockManager,"Stored 
> Block blk_66092666802_65700933553 from the same DN 
> [DISK]DS-3:NORMAL:[DN3]:10010 has a newer GenStamp."
> 2024-04-03T15:39:09.050-0700,INFO,[IPC Server handler 8 on default port 
> 8020],org.apache.hadoop.hdfs.server.blockmanagement.BlockManager,"Marking the 
> block blk_66092666802_65700925027 as corrupt"
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org

Reply via email to