[ https://issues.apache.org/jira/browse/HDFS-17453?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17834124#comment-17834124 ]
ASF GitHub Bot commented on HDFS-17453: --------------------------------------- hadoop-yetus commented on PR #6708: URL: https://github.com/apache/hadoop/pull/6708#issuecomment-2038639280 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |:----:|----------:|--------:|:--------:|:-------:| | +0 :ok: | reexec | 0m 28s | | Docker mode activated. | |||| _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | |||| _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 45m 30s | | trunk passed | | +1 :green_heart: | compile | 1m 22s | | trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | compile | 1m 15s | | trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | checkstyle | 1m 17s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 21s | | trunk passed | | +1 :green_heart: | javadoc | 1m 7s | | trunk passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javadoc | 1m 41s | | trunk passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | spotbugs | 3m 16s | | trunk passed | | +1 :green_heart: | shadedclient | 36m 52s | | branch has no errors when building and testing our client artifacts. | |||| _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 15s | | the patch passed | | +1 :green_heart: | compile | 1m 17s | | the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javac | 1m 17s | | the patch passed | | +1 :green_heart: | compile | 1m 14s | | the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | javac | 1m 14s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 1m 3s | [/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6708/2/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 86 unchanged - 0 fixed = 87 total (was 86) | | +1 :green_heart: | mvnsite | 1m 17s | | the patch passed | | +1 :green_heart: | javadoc | 0m 51s | | the patch passed with JDK Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 | | +1 :green_heart: | javadoc | 1m 34s | | the patch passed with JDK Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | +1 :green_heart: | spotbugs | 3m 18s | | the patch passed | | +1 :green_heart: | shadedclient | 37m 52s | | patch has no errors when building and testing our client artifacts. | |||| _ Other Tests _ | | -1 :x: | unit | 231m 45s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6708/2/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 46s | | The patch does not generate ASF License warnings. | | | | 377m 44s | | | | Reason | Tests | |-------:|:------| | Failed junit tests | hadoop.hdfs.server.blockmanagement.TestPendingDataNodeMessages | | Subsystem | Report/Notes | |----------:|:-------------| | Docker | ClientAPI=1.45 ServerAPI=1.45 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6708/2/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/6708 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux bf6537361823 5.15.0-94-generic #104-Ubuntu SMP Tue Jan 9 15:25:40 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 96cbeddc700d96b57c1d818a173657a91efcc8cc | | Default Java | Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.22+7-post-Ubuntu-0ubuntu220.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_402-8u402-ga-2ubuntu1~20.04-b06 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6708/2/testReport/ | | Max. process+thread count | 4458 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-6708/2/console | | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 | | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org | This message was automatically generated. > IncrementalBlockReport can have race condition with Edit Log Tailer > ------------------------------------------------------------------- > > Key: HDFS-17453 > URL: https://issues.apache.org/jira/browse/HDFS-17453 > Project: Hadoop HDFS > Issue Type: Bug > Components: auto-failover, ha, hdfs, namenode > Affects Versions: 3.3.0, 3.3.1, 2.10.2, 3.3.2, 3.3.5, 3.3.4, 3.3.6 > Reporter: Danny Becker > Assignee: Danny Becker > Priority: Major > Labels: pull-request-available > > h2. Summary > There is a race condition between IncrementalBlockReports (IBR) and > EditLogTailer in Standby NameNode (SNN) which can lead to leaked IBRs and > false corrupt blocks after HA Failover. The race condition occurs when the > SNN loads the edit logs before it receives the block reports from DataNode > (DN). > h2. Example > In the following example there is a block (b1) with 3 generation stamps (gs1, > gs2, gs3). > # SNN1 loads edit logs for b1gs1 and b1gs2. > # DN1 sends the IBR for b1gs1 to SNN1. > # SNN1 will determine that the reported block b1gs1 from DN1 is corrupt and > it will be queued for later. > [BlockManager.java|https://github.com/apache/hadoop/blob/6ed73896f6e8b4b7c720eff64193cb30b3e77fb2/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java#L3447C1-L3464C6] > {code:java} > BlockToMarkCorrupt c = checkReplicaCorrupt( > block, reportedState, storedBlock, ucState, dn); > if (c != null) { > if (shouldPostponeBlocksFromFuture) { > // If the block is an out-of-date generation stamp or state, > // but we're the standby, we shouldn't treat it as corrupt, > // but instead just queue it for later processing. > // Storing the reported block for later processing, as that is what > // comes from the IBR / FBR and hence what we should use to compare > // against the memory state. > // See HDFS-6289 and HDFS-15422 for more context. > queueReportedBlock(storageInfo, block, reportedState, > QUEUE_REASON_CORRUPT_STATE); > } else { > toCorrupt.add(c); > } > return storedBlock; > } {code} > # DN1 sends IBR for b1gs2 and b1gs3 to SNN1. > # SNN1 processes b1sg2 and updates the blocks map. > # SNN1 queues b1gs3 for later because it determines that b1gs3 is a future > genstamp. > # SNN1 loads b1gs3 edit logs and processes the queued reports for b1. > # SNN1 processes b1gs1 first and puts it back in the queue. > # SNN1 processes b1gs3 next and updates the blocks map. > # Later, SNN1 becomes the Active NameNode (ANN) during an HA Failover. > # SNN1 will catch to the latest edit logs, then process all queued block > reports to become the ANN. > # ANN1 will process b1gs1 and mark it as corrupt. > If the example above happens for every DN which stores b1, then when the HA > failover happens, b1 will be incorrectly marked as corrupt. This will be > fixed when the first DN sends a FullBlockReport or an IBR for b1. > h2. Logs from Active Cluster > I added the following logs to confirm this issue in an active cluster: > {code:java} > BlockToMarkCorrupt c = checkReplicaCorrupt( > block, reportedState, storedBlock, ucState, dn); > if (c != null) { > DatanodeStorageInfo storedStorageInfo = storedBlock.findStorageInfo(dn); > LOG.info("Found corrupt block {} [{}, {}] from DN {}. Stored block {} from > DN {}", > block, reportedState.name(), ucState.name(), storageInfo, storedBlock, > storedStorageInfo); > if (storageInfo.equals(storedStorageInfo) && > storedBlock.getGenerationStamp() > block.getGenerationStamp()) { > LOG.info("Stored Block {} from the same DN {} has a newer GenStamp." + > storedBlock, storedStorageInfo); > } > if (shouldPostponeBlocksFromFuture) { > // If the block is an out-of-date generation stamp or state, > // but we're the standby, we shouldn't treat it as corrupt, > // but instead just queue it for later processing. > // Storing the reported block for later processing, as that is what > // comes from the IBR / FBR and hence what we should use to compare > // against the memory state. > // See HDFS-6289 and HDFS-15422 for more context. > queueReportedBlock(storageInfo, block, reportedState, > QUEUE_REASON_CORRUPT_STATE); > LOG.info("Queueing the block {} for later processing", block); > } else { > toCorrupt.add(c); > LOG.info("Marking the block {} as corrupt", block); > } > return storedBlock; > } {code} > > Logs from nn1 (Active): > {code:java} > 2024-04-03T03:00:52.524-0700,INFO,[IPC Server handler 6 on default port > 443],org.apache.hadoop.hdfs.server.namenode.FSNamesystem,"updatePipeline(blk_66092666802_65700910634, > newGS=65700925027, newLength=10485760, newNodes=[[DN1]:10010, [DN2]:10010, > [DN3]:10010, client=client1)" > 2024-04-03T03:00:52.539-0700,INFO,[IPC Server handler 6 on default port > 443],org.apache.hadoop.hdfs.server.namenode.FSNamesystem,"updatePipeline(blk_66092666802_65700910634 > => blk_66092666802_65700925027) success" > 2024-04-03T03:01:07.413-0700,INFO,[IPC Server handler 6 on default port > 443],org.apache.hadoop.hdfs.server.namenode.FSNamesystem,"updatePipeline(blk_66092666802_65700925027, > newGS=65700933553, newLength=20971520, newNodes=[[DN1]:10010, [DN2]:10010, > [DN3]:10010, client=client1)" > 2024-04-03T03:01:07.413-0700,INFO,[IPC Server handler 6 on default port > 443],org.apache.hadoop.hdfs.server.namenode.FSNamesystem,"updatePipeline(blk_66092666802_65700925027 > => blk_66092666802_65700933553) success" {code} > > Logs from nn2 (Standby): > {code:java} > 2024-04-03T03:01:23.067-0700,INFO,[Block report > processor],org.apache.hadoop.hdfs.server.blockmanagement.BlockManager,"Found > corrupt block blk_66092666802_65700925027 [FINALIZED, COMPLETE] from DN > [DISK]DS-1:NORMAL:[DN1]:10010. Stored block blk_66092666802_65700933553 from > DN null" > 2024-04-03T03:01:23.067-0700,INFO,[Block report > processor],org.apache.hadoop.hdfs.server.blockmanagement.BlockManager,"Queueing > the block blk_66092666802_65700925027 for later processing" > 2024-04-03T03:01:24.159-0700,INFO,[Block report > processor],org.apache.hadoop.hdfs.server.blockmanagement.BlockManager,"Found > corrupt block blk_66092666802_65700925027 [FINALIZED, COMPLETE] from DN > [DISK]DS-3:NORMAL:[DN3]:10010. Stored block blk_66092666802_65700933553 from > DN null" > 2024-04-03T03:01:24.159-0700,INFO,[Block report > processor],org.apache.hadoop.hdfs.server.blockmanagement.BlockManager,"Queueing > the block blk_66092666802_65700925027 for later processing" > 2024-04-03T03:01:24.159-0700,INFO,[Block report > processor],org.apache.hadoop.hdfs.server.blockmanagement.BlockManager,"Found > corrupt block blk_66092666802_65700925027 [FINALIZED, COMPLETE] from DN > [DISK]DS-2:NORMAL:[DN2]:10010. Stored block blk_66092666802_65700933553 from > DN null" > 2024-04-03T03:01:24.159-0700,INFO,[Block report > processor],org.apache.hadoop.hdfs.server.blockmanagement.BlockManager,"Queueing > the block blk_66092666802_65700925027 for later processing" {code} > > Logs from nn2 when it transitions to Active: > {code:java} > 2024-04-03T15:39:09.050-0700,INFO,[IPC Server handler 8 on default port > 8020],org.apache.hadoop.hdfs.server.blockmanagement.BlockManager,"Found > corrupt block blk_66092666802_65700925027 [FINALIZED, COMPLETE] from DN > [DISK]DS-1:NORMAL:[DN1]:10010. Stored block blk_66092666802_65700933553 from > DN [DISK]DS-1:NORMAL:[DN1]:10010" > 2024-04-03T15:39:09.050-0700,INFO,[IPC Server handler 8 on default port > 8020],org.apache.hadoop.hdfs.server.blockmanagement.BlockManager,"Stored > Block blk_66092666802_65700933553 from the same DN > [DISK]DS-1:NORMAL:[DN1]:10010 has a newer GenStamp." > 2024-04-03T15:39:09.050-0700,INFO,[IPC Server handler 8 on default port > 8020],org.apache.hadoop.hdfs.server.blockmanagement.BlockManager,"Marking the > block blk_66092666802_65700925027 as corrupt" > 2024-04-03T15:39:09.050-0700,INFO,[IPC Server handler 8 on default port > 8020],org.apache.hadoop.hdfs.server.blockmanagement.BlockManager,"Found > corrupt block blk_66092666802_65700925027 [FINALIZED, COMPLETE] from DN > [DISK]DS-2:NORMAL:[DN2]:10010. Stored block blk_66092666802_65700933553 from > DN [DISK]DS-2:NORMAL:[DN2]:10010" > 2024-04-03T15:39:09.050-0700,INFO,[IPC Server handler 8 on default port > 8020],org.apache.hadoop.hdfs.server.blockmanagement.BlockManager,"Stored > Block blk_66092666802_65700933553 from the same DN > [DISK]DS-2:NORMAL:[DN2]:10010 has a newer GenStamp." > 2024-04-03T15:39:09.050-0700,INFO,[IPC Server handler 8 on default port > 8020],org.apache.hadoop.hdfs.server.blockmanagement.BlockManager,"Marking the > block blk_66092666802_65700925027 as corrupt" > 2024-04-03T15:39:09.050-0700,INFO,[IPC Server handler 8 on default port > 8020],org.apache.hadoop.hdfs.server.blockmanagement.BlockManager,"Found > corrupt block blk_66092666802_65700925027 [FINALIZED, COMPLETE] from DN > [DISK]DS-3:NORMAL:[DN3]:10010. Stored block blk_66092666802_65700933553 from > DN [DISK]DS-3:NORMAL:[DN3]:10010" > 2024-04-03T15:39:09.050-0700,INFO,[IPC Server handler 8 on default port > 8020],org.apache.hadoop.hdfs.server.blockmanagement.BlockManager,"Stored > Block blk_66092666802_65700933553 from the same DN > [DISK]DS-3:NORMAL:[DN3]:10010 has a newer GenStamp." > 2024-04-03T15:39:09.050-0700,INFO,[IPC Server handler 8 on default port > 8020],org.apache.hadoop.hdfs.server.blockmanagement.BlockManager,"Marking the > block blk_66092666802_65700925027 as corrupt" > {code} > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org