[jira] [Work logged] (HDFS-16622) addRDBI in IncrementalBlockReportManager may remove the block with bigger GS.
[ https://issues.apache.org/jira/browse/HDFS-16622?focusedWorklogId=779952=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-779952 ] ASF GitHub Bot logged work on HDFS-16622: - Author: ASF GitHub Bot Created on: 09/Jun/22 13:35 Start Date: 09/Jun/22 13:35 Worklog Time Spent: 10m Work Description: Hexiaoqiao commented on code in PR #4407: URL: https://github.com/apache/hadoop/pull/4407#discussion_r893514408 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/IncrementalBlockReportManager.java: ## @@ -251,12 +251,20 @@ synchronized void addRDBI(ReceivedDeletedBlockInfo rdbi, DatanodeStorage storage) { // Make sure another entry for the same block is first removed. // There may only be one such entry. +ReceivedDeletedBlockInfo removedInfo = null; for (PerStorageIBR perStorage : pendingIBRs.values()) { - if (perStorage.remove(rdbi.getBlock()) != null) { + removedInfo = perStorage.remove(rdbi.getBlock()); + if (removedInfo != null) { break; } } -getPerStorageIBR(storage).put(rdbi); +if (removedInfo != null && Review Comment: @ZanderXu Thanks for the detailed information. It is an interesting case. IMO, this improvement makes sense to me. Would you mind to add unit test to cover this case? Issue Time Tracking --- Worklog Id: (was: 779952) Time Spent: 1h (was: 50m) > addRDBI in IncrementalBlockReportManager may remove the block with bigger GS. > - > > Key: HDFS-16622 > URL: https://issues.apache.org/jira/browse/HDFS-16622 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Major > Labels: pull-request-available > Time Spent: 1h > Remaining Estimate: 0h > > In our production environment, there is a strange missing block, according > to the log, I suspect there is a bug in function > addRDBI(ReceivedDeletedBlockInfo rdbi,DatanodeStorage storage)(line 250). > Bug code in the for loop: > {code:java} > synchronized void addRDBI(ReceivedDeletedBlockInfo rdbi, > DatanodeStorage storage) { > // Make sure another entry for the same block is first removed. > // There may only be one such entry. > for (PerStorageIBR perStorage : pendingIBRs.values()) { > if (perStorage.remove(rdbi.getBlock()) != null) { > break; > } > } > getPerStorageIBR(storage).put(rdbi); > } > {code} > Removed the GS of the Block in ReceivedDeletedBlockInfo may be greater than > the GS of the Block in rdbi. And NN will invalidate the Replicate will small > GS when complete one block. > So If there is only one replicate for one block, there is a possibility of > missingblock because of this wrong logic. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-16622) addRDBI in IncrementalBlockReportManager may remove the block with bigger GS.
[ https://issues.apache.org/jira/browse/HDFS-16622?focusedWorklogId=779121=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-779121 ] ASF GitHub Bot logged work on HDFS-16622: - Author: ASF GitHub Bot Created on: 07/Jun/22 14:24 Start Date: 07/Jun/22 14:24 Worklog Time Spent: 10m Work Description: ZanderXu commented on code in PR #4407: URL: https://github.com/apache/hadoop/pull/4407#discussion_r891298950 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/IncrementalBlockReportManager.java: ## @@ -251,12 +251,20 @@ synchronized void addRDBI(ReceivedDeletedBlockInfo rdbi, DatanodeStorage storage) { // Make sure another entry for the same block is first removed. // There may only be one such entry. +ReceivedDeletedBlockInfo removedInfo = null; for (PerStorageIBR perStorage : pendingIBRs.values()) { - if (perStorage.remove(rdbi.getBlock()) != null) { + removedInfo = perStorage.remove(rdbi.getBlock()); + if (removedInfo != null) { break; } } -getPerStorageIBR(storage).put(rdbi); +if (removedInfo != null && Review Comment: We encountered the case of concurrent CloseRecovery. The CloseRecovery with small GS early process block on Storage but later being added into pendingIBRs, and CloseRecovery with bigger GS later process block on Storage but early being added into pendingIBRs. As a result, the large GS block is stored on the disk, but small GS block being reported to Namenode. And very unfortunately, the block has one this valid replica, and leads to the block missing. Issue Time Tracking --- Worklog Id: (was: 779121) Time Spent: 50m (was: 40m) > addRDBI in IncrementalBlockReportManager may remove the block with bigger GS. > - > > Key: HDFS-16622 > URL: https://issues.apache.org/jira/browse/HDFS-16622 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Major > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > In our production environment, there is a strange missing block, according > to the log, I suspect there is a bug in function > addRDBI(ReceivedDeletedBlockInfo rdbi,DatanodeStorage storage)(line 250). > Bug code in the for loop: > {code:java} > synchronized void addRDBI(ReceivedDeletedBlockInfo rdbi, > DatanodeStorage storage) { > // Make sure another entry for the same block is first removed. > // There may only be one such entry. > for (PerStorageIBR perStorage : pendingIBRs.values()) { > if (perStorage.remove(rdbi.getBlock()) != null) { > break; > } > } > getPerStorageIBR(storage).put(rdbi); > } > {code} > Removed the GS of the Block in ReceivedDeletedBlockInfo may be greater than > the GS of the Block in rdbi. And NN will invalidate the Replicate will small > GS when complete one block. > So If there is only one replicate for one block, there is a possibility of > missingblock because of this wrong logic. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-16622) addRDBI in IncrementalBlockReportManager may remove the block with bigger GS.
[ https://issues.apache.org/jira/browse/HDFS-16622?focusedWorklogId=779118=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-779118 ] ASF GitHub Bot logged work on HDFS-16622: - Author: ASF GitHub Bot Created on: 07/Jun/22 14:21 Start Date: 07/Jun/22 14:21 Worklog Time Spent: 10m Work Description: ZanderXu commented on code in PR #4407: URL: https://github.com/apache/hadoop/pull/4407#discussion_r891298950 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/IncrementalBlockReportManager.java: ## @@ -251,12 +251,20 @@ synchronized void addRDBI(ReceivedDeletedBlockInfo rdbi, DatanodeStorage storage) { // Make sure another entry for the same block is first removed. // There may only be one such entry. +ReceivedDeletedBlockInfo removedInfo = null; for (PerStorageIBR perStorage : pendingIBRs.values()) { - if (perStorage.remove(rdbi.getBlock()) != null) { + removedInfo = perStorage.remove(rdbi.getBlock()); + if (removedInfo != null) { break; } } -getPerStorageIBR(storage).put(rdbi); +if (removedInfo != null && Review Comment: We encountered the case of concurrent CloseRecovery. The CloseRecovery with small GS early process block on Storage but later being added into pendingIBRs, and CloseRecovery with bigger GS later process block on Storage but early being added into pendingIBRs. As a result, the large GS block is stored on the disk, but small GS block being reported to Namenode. Issue Time Tracking --- Worklog Id: (was: 779118) Time Spent: 40m (was: 0.5h) > addRDBI in IncrementalBlockReportManager may remove the block with bigger GS. > - > > Key: HDFS-16622 > URL: https://issues.apache.org/jira/browse/HDFS-16622 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > In our production environment, there is a strange missing block, according > to the log, I suspect there is a bug in function > addRDBI(ReceivedDeletedBlockInfo rdbi,DatanodeStorage storage)(line 250). > Bug code in the for loop: > {code:java} > synchronized void addRDBI(ReceivedDeletedBlockInfo rdbi, > DatanodeStorage storage) { > // Make sure another entry for the same block is first removed. > // There may only be one such entry. > for (PerStorageIBR perStorage : pendingIBRs.values()) { > if (perStorage.remove(rdbi.getBlock()) != null) { > break; > } > } > getPerStorageIBR(storage).put(rdbi); > } > {code} > Removed the GS of the Block in ReceivedDeletedBlockInfo may be greater than > the GS of the Block in rdbi. And NN will invalidate the Replicate will small > GS when complete one block. > So If there is only one replicate for one block, there is a possibility of > missingblock because of this wrong logic. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-16622) addRDBI in IncrementalBlockReportManager may remove the block with bigger GS.
[ https://issues.apache.org/jira/browse/HDFS-16622?focusedWorklogId=779101=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-779101 ] ASF GitHub Bot logged work on HDFS-16622: - Author: ASF GitHub Bot Created on: 07/Jun/22 13:41 Start Date: 07/Jun/22 13:41 Worklog Time Spent: 10m Work Description: Hexiaoqiao commented on code in PR #4407: URL: https://github.com/apache/hadoop/pull/4407#discussion_r891245808 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/IncrementalBlockReportManager.java: ## @@ -251,12 +251,20 @@ synchronized void addRDBI(ReceivedDeletedBlockInfo rdbi, DatanodeStorage storage) { // Make sure another entry for the same block is first removed. // There may only be one such entry. +ReceivedDeletedBlockInfo removedInfo = null; for (PerStorageIBR perStorage : pendingIBRs.values()) { - if (perStorage.remove(rdbi.getBlock()) != null) { + removedInfo = perStorage.remove(rdbi.getBlock()); + if (removedInfo != null) { break; } } -getPerStorageIBR(storage).put(rdbi); +if (removedInfo != null && Review Comment: My first feeling is `pendingIBRs` should keep the freshest `rdbis` set to report NameNode. But after changes, it will be not the fresh data and also inconsistence with block data on Storage, right? Issue Time Tracking --- Worklog Id: (was: 779101) Time Spent: 0.5h (was: 20m) > addRDBI in IncrementalBlockReportManager may remove the block with bigger GS. > - > > Key: HDFS-16622 > URL: https://issues.apache.org/jira/browse/HDFS-16622 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Major > Labels: pull-request-available > Time Spent: 0.5h > Remaining Estimate: 0h > > In our production environment, there is a strange missing block, according > to the log, I suspect there is a bug in function > addRDBI(ReceivedDeletedBlockInfo rdbi,DatanodeStorage storage)(line 250). > Bug code in the for loop: > {code:java} > synchronized void addRDBI(ReceivedDeletedBlockInfo rdbi, > DatanodeStorage storage) { > // Make sure another entry for the same block is first removed. > // There may only be one such entry. > for (PerStorageIBR perStorage : pendingIBRs.values()) { > if (perStorage.remove(rdbi.getBlock()) != null) { > break; > } > } > getPerStorageIBR(storage).put(rdbi); > } > {code} > Removed the GS of the Block in ReceivedDeletedBlockInfo may be greater than > the GS of the Block in rdbi. And NN will invalidate the Replicate will small > GS when complete one block. > So If there is only one replicate for one block, there is a possibility of > missingblock because of this wrong logic. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-16622) addRDBI in IncrementalBlockReportManager may remove the block with bigger GS.
[ https://issues.apache.org/jira/browse/HDFS-16622?focusedWorklogId=778582=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778582 ] ASF GitHub Bot logged work on HDFS-16622: - Author: ASF GitHub Bot Created on: 06/Jun/22 11:27 Start Date: 06/Jun/22 11:27 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on PR #4407: URL: https://github.com/apache/hadoop/pull/4407#issuecomment-1147344802 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 1m 3s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | -1 :x: | test4tests | 0m 0s | | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 40m 18s | | trunk passed | | +1 :green_heart: | compile | 1m 44s | | trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | compile | 1m 32s | | trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | checkstyle | 1m 21s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 41s | | trunk passed | | +1 :green_heart: | javadoc | 1m 22s | | trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javadoc | 1m 40s | | trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 3m 48s | | trunk passed | | +1 :green_heart: | shadedclient | 26m 0s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 24s | | the patch passed | | +1 :green_heart: | compile | 1m 32s | | the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javac | 1m 32s | | the patch passed | | +1 :green_heart: | compile | 1m 21s | | the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | javac | 1m 21s | | the patch passed | | +1 :green_heart: | blanks | 0m 1s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 1m 1s | | the patch passed | | +1 :green_heart: | mvnsite | 1m 29s | | the patch passed | | +1 :green_heart: | javadoc | 0m 59s | | the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javadoc | 1m 31s | | the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 3m 33s | | the patch passed | | +1 :green_heart: | shadedclient | 25m 41s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 395m 0s | | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 1m 2s | | The patch does not generate ASF License warnings. | | | | 512m 44s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4407/1/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/4407 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux e7745f582308 4.15.0-175-generic #184-Ubuntu SMP Thu Mar 24 17:48:36 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 91f7ff3a9989a9a18398cf8c82b1e30492a86bad | | Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4407/1/testReport/ | | Max. process+thread count | 2066 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs U:
[jira] [Work logged] (HDFS-16622) addRDBI in IncrementalBlockReportManager may remove the block with bigger GS.
[ https://issues.apache.org/jira/browse/HDFS-16622?focusedWorklogId=778501=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-778501 ] ASF GitHub Bot logged work on HDFS-16622: - Author: ASF GitHub Bot Created on: 06/Jun/22 02:53 Start Date: 06/Jun/22 02:53 Worklog Time Spent: 10m Work Description: ZanderXu opened a new pull request, #4407: URL: https://github.com/apache/hadoop/pull/4407 JIRA: [HDFS-16622](https://issues.apache.org/jira/browse/HDFS-16622). addRDBI in IncrementalBlockReportManager may remove the block with bigger GS. I suspect there is a bug in function addRDBI(ReceivedDeletedBlockInfo rdbi,DatanodeStorage storage)(line 250). Bug code in the for loop: synchronized void addRDBI(ReceivedDeletedBlockInfo rdbi, DatanodeStorage storage) { // Make sure another entry for the same block is first removed. // There may only be one such entry. for (PerStorageIBR perStorage : pendingIBRs.values()) { if (perStorage.remove(rdbi.getBlock()) != null) { break; } } getPerStorageIBR(storage).put(rdbi); } Removed the GS of the Block in ReceivedDeletedBlockInfo may be greater than the GS of the Block in rdbi. And NN will invalidate the Replicate will small GS when complete one block. Issue Time Tracking --- Worklog Id: (was: 778501) Remaining Estimate: 0h Time Spent: 10m > addRDBI in IncrementalBlockReportManager may remove the block with bigger GS. > - > > Key: HDFS-16622 > URL: https://issues.apache.org/jira/browse/HDFS-16622 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > In our production environment, there is a strange missing block, according > to the log, I suspect there is a bug in function > addRDBI(ReceivedDeletedBlockInfo rdbi,DatanodeStorage storage)(line 250). > Bug code in the for loop: > {code:java} > synchronized void addRDBI(ReceivedDeletedBlockInfo rdbi, > DatanodeStorage storage) { > // Make sure another entry for the same block is first removed. > // There may only be one such entry. > for (PerStorageIBR perStorage : pendingIBRs.values()) { > if (perStorage.remove(rdbi.getBlock()) != null) { > break; > } > } > getPerStorageIBR(storage).put(rdbi); > } > {code} > Removed the GS of the Block in ReceivedDeletedBlockInfo may be greater than > the GS of the Block in rdbi. And NN will invalidate the Replicate will small > GS when complete one block. > So If there is only one replicate for one block, there is a possibility of > missingblock because of this wrong logic. -- This message was sent by Atlassian Jira (v8.20.7#820007) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org