[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.
[ https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17801009#comment-17801009 ] Takanobu Asanuma commented on HDFS-17150: - Cherry-picked to branch-3.3. > EC: Fix the bug of failed lease recovery. > - > > Key: HDFS-17150 > URL: https://issues.apache.org/jira/browse/HDFS-17150 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Shuyan Zhang >Assignee: Shuyan Zhang >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0, 3.3.9 > > > If the client crashes without writing the minimum number of internal blocks > required by the EC policy, the lease recovery process for the corresponding > unclosed file may continue to fail. Taking RS(6,3) policy as an example, the > timeline is as follows: > 1. The client writes some data to only 5 datanodes; > 2. Client crashes; > 3. NN fails over; > 4. Now the result of `uc.getNumExpectedLocations()` completely depends on > block report, and there are 5 datanodes reporting internal blocks; > 5. When the lease expires hard limit, NN issues a block recovery command; > 6. The datanode checks the command and finds that the number of internal > blocks is insufficient, resulting in an error and recovery failure; > 7. The lease expires hard limit again, and NN issues a block recovery command > again, but the recovery fails again.. > When the number of internal blocks written by the client is less than 6, the > block group is actually unrecoverable. We should equate this situation to the > case where the number of replicas is 0 when processing replica files, i.e., > directly remove the last block group and close the file. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.
[ https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17754598#comment-17754598 ] ASF GitHub Bot commented on HDFS-17150: --- Hexiaoqiao commented on PR #5937: URL: https://github.com/apache/hadoop/pull/5937#issuecomment-1678808788 Committed to trunk. Thanks @zhangshuyan0 for your works. And @haiyang1987 , @hfutatzhanghb for your reviews! > EC: Fix the bug of failed lease recovery. > - > > Key: HDFS-17150 > URL: https://issues.apache.org/jira/browse/HDFS-17150 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Shuyan Zhang >Priority: Major > Labels: pull-request-available > > If the client crashes without writing the minimum number of internal blocks > required by the EC policy, the lease recovery process for the corresponding > unclosed file may continue to fail. Taking RS(6,3) policy as an example, the > timeline is as follows: > 1. The client writes some data to only 5 datanodes; > 2. Client crashes; > 3. NN fails over; > 4. Now the result of `uc.getNumExpectedLocations()` completely depends on > block report, and there are 5 datanodes reporting internal blocks; > 5. When the lease expires hard limit, NN issues a block recovery command; > 6. The datanode checks the command and finds that the number of internal > blocks is insufficient, resulting in an error and recovery failure; > 7. The lease expires hard limit again, and NN issues a block recovery command > again, but the recovery fails again.. > When the number of internal blocks written by the client is less than 6, the > block group is actually unrecoverable. We should equate this situation to the > case where the number of replicas is 0 when processing replica files, i.e., > directly remove the last block group and close the file. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.
[ https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17754597#comment-17754597 ] ASF GitHub Bot commented on HDFS-17150: --- Hexiaoqiao merged PR #5937: URL: https://github.com/apache/hadoop/pull/5937 > EC: Fix the bug of failed lease recovery. > - > > Key: HDFS-17150 > URL: https://issues.apache.org/jira/browse/HDFS-17150 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Shuyan Zhang >Priority: Major > Labels: pull-request-available > > If the client crashes without writing the minimum number of internal blocks > required by the EC policy, the lease recovery process for the corresponding > unclosed file may continue to fail. Taking RS(6,3) policy as an example, the > timeline is as follows: > 1. The client writes some data to only 5 datanodes; > 2. Client crashes; > 3. NN fails over; > 4. Now the result of `uc.getNumExpectedLocations()` completely depends on > block report, and there are 5 datanodes reporting internal blocks; > 5. When the lease expires hard limit, NN issues a block recovery command; > 6. The datanode checks the command and finds that the number of internal > blocks is insufficient, resulting in an error and recovery failure; > 7. The lease expires hard limit again, and NN issues a block recovery command > again, but the recovery fails again.. > When the number of internal blocks written by the client is less than 6, the > block group is actually unrecoverable. We should equate this situation to the > case where the number of replicas is 0 when processing replica files, i.e., > directly remove the last block group and close the file. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.
[ https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17754540#comment-17754540 ] ASF GitHub Bot commented on HDFS-17150: --- haiyang1987 commented on PR #5937: URL: https://github.com/apache/hadoop/pull/5937#issuecomment-1678720922 LGTM. +1. > EC: Fix the bug of failed lease recovery. > - > > Key: HDFS-17150 > URL: https://issues.apache.org/jira/browse/HDFS-17150 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Shuyan Zhang >Priority: Major > Labels: pull-request-available > > If the client crashes without writing the minimum number of internal blocks > required by the EC policy, the lease recovery process for the corresponding > unclosed file may continue to fail. Taking RS(6,3) policy as an example, the > timeline is as follows: > 1. The client writes some data to only 5 datanodes; > 2. Client crashes; > 3. NN fails over; > 4. Now the result of `uc.getNumExpectedLocations()` completely depends on > block report, and there are 5 datanodes reporting internal blocks; > 5. When the lease expires hard limit, NN issues a block recovery command; > 6. The datanode checks the command and finds that the number of internal > blocks is insufficient, resulting in an error and recovery failure; > 7. The lease expires hard limit again, and NN issues a block recovery command > again, but the recovery fails again.. > When the number of internal blocks written by the client is less than 6, the > block group is actually unrecoverable. We should equate this situation to the > case where the number of replicas is 0 when processing replica files, i.e., > directly remove the last block group and close the file. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.
[ https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17754445#comment-17754445 ] ASF GitHub Bot commented on HDFS-17150: --- hadoop-yetus commented on PR #5937: URL: https://github.com/apache/hadoop/pull/5937#issuecomment-1678509228 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 28s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | -1 :x: | test4tests | 0m 0s | | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 32m 56s | | trunk passed | | +1 :green_heart: | compile | 0m 54s | | trunk passed with JDK Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 | | +1 :green_heart: | compile | 0m 50s | | trunk passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 | | +1 :green_heart: | checkstyle | 0m 44s | | trunk passed | | +1 :green_heart: | mvnsite | 0m 55s | | trunk passed | | +1 :green_heart: | javadoc | 0m 51s | | trunk passed with JDK Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 | | +1 :green_heart: | javadoc | 1m 12s | | trunk passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 | | +1 :green_heart: | spotbugs | 2m 0s | | trunk passed | | +1 :green_heart: | shadedclient | 22m 11s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 47s | | the patch passed | | +1 :green_heart: | compile | 0m 48s | | the patch passed with JDK Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 | | +1 :green_heart: | javac | 0m 48s | | the patch passed | | +1 :green_heart: | compile | 0m 43s | | the patch passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 | | +1 :green_heart: | javac | 0m 43s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 0m 34s | [/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5937/3/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 110 unchanged - 0 fixed = 111 total (was 110) | | +1 :green_heart: | mvnsite | 0m 45s | | the patch passed | | +1 :green_heart: | javadoc | 0m 40s | | the patch passed with JDK Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 | | +1 :green_heart: | javadoc | 1m 4s | | the patch passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 | | +1 :green_heart: | spotbugs | 1m 55s | | the patch passed | | +1 :green_heart: | shadedclient | 22m 18s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 183m 51s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5937/3/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 39s | | The patch does not generate ASF License warnings. | | | | 278m 3s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.server.blockmanagement.TestBlockTokenWithShortCircuitRead | | | hadoop.hdfs.TestFileChecksum | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5937/3/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5937 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 41e15ee1ded0 4.15.0-212-generic #223-Ubuntu SMP Tue May 23 13:09:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / b0ce7e38eb84e7ffc0f14493ce3dbae8c8f7393c | | Default Java | Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 | | Multi-JDK versions |
[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.
[ https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17754409#comment-17754409 ] ASF GitHub Bot commented on HDFS-17150: --- hfutatzhanghb commented on PR #5937: URL: https://github.com/apache/hadoop/pull/5937#issuecomment-1678450434 LGTM. +1. > EC: Fix the bug of failed lease recovery. > - > > Key: HDFS-17150 > URL: https://issues.apache.org/jira/browse/HDFS-17150 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Shuyan Zhang >Priority: Major > Labels: pull-request-available > > If the client crashes without writing the minimum number of internal blocks > required by the EC policy, the lease recovery process for the corresponding > unclosed file may continue to fail. Taking RS(6,3) policy as an example, the > timeline is as follows: > 1. The client writes some data to only 5 datanodes; > 2. Client crashes; > 3. NN fails over; > 4. Now the result of `uc.getNumExpectedLocations()` completely depends on > block report, and there are 5 datanodes reporting internal blocks; > 5. When the lease expires hard limit, NN issues a block recovery command; > 6. The datanode checks the command and finds that the number of internal > blocks is insufficient, resulting in an error and recovery failure; > 7. The lease expires hard limit again, and NN issues a block recovery command > again, but the recovery fails again.. > When the number of internal blocks written by the client is less than 6, the > block group is actually unrecoverable. We should equate this situation to the > case where the number of replicas is 0 when processing replica files, i.e., > directly remove the last block group and close the file. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.
[ https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17754372#comment-17754372 ] ASF GitHub Bot commented on HDFS-17150: --- zhangshuyan0 commented on PR #5937: URL: https://github.com/apache/hadoop/pull/5937#issuecomment-1678346266 @Hexiaoqiao @haiyang1987 Some comments have been added. Please take a check when you have free time. > EC: Fix the bug of failed lease recovery. > - > > Key: HDFS-17150 > URL: https://issues.apache.org/jira/browse/HDFS-17150 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Shuyan Zhang >Priority: Major > Labels: pull-request-available > > If the client crashes without writing the minimum number of internal blocks > required by the EC policy, the lease recovery process for the corresponding > unclosed file may continue to fail. Taking RS(6,3) policy as an example, the > timeline is as follows: > 1. The client writes some data to only 5 datanodes; > 2. Client crashes; > 3. NN fails over; > 4. Now the result of `uc.getNumExpectedLocations()` completely depends on > block report, and there are 5 datanodes reporting internal blocks; > 5. When the lease expires hard limit, NN issues a block recovery command; > 6. The datanode checks the command and finds that the number of internal > blocks is insufficient, resulting in an error and recovery failure; > 7. The lease expires hard limit again, and NN issues a block recovery command > again, but the recovery fails again.. > When the number of internal blocks written by the client is less than 6, the > block group is actually unrecoverable. We should equate this situation to the > case where the number of replicas is 0 when processing replica files, i.e., > directly remove the last block group and close the file. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.
[ https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17754030#comment-17754030 ] ASF GitHub Bot commented on HDFS-17150: --- haiyang1987 commented on code in PR #5937: URL: https://github.com/apache/hadoop/pull/5937#discussion_r1293312263 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java: ## @@ -3802,16 +3803,26 @@ boolean internalReleaseLease(Lease lease, String src, INodesInPath iip, lastBlock.getBlockType()); } - if (uc.getNumExpectedLocations() == 0 && lastBlock.getNumBytes() == 0) { + int minLocationsNum = 1; + if (lastBlock.isStriped()) { +minLocationsNum = ((BlockInfoStriped) lastBlock).getRealDataBlockNum(); + } + if (uc.getNumExpectedLocations() < minLocationsNum && + lastBlock.getNumBytes() == 0) { // There is no datanode reported to this block. // may be client have crashed before writing data to pipeline. // This blocks doesn't need any recovery. // We can remove this block and close the file. pendingFile.removeLastBlock(lastBlock); finalizeINodeFileUnderConstruction(src, pendingFile, iip.getLatestSnapshotId(), false); -NameNode.stateChangeLog.warn("BLOCK* internalReleaseLease: " -+ "Removed empty last block and closed file " + src); +if (uc.getNumExpectedLocations() == 0) { Review Comment: yeah, I understand your thoughts, Perhaps it would be better to add some comment description. > EC: Fix the bug of failed lease recovery. > - > > Key: HDFS-17150 > URL: https://issues.apache.org/jira/browse/HDFS-17150 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Shuyan Zhang >Priority: Major > Labels: pull-request-available > > If the client crashes without writing the minimum number of internal blocks > required by the EC policy, the lease recovery process for the corresponding > unclosed file may continue to fail. Taking RS(6,3) policy as an example, the > timeline is as follows: > 1. The client writes some data to only 5 datanodes; > 2. Client crashes; > 3. NN fails over; > 4. Now the result of `uc.getNumExpectedLocations()` completely depends on > block report, and there are 5 datanodes reporting internal blocks; > 5. When the lease expires hard limit, NN issues a block recovery command; > 6. The datanode checks the command and finds that the number of internal > blocks is insufficient, resulting in an error and recovery failure; > 7. The lease expires hard limit again, and NN issues a block recovery command > again, but the recovery fails again.. > When the number of internal blocks written by the client is less than 6, the > block group is actually unrecoverable. We should equate this situation to the > case where the number of replicas is 0 when processing replica files, i.e., > directly remove the last block group and close the file. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.
[ https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17753917#comment-17753917 ] ASF GitHub Bot commented on HDFS-17150: --- zhangshuyan0 commented on code in PR #5937: URL: https://github.com/apache/hadoop/pull/5937#discussion_r1293027160 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java: ## @@ -3802,16 +3803,26 @@ boolean internalReleaseLease(Lease lease, String src, INodesInPath iip, lastBlock.getBlockType()); } - if (uc.getNumExpectedLocations() == 0 && lastBlock.getNumBytes() == 0) { + int minLocationsNum = 1; + if (lastBlock.isStriped()) { +minLocationsNum = ((BlockInfoStriped) lastBlock).getRealDataBlockNum(); + } + if (uc.getNumExpectedLocations() < minLocationsNum && + lastBlock.getNumBytes() == 0) { // There is no datanode reported to this block. // may be client have crashed before writing data to pipeline. // This blocks doesn't need any recovery. // We can remove this block and close the file. pendingFile.removeLastBlock(lastBlock); finalizeINodeFileUnderConstruction(src, pendingFile, iip.getLatestSnapshotId(), false); -NameNode.stateChangeLog.warn("BLOCK* internalReleaseLease: " -+ "Removed empty last block and closed file " + src); +if (uc.getNumExpectedLocations() == 0) { + NameNode.stateChangeLog.warn("BLOCK* internalReleaseLease: " + + "Removed empty last block and closed file " + src); +} else { + NameNode.stateChangeLog.warn("BLOCK* internalReleaseLease: " Review Comment: > If uc.getNumExpectedLocations() is 0, regardless of whether it is a striped block or not, I think we should all consider it to be an empty block, not unrecoverable. How about adding some comments here? > EC: Fix the bug of failed lease recovery. > - > > Key: HDFS-17150 > URL: https://issues.apache.org/jira/browse/HDFS-17150 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Shuyan Zhang >Priority: Major > Labels: pull-request-available > > If the client crashes without writing the minimum number of internal blocks > required by the EC policy, the lease recovery process for the corresponding > unclosed file may continue to fail. Taking RS(6,3) policy as an example, the > timeline is as follows: > 1. The client writes some data to only 5 datanodes; > 2. Client crashes; > 3. NN fails over; > 4. Now the result of `uc.getNumExpectedLocations()` completely depends on > block report, and there are 5 datanodes reporting internal blocks; > 5. When the lease expires hard limit, NN issues a block recovery command; > 6. The datanode checks the command and finds that the number of internal > blocks is insufficient, resulting in an error and recovery failure; > 7. The lease expires hard limit again, and NN issues a block recovery command > again, but the recovery fails again.. > When the number of internal blocks written by the client is less than 6, the > block group is actually unrecoverable. We should equate this situation to the > case where the number of replicas is 0 when processing replica files, i.e., > directly remove the last block group and close the file. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.
[ https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17753889#comment-17753889 ] ASF GitHub Bot commented on HDFS-17150: --- Hexiaoqiao commented on code in PR #5937: URL: https://github.com/apache/hadoop/pull/5937#discussion_r1292951771 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java: ## @@ -3802,16 +3803,26 @@ boolean internalReleaseLease(Lease lease, String src, INodesInPath iip, lastBlock.getBlockType()); } - if (uc.getNumExpectedLocations() == 0 && lastBlock.getNumBytes() == 0) { + int minLocationsNum = 1; + if (lastBlock.isStriped()) { +minLocationsNum = ((BlockInfoStriped) lastBlock).getRealDataBlockNum(); + } + if (uc.getNumExpectedLocations() < minLocationsNum && + lastBlock.getNumBytes() == 0) { // There is no datanode reported to this block. // may be client have crashed before writing data to pipeline. // This blocks doesn't need any recovery. // We can remove this block and close the file. pendingFile.removeLastBlock(lastBlock); finalizeINodeFileUnderConstruction(src, pendingFile, iip.getLatestSnapshotId(), false); -NameNode.stateChangeLog.warn("BLOCK* internalReleaseLease: " -+ "Removed empty last block and closed file " + src); +if (uc.getNumExpectedLocations() == 0) { + NameNode.stateChangeLog.warn("BLOCK* internalReleaseLease: " + + "Removed empty last block and closed file " + src); +} else { + NameNode.stateChangeLog.warn("BLOCK* internalReleaseLease: " Review Comment: Totally true, but poor readability. Any other way to improve it, such as `lastBlock.isStriped()` or others? > EC: Fix the bug of failed lease recovery. > - > > Key: HDFS-17150 > URL: https://issues.apache.org/jira/browse/HDFS-17150 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Shuyan Zhang >Priority: Major > Labels: pull-request-available > > If the client crashes without writing the minimum number of internal blocks > required by the EC policy, the lease recovery process for the corresponding > unclosed file may continue to fail. Taking RS(6,3) policy as an example, the > timeline is as follows: > 1. The client writes some data to only 5 datanodes; > 2. Client crashes; > 3. NN fails over; > 4. Now the result of `uc.getNumExpectedLocations()` completely depends on > block report, and there are 5 datanodes reporting internal blocks; > 5. When the lease expires hard limit, NN issues a block recovery command; > 6. The datanode checks the command and finds that the number of internal > blocks is insufficient, resulting in an error and recovery failure; > 7. The lease expires hard limit again, and NN issues a block recovery command > again, but the recovery fails again.. > When the number of internal blocks written by the client is less than 6, the > block group is actually unrecoverable. We should equate this situation to the > case where the number of replicas is 0 when processing replica files, i.e., > directly remove the last block group and close the file. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.
[ https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17753465#comment-17753465 ] ASF GitHub Bot commented on HDFS-17150: --- zhangshuyan0 commented on code in PR #5937: URL: https://github.com/apache/hadoop/pull/5937#discussion_r1292104709 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java: ## @@ -3802,16 +3803,26 @@ boolean internalReleaseLease(Lease lease, String src, INodesInPath iip, lastBlock.getBlockType()); } - if (uc.getNumExpectedLocations() == 0 && lastBlock.getNumBytes() == 0) { + int minLocationsNum = 1; + if (lastBlock.isStriped()) { +minLocationsNum = ((BlockInfoStriped) lastBlock).getRealDataBlockNum(); + } + if (uc.getNumExpectedLocations() < minLocationsNum && + lastBlock.getNumBytes() == 0) { // There is no datanode reported to this block. // may be client have crashed before writing data to pipeline. // This blocks doesn't need any recovery. // We can remove this block and close the file. pendingFile.removeLastBlock(lastBlock); finalizeINodeFileUnderConstruction(src, pendingFile, iip.getLatestSnapshotId(), false); -NameNode.stateChangeLog.warn("BLOCK* internalReleaseLease: " -+ "Removed empty last block and closed file " + src); +if (uc.getNumExpectedLocations() == 0) { + NameNode.stateChangeLog.warn("BLOCK* internalReleaseLease: " + + "Removed empty last block and closed file " + src); +} else { + NameNode.stateChangeLog.warn("BLOCK* internalReleaseLease: " Review Comment: `uc.getNumExpectedLocations() != 0` means `minLocationsNum != 1`, so it must be a EC file according to line 3806-3809. > EC: Fix the bug of failed lease recovery. > - > > Key: HDFS-17150 > URL: https://issues.apache.org/jira/browse/HDFS-17150 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Shuyan Zhang >Priority: Major > Labels: pull-request-available > > If the client crashes without writing the minimum number of internal blocks > required by the EC policy, the lease recovery process for the corresponding > unclosed file may continue to fail. Taking RS(6,3) policy as an example, the > timeline is as follows: > 1. The client writes some data to only 5 datanodes; > 2. Client crashes; > 3. NN fails over; > 4. Now the result of `uc.getNumExpectedLocations()` completely depends on > block report, and there are 5 datanodes reporting internal blocks; > 5. When the lease expires hard limit, NN issues a block recovery command; > 6. The datanode checks the command and finds that the number of internal > blocks is insufficient, resulting in an error and recovery failure; > 7. The lease expires hard limit again, and NN issues a block recovery command > again, but the recovery fails again.. > When the number of internal blocks written by the client is less than 6, the > block group is actually unrecoverable. We should equate this situation to the > case where the number of replicas is 0 when processing replica files, i.e., > directly remove the last block group and close the file. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.
[ https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17753209#comment-17753209 ] ASF GitHub Bot commented on HDFS-17150: --- Hexiaoqiao commented on code in PR #5937: URL: https://github.com/apache/hadoop/pull/5937#discussion_r1291256537 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java: ## @@ -3802,16 +3803,26 @@ boolean internalReleaseLease(Lease lease, String src, INodesInPath iip, lastBlock.getBlockType()); } - if (uc.getNumExpectedLocations() == 0 && lastBlock.getNumBytes() == 0) { + int minLocationsNum = 1; + if (lastBlock.isStriped()) { +minLocationsNum = ((BlockInfoStriped) lastBlock).getRealDataBlockNum(); + } + if (uc.getNumExpectedLocations() < minLocationsNum && + lastBlock.getNumBytes() == 0) { // There is no datanode reported to this block. // may be client have crashed before writing data to pipeline. // This blocks doesn't need any recovery. // We can remove this block and close the file. pendingFile.removeLastBlock(lastBlock); finalizeINodeFileUnderConstruction(src, pendingFile, iip.getLatestSnapshotId(), false); -NameNode.stateChangeLog.warn("BLOCK* internalReleaseLease: " -+ "Removed empty last block and closed file " + src); +if (uc.getNumExpectedLocations() == 0) { + NameNode.stateChangeLog.warn("BLOCK* internalReleaseLease: " + + "Removed empty last block and closed file " + src); +} else { + NameNode.stateChangeLog.warn("BLOCK* internalReleaseLease: " Review Comment: A little weird that if we can determine it is EC file here when `uc.getNumExpectedLocations() != 0`, if not it will be ambiguity of this log print. > EC: Fix the bug of failed lease recovery. > - > > Key: HDFS-17150 > URL: https://issues.apache.org/jira/browse/HDFS-17150 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Shuyan Zhang >Priority: Major > Labels: pull-request-available > > If the client crashes without writing the minimum number of internal blocks > required by the EC policy, the lease recovery process for the corresponding > unclosed file may continue to fail. Taking RS(6,3) policy as an example, the > timeline is as follows: > 1. The client writes some data to only 5 datanodes; > 2. Client crashes; > 3. NN fails over; > 4. Now the result of `uc.getNumExpectedLocations()` completely depends on > block report, and there are 5 datanodes reporting internal blocks; > 5. When the lease expires hard limit, NN issues a block recovery command; > 6. The datanode checks the command and finds that the number of internal > blocks is insufficient, resulting in an error and recovery failure; > 7. The lease expires hard limit again, and NN issues a block recovery command > again, but the recovery fails again.. > When the number of internal blocks written by the client is less than 6, the > block group is actually unrecoverable. We should equate this situation to the > case where the number of replicas is 0 when processing replica files, i.e., > directly remove the last block group and close the file. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.
[ https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752717#comment-17752717 ] ASF GitHub Bot commented on HDFS-17150: --- hadoop-yetus commented on PR #5937: URL: https://github.com/apache/hadoop/pull/5937#issuecomment-1672842556 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 30s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | -1 :x: | test4tests | 0m 0s | | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 34m 54s | | trunk passed | | +1 :green_heart: | compile | 0m 52s | | trunk passed with JDK Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 | | +1 :green_heart: | compile | 0m 50s | | trunk passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 | | +1 :green_heart: | checkstyle | 0m 47s | | trunk passed | | +1 :green_heart: | mvnsite | 0m 52s | | trunk passed | | +1 :green_heart: | javadoc | 0m 48s | | trunk passed with JDK Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 | | +1 :green_heart: | javadoc | 1m 8s | | trunk passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 | | +1 :green_heart: | spotbugs | 1m 57s | | trunk passed | | +1 :green_heart: | shadedclient | 27m 6s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 49s | | the patch passed | | +1 :green_heart: | compile | 0m 52s | | the patch passed with JDK Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 | | +1 :green_heart: | javac | 0m 52s | | the patch passed | | +1 :green_heart: | compile | 0m 45s | | the patch passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 | | +1 :green_heart: | javac | 0m 45s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 0m 38s | [/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5937/2/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 110 unchanged - 0 fixed = 111 total (was 110) | | +1 :green_heart: | mvnsite | 0m 47s | | the patch passed | | +1 :green_heart: | javadoc | 0m 41s | | the patch passed with JDK Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 | | +1 :green_heart: | javadoc | 1m 16s | | the patch passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 | | +1 :green_heart: | spotbugs | 2m 17s | | the patch passed | | +1 :green_heart: | shadedclient | 27m 59s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 194m 35s | | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 39s | | The patch does not generate ASF License warnings. | | | | 302m 11s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5937/2/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5937 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux e3965eccf0d3 4.15.0-212-generic #223-Ubuntu SMP Tue May 23 13:09:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 35eb68154529a0568e80d3ca94540225b1dda911 | | Default Java | Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5937/2/testReport/ | | Max. process+thread count | 3238 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-
[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.
[ https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752684#comment-17752684 ] ASF GitHub Bot commented on HDFS-17150: --- hadoop-yetus commented on PR #5937: URL: https://github.com/apache/hadoop/pull/5937#issuecomment-1672759197 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 29s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | -1 :x: | test4tests | 0m 0s | | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 32m 0s | | trunk passed | | +1 :green_heart: | compile | 0m 53s | | trunk passed with JDK Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 | | +1 :green_heart: | compile | 0m 48s | | trunk passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 | | +1 :green_heart: | checkstyle | 0m 44s | | trunk passed | | +1 :green_heart: | mvnsite | 0m 55s | | trunk passed | | +1 :green_heart: | javadoc | 0m 52s | | trunk passed with JDK Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 | | +1 :green_heart: | javadoc | 1m 10s | | trunk passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 | | +1 :green_heart: | spotbugs | 1m 58s | | trunk passed | | +1 :green_heart: | shadedclient | 23m 55s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 46s | | the patch passed | | +1 :green_heart: | compile | 0m 45s | | the patch passed with JDK Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 | | +1 :green_heart: | javac | 0m 45s | | the patch passed | | +1 :green_heart: | compile | 0m 38s | | the patch passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 | | +1 :green_heart: | javac | 0m 38s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 32s | | the patch passed | | +1 :green_heart: | mvnsite | 0m 42s | | the patch passed | | +1 :green_heart: | javadoc | 0m 38s | | the patch passed with JDK Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 | | +1 :green_heart: | javadoc | 1m 6s | | the patch passed with JDK Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 | | +1 :green_heart: | spotbugs | 1m 52s | | the patch passed | | +1 :green_heart: | shadedclient | 24m 48s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 190m 50s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5937/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 39s | | The patch does not generate ASF License warnings. | | | | 288m 11s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.TestFileChecksum | | | hadoop.hdfs.server.namenode.ha.TestObserverNode | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5937/1/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5937 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 068005fabb65 4.15.0-212-generic #223-Ubuntu SMP Tue May 23 13:09:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 4e7cd881a92b7deb7d8d188b8c1dc85ca6e8ee5f | | Default Java | Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5937/1/testReport/ | | Max. process+thread count | 4019
[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.
[ https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752671#comment-17752671 ] ASF GitHub Bot commented on HDFS-17150: --- zhangshuyan0 commented on code in PR #5937: URL: https://github.com/apache/hadoop/pull/5937#discussion_r1289707937 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java: ## @@ -3802,7 +3803,12 @@ boolean internalReleaseLease(Lease lease, String src, INodesInPath iip, lastBlock.getBlockType()); } - if (uc.getNumExpectedLocations() == 0 && lastBlock.getNumBytes() == 0) { + int minLocationsNum = 1; + if (lastBlock.isStriped()) { +minLocationsNum = ((BlockInfoStriped) lastBlock).getRealDataBlockNum(); + } + if (uc.getNumExpectedLocations() < minLocationsNum && Review Comment: @hfutatzhanghb `AddBlockOp` only stores blockId, numBytes and generationStamp for the last block. > EC: Fix the bug of failed lease recovery. > - > > Key: HDFS-17150 > URL: https://issues.apache.org/jira/browse/HDFS-17150 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Shuyan Zhang >Priority: Major > Labels: pull-request-available > > If the client crashes without writing the minimum number of internal blocks > required by the EC policy, the lease recovery process for the corresponding > unclosed file may continue to fail. Taking RS(6,3) policy as an example, the > timeline is as follows: > 1. The client writes some data to only 5 datanodes; > 2. Client crashes; > 3. NN fails over; > 4. Now the result of `uc.getNumExpectedLocations()` completely depends on > block report, and there are 5 datanodes reporting internal blocks; > 5. When the lease expires hard limit, NN issues a block recovery command; > 6. The datanode checks the command and finds that the number of internal > blocks is insufficient, resulting in an error and recovery failure; > 7. The lease expires hard limit again, and NN issues a block recovery command > again, but the recovery fails again.. > When the number of internal blocks written by the client is less than 6, the > block group is actually unrecoverable. We should equate this situation to the > case where the number of replicas is 0 when processing replica files, i.e., > directly remove the last block group and close the file. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.
[ https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752648#comment-17752648 ] ASF GitHub Bot commented on HDFS-17150: --- hfutatzhanghb commented on code in PR #5937: URL: https://github.com/apache/hadoop/pull/5937#discussion_r1289651742 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java: ## @@ -3802,7 +3803,12 @@ boolean internalReleaseLease(Lease lease, String src, INodesInPath iip, lastBlock.getBlockType()); } - if (uc.getNumExpectedLocations() == 0 && lastBlock.getNumBytes() == 0) { + int minLocationsNum = 1; + if (lastBlock.isStriped()) { +minLocationsNum = ((BlockInfoStriped) lastBlock).getRealDataBlockNum(); + } + if (uc.getNumExpectedLocations() < minLocationsNum && Review Comment: @zhangshuyan0 Hi, shuyan. please also check below code snippet in method FSDirWriteFileOp#storeAllocatedBlock: ```java final BlockType blockType = pendingFile.getBlockType(); // allocate new block, record block locations in INode. Block newBlock = fsn.createNewBlock(blockType); INodesInPath inodesInPath = INodesInPath.fromINode(pendingFile); saveAllocatedBlock(fsn, src, inodesInPath, newBlock, targets, blockType); persistNewBlock(fsn, src, pendingFile); ``` Does the BlockUnderConstructionFeature#replicas also write to editlog because it is a part of lastBlock. Thanks a lot. > EC: Fix the bug of failed lease recovery. > - > > Key: HDFS-17150 > URL: https://issues.apache.org/jira/browse/HDFS-17150 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Shuyan Zhang >Priority: Major > Labels: pull-request-available > > If the client crashes without writing the minimum number of internal blocks > required by the EC policy, the lease recovery process for the corresponding > unclosed file may continue to fail. Taking RS(6,3) policy as an example, the > timeline is as follows: > 1. The client writes some data to only 5 datanodes; > 2. Client crashes; > 3. NN fails over; > 4. Now the result of `uc.getNumExpectedLocations()` completely depends on > block report, and there are 5 datanodes reporting internal blocks; > 5. When the lease expires hard limit, NN issues a block recovery command; > 6. The datanode checks the command and finds that the number of internal > blocks is insufficient, resulting in an error and recovery failure; > 7. The lease expires hard limit again, and NN issues a block recovery command > again, but the recovery fails again.. > When the number of internal blocks written by the client is less than 6, the > block group is actually unrecoverable. We should equate this situation to the > case where the number of replicas is 0 when processing replica files, i.e., > directly remove the last block group and close the file. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.
[ https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752642#comment-17752642 ] ASF GitHub Bot commented on HDFS-17150: --- zhangshuyan0 commented on code in PR #5937: URL: https://github.com/apache/hadoop/pull/5937#discussion_r1289631865 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java: ## @@ -3802,16 +3803,26 @@ boolean internalReleaseLease(Lease lease, String src, INodesInPath iip, lastBlock.getBlockType()); } - if (uc.getNumExpectedLocations() == 0 && lastBlock.getNumBytes() == 0) { + int minLocationsNum = 1; + if (lastBlock.isStriped()) { +minLocationsNum = ((BlockInfoStriped) lastBlock).getRealDataBlockNum(); + } + if (uc.getNumExpectedLocations() < minLocationsNum && + lastBlock.getNumBytes() == 0) { // There is no datanode reported to this block. // may be client have crashed before writing data to pipeline. // This blocks doesn't need any recovery. // We can remove this block and close the file. pendingFile.removeLastBlock(lastBlock); finalizeINodeFileUnderConstruction(src, pendingFile, iip.getLatestSnapshotId(), false); -NameNode.stateChangeLog.warn("BLOCK* internalReleaseLease: " -+ "Removed empty last block and closed file " + src); +if (uc.getNumExpectedLocations() == 0) { Review Comment: If `uc.getNumExpectedLocations()` is 0, regardless of whether it is a striped block or not, I think we should all consider it to be an empty block, not unrecoverable. So I think the code before is better. What's your opinion? > EC: Fix the bug of failed lease recovery. > - > > Key: HDFS-17150 > URL: https://issues.apache.org/jira/browse/HDFS-17150 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Shuyan Zhang >Priority: Major > Labels: pull-request-available > > If the client crashes without writing the minimum number of internal blocks > required by the EC policy, the lease recovery process for the corresponding > unclosed file may continue to fail. Taking RS(6,3) policy as an example, the > timeline is as follows: > 1. The client writes some data to only 5 datanodes; > 2. Client crashes; > 3. NN fails over; > 4. Now the result of `uc.getNumExpectedLocations()` completely depends on > block report, and there are 5 datanodes reporting internal blocks; > 5. When the lease expires hard limit, NN issues a block recovery command; > 6. The datanode checks the command and finds that the number of internal > blocks is insufficient, resulting in an error and recovery failure; > 7. The lease expires hard limit again, and NN issues a block recovery command > again, but the recovery fails again.. > When the number of internal blocks written by the client is less than 6, the > block group is actually unrecoverable. We should equate this situation to the > case where the number of replicas is 0 when processing replica files, i.e., > directly remove the last block group and close the file. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.
[ https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752640#comment-17752640 ] ASF GitHub Bot commented on HDFS-17150: --- zhangshuyan0 commented on code in PR #5937: URL: https://github.com/apache/hadoop/pull/5937#discussion_r1289626419 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java: ## @@ -3802,7 +3803,12 @@ boolean internalReleaseLease(Lease lease, String src, INodesInPath iip, lastBlock.getBlockType()); } - if (uc.getNumExpectedLocations() == 0 && lastBlock.getNumBytes() == 0) { + int minLocationsNum = 1; + if (lastBlock.isStriped()) { +minLocationsNum = ((BlockInfoStriped) lastBlock).getRealDataBlockNum(); + } + if (uc.getNumExpectedLocations() < minLocationsNum && Review Comment: > First, the writing process of EC files never call `getAdditionalBlock`. Second, after a failover in NameNode, the content of `BlockUnderConstructionFeature#replicas` is totally depends on block reports (IBR or FBR). Sorry, I mistook `getAdditionalBlock` for `getAdditionalDatanode` just now, but the conclusion still holds because the failover occurs. > EC: Fix the bug of failed lease recovery. > - > > Key: HDFS-17150 > URL: https://issues.apache.org/jira/browse/HDFS-17150 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Shuyan Zhang >Priority: Major > Labels: pull-request-available > > If the client crashes without writing the minimum number of internal blocks > required by the EC policy, the lease recovery process for the corresponding > unclosed file may continue to fail. Taking RS(6,3) policy as an example, the > timeline is as follows: > 1. The client writes some data to only 5 datanodes; > 2. Client crashes; > 3. NN fails over; > 4. Now the result of `uc.getNumExpectedLocations()` completely depends on > block report, and there are 5 datanodes reporting internal blocks; > 5. When the lease expires hard limit, NN issues a block recovery command; > 6. The datanode checks the command and finds that the number of internal > blocks is insufficient, resulting in an error and recovery failure; > 7. The lease expires hard limit again, and NN issues a block recovery command > again, but the recovery fails again.. > When the number of internal blocks written by the client is less than 6, the > block group is actually unrecoverable. We should equate this situation to the > case where the number of replicas is 0 when processing replica files, i.e., > directly remove the last block group and close the file. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.
[ https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752631#comment-17752631 ] ASF GitHub Bot commented on HDFS-17150: --- haiyang1987 commented on code in PR #5937: URL: https://github.com/apache/hadoop/pull/5937#discussion_r1289579716 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java: ## @@ -3802,7 +3803,12 @@ boolean internalReleaseLease(Lease lease, String src, INodesInPath iip, lastBlock.getBlockType()); } - if (uc.getNumExpectedLocations() == 0 && lastBlock.getNumBytes() == 0) { + int minLocationsNum = 1; + if (lastBlock.isStriped()) { +minLocationsNum = ((BlockInfoStriped) lastBlock).getRealDataBlockNum(); + } + if (uc.getNumExpectedLocations() < minLocationsNum && Review Comment: Here I think the writing process of EC files will call getAdditionalBlock and will set ReplicaUnderConstruction[] replicas , for standby name need run failover ha will set BlockUnderConstructionFeature#replicas is totally when block reports (IBR or FBR). ? > EC: Fix the bug of failed lease recovery. > - > > Key: HDFS-17150 > URL: https://issues.apache.org/jira/browse/HDFS-17150 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Shuyan Zhang >Priority: Major > Labels: pull-request-available > > If the client crashes without writing the minimum number of internal blocks > required by the EC policy, the lease recovery process for the corresponding > unclosed file may continue to fail. Taking RS(6,3) policy as an example, the > timeline is as follows: > 1. The client writes some data to only 5 datanodes; > 2. Client crashes; > 3. NN fails over; > 4. Now the result of `uc.getNumExpectedLocations()` completely depends on > block report, and there are 5 datanodes reporting internal blocks; > 5. When the lease expires hard limit, NN issues a block recovery command; > 6. The datanode checks the command and finds that the number of internal > blocks is insufficient, resulting in an error and recovery failure; > 7. The lease expires hard limit again, and NN issues a block recovery command > again, but the recovery fails again.. > When the number of internal blocks written by the client is less than 6, the > block group is actually unrecoverable. We should equate this situation to the > case where the number of replicas is 0 when processing replica files, i.e., > directly remove the last block group and close the file. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.
[ https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752630#comment-17752630 ] ASF GitHub Bot commented on HDFS-17150: --- hfutatzhanghb commented on code in PR #5937: URL: https://github.com/apache/hadoop/pull/5937#discussion_r1289579189 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java: ## @@ -3802,7 +3803,12 @@ boolean internalReleaseLease(Lease lease, String src, INodesInPath iip, lastBlock.getBlockType()); } - if (uc.getNumExpectedLocations() == 0 && lastBlock.getNumBytes() == 0) { + int minLocationsNum = 1; + if (lastBlock.isStriped()) { +minLocationsNum = ((BlockInfoStriped) lastBlock).getRealDataBlockNum(); + } + if (uc.getNumExpectedLocations() < minLocationsNum && Review Comment: @zhangshuyan0 Thanks a lot. I ignored the failover condition. > EC: Fix the bug of failed lease recovery. > - > > Key: HDFS-17150 > URL: https://issues.apache.org/jira/browse/HDFS-17150 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Shuyan Zhang >Priority: Major > Labels: pull-request-available > > If the client crashes without writing the minimum number of internal blocks > required by the EC policy, the lease recovery process for the corresponding > unclosed file may continue to fail. Taking RS(6,3) policy as an example, the > timeline is as follows: > 1. The client writes some data to only 5 datanodes; > 2. Client crashes; > 3. NN fails over; > 4. Now the result of `uc.getNumExpectedLocations()` completely depends on > block report, and there are 5 datanodes reporting internal blocks; > 5. When the lease expires hard limit, NN issues a block recovery command; > 6. The datanode checks the command and finds that the number of internal > blocks is insufficient, resulting in an error and recovery failure; > 7. The lease expires hard limit again, and NN issues a block recovery command > again, but the recovery fails again.. > When the number of internal blocks written by the client is less than 6, the > block group is actually unrecoverable. We should equate this situation to the > case where the number of replicas is 0 when processing replica files, i.e., > directly remove the last block group and close the file. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.
[ https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752626#comment-17752626 ] ASF GitHub Bot commented on HDFS-17150: --- haiyang1987 commented on code in PR #5937: URL: https://github.com/apache/hadoop/pull/5937#discussion_r1289559249 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java: ## @@ -3802,16 +3803,26 @@ boolean internalReleaseLease(Lease lease, String src, INodesInPath iip, lastBlock.getBlockType()); } - if (uc.getNumExpectedLocations() == 0 && lastBlock.getNumBytes() == 0) { + int minLocationsNum = 1; + if (lastBlock.isStriped()) { +minLocationsNum = ((BlockInfoStriped) lastBlock).getRealDataBlockNum(); + } + if (uc.getNumExpectedLocations() < minLocationsNum && + lastBlock.getNumBytes() == 0) { // There is no datanode reported to this block. // may be client have crashed before writing data to pipeline. // This blocks doesn't need any recovery. // We can remove this block and close the file. pendingFile.removeLastBlock(lastBlock); finalizeINodeFileUnderConstruction(src, pendingFile, iip.getLatestSnapshotId(), false); -NameNode.stateChangeLog.warn("BLOCK* internalReleaseLease: " -+ "Removed empty last block and closed file " + src); +if (uc.getNumExpectedLocations() == 0) { Review Comment: Here how about update to this judgment logic? ``` if (lastBlock.isStriped()) { NameNode.stateChangeLog.warn("BLOCK* internalReleaseLease: " + "Removed last unrecoverable block group and closed file " + src); } else { NameNode.stateChangeLog.warn("BLOCK* internalReleaseLease: " + "Removed empty last block and closed file " + src); }``` > EC: Fix the bug of failed lease recovery. > - > > Key: HDFS-17150 > URL: https://issues.apache.org/jira/browse/HDFS-17150 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Shuyan Zhang >Priority: Major > Labels: pull-request-available > > If the client crashes without writing the minimum number of internal blocks > required by the EC policy, the lease recovery process for the corresponding > unclosed file may continue to fail. Taking RS(6,3) policy as an example, the > timeline is as follows: > 1. The client writes some data to only 5 datanodes; > 2. Client crashes; > 3. NN fails over; > 4. Now the result of `uc.getNumExpectedLocations()` completely depends on > block report, and there are 5 datanodes reporting internal blocks; > 5. When the lease expires hard limit, NN issues a block recovery command; > 6. The datanode checks the command and finds that the number of internal > blocks is insufficient, resulting in an error and recovery failure; > 7. The lease expires hard limit again, and NN issues a block recovery command > again, but the recovery fails again.. > When the number of internal blocks written by the client is less than 6, the > block group is actually unrecoverable. We should equate this situation to the > case where the number of replicas is 0 when processing replica files, i.e., > directly remove the last block group and close the file. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.
[ https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752615#comment-17752615 ] ASF GitHub Bot commented on HDFS-17150: --- zhangshuyan0 commented on code in PR #5937: URL: https://github.com/apache/hadoop/pull/5937#discussion_r1289529656 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java: ## @@ -3802,7 +3803,12 @@ boolean internalReleaseLease(Lease lease, String src, INodesInPath iip, lastBlock.getBlockType()); } - if (uc.getNumExpectedLocations() == 0 && lastBlock.getNumBytes() == 0) { + int minLocationsNum = 1; + if (lastBlock.isStriped()) { +minLocationsNum = ((BlockInfoStriped) lastBlock).getRealDataBlockNum(); + } + if (uc.getNumExpectedLocations() < minLocationsNum && Review Comment: First, the writing process of EC files never call `getAdditionalBlock`. Second, after a failover in NameNode, the content of `BlockUnderConstructionFeature#replicas` is totally depends on block reports (IBR or FBR). > EC: Fix the bug of failed lease recovery. > - > > Key: HDFS-17150 > URL: https://issues.apache.org/jira/browse/HDFS-17150 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Shuyan Zhang >Priority: Major > Labels: pull-request-available > > If the client crashes without writing the minimum number of internal blocks > required by the EC policy, the lease recovery process for the corresponding > unclosed file may continue to fail. Taking RS(6,3) policy as an example, the > timeline is as follows: > 1. The client writes some data to only 5 datanodes; > 2. Client crashes; > 3. NN fails over; > 4. Now the result of `uc.getNumExpectedLocations()` completely depends on > block report, and there are 5 datanodes reporting internal blocks; > 5. When the lease expires hard limit, NN issues a block recovery command; > 6. The datanode checks the command and finds that the number of internal > blocks is insufficient, resulting in an error and recovery failure; > 7. The lease expires hard limit again, and NN issues a block recovery command > again, but the recovery fails again.. > When the number of internal blocks written by the client is less than 6, the > block group is actually unrecoverable. We should equate this situation to the > case where the number of replicas is 0 when processing replica files, i.e., > directly remove the last block group and close the file. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.
[ https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752612#comment-17752612 ] ASF GitHub Bot commented on HDFS-17150: --- hfutatzhanghb commented on code in PR #5937: URL: https://github.com/apache/hadoop/pull/5937#discussion_r1289519764 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java: ## @@ -3802,7 +3803,12 @@ boolean internalReleaseLease(Lease lease, String src, INodesInPath iip, lastBlock.getBlockType()); } - if (uc.getNumExpectedLocations() == 0 && lastBlock.getNumBytes() == 0) { + int minLocationsNum = 1; + if (lastBlock.isStriped()) { +minLocationsNum = ((BlockInfoStriped) lastBlock).getRealDataBlockNum(); + } + if (uc.getNumExpectedLocations() < minLocationsNum && Review Comment: Hi, @zhangshuyan0 , Also have a question here, is `uc.getNumExpectedLocations()` always `>=` minLocationsNum? because when invoke getAdditionalBlock rpc, the field replicas in classBlockUnderConstructionFeature has been set to targets. > EC: Fix the bug of failed lease recovery. > - > > Key: HDFS-17150 > URL: https://issues.apache.org/jira/browse/HDFS-17150 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Shuyan Zhang >Priority: Major > Labels: pull-request-available > > If the client crashes without writing the minimum number of internal blocks > required by the EC policy, the lease recovery process for the corresponding > unclosed file may continue to fail. Taking RS(6,3) policy as an example, the > timeline is as follows: > 1. The client writes some data to only 5 datanodes; > 2. Client crashes; > 3. NN fails over; > 4. Now the result of `uc.getNumExpectedLocations()` completely depends on > block report, and there are 5 datanodes reporting internal blocks; > 5. When the lease expires hard limit, NN issues a block recovery command; > 6. The datanode checks the command and finds that the number of internal > blocks is insufficient, resulting in an error and recovery failure; > 7. The lease expires hard limit again, and NN issues a block recovery command > again, but the recovery fails again.. > When the number of internal blocks written by the client is less than 6, the > block group is actually unrecoverable. We should equate this situation to the > case where the number of replicas is 0 when processing replica files, i.e., > directly remove the last block group and close the file. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.
[ https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752611#comment-17752611 ] ASF GitHub Bot commented on HDFS-17150: --- zhangshuyan0 commented on code in PR #5937: URL: https://github.com/apache/hadoop/pull/5937#discussion_r1289517665 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java: ## @@ -3802,7 +3803,12 @@ boolean internalReleaseLease(Lease lease, String src, INodesInPath iip, lastBlock.getBlockType()); } - if (uc.getNumExpectedLocations() == 0 && lastBlock.getNumBytes() == 0) { + int minLocationsNum = 1; + if (lastBlock.isStriped()) { +minLocationsNum = ((BlockInfoStriped) lastBlock).getRealDataBlockNum(); + } + if (uc.getNumExpectedLocations() < minLocationsNum && + lastBlock.getNumBytes() == 0) { Review Comment: Thanks for your review. This log is not very suitable, I will make some changes. > EC: Fix the bug of failed lease recovery. > - > > Key: HDFS-17150 > URL: https://issues.apache.org/jira/browse/HDFS-17150 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Shuyan Zhang >Priority: Major > Labels: pull-request-available > > If the client crashes without writing the minimum number of internal blocks > required by the EC policy, the lease recovery process for the corresponding > unclosed file may continue to fail. Taking RS(6,3) policy as an example, the > timeline is as follows: > 1. The client writes some data to only 5 datanodes; > 2. Client crashes; > 3. NN fails over; > 4. Now the result of `uc.getNumExpectedLocations()` completely depends on > block report, and there are 5 datanodes reporting internal blocks; > 5. When the lease expires hard limit, NN issues a block recovery command; > 6. The datanode checks the command and finds that the number of internal > blocks is insufficient, resulting in an error and recovery failure; > 7. The lease expires hard limit again, and NN issues a block recovery command > again, but the recovery fails again.. > When the number of internal blocks written by the client is less than 6, the > block group is actually unrecoverable. We should equate this situation to the > case where the number of replicas is 0 when processing replica files, i.e., > directly remove the last block group and close the file. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.
[ https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752606#comment-17752606 ] ASF GitHub Bot commented on HDFS-17150: --- hfutatzhanghb commented on code in PR #5937: URL: https://github.com/apache/hadoop/pull/5937#discussion_r1289509777 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java: ## @@ -3802,7 +3803,12 @@ boolean internalReleaseLease(Lease lease, String src, INodesInPath iip, lastBlock.getBlockType()); } - if (uc.getNumExpectedLocations() == 0 && lastBlock.getNumBytes() == 0) { + int minLocationsNum = 1; + if (lastBlock.isStriped()) { +minLocationsNum = ((BlockInfoStriped) lastBlock).getRealDataBlockNum(); + } + if (uc.getNumExpectedLocations() < minLocationsNum && + lastBlock.getNumBytes() == 0) { Review Comment: @zhangshuyan0 Whether should we log this special case or not? https://github.com/apache/hadoop/blob/4e7cd881a92b7deb7d8d188b8c1dc85ca6e8ee5f/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java#L3819 > EC: Fix the bug of failed lease recovery. > - > > Key: HDFS-17150 > URL: https://issues.apache.org/jira/browse/HDFS-17150 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Shuyan Zhang >Priority: Major > Labels: pull-request-available > > If the client crashes without writing the minimum number of internal blocks > required by the EC policy, the lease recovery process for the corresponding > unclosed file may continue to fail. Taking RS(6,3) policy as an example, the > timeline is as follows: > 1. The client writes some data to only 5 datanodes; > 2. Client crashes; > 3. NN fails over; > 4. Now the result of `uc.getNumExpectedLocations()` completely depends on > block report, and there are 5 datanodes reporting internal blocks; > 5. When the lease expires hard limit, NN issues a block recovery command; > 6. The datanode checks the command and finds that the number of internal > blocks is insufficient, resulting in an error and recovery failure; > 7. The lease expires hard limit again, and NN issues a block recovery command > again, but the recovery fails again.. > When the number of internal blocks written by the client is less than 6, the > block group is actually unrecoverable. We should equate this situation to the > case where the number of replicas is 0 when processing replica files, i.e., > directly remove the last block group and close the file. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.
[ https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752602#comment-17752602 ] ASF GitHub Bot commented on HDFS-17150: --- hfutatzhanghb commented on PR #5937: URL: https://github.com/apache/hadoop/pull/5937#issuecomment-1672500893 @zhangshuyan0 Thanks a lot for reporting this bug. This phenomenon also happens on our clusters. Leave some questions, hope to receive your reply~ Thanks. > EC: Fix the bug of failed lease recovery. > - > > Key: HDFS-17150 > URL: https://issues.apache.org/jira/browse/HDFS-17150 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Shuyan Zhang >Priority: Major > Labels: pull-request-available > > If the client crashes without writing the minimum number of internal blocks > required by the EC policy, the lease recovery process for the corresponding > unclosed file may continue to fail. Taking RS(6,3) policy as an example, the > timeline is as follows: > 1. The client writes some data to only 5 datanodes; > 2. Client crashes; > 3. NN fails over; > 4. Now the result of `uc.getNumExpectedLocations()` completely depends on > block report, and there are 5 datanodes reporting internal blocks; > 5. When the lease expires hard limit, NN issues a block recovery command; > 6. The datanode checks the command and finds that the number of internal > blocks is insufficient, resulting in an error and recovery failure; > 7. The lease expires hard limit again, and NN issues a block recovery command > again, but the recovery fails again.. > When the number of internal blocks written by the client is less than 6, the > block group is actually unrecoverable. We should equate this situation to the > case where the number of replicas is 0 when processing replica files, i.e., > directly remove the last block group and close the file. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.
[ https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752600#comment-17752600 ] ASF GitHub Bot commented on HDFS-17150: --- zhangshuyan0 opened a new pull request, #5937: URL: https://github.com/apache/hadoop/pull/5937 EC: Fix the bug of failed lease recovery. If the client crashes without writing the minimum number of internal blocks required by the EC policy, the lease recovery process for the corresponding unclosed file may continue to fail. Taking RS(6,3) policy as an example, the timeline is as follows: 1. The client writes some data to only 5 datanodes; 2. Client crashes; 3. NN fails over; 4. Now the result of `uc.getNumExpectedLocations()` completely depends on block report, and there are 5 datanodes reporting internal blocks; 5. When the lease expires hard limit, NN issues a block recovery command; 6. The datanode checks the command and finds that the number of internal blocks is insufficient, resulting in an exception and recovery failure; https://github.com/apache/hadoop/blob/b6edcb9a84ceac340c79cd692637b3e11c997cc5/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockRecoveryWorker.java#L534-L540 7. The lease expires hard limit again, and NN issues a block recovery command again, but the recovery fails again.. When the number of internal blocks written by the client is less than 6, the block group is actually unrecoverable. We should equate this situation to the case where the number of replicas is 0 when processing replica files, i.e., directly remove the last block group and close the file. > EC: Fix the bug of failed lease recovery. > - > > Key: HDFS-17150 > URL: https://issues.apache.org/jira/browse/HDFS-17150 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Shuyan Zhang >Priority: Major > > If the client crashes without writing the minimum number of internal blocks > required by the EC policy, the lease recovery process for the corresponding > unclosed file may continue to fail. Taking RS(6,3) policy as an example, the > timeline is as follows: > 1. The client writes some data to only 5 datanodes; > 2. Client crashes; > 3. NN fails over; > 4. Now the result of `uc.getNumExpectedLocations()` completely depends on > block report, and there are 5 datanodes reporting internal blocks; > 5. When the lease expires hard limit, NN issues a block recovery command; > 6. The datanode checks the command and finds that the number of internal > blocks is insufficient, resulting in an error and recovery failure; > 7. The lease expires hard limit again, and NN issues a block recovery command > again, but the recovery fails again.. > When the number of internal blocks written by the client is less than 6, the > block group is actually unrecoverable. We should equate this situation to the > case where the number of replicas is 0 when processing replica files, i.e., > directly remove the last block group and close the file. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org