[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.

2023-12-28 Thread Takanobu Asanuma (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17801009#comment-17801009
 ] 

Takanobu Asanuma commented on HDFS-17150:
-

Cherry-picked to branch-3.3.

> EC: Fix the bug of failed lease recovery.
> -
>
> Key: HDFS-17150
> URL: https://issues.apache.org/jira/browse/HDFS-17150
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Assignee: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0, 3.3.9
>
>
> If the client crashes without writing the minimum number of internal blocks 
> required by the EC policy, the lease recovery process for the corresponding 
> unclosed file may continue to fail. Taking RS(6,3) policy as an example, the 
> timeline is as follows:
> 1. The client writes some data to only 5 datanodes;
> 2. Client crashes;
> 3. NN fails over;
> 4. Now the result of `uc.getNumExpectedLocations()` completely depends on 
> block report, and there are 5 datanodes reporting internal blocks;
> 5. When the lease expires hard limit, NN issues a block recovery command;
> 6. The datanode checks the command and finds that the number of internal 
> blocks is insufficient, resulting in an error and recovery failure;
> 7. The lease expires hard limit again, and NN issues a block recovery command 
> again, but the recovery fails again..
> When the number of internal blocks written by the client is less than 6, the 
> block group is actually unrecoverable. We should equate this situation to the 
> case where the number of replicas is 0 when processing replica files, i.e., 
> directly remove the last block group and close the file.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.

2023-08-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17754598#comment-17754598
 ] 

ASF GitHub Bot commented on HDFS-17150:
---

Hexiaoqiao commented on PR #5937:
URL: https://github.com/apache/hadoop/pull/5937#issuecomment-1678808788

   Committed to trunk. Thanks @zhangshuyan0 for your works. And @haiyang1987 , 
@hfutatzhanghb for your reviews!




> EC: Fix the bug of failed lease recovery.
> -
>
> Key: HDFS-17150
> URL: https://issues.apache.org/jira/browse/HDFS-17150
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
>
> If the client crashes without writing the minimum number of internal blocks 
> required by the EC policy, the lease recovery process for the corresponding 
> unclosed file may continue to fail. Taking RS(6,3) policy as an example, the 
> timeline is as follows:
> 1. The client writes some data to only 5 datanodes;
> 2. Client crashes;
> 3. NN fails over;
> 4. Now the result of `uc.getNumExpectedLocations()` completely depends on 
> block report, and there are 5 datanodes reporting internal blocks;
> 5. When the lease expires hard limit, NN issues a block recovery command;
> 6. The datanode checks the command and finds that the number of internal 
> blocks is insufficient, resulting in an error and recovery failure;
> 7. The lease expires hard limit again, and NN issues a block recovery command 
> again, but the recovery fails again..
> When the number of internal blocks written by the client is less than 6, the 
> block group is actually unrecoverable. We should equate this situation to the 
> case where the number of replicas is 0 when processing replica files, i.e., 
> directly remove the last block group and close the file.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.

2023-08-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17754597#comment-17754597
 ] 

ASF GitHub Bot commented on HDFS-17150:
---

Hexiaoqiao merged PR #5937:
URL: https://github.com/apache/hadoop/pull/5937




> EC: Fix the bug of failed lease recovery.
> -
>
> Key: HDFS-17150
> URL: https://issues.apache.org/jira/browse/HDFS-17150
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
>
> If the client crashes without writing the minimum number of internal blocks 
> required by the EC policy, the lease recovery process for the corresponding 
> unclosed file may continue to fail. Taking RS(6,3) policy as an example, the 
> timeline is as follows:
> 1. The client writes some data to only 5 datanodes;
> 2. Client crashes;
> 3. NN fails over;
> 4. Now the result of `uc.getNumExpectedLocations()` completely depends on 
> block report, and there are 5 datanodes reporting internal blocks;
> 5. When the lease expires hard limit, NN issues a block recovery command;
> 6. The datanode checks the command and finds that the number of internal 
> blocks is insufficient, resulting in an error and recovery failure;
> 7. The lease expires hard limit again, and NN issues a block recovery command 
> again, but the recovery fails again..
> When the number of internal blocks written by the client is less than 6, the 
> block group is actually unrecoverable. We should equate this situation to the 
> case where the number of replicas is 0 when processing replica files, i.e., 
> directly remove the last block group and close the file.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.

2023-08-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17754540#comment-17754540
 ] 

ASF GitHub Bot commented on HDFS-17150:
---

haiyang1987 commented on PR #5937:
URL: https://github.com/apache/hadoop/pull/5937#issuecomment-1678720922

   LGTM. +1.
   
   




> EC: Fix the bug of failed lease recovery.
> -
>
> Key: HDFS-17150
> URL: https://issues.apache.org/jira/browse/HDFS-17150
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
>
> If the client crashes without writing the minimum number of internal blocks 
> required by the EC policy, the lease recovery process for the corresponding 
> unclosed file may continue to fail. Taking RS(6,3) policy as an example, the 
> timeline is as follows:
> 1. The client writes some data to only 5 datanodes;
> 2. Client crashes;
> 3. NN fails over;
> 4. Now the result of `uc.getNumExpectedLocations()` completely depends on 
> block report, and there are 5 datanodes reporting internal blocks;
> 5. When the lease expires hard limit, NN issues a block recovery command;
> 6. The datanode checks the command and finds that the number of internal 
> blocks is insufficient, resulting in an error and recovery failure;
> 7. The lease expires hard limit again, and NN issues a block recovery command 
> again, but the recovery fails again..
> When the number of internal blocks written by the client is less than 6, the 
> block group is actually unrecoverable. We should equate this situation to the 
> case where the number of replicas is 0 when processing replica files, i.e., 
> directly remove the last block group and close the file.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.

2023-08-15 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17754445#comment-17754445
 ] 

ASF GitHub Bot commented on HDFS-17150:
---

hadoop-yetus commented on PR #5937:
URL: https://github.com/apache/hadoop/pull/5937#issuecomment-1678509228

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 28s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | -1 :x: |  test4tests  |   0m  0s |  |  The patch doesn't appear to include 
any new or modified tests. Please justify why no new tests are needed for this 
patch. Also please list what manual steps were performed to verify this patch.  
|
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  32m 56s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   0m 54s |  |  trunk passed with JDK 
Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04  |
   | +1 :green_heart: |  compile  |   0m 50s |  |  trunk passed with JDK 
Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05  |
   | +1 :green_heart: |  checkstyle  |   0m 44s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 55s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 51s |  |  trunk passed with JDK 
Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   1m 12s |  |  trunk passed with JDK 
Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05  |
   | +1 :green_heart: |  spotbugs  |   2m  0s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  22m 11s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 47s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 48s |  |  the patch passed with JDK 
Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04  |
   | +1 :green_heart: |  javac  |   0m 48s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 43s |  |  the patch passed with JDK 
Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05  |
   | +1 :green_heart: |  javac  |   0m 43s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 34s | 
[/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5937/3/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 110 unchanged 
- 0 fixed = 111 total (was 110)  |
   | +1 :green_heart: |  mvnsite  |   0m 45s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 40s |  |  the patch passed with JDK 
Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   1m  4s |  |  the patch passed with JDK 
Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05  |
   | +1 :green_heart: |  spotbugs  |   1m 55s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  22m 18s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  | 183m 51s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5937/3/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 39s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 278m  3s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | 
hadoop.hdfs.server.blockmanagement.TestBlockTokenWithShortCircuitRead |
   |   | hadoop.hdfs.TestFileChecksum |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.43 ServerAPI=1.43 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5937/3/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/5937 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 41e15ee1ded0 4.15.0-212-generic #223-Ubuntu SMP Tue May 23 
13:09:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / b0ce7e38eb84e7ffc0f14493ce3dbae8c8f7393c |
   | Default Java | Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 |
   | Multi-JDK versions |

[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.

2023-08-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17754409#comment-17754409
 ] 

ASF GitHub Bot commented on HDFS-17150:
---

hfutatzhanghb commented on PR #5937:
URL: https://github.com/apache/hadoop/pull/5937#issuecomment-1678450434

   LGTM. +1.




> EC: Fix the bug of failed lease recovery.
> -
>
> Key: HDFS-17150
> URL: https://issues.apache.org/jira/browse/HDFS-17150
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
>
> If the client crashes without writing the minimum number of internal blocks 
> required by the EC policy, the lease recovery process for the corresponding 
> unclosed file may continue to fail. Taking RS(6,3) policy as an example, the 
> timeline is as follows:
> 1. The client writes some data to only 5 datanodes;
> 2. Client crashes;
> 3. NN fails over;
> 4. Now the result of `uc.getNumExpectedLocations()` completely depends on 
> block report, and there are 5 datanodes reporting internal blocks;
> 5. When the lease expires hard limit, NN issues a block recovery command;
> 6. The datanode checks the command and finds that the number of internal 
> blocks is insufficient, resulting in an error and recovery failure;
> 7. The lease expires hard limit again, and NN issues a block recovery command 
> again, but the recovery fails again..
> When the number of internal blocks written by the client is less than 6, the 
> block group is actually unrecoverable. We should equate this situation to the 
> case where the number of replicas is 0 when processing replica files, i.e., 
> directly remove the last block group and close the file.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.

2023-08-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17754372#comment-17754372
 ] 

ASF GitHub Bot commented on HDFS-17150:
---

zhangshuyan0 commented on PR #5937:
URL: https://github.com/apache/hadoop/pull/5937#issuecomment-1678346266

   @Hexiaoqiao @haiyang1987 Some comments have been added. Please take a check 
when you have free time.




> EC: Fix the bug of failed lease recovery.
> -
>
> Key: HDFS-17150
> URL: https://issues.apache.org/jira/browse/HDFS-17150
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
>
> If the client crashes without writing the minimum number of internal blocks 
> required by the EC policy, the lease recovery process for the corresponding 
> unclosed file may continue to fail. Taking RS(6,3) policy as an example, the 
> timeline is as follows:
> 1. The client writes some data to only 5 datanodes;
> 2. Client crashes;
> 3. NN fails over;
> 4. Now the result of `uc.getNumExpectedLocations()` completely depends on 
> block report, and there are 5 datanodes reporting internal blocks;
> 5. When the lease expires hard limit, NN issues a block recovery command;
> 6. The datanode checks the command and finds that the number of internal 
> blocks is insufficient, resulting in an error and recovery failure;
> 7. The lease expires hard limit again, and NN issues a block recovery command 
> again, but the recovery fails again..
> When the number of internal blocks written by the client is less than 6, the 
> block group is actually unrecoverable. We should equate this situation to the 
> case where the number of replicas is 0 when processing replica files, i.e., 
> directly remove the last block group and close the file.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.

2023-08-14 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17754030#comment-17754030
 ] 

ASF GitHub Bot commented on HDFS-17150:
---

haiyang1987 commented on code in PR #5937:
URL: https://github.com/apache/hadoop/pull/5937#discussion_r1293312263


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java:
##
@@ -3802,16 +3803,26 @@ boolean internalReleaseLease(Lease lease, String src, 
INodesInPath iip,
 lastBlock.getBlockType());
   }
 
-  if (uc.getNumExpectedLocations() == 0 && lastBlock.getNumBytes() == 0) {
+  int minLocationsNum = 1;
+  if (lastBlock.isStriped()) {
+minLocationsNum = ((BlockInfoStriped) lastBlock).getRealDataBlockNum();
+  }
+  if (uc.getNumExpectedLocations() < minLocationsNum &&
+  lastBlock.getNumBytes() == 0) {
 // There is no datanode reported to this block.
 // may be client have crashed before writing data to pipeline.
 // This blocks doesn't need any recovery.
 // We can remove this block and close the file.
 pendingFile.removeLastBlock(lastBlock);
 finalizeINodeFileUnderConstruction(src, pendingFile,
 iip.getLatestSnapshotId(), false);
-NameNode.stateChangeLog.warn("BLOCK* internalReleaseLease: "
-+ "Removed empty last block and closed file " + src);
+if (uc.getNumExpectedLocations() == 0) {

Review Comment:
   yeah, I understand your thoughts,
   
   Perhaps it would be better to add some comment description.





> EC: Fix the bug of failed lease recovery.
> -
>
> Key: HDFS-17150
> URL: https://issues.apache.org/jira/browse/HDFS-17150
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
>
> If the client crashes without writing the minimum number of internal blocks 
> required by the EC policy, the lease recovery process for the corresponding 
> unclosed file may continue to fail. Taking RS(6,3) policy as an example, the 
> timeline is as follows:
> 1. The client writes some data to only 5 datanodes;
> 2. Client crashes;
> 3. NN fails over;
> 4. Now the result of `uc.getNumExpectedLocations()` completely depends on 
> block report, and there are 5 datanodes reporting internal blocks;
> 5. When the lease expires hard limit, NN issues a block recovery command;
> 6. The datanode checks the command and finds that the number of internal 
> blocks is insufficient, resulting in an error and recovery failure;
> 7. The lease expires hard limit again, and NN issues a block recovery command 
> again, but the recovery fails again..
> When the number of internal blocks written by the client is less than 6, the 
> block group is actually unrecoverable. We should equate this situation to the 
> case where the number of replicas is 0 when processing replica files, i.e., 
> directly remove the last block group and close the file.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.

2023-08-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17753917#comment-17753917
 ] 

ASF GitHub Bot commented on HDFS-17150:
---

zhangshuyan0 commented on code in PR #5937:
URL: https://github.com/apache/hadoop/pull/5937#discussion_r1293027160


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java:
##
@@ -3802,16 +3803,26 @@ boolean internalReleaseLease(Lease lease, String src, 
INodesInPath iip,
 lastBlock.getBlockType());
   }
 
-  if (uc.getNumExpectedLocations() == 0 && lastBlock.getNumBytes() == 0) {
+  int minLocationsNum = 1;
+  if (lastBlock.isStriped()) {
+minLocationsNum = ((BlockInfoStriped) lastBlock).getRealDataBlockNum();
+  }
+  if (uc.getNumExpectedLocations() < minLocationsNum &&
+  lastBlock.getNumBytes() == 0) {
 // There is no datanode reported to this block.
 // may be client have crashed before writing data to pipeline.
 // This blocks doesn't need any recovery.
 // We can remove this block and close the file.
 pendingFile.removeLastBlock(lastBlock);
 finalizeINodeFileUnderConstruction(src, pendingFile,
 iip.getLatestSnapshotId(), false);
-NameNode.stateChangeLog.warn("BLOCK* internalReleaseLease: "
-+ "Removed empty last block and closed file " + src);
+if (uc.getNumExpectedLocations() == 0) {
+  NameNode.stateChangeLog.warn("BLOCK* internalReleaseLease: "
+  + "Removed empty last block and closed file " + src);
+} else {
+  NameNode.stateChangeLog.warn("BLOCK* internalReleaseLease: "

Review Comment:
   > If uc.getNumExpectedLocations() is 0, regardless of whether it is a 
striped block or not, I think we should all consider it to be an empty block, 
not unrecoverable. 
   
   How about adding some comments here?





> EC: Fix the bug of failed lease recovery.
> -
>
> Key: HDFS-17150
> URL: https://issues.apache.org/jira/browse/HDFS-17150
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
>
> If the client crashes without writing the minimum number of internal blocks 
> required by the EC policy, the lease recovery process for the corresponding 
> unclosed file may continue to fail. Taking RS(6,3) policy as an example, the 
> timeline is as follows:
> 1. The client writes some data to only 5 datanodes;
> 2. Client crashes;
> 3. NN fails over;
> 4. Now the result of `uc.getNumExpectedLocations()` completely depends on 
> block report, and there are 5 datanodes reporting internal blocks;
> 5. When the lease expires hard limit, NN issues a block recovery command;
> 6. The datanode checks the command and finds that the number of internal 
> blocks is insufficient, resulting in an error and recovery failure;
> 7. The lease expires hard limit again, and NN issues a block recovery command 
> again, but the recovery fails again..
> When the number of internal blocks written by the client is less than 6, the 
> block group is actually unrecoverable. We should equate this situation to the 
> case where the number of replicas is 0 when processing replica files, i.e., 
> directly remove the last block group and close the file.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.

2023-08-13 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17753889#comment-17753889
 ] 

ASF GitHub Bot commented on HDFS-17150:
---

Hexiaoqiao commented on code in PR #5937:
URL: https://github.com/apache/hadoop/pull/5937#discussion_r1292951771


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java:
##
@@ -3802,16 +3803,26 @@ boolean internalReleaseLease(Lease lease, String src, 
INodesInPath iip,
 lastBlock.getBlockType());
   }
 
-  if (uc.getNumExpectedLocations() == 0 && lastBlock.getNumBytes() == 0) {
+  int minLocationsNum = 1;
+  if (lastBlock.isStriped()) {
+minLocationsNum = ((BlockInfoStriped) lastBlock).getRealDataBlockNum();
+  }
+  if (uc.getNumExpectedLocations() < minLocationsNum &&
+  lastBlock.getNumBytes() == 0) {
 // There is no datanode reported to this block.
 // may be client have crashed before writing data to pipeline.
 // This blocks doesn't need any recovery.
 // We can remove this block and close the file.
 pendingFile.removeLastBlock(lastBlock);
 finalizeINodeFileUnderConstruction(src, pendingFile,
 iip.getLatestSnapshotId(), false);
-NameNode.stateChangeLog.warn("BLOCK* internalReleaseLease: "
-+ "Removed empty last block and closed file " + src);
+if (uc.getNumExpectedLocations() == 0) {
+  NameNode.stateChangeLog.warn("BLOCK* internalReleaseLease: "
+  + "Removed empty last block and closed file " + src);
+} else {
+  NameNode.stateChangeLog.warn("BLOCK* internalReleaseLease: "

Review Comment:
   Totally true, but poor readability. Any other way to improve it, such as 
`lastBlock.isStriped()` or others?





> EC: Fix the bug of failed lease recovery.
> -
>
> Key: HDFS-17150
> URL: https://issues.apache.org/jira/browse/HDFS-17150
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
>
> If the client crashes without writing the minimum number of internal blocks 
> required by the EC policy, the lease recovery process for the corresponding 
> unclosed file may continue to fail. Taking RS(6,3) policy as an example, the 
> timeline is as follows:
> 1. The client writes some data to only 5 datanodes;
> 2. Client crashes;
> 3. NN fails over;
> 4. Now the result of `uc.getNumExpectedLocations()` completely depends on 
> block report, and there are 5 datanodes reporting internal blocks;
> 5. When the lease expires hard limit, NN issues a block recovery command;
> 6. The datanode checks the command and finds that the number of internal 
> blocks is insufficient, resulting in an error and recovery failure;
> 7. The lease expires hard limit again, and NN issues a block recovery command 
> again, but the recovery fails again..
> When the number of internal blocks written by the client is less than 6, the 
> block group is actually unrecoverable. We should equate this situation to the 
> case where the number of replicas is 0 when processing replica files, i.e., 
> directly remove the last block group and close the file.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.

2023-08-11 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17753465#comment-17753465
 ] 

ASF GitHub Bot commented on HDFS-17150:
---

zhangshuyan0 commented on code in PR #5937:
URL: https://github.com/apache/hadoop/pull/5937#discussion_r1292104709


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java:
##
@@ -3802,16 +3803,26 @@ boolean internalReleaseLease(Lease lease, String src, 
INodesInPath iip,
 lastBlock.getBlockType());
   }
 
-  if (uc.getNumExpectedLocations() == 0 && lastBlock.getNumBytes() == 0) {
+  int minLocationsNum = 1;
+  if (lastBlock.isStriped()) {
+minLocationsNum = ((BlockInfoStriped) lastBlock).getRealDataBlockNum();
+  }
+  if (uc.getNumExpectedLocations() < minLocationsNum &&
+  lastBlock.getNumBytes() == 0) {
 // There is no datanode reported to this block.
 // may be client have crashed before writing data to pipeline.
 // This blocks doesn't need any recovery.
 // We can remove this block and close the file.
 pendingFile.removeLastBlock(lastBlock);
 finalizeINodeFileUnderConstruction(src, pendingFile,
 iip.getLatestSnapshotId(), false);
-NameNode.stateChangeLog.warn("BLOCK* internalReleaseLease: "
-+ "Removed empty last block and closed file " + src);
+if (uc.getNumExpectedLocations() == 0) {
+  NameNode.stateChangeLog.warn("BLOCK* internalReleaseLease: "
+  + "Removed empty last block and closed file " + src);
+} else {
+  NameNode.stateChangeLog.warn("BLOCK* internalReleaseLease: "

Review Comment:
   `uc.getNumExpectedLocations() != 0` means `minLocationsNum != 1`, so it must 
be a EC file according to line 3806-3809.





> EC: Fix the bug of failed lease recovery.
> -
>
> Key: HDFS-17150
> URL: https://issues.apache.org/jira/browse/HDFS-17150
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
>
> If the client crashes without writing the minimum number of internal blocks 
> required by the EC policy, the lease recovery process for the corresponding 
> unclosed file may continue to fail. Taking RS(6,3) policy as an example, the 
> timeline is as follows:
> 1. The client writes some data to only 5 datanodes;
> 2. Client crashes;
> 3. NN fails over;
> 4. Now the result of `uc.getNumExpectedLocations()` completely depends on 
> block report, and there are 5 datanodes reporting internal blocks;
> 5. When the lease expires hard limit, NN issues a block recovery command;
> 6. The datanode checks the command and finds that the number of internal 
> blocks is insufficient, resulting in an error and recovery failure;
> 7. The lease expires hard limit again, and NN issues a block recovery command 
> again, but the recovery fails again..
> When the number of internal blocks written by the client is less than 6, the 
> block group is actually unrecoverable. We should equate this situation to the 
> case where the number of replicas is 0 when processing replica files, i.e., 
> directly remove the last block group and close the file.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.

2023-08-11 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17753209#comment-17753209
 ] 

ASF GitHub Bot commented on HDFS-17150:
---

Hexiaoqiao commented on code in PR #5937:
URL: https://github.com/apache/hadoop/pull/5937#discussion_r1291256537


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java:
##
@@ -3802,16 +3803,26 @@ boolean internalReleaseLease(Lease lease, String src, 
INodesInPath iip,
 lastBlock.getBlockType());
   }
 
-  if (uc.getNumExpectedLocations() == 0 && lastBlock.getNumBytes() == 0) {
+  int minLocationsNum = 1;
+  if (lastBlock.isStriped()) {
+minLocationsNum = ((BlockInfoStriped) lastBlock).getRealDataBlockNum();
+  }
+  if (uc.getNumExpectedLocations() < minLocationsNum &&
+  lastBlock.getNumBytes() == 0) {
 // There is no datanode reported to this block.
 // may be client have crashed before writing data to pipeline.
 // This blocks doesn't need any recovery.
 // We can remove this block and close the file.
 pendingFile.removeLastBlock(lastBlock);
 finalizeINodeFileUnderConstruction(src, pendingFile,
 iip.getLatestSnapshotId(), false);
-NameNode.stateChangeLog.warn("BLOCK* internalReleaseLease: "
-+ "Removed empty last block and closed file " + src);
+if (uc.getNumExpectedLocations() == 0) {
+  NameNode.stateChangeLog.warn("BLOCK* internalReleaseLease: "
+  + "Removed empty last block and closed file " + src);
+} else {
+  NameNode.stateChangeLog.warn("BLOCK* internalReleaseLease: "

Review Comment:
   A little weird that if we can determine it is EC file here when 
`uc.getNumExpectedLocations() != 0`, if not it will be ambiguity of this log 
print.





> EC: Fix the bug of failed lease recovery.
> -
>
> Key: HDFS-17150
> URL: https://issues.apache.org/jira/browse/HDFS-17150
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
>
> If the client crashes without writing the minimum number of internal blocks 
> required by the EC policy, the lease recovery process for the corresponding 
> unclosed file may continue to fail. Taking RS(6,3) policy as an example, the 
> timeline is as follows:
> 1. The client writes some data to only 5 datanodes;
> 2. Client crashes;
> 3. NN fails over;
> 4. Now the result of `uc.getNumExpectedLocations()` completely depends on 
> block report, and there are 5 datanodes reporting internal blocks;
> 5. When the lease expires hard limit, NN issues a block recovery command;
> 6. The datanode checks the command and finds that the number of internal 
> blocks is insufficient, resulting in an error and recovery failure;
> 7. The lease expires hard limit again, and NN issues a block recovery command 
> again, but the recovery fails again..
> When the number of internal blocks written by the client is less than 6, the 
> block group is actually unrecoverable. We should equate this situation to the 
> case where the number of replicas is 0 when processing replica files, i.e., 
> directly remove the last block group and close the file.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.

2023-08-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752717#comment-17752717
 ] 

ASF GitHub Bot commented on HDFS-17150:
---

hadoop-yetus commented on PR #5937:
URL: https://github.com/apache/hadoop/pull/5937#issuecomment-1672842556

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 30s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | -1 :x: |  test4tests  |   0m  0s |  |  The patch doesn't appear to include 
any new or modified tests. Please justify why no new tests are needed for this 
patch. Also please list what manual steps were performed to verify this patch.  
|
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  34m 54s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   0m 52s |  |  trunk passed with JDK 
Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04  |
   | +1 :green_heart: |  compile  |   0m 50s |  |  trunk passed with JDK 
Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05  |
   | +1 :green_heart: |  checkstyle  |   0m 47s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 52s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 48s |  |  trunk passed with JDK 
Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   1m  8s |  |  trunk passed with JDK 
Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05  |
   | +1 :green_heart: |  spotbugs  |   1m 57s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  27m  6s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 49s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 52s |  |  the patch passed with JDK 
Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04  |
   | +1 :green_heart: |  javac  |   0m 52s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 45s |  |  the patch passed with JDK 
Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05  |
   | +1 :green_heart: |  javac  |   0m 45s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 38s | 
[/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5937/2/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 110 unchanged 
- 0 fixed = 111 total (was 110)  |
   | +1 :green_heart: |  mvnsite  |   0m 47s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 41s |  |  the patch passed with JDK 
Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   1m 16s |  |  the patch passed with JDK 
Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05  |
   | +1 :green_heart: |  spotbugs  |   2m 17s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  27m 59s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  | 194m 35s |  |  hadoop-hdfs in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   0m 39s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 302m 11s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.43 ServerAPI=1.43 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5937/2/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/5937 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux e3965eccf0d3 4.15.0-212-generic #223-Ubuntu SMP Tue May 23 
13:09:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 35eb68154529a0568e80d3ca94540225b1dda911 |
   | Default Java | Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_382-8u382-ga-1~20.04.1-b05 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5937/2/testReport/ |
   | Max. process+thread count | 3238 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-

[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.

2023-08-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752684#comment-17752684
 ] 

ASF GitHub Bot commented on HDFS-17150:
---

hadoop-yetus commented on PR #5937:
URL: https://github.com/apache/hadoop/pull/5937#issuecomment-1672759197

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 29s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | -1 :x: |  test4tests  |   0m  0s |  |  The patch doesn't appear to include 
any new or modified tests. Please justify why no new tests are needed for this 
patch. Also please list what manual steps were performed to verify this patch.  
|
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  32m  0s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   0m 53s |  |  trunk passed with JDK 
Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04  |
   | +1 :green_heart: |  compile  |   0m 48s |  |  trunk passed with JDK 
Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05  |
   | +1 :green_heart: |  checkstyle  |   0m 44s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 55s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 52s |  |  trunk passed with JDK 
Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   1m 10s |  |  trunk passed with JDK 
Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05  |
   | +1 :green_heart: |  spotbugs  |   1m 58s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  23m 55s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 46s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 45s |  |  the patch passed with JDK 
Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04  |
   | +1 :green_heart: |  javac  |   0m 45s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 38s |  |  the patch passed with JDK 
Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05  |
   | +1 :green_heart: |  javac  |   0m 38s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 32s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   0m 42s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 38s |  |  the patch passed with JDK 
Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04  |
   | +1 :green_heart: |  javadoc  |   1m  6s |  |  the patch passed with JDK 
Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05  |
   | +1 :green_heart: |  spotbugs  |   1m 52s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  24m 48s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  | 190m 50s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5937/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 39s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 288m 11s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.TestFileChecksum |
   |   | hadoop.hdfs.server.namenode.ha.TestObserverNode |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.43 ServerAPI=1.43 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5937/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/5937 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 068005fabb65 4.15.0-212-generic #223-Ubuntu SMP Tue May 23 
13:09:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 4e7cd881a92b7deb7d8d188b8c1dc85ca6e8ee5f |
   | Default Java | Private Build-1.8.0_382-8u382-ga-1~20.04.1-b05 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.20+8-post-Ubuntu-1ubuntu120.04 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_382-8u382-ga-1~20.04.1-b05 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5937/1/testReport/ |
   | Max. process+thread count | 4019

[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.

2023-08-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752671#comment-17752671
 ] 

ASF GitHub Bot commented on HDFS-17150:
---

zhangshuyan0 commented on code in PR #5937:
URL: https://github.com/apache/hadoop/pull/5937#discussion_r1289707937


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java:
##
@@ -3802,7 +3803,12 @@ boolean internalReleaseLease(Lease lease, String src, 
INodesInPath iip,
 lastBlock.getBlockType());
   }
 
-  if (uc.getNumExpectedLocations() == 0 && lastBlock.getNumBytes() == 0) {
+  int minLocationsNum = 1;
+  if (lastBlock.isStriped()) {
+minLocationsNum = ((BlockInfoStriped) lastBlock).getRealDataBlockNum();
+  }
+  if (uc.getNumExpectedLocations() < minLocationsNum &&

Review Comment:
   @hfutatzhanghb `AddBlockOp` only stores blockId, numBytes and 
generationStamp for the last block.





> EC: Fix the bug of failed lease recovery.
> -
>
> Key: HDFS-17150
> URL: https://issues.apache.org/jira/browse/HDFS-17150
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
>
> If the client crashes without writing the minimum number of internal blocks 
> required by the EC policy, the lease recovery process for the corresponding 
> unclosed file may continue to fail. Taking RS(6,3) policy as an example, the 
> timeline is as follows:
> 1. The client writes some data to only 5 datanodes;
> 2. Client crashes;
> 3. NN fails over;
> 4. Now the result of `uc.getNumExpectedLocations()` completely depends on 
> block report, and there are 5 datanodes reporting internal blocks;
> 5. When the lease expires hard limit, NN issues a block recovery command;
> 6. The datanode checks the command and finds that the number of internal 
> blocks is insufficient, resulting in an error and recovery failure;
> 7. The lease expires hard limit again, and NN issues a block recovery command 
> again, but the recovery fails again..
> When the number of internal blocks written by the client is less than 6, the 
> block group is actually unrecoverable. We should equate this situation to the 
> case where the number of replicas is 0 when processing replica files, i.e., 
> directly remove the last block group and close the file.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.

2023-08-10 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752648#comment-17752648
 ] 

ASF GitHub Bot commented on HDFS-17150:
---

hfutatzhanghb commented on code in PR #5937:
URL: https://github.com/apache/hadoop/pull/5937#discussion_r1289651742


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java:
##
@@ -3802,7 +3803,12 @@ boolean internalReleaseLease(Lease lease, String src, 
INodesInPath iip,
 lastBlock.getBlockType());
   }
 
-  if (uc.getNumExpectedLocations() == 0 && lastBlock.getNumBytes() == 0) {
+  int minLocationsNum = 1;
+  if (lastBlock.isStriped()) {
+minLocationsNum = ((BlockInfoStriped) lastBlock).getRealDataBlockNum();
+  }
+  if (uc.getNumExpectedLocations() < minLocationsNum &&

Review Comment:
   @zhangshuyan0 Hi, shuyan. please also check below code snippet in method 
FSDirWriteFileOp#storeAllocatedBlock:
   ```java
   final BlockType blockType = pendingFile.getBlockType();
   // allocate new block, record block locations in INode.
   Block newBlock = fsn.createNewBlock(blockType);
   INodesInPath inodesInPath = INodesInPath.fromINode(pendingFile);
   saveAllocatedBlock(fsn, src, inodesInPath, newBlock, targets, blockType);
   
   persistNewBlock(fsn, src, pendingFile);
   ```
   Does the BlockUnderConstructionFeature#replicas also write to editlog 
because it is a part of lastBlock.
   Thanks a lot.





> EC: Fix the bug of failed lease recovery.
> -
>
> Key: HDFS-17150
> URL: https://issues.apache.org/jira/browse/HDFS-17150
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
>
> If the client crashes without writing the minimum number of internal blocks 
> required by the EC policy, the lease recovery process for the corresponding 
> unclosed file may continue to fail. Taking RS(6,3) policy as an example, the 
> timeline is as follows:
> 1. The client writes some data to only 5 datanodes;
> 2. Client crashes;
> 3. NN fails over;
> 4. Now the result of `uc.getNumExpectedLocations()` completely depends on 
> block report, and there are 5 datanodes reporting internal blocks;
> 5. When the lease expires hard limit, NN issues a block recovery command;
> 6. The datanode checks the command and finds that the number of internal 
> blocks is insufficient, resulting in an error and recovery failure;
> 7. The lease expires hard limit again, and NN issues a block recovery command 
> again, but the recovery fails again..
> When the number of internal blocks written by the client is less than 6, the 
> block group is actually unrecoverable. We should equate this situation to the 
> case where the number of replicas is 0 when processing replica files, i.e., 
> directly remove the last block group and close the file.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.

2023-08-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752642#comment-17752642
 ] 

ASF GitHub Bot commented on HDFS-17150:
---

zhangshuyan0 commented on code in PR #5937:
URL: https://github.com/apache/hadoop/pull/5937#discussion_r1289631865


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java:
##
@@ -3802,16 +3803,26 @@ boolean internalReleaseLease(Lease lease, String src, 
INodesInPath iip,
 lastBlock.getBlockType());
   }
 
-  if (uc.getNumExpectedLocations() == 0 && lastBlock.getNumBytes() == 0) {
+  int minLocationsNum = 1;
+  if (lastBlock.isStriped()) {
+minLocationsNum = ((BlockInfoStriped) lastBlock).getRealDataBlockNum();
+  }
+  if (uc.getNumExpectedLocations() < minLocationsNum &&
+  lastBlock.getNumBytes() == 0) {
 // There is no datanode reported to this block.
 // may be client have crashed before writing data to pipeline.
 // This blocks doesn't need any recovery.
 // We can remove this block and close the file.
 pendingFile.removeLastBlock(lastBlock);
 finalizeINodeFileUnderConstruction(src, pendingFile,
 iip.getLatestSnapshotId(), false);
-NameNode.stateChangeLog.warn("BLOCK* internalReleaseLease: "
-+ "Removed empty last block and closed file " + src);
+if (uc.getNumExpectedLocations() == 0) {

Review Comment:
   If `uc.getNumExpectedLocations()` is 0, regardless of whether it is a 
striped block or not, I think we should all consider it to be an empty block, 
not unrecoverable. So I think the code before is better. What's your opinion?





> EC: Fix the bug of failed lease recovery.
> -
>
> Key: HDFS-17150
> URL: https://issues.apache.org/jira/browse/HDFS-17150
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
>
> If the client crashes without writing the minimum number of internal blocks 
> required by the EC policy, the lease recovery process for the corresponding 
> unclosed file may continue to fail. Taking RS(6,3) policy as an example, the 
> timeline is as follows:
> 1. The client writes some data to only 5 datanodes;
> 2. Client crashes;
> 3. NN fails over;
> 4. Now the result of `uc.getNumExpectedLocations()` completely depends on 
> block report, and there are 5 datanodes reporting internal blocks;
> 5. When the lease expires hard limit, NN issues a block recovery command;
> 6. The datanode checks the command and finds that the number of internal 
> blocks is insufficient, resulting in an error and recovery failure;
> 7. The lease expires hard limit again, and NN issues a block recovery command 
> again, but the recovery fails again..
> When the number of internal blocks written by the client is less than 6, the 
> block group is actually unrecoverable. We should equate this situation to the 
> case where the number of replicas is 0 when processing replica files, i.e., 
> directly remove the last block group and close the file.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.

2023-08-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752640#comment-17752640
 ] 

ASF GitHub Bot commented on HDFS-17150:
---

zhangshuyan0 commented on code in PR #5937:
URL: https://github.com/apache/hadoop/pull/5937#discussion_r1289626419


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java:
##
@@ -3802,7 +3803,12 @@ boolean internalReleaseLease(Lease lease, String src, 
INodesInPath iip,
 lastBlock.getBlockType());
   }
 
-  if (uc.getNumExpectedLocations() == 0 && lastBlock.getNumBytes() == 0) {
+  int minLocationsNum = 1;
+  if (lastBlock.isStriped()) {
+minLocationsNum = ((BlockInfoStriped) lastBlock).getRealDataBlockNum();
+  }
+  if (uc.getNumExpectedLocations() < minLocationsNum &&

Review Comment:
   > First, the writing process of EC files never call `getAdditionalBlock`. 
Second, after a failover in NameNode, the content of 
`BlockUnderConstructionFeature#replicas` is totally depends on block reports 
(IBR or FBR).
   
   Sorry, I mistook `getAdditionalBlock` for `getAdditionalDatanode` just now, 
but the conclusion still holds because the failover occurs.





> EC: Fix the bug of failed lease recovery.
> -
>
> Key: HDFS-17150
> URL: https://issues.apache.org/jira/browse/HDFS-17150
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
>
> If the client crashes without writing the minimum number of internal blocks 
> required by the EC policy, the lease recovery process for the corresponding 
> unclosed file may continue to fail. Taking RS(6,3) policy as an example, the 
> timeline is as follows:
> 1. The client writes some data to only 5 datanodes;
> 2. Client crashes;
> 3. NN fails over;
> 4. Now the result of `uc.getNumExpectedLocations()` completely depends on 
> block report, and there are 5 datanodes reporting internal blocks;
> 5. When the lease expires hard limit, NN issues a block recovery command;
> 6. The datanode checks the command and finds that the number of internal 
> blocks is insufficient, resulting in an error and recovery failure;
> 7. The lease expires hard limit again, and NN issues a block recovery command 
> again, but the recovery fails again..
> When the number of internal blocks written by the client is less than 6, the 
> block group is actually unrecoverable. We should equate this situation to the 
> case where the number of replicas is 0 when processing replica files, i.e., 
> directly remove the last block group and close the file.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.

2023-08-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752631#comment-17752631
 ] 

ASF GitHub Bot commented on HDFS-17150:
---

haiyang1987 commented on code in PR #5937:
URL: https://github.com/apache/hadoop/pull/5937#discussion_r1289579716


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java:
##
@@ -3802,7 +3803,12 @@ boolean internalReleaseLease(Lease lease, String src, 
INodesInPath iip,
 lastBlock.getBlockType());
   }
 
-  if (uc.getNumExpectedLocations() == 0 && lastBlock.getNumBytes() == 0) {
+  int minLocationsNum = 1;
+  if (lastBlock.isStriped()) {
+minLocationsNum = ((BlockInfoStriped) lastBlock).getRealDataBlockNum();
+  }
+  if (uc.getNumExpectedLocations() < minLocationsNum &&

Review Comment:
   Here I think the writing process of EC files will call getAdditionalBlock 
and will set ReplicaUnderConstruction[] replicas , for standby name need run 
failover ha  will  set BlockUnderConstructionFeature#replicas is totally when 
block reports (IBR or FBR). ?





> EC: Fix the bug of failed lease recovery.
> -
>
> Key: HDFS-17150
> URL: https://issues.apache.org/jira/browse/HDFS-17150
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
>
> If the client crashes without writing the minimum number of internal blocks 
> required by the EC policy, the lease recovery process for the corresponding 
> unclosed file may continue to fail. Taking RS(6,3) policy as an example, the 
> timeline is as follows:
> 1. The client writes some data to only 5 datanodes;
> 2. Client crashes;
> 3. NN fails over;
> 4. Now the result of `uc.getNumExpectedLocations()` completely depends on 
> block report, and there are 5 datanodes reporting internal blocks;
> 5. When the lease expires hard limit, NN issues a block recovery command;
> 6. The datanode checks the command and finds that the number of internal 
> blocks is insufficient, resulting in an error and recovery failure;
> 7. The lease expires hard limit again, and NN issues a block recovery command 
> again, but the recovery fails again..
> When the number of internal blocks written by the client is less than 6, the 
> block group is actually unrecoverable. We should equate this situation to the 
> case where the number of replicas is 0 when processing replica files, i.e., 
> directly remove the last block group and close the file.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.

2023-08-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752630#comment-17752630
 ] 

ASF GitHub Bot commented on HDFS-17150:
---

hfutatzhanghb commented on code in PR #5937:
URL: https://github.com/apache/hadoop/pull/5937#discussion_r1289579189


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java:
##
@@ -3802,7 +3803,12 @@ boolean internalReleaseLease(Lease lease, String src, 
INodesInPath iip,
 lastBlock.getBlockType());
   }
 
-  if (uc.getNumExpectedLocations() == 0 && lastBlock.getNumBytes() == 0) {
+  int minLocationsNum = 1;
+  if (lastBlock.isStriped()) {
+minLocationsNum = ((BlockInfoStriped) lastBlock).getRealDataBlockNum();
+  }
+  if (uc.getNumExpectedLocations() < minLocationsNum &&

Review Comment:
   @zhangshuyan0 Thanks a lot. I ignored the failover condition.





> EC: Fix the bug of failed lease recovery.
> -
>
> Key: HDFS-17150
> URL: https://issues.apache.org/jira/browse/HDFS-17150
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
>
> If the client crashes without writing the minimum number of internal blocks 
> required by the EC policy, the lease recovery process for the corresponding 
> unclosed file may continue to fail. Taking RS(6,3) policy as an example, the 
> timeline is as follows:
> 1. The client writes some data to only 5 datanodes;
> 2. Client crashes;
> 3. NN fails over;
> 4. Now the result of `uc.getNumExpectedLocations()` completely depends on 
> block report, and there are 5 datanodes reporting internal blocks;
> 5. When the lease expires hard limit, NN issues a block recovery command;
> 6. The datanode checks the command and finds that the number of internal 
> blocks is insufficient, resulting in an error and recovery failure;
> 7. The lease expires hard limit again, and NN issues a block recovery command 
> again, but the recovery fails again..
> When the number of internal blocks written by the client is less than 6, the 
> block group is actually unrecoverable. We should equate this situation to the 
> case where the number of replicas is 0 when processing replica files, i.e., 
> directly remove the last block group and close the file.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.

2023-08-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752626#comment-17752626
 ] 

ASF GitHub Bot commented on HDFS-17150:
---

haiyang1987 commented on code in PR #5937:
URL: https://github.com/apache/hadoop/pull/5937#discussion_r1289559249


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java:
##
@@ -3802,16 +3803,26 @@ boolean internalReleaseLease(Lease lease, String src, 
INodesInPath iip,
 lastBlock.getBlockType());
   }
 
-  if (uc.getNumExpectedLocations() == 0 && lastBlock.getNumBytes() == 0) {
+  int minLocationsNum = 1;
+  if (lastBlock.isStriped()) {
+minLocationsNum = ((BlockInfoStriped) lastBlock).getRealDataBlockNum();
+  }
+  if (uc.getNumExpectedLocations() < minLocationsNum &&
+  lastBlock.getNumBytes() == 0) {
 // There is no datanode reported to this block.
 // may be client have crashed before writing data to pipeline.
 // This blocks doesn't need any recovery.
 // We can remove this block and close the file.
 pendingFile.removeLastBlock(lastBlock);
 finalizeINodeFileUnderConstruction(src, pendingFile,
 iip.getLatestSnapshotId(), false);
-NameNode.stateChangeLog.warn("BLOCK* internalReleaseLease: "
-+ "Removed empty last block and closed file " + src);
+if (uc.getNumExpectedLocations() == 0) {

Review Comment:
   Here how about update to this judgment logic?
   ```
 if (lastBlock.isStriped()) {
   NameNode.stateChangeLog.warn("BLOCK* internalReleaseLease: "
 + "Removed last unrecoverable block group and closed file " + 
src);
 } else {
 NameNode.stateChangeLog.warn("BLOCK* internalReleaseLease: "
 + "Removed empty last block and closed file " + src);
 }```





> EC: Fix the bug of failed lease recovery.
> -
>
> Key: HDFS-17150
> URL: https://issues.apache.org/jira/browse/HDFS-17150
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
>
> If the client crashes without writing the minimum number of internal blocks 
> required by the EC policy, the lease recovery process for the corresponding 
> unclosed file may continue to fail. Taking RS(6,3) policy as an example, the 
> timeline is as follows:
> 1. The client writes some data to only 5 datanodes;
> 2. Client crashes;
> 3. NN fails over;
> 4. Now the result of `uc.getNumExpectedLocations()` completely depends on 
> block report, and there are 5 datanodes reporting internal blocks;
> 5. When the lease expires hard limit, NN issues a block recovery command;
> 6. The datanode checks the command and finds that the number of internal 
> blocks is insufficient, resulting in an error and recovery failure;
> 7. The lease expires hard limit again, and NN issues a block recovery command 
> again, but the recovery fails again..
> When the number of internal blocks written by the client is less than 6, the 
> block group is actually unrecoverable. We should equate this situation to the 
> case where the number of replicas is 0 when processing replica files, i.e., 
> directly remove the last block group and close the file.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.

2023-08-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752615#comment-17752615
 ] 

ASF GitHub Bot commented on HDFS-17150:
---

zhangshuyan0 commented on code in PR #5937:
URL: https://github.com/apache/hadoop/pull/5937#discussion_r1289529656


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java:
##
@@ -3802,7 +3803,12 @@ boolean internalReleaseLease(Lease lease, String src, 
INodesInPath iip,
 lastBlock.getBlockType());
   }
 
-  if (uc.getNumExpectedLocations() == 0 && lastBlock.getNumBytes() == 0) {
+  int minLocationsNum = 1;
+  if (lastBlock.isStriped()) {
+minLocationsNum = ((BlockInfoStriped) lastBlock).getRealDataBlockNum();
+  }
+  if (uc.getNumExpectedLocations() < minLocationsNum &&

Review Comment:
   First, the writing process of EC files never call `getAdditionalBlock`. 
Second, after a failover in NameNode, the content of 
`BlockUnderConstructionFeature#replicas` is totally depends on block reports 
(IBR or FBR).





> EC: Fix the bug of failed lease recovery.
> -
>
> Key: HDFS-17150
> URL: https://issues.apache.org/jira/browse/HDFS-17150
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
>
> If the client crashes without writing the minimum number of internal blocks 
> required by the EC policy, the lease recovery process for the corresponding 
> unclosed file may continue to fail. Taking RS(6,3) policy as an example, the 
> timeline is as follows:
> 1. The client writes some data to only 5 datanodes;
> 2. Client crashes;
> 3. NN fails over;
> 4. Now the result of `uc.getNumExpectedLocations()` completely depends on 
> block report, and there are 5 datanodes reporting internal blocks;
> 5. When the lease expires hard limit, NN issues a block recovery command;
> 6. The datanode checks the command and finds that the number of internal 
> blocks is insufficient, resulting in an error and recovery failure;
> 7. The lease expires hard limit again, and NN issues a block recovery command 
> again, but the recovery fails again..
> When the number of internal blocks written by the client is less than 6, the 
> block group is actually unrecoverable. We should equate this situation to the 
> case where the number of replicas is 0 when processing replica files, i.e., 
> directly remove the last block group and close the file.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.

2023-08-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752612#comment-17752612
 ] 

ASF GitHub Bot commented on HDFS-17150:
---

hfutatzhanghb commented on code in PR #5937:
URL: https://github.com/apache/hadoop/pull/5937#discussion_r1289519764


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java:
##
@@ -3802,7 +3803,12 @@ boolean internalReleaseLease(Lease lease, String src, 
INodesInPath iip,
 lastBlock.getBlockType());
   }
 
-  if (uc.getNumExpectedLocations() == 0 && lastBlock.getNumBytes() == 0) {
+  int minLocationsNum = 1;
+  if (lastBlock.isStriped()) {
+minLocationsNum = ((BlockInfoStriped) lastBlock).getRealDataBlockNum();
+  }
+  if (uc.getNumExpectedLocations() < minLocationsNum &&

Review Comment:
   Hi, @zhangshuyan0 ,  Also have a question here,  is 
`uc.getNumExpectedLocations()` always `>=` minLocationsNum?  because when 
invoke getAdditionalBlock rpc, the field replicas in 
classBlockUnderConstructionFeature has been set to targets.





> EC: Fix the bug of failed lease recovery.
> -
>
> Key: HDFS-17150
> URL: https://issues.apache.org/jira/browse/HDFS-17150
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
>
> If the client crashes without writing the minimum number of internal blocks 
> required by the EC policy, the lease recovery process for the corresponding 
> unclosed file may continue to fail. Taking RS(6,3) policy as an example, the 
> timeline is as follows:
> 1. The client writes some data to only 5 datanodes;
> 2. Client crashes;
> 3. NN fails over;
> 4. Now the result of `uc.getNumExpectedLocations()` completely depends on 
> block report, and there are 5 datanodes reporting internal blocks;
> 5. When the lease expires hard limit, NN issues a block recovery command;
> 6. The datanode checks the command and finds that the number of internal 
> blocks is insufficient, resulting in an error and recovery failure;
> 7. The lease expires hard limit again, and NN issues a block recovery command 
> again, but the recovery fails again..
> When the number of internal blocks written by the client is less than 6, the 
> block group is actually unrecoverable. We should equate this situation to the 
> case where the number of replicas is 0 when processing replica files, i.e., 
> directly remove the last block group and close the file.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.

2023-08-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752611#comment-17752611
 ] 

ASF GitHub Bot commented on HDFS-17150:
---

zhangshuyan0 commented on code in PR #5937:
URL: https://github.com/apache/hadoop/pull/5937#discussion_r1289517665


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java:
##
@@ -3802,7 +3803,12 @@ boolean internalReleaseLease(Lease lease, String src, 
INodesInPath iip,
 lastBlock.getBlockType());
   }
 
-  if (uc.getNumExpectedLocations() == 0 && lastBlock.getNumBytes() == 0) {
+  int minLocationsNum = 1;
+  if (lastBlock.isStriped()) {
+minLocationsNum = ((BlockInfoStriped) lastBlock).getRealDataBlockNum();
+  }
+  if (uc.getNumExpectedLocations() < minLocationsNum &&
+  lastBlock.getNumBytes() == 0) {

Review Comment:
   Thanks for your review. This log is not very suitable, I will make some 
changes.





> EC: Fix the bug of failed lease recovery.
> -
>
> Key: HDFS-17150
> URL: https://issues.apache.org/jira/browse/HDFS-17150
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
>
> If the client crashes without writing the minimum number of internal blocks 
> required by the EC policy, the lease recovery process for the corresponding 
> unclosed file may continue to fail. Taking RS(6,3) policy as an example, the 
> timeline is as follows:
> 1. The client writes some data to only 5 datanodes;
> 2. Client crashes;
> 3. NN fails over;
> 4. Now the result of `uc.getNumExpectedLocations()` completely depends on 
> block report, and there are 5 datanodes reporting internal blocks;
> 5. When the lease expires hard limit, NN issues a block recovery command;
> 6. The datanode checks the command and finds that the number of internal 
> blocks is insufficient, resulting in an error and recovery failure;
> 7. The lease expires hard limit again, and NN issues a block recovery command 
> again, but the recovery fails again..
> When the number of internal blocks written by the client is less than 6, the 
> block group is actually unrecoverable. We should equate this situation to the 
> case where the number of replicas is 0 when processing replica files, i.e., 
> directly remove the last block group and close the file.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.

2023-08-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752606#comment-17752606
 ] 

ASF GitHub Bot commented on HDFS-17150:
---

hfutatzhanghb commented on code in PR #5937:
URL: https://github.com/apache/hadoop/pull/5937#discussion_r1289509777


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java:
##
@@ -3802,7 +3803,12 @@ boolean internalReleaseLease(Lease lease, String src, 
INodesInPath iip,
 lastBlock.getBlockType());
   }
 
-  if (uc.getNumExpectedLocations() == 0 && lastBlock.getNumBytes() == 0) {
+  int minLocationsNum = 1;
+  if (lastBlock.isStriped()) {
+minLocationsNum = ((BlockInfoStriped) lastBlock).getRealDataBlockNum();
+  }
+  if (uc.getNumExpectedLocations() < minLocationsNum &&
+  lastBlock.getNumBytes() == 0) {

Review Comment:
   @zhangshuyan0 Whether should we log this special case or not?  
https://github.com/apache/hadoop/blob/4e7cd881a92b7deb7d8d188b8c1dc85ca6e8ee5f/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java#L3819
   





> EC: Fix the bug of failed lease recovery.
> -
>
> Key: HDFS-17150
> URL: https://issues.apache.org/jira/browse/HDFS-17150
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
>
> If the client crashes without writing the minimum number of internal blocks 
> required by the EC policy, the lease recovery process for the corresponding 
> unclosed file may continue to fail. Taking RS(6,3) policy as an example, the 
> timeline is as follows:
> 1. The client writes some data to only 5 datanodes;
> 2. Client crashes;
> 3. NN fails over;
> 4. Now the result of `uc.getNumExpectedLocations()` completely depends on 
> block report, and there are 5 datanodes reporting internal blocks;
> 5. When the lease expires hard limit, NN issues a block recovery command;
> 6. The datanode checks the command and finds that the number of internal 
> blocks is insufficient, resulting in an error and recovery failure;
> 7. The lease expires hard limit again, and NN issues a block recovery command 
> again, but the recovery fails again..
> When the number of internal blocks written by the client is less than 6, the 
> block group is actually unrecoverable. We should equate this situation to the 
> case where the number of replicas is 0 when processing replica files, i.e., 
> directly remove the last block group and close the file.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.

2023-08-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752602#comment-17752602
 ] 

ASF GitHub Bot commented on HDFS-17150:
---

hfutatzhanghb commented on PR #5937:
URL: https://github.com/apache/hadoop/pull/5937#issuecomment-1672500893

   @zhangshuyan0 Thanks a lot for reporting this bug. This phenomenon also 
happens on our clusters. Leave some questions, hope to receive your reply~ 
Thanks.




> EC: Fix the bug of failed lease recovery.
> -
>
> Key: HDFS-17150
> URL: https://issues.apache.org/jira/browse/HDFS-17150
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
>
> If the client crashes without writing the minimum number of internal blocks 
> required by the EC policy, the lease recovery process for the corresponding 
> unclosed file may continue to fail. Taking RS(6,3) policy as an example, the 
> timeline is as follows:
> 1. The client writes some data to only 5 datanodes;
> 2. Client crashes;
> 3. NN fails over;
> 4. Now the result of `uc.getNumExpectedLocations()` completely depends on 
> block report, and there are 5 datanodes reporting internal blocks;
> 5. When the lease expires hard limit, NN issues a block recovery command;
> 6. The datanode checks the command and finds that the number of internal 
> blocks is insufficient, resulting in an error and recovery failure;
> 7. The lease expires hard limit again, and NN issues a block recovery command 
> again, but the recovery fails again..
> When the number of internal blocks written by the client is less than 6, the 
> block group is actually unrecoverable. We should equate this situation to the 
> case where the number of replicas is 0 when processing replica files, i.e., 
> directly remove the last block group and close the file.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17150) EC: Fix the bug of failed lease recovery.

2023-08-09 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17150?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17752600#comment-17752600
 ] 

ASF GitHub Bot commented on HDFS-17150:
---

zhangshuyan0 opened a new pull request, #5937:
URL: https://github.com/apache/hadoop/pull/5937

   EC: Fix the bug of failed lease recovery.
   
   If the client crashes without writing the minimum number of internal blocks 
required by the EC policy, the lease recovery process for the corresponding 
unclosed file may continue to fail. Taking RS(6,3) policy as an example, the 
timeline is as follows:
   1. The client writes some data to only 5 datanodes;
   2. Client crashes;
   3. NN fails over;
   4. Now the result of `uc.getNumExpectedLocations()` completely depends on 
block report, and there are 5 datanodes reporting internal blocks;
   5. When the lease expires hard limit, NN issues a block recovery command;
   6. The datanode checks the command and finds that the number of internal 
blocks is insufficient, resulting in an exception and recovery failure;
   
https://github.com/apache/hadoop/blob/b6edcb9a84ceac340c79cd692637b3e11c997cc5/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BlockRecoveryWorker.java#L534-L540
   7. The lease expires hard limit again, and NN issues a block recovery 
command again, but the recovery fails again..
   
   When the number of internal blocks written by the client is less than 6, the 
block group is actually unrecoverable. We should equate this situation to the 
case where the number of replicas is 0 when processing replica files, i.e., 
directly remove the last block group and close the file.
   
   




> EC: Fix the bug of failed lease recovery.
> -
>
> Key: HDFS-17150
> URL: https://issues.apache.org/jira/browse/HDFS-17150
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Priority: Major
>
> If the client crashes without writing the minimum number of internal blocks 
> required by the EC policy, the lease recovery process for the corresponding 
> unclosed file may continue to fail. Taking RS(6,3) policy as an example, the 
> timeline is as follows:
> 1. The client writes some data to only 5 datanodes;
> 2. Client crashes;
> 3. NN fails over;
> 4. Now the result of `uc.getNumExpectedLocations()` completely depends on 
> block report, and there are 5 datanodes reporting internal blocks;
> 5. When the lease expires hard limit, NN issues a block recovery command;
> 6. The datanode checks the command and finds that the number of internal 
> blocks is insufficient, resulting in an error and recovery failure;
> 7. The lease expires hard limit again, and NN issues a block recovery command 
> again, but the recovery fails again..
> When the number of internal blocks written by the client is less than 6, the 
> block group is actually unrecoverable. We should equate this situation to the 
> case where the number of replicas is 0 when processing replica files, i.e., 
> directly remove the last block group and close the file.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org