[jira] [Commented] (HDFS-17093) In the case of all datanodes sending FBR when the namenode restarts (large clusters), there is an issue with incomplete block reporting

2023-07-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744882#comment-17744882
 ] 

ASF GitHub Bot commented on HDFS-17093:
---

Tre2878 commented on PR #5856:
URL: https://github.com/apache/hadoop/pull/5856#issuecomment-1643253493

   @Hexiaoqiao Ok, I've committed in the original branch




> In the case of all datanodes sending FBR when the namenode restarts (large 
> clusters), there is an issue with incomplete block reporting
> ---
>
> Key: HDFS-17093
> URL: https://issues.apache.org/jira/browse/HDFS-17093
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.3.4
>Reporter: Yanlei Yu
>Priority: Minor
>  Labels: pull-request-available
>
> In our cluster of 800+ nodes, after restarting the namenode, we found that 
> some datanodes did not report enough blocks, causing the namenode to stay in 
> secure mode for a long time after restarting because of incomplete block 
> reporting
> I found in the logs of the datanode with incomplete block reporting that the 
> first FBR attempt failed, possibly due to namenode stress, and then a second 
> FBR attempt was made as follows:
> {code:java}
> 
> 2023-07-17 11:29:28,982 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Unsuccessfully sent block report 0x6237a52c1e817e,  containing 12 storage 
> report(s), of which we sent 1. The reports had 1099057 total blocks and used 
> 1 RPC(s). This took 294 msec to generate and 101721 msecs for RPC and NN 
> processing. Got back no commands.
> 2023-07-17 11:37:04,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Successfully sent block report 0x62382416f3f055,  containing 12 storage 
> report(s), of which we sent 12. The reports had 1099048 total blocks and used 
> 12 RPC(s). This took 295 msec to generate and 11647 msecs for RPC and NN 
> processing. Got back no commands. {code}
> There's nothing wrong with that. Retry the send if it fails But on the 
> namenode side of the logic:
> {code:java}
> if (namesystem.isInStartupSafeMode()
>     && !StorageType.PROVIDED.equals(storageInfo.getStorageType())
>     && storageInfo.getBlockReportCount() > 0) {
>   blockLog.info("BLOCK* processReport 0x{} with lease ID 0x{}: "
>       + "discarded non-initial block report from {}"
>       + " because namenode still in startup phase",
>       strBlockReportId, fullBrLeaseId, nodeID);
>   blockReportLeaseManager.removeLease(node);
>   return !node.hasStaleStorages();
> } {code}
> When a disk was identified as the report is not the first time, namely 
> storageInfo. GetBlockReportCount > 0, Will remove the ticket from the 
> datanode, lead to a second report failed because no lease



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17093) In the case of all datanodes sending FBR when the namenode restarts (large clusters), there is an issue with incomplete block reporting

2023-07-19 Thread Yanlei Yu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744861#comment-17744861
 ] 

Yanlei Yu commented on HDFS-17093:
--

{quote}Please push update to your same original branch (for this case which is 
Tre2878:HDFS-17093), DO NOT pull repeat request for same issue.
{quote}
Sorry, I now made changes and commits in the original branch

> In the case of all datanodes sending FBR when the namenode restarts (large 
> clusters), there is an issue with incomplete block reporting
> ---
>
> Key: HDFS-17093
> URL: https://issues.apache.org/jira/browse/HDFS-17093
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.3.4
>Reporter: Yanlei Yu
>Priority: Minor
>  Labels: pull-request-available
>
> In our cluster of 800+ nodes, after restarting the namenode, we found that 
> some datanodes did not report enough blocks, causing the namenode to stay in 
> secure mode for a long time after restarting because of incomplete block 
> reporting
> I found in the logs of the datanode with incomplete block reporting that the 
> first FBR attempt failed, possibly due to namenode stress, and then a second 
> FBR attempt was made as follows:
> {code:java}
> 
> 2023-07-17 11:29:28,982 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Unsuccessfully sent block report 0x6237a52c1e817e,  containing 12 storage 
> report(s), of which we sent 1. The reports had 1099057 total blocks and used 
> 1 RPC(s). This took 294 msec to generate and 101721 msecs for RPC and NN 
> processing. Got back no commands.
> 2023-07-17 11:37:04,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Successfully sent block report 0x62382416f3f055,  containing 12 storage 
> report(s), of which we sent 12. The reports had 1099048 total blocks and used 
> 12 RPC(s). This took 295 msec to generate and 11647 msecs for RPC and NN 
> processing. Got back no commands. {code}
> There's nothing wrong with that. Retry the send if it fails But on the 
> namenode side of the logic:
> {code:java}
> if (namesystem.isInStartupSafeMode()
>     && !StorageType.PROVIDED.equals(storageInfo.getStorageType())
>     && storageInfo.getBlockReportCount() > 0) {
>   blockLog.info("BLOCK* processReport 0x{} with lease ID 0x{}: "
>       + "discarded non-initial block report from {}"
>       + " because namenode still in startup phase",
>       strBlockReportId, fullBrLeaseId, nodeID);
>   blockReportLeaseManager.removeLease(node);
>   return !node.hasStaleStorages();
> } {code}
> When a disk was identified as the report is not the first time, namely 
> storageInfo. GetBlockReportCount > 0, Will remove the ticket from the 
> datanode, lead to a second report failed because no lease



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17094) EC: Fix bug in block recovery when there are stale datanodes

2023-07-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744854#comment-17744854
 ] 

ASF GitHub Bot commented on HDFS-17094:
---

tomscut commented on PR #5854:
URL: https://github.com/apache/hadoop/pull/5854#issuecomment-1643035843

   > @tomscut This PR can cherry pick to branch-3.3 smoothly, Please 
cherry-pick directly if you evaluate it also need to fix for branch-3.3 rather 
than submit another PR. Thanks.
   
   OKK, I have backport to branch-3.3. I thought it would be safer to trigger 
jenkins. But for this PR, it's really not necessary. Thank you for your advice.




> EC: Fix bug in block recovery when there are stale datanodes
> 
>
> Key: HDFS-17094
> URL: https://issues.apache.org/jira/browse/HDFS-17094
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Assignee: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> When a block recovery occurs, `RecoveryTaskStriped` in datanode expects 
> `rBlock.getLocations()` and `rBlock. getBlockIndices()` to be in one-to-one 
> correspondence. However, if there are locations in stale state when NameNode 
> handles heartbeat, this correspondence will be disrupted. In detail, there is 
> no stale location in `recoveryLocations`, but the block indices array is 
> still complete (i.e. contains the indices of all the locations). This will 
> cause `BlockRecoveryWorker.RecoveryTaskStriped#recover` to generate a wrong 
> internal block ID, and the corresponding datanode cannot find the replica, 
> thus making the recovery process fail. This bug needs to be fixed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17111) RBF: Optimize msync to only call nameservices that have observer reads enabled.

2023-07-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744850#comment-17744850
 ] 

ASF GitHub Bot commented on HDFS-17111:
---

hadoop-yetus commented on PR #5860:
URL: https://github.com/apache/hadoop/pull/5860#issuecomment-1643028418

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 56s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  50m  8s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   0m 42s |  |  trunk passed with JDK 
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  compile  |   0m 38s |  |  trunk passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09  |
   | +1 :green_heart: |  checkstyle  |   0m 29s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 43s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 42s |  |  trunk passed with JDK 
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  javadoc  |   0m 30s |  |  trunk passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09  |
   | +1 :green_heart: |  spotbugs  |   1m 27s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  38m 49s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 34s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 34s |  |  the patch passed with JDK 
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  javac  |   0m 34s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 31s |  |  the patch passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09  |
   | +1 :green_heart: |  javac  |   0m 31s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 18s |  |  
hadoop-hdfs-project/hadoop-hdfs-rbf: The patch generated 0 new + 0 unchanged - 
2 fixed = 0 total (was 2)  |
   | +1 :green_heart: |  mvnsite  |   0m 33s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 29s |  |  the patch passed with JDK 
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  javadoc  |   0m 23s |  |  the patch passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09  |
   | +1 :green_heart: |  spotbugs  |   1m 23s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  40m  1s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  21m 27s |  |  hadoop-hdfs-rbf in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   0m 37s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 166m 27s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.43 ServerAPI=1.43 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5860/6/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/5860 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 56fb1db7c75c 4.15.0-212-generic #223-Ubuntu SMP Tue May 23 
13:09:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 6e81175b5dfc13c8f1bc09604bd08b02023e1d06 |
   | Default Java | Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5860/6/testReport/ |
   | Max. process+thread count | 2609 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-hdfs-rbf U: 
hadoop-hdfs-project/hadoop-hdfs-rbf |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5860/6/console |
   | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0 

[jira] [Commented] (HDFS-17111) RBF: Optimize msync to only call nameservices that have observer reads enabled.

2023-07-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744845#comment-17744845
 ] 

ASF GitHub Bot commented on HDFS-17111:
---

hadoop-yetus commented on PR #5860:
URL: https://github.com/apache/hadoop/pull/5860#issuecomment-1643006475

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 49s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  50m 13s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   0m 53s |  |  trunk passed with JDK 
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  compile  |   0m 37s |  |  trunk passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09  |
   | +1 :green_heart: |  checkstyle  |   0m 31s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 42s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 41s |  |  trunk passed with JDK 
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  javadoc  |   0m 30s |  |  trunk passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09  |
   | +1 :green_heart: |  spotbugs  |   1m 26s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  39m  3s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 36s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 35s |  |  the patch passed with JDK 
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  javac  |   0m 35s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 29s |  |  the patch passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09  |
   | +1 :green_heart: |  javac  |   0m 29s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   0m 18s |  |  
hadoop-hdfs-project/hadoop-hdfs-rbf: The patch generated 0 new + 0 unchanged - 
2 fixed = 0 total (was 2)  |
   | +1 :green_heart: |  mvnsite  |   0m 32s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 29s |  |  the patch passed with JDK 
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  javadoc  |   0m 23s |  |  the patch passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09  |
   | +1 :green_heart: |  spotbugs  |   1m 22s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  39m  8s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  |  21m 48s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5860/5/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt)
 |  hadoop-hdfs-rbf in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 36s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 165m 46s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.server.federation.router.TestRouter |
   |   | 
hadoop.hdfs.server.federation.fairness.TestRouterRefreshFairnessPolicyController
 |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.43 ServerAPI=1.43 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5860/5/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/5860 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 16b73b494aa3 4.15.0-212-generic #223-Ubuntu SMP Tue May 23 
13:09:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 732691ee8552a3ad24e7642266a74c05321efbae |
   | Default Java | Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 |
   |  Test Results | 

[jira] [Commented] (HDFS-17094) EC: Fix bug in block recovery when there are stale datanodes

2023-07-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744839#comment-17744839
 ] 

ASF GitHub Bot commented on HDFS-17094:
---

Hexiaoqiao commented on PR #5854:
URL: https://github.com/apache/hadoop/pull/5854#issuecomment-1642997064

   @tomscut This PR can cherry pick to branch-3.3 smoothly, Please cherry-pick 
directly if you evaluate it also need to fix for branch-3.3 rather than submit 
another PR. Thanks.




> EC: Fix bug in block recovery when there are stale datanodes
> 
>
> Key: HDFS-17094
> URL: https://issues.apache.org/jira/browse/HDFS-17094
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Assignee: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> When a block recovery occurs, `RecoveryTaskStriped` in datanode expects 
> `rBlock.getLocations()` and `rBlock. getBlockIndices()` to be in one-to-one 
> correspondence. However, if there are locations in stale state when NameNode 
> handles heartbeat, this correspondence will be disrupted. In detail, there is 
> no stale location in `recoveryLocations`, but the block indices array is 
> still complete (i.e. contains the indices of all the locations). This will 
> cause `BlockRecoveryWorker.RecoveryTaskStriped#recover` to generate a wrong 
> internal block ID, and the corresponding datanode cannot find the replica, 
> thus making the recovery process fail. This bug needs to be fixed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17094) EC: Fix bug in block recovery when there are stale datanodes

2023-07-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744838#comment-17744838
 ] 

ASF GitHub Bot commented on HDFS-17094:
---

zhangshuyan0 commented on PR #5854:
URL: https://github.com/apache/hadoop/pull/5854#issuecomment-1642992778

   > @zhangshuyan0 Could you please backport this to branch-3.3? Thanks!
   
   Ok, I'll do this later.




> EC: Fix bug in block recovery when there are stale datanodes
> 
>
> Key: HDFS-17094
> URL: https://issues.apache.org/jira/browse/HDFS-17094
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Assignee: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> When a block recovery occurs, `RecoveryTaskStriped` in datanode expects 
> `rBlock.getLocations()` and `rBlock. getBlockIndices()` to be in one-to-one 
> correspondence. However, if there are locations in stale state when NameNode 
> handles heartbeat, this correspondence will be disrupted. In detail, there is 
> no stale location in `recoveryLocations`, but the block indices array is 
> still complete (i.e. contains the indices of all the locations). This will 
> cause `BlockRecoveryWorker.RecoveryTaskStriped#recover` to generate a wrong 
> internal block ID, and the corresponding datanode cannot find the replica, 
> thus making the recovery process fail. This bug needs to be fixed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17092) Datanode Full Block Report failed can lead to missing and under replicated blocks

2023-07-19 Thread Tao Li (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744836#comment-17744836
 ] 

Tao Li commented on HDFS-17092:
---

Seems duplicated with HDFS-17093.

> Datanode Full Block Report failed can lead to missing and under replicated 
> blocks
> -
>
> Key: HDFS-17092
> URL: https://issues.apache.org/jira/browse/HDFS-17092
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: datanode
>Reporter: microle.dong
>Priority: Major
>
> when restarting namenode, we found that some datanodes did not report enough 
> blocks, which  can lead to missing and under replicated blocks. 
> Datanode use multipul RPCs to report blocks,  I found in the logs of the 
> datanode with incomplete block reporting that the first FBR attempt failed, 
> due to namenode error
>  
> {code:java}
> 2023-07-14 17:29:24,776 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Unsuccessfully sent block report 0x7b738b02996cd2,  containing 12 storage 
> report(s), of which we sent 1. The reports had 633013 total blocks and used 1 
> RPC(s). This took 234 msec to generate and 98739 msecs for RPC and NN 
> processing. Got back no commands.
> 2023-07-14 17:29:24,776 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: 
> IOException in offerService
> java.net.SocketTimeoutException: Call From x.x.x.x/x.x.x.x to x.x.x.x:9002 
> failed on socket timeout exception: java.net.SocketTimeoutException: 6 
> millis timeout while waiting for channel to be ready for read. ch : 
> java.nio.channels.SocketChannel[connected local=/x.x.x.x:13868 
> remote=x.x.x.x/x.x.x.x:9002]; For more details see:  
> http://wiki.apache.org/hadoop/SocketTimeout 
> t sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
>         at 
> sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62)
>         at 
> sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45)
>         at java.lang.reflect.Constructor.newInstance(Constructor.java:422)
>         at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:863)
>         at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:822)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1480)
>         at org.apache.hadoop.ipc.Client.call(Client.java:1413)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229)
>         at com.sun.proxy.$Proxy14.blockReport(Unknown Source)
>         at 
> org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.blockReport(DatanodeProtocolClientSideTranslatorPB.java:205)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.blockReport(BPServiceActor.java:333)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:572)
>         at 
> org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:706)
>         at java.lang.Thread.run(Thread.java:745){code}
> the Datanode second FBR will use same lease , which will make namenode  
> remove the datanode  lease  (just as HDFS-8930) , lead to other FBR RPC 
> failed because no lease is left.
> we should  rest a new lease and try again when datanode FBR failed .
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17093) In the case of all datanodes sending FBR when the namenode restarts (large clusters), there is an issue with incomplete block reporting

2023-07-19 Thread Yanlei Yu (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanlei Yu updated HDFS-17093:
-
Attachment: (was: HDFS-17093.patch)

> In the case of all datanodes sending FBR when the namenode restarts (large 
> clusters), there is an issue with incomplete block reporting
> ---
>
> Key: HDFS-17093
> URL: https://issues.apache.org/jira/browse/HDFS-17093
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.3.4
>Reporter: Yanlei Yu
>Priority: Minor
>  Labels: pull-request-available
>
> In our cluster of 800+ nodes, after restarting the namenode, we found that 
> some datanodes did not report enough blocks, causing the namenode to stay in 
> secure mode for a long time after restarting because of incomplete block 
> reporting
> I found in the logs of the datanode with incomplete block reporting that the 
> first FBR attempt failed, possibly due to namenode stress, and then a second 
> FBR attempt was made as follows:
> {code:java}
> 
> 2023-07-17 11:29:28,982 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Unsuccessfully sent block report 0x6237a52c1e817e,  containing 12 storage 
> report(s), of which we sent 1. The reports had 1099057 total blocks and used 
> 1 RPC(s). This took 294 msec to generate and 101721 msecs for RPC and NN 
> processing. Got back no commands.
> 2023-07-17 11:37:04,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Successfully sent block report 0x62382416f3f055,  containing 12 storage 
> report(s), of which we sent 12. The reports had 1099048 total blocks and used 
> 12 RPC(s). This took 295 msec to generate and 11647 msecs for RPC and NN 
> processing. Got back no commands. {code}
> There's nothing wrong with that. Retry the send if it fails But on the 
> namenode side of the logic:
> {code:java}
> if (namesystem.isInStartupSafeMode()
>     && !StorageType.PROVIDED.equals(storageInfo.getStorageType())
>     && storageInfo.getBlockReportCount() > 0) {
>   blockLog.info("BLOCK* processReport 0x{} with lease ID 0x{}: "
>       + "discarded non-initial block report from {}"
>       + " because namenode still in startup phase",
>       strBlockReportId, fullBrLeaseId, nodeID);
>   blockReportLeaseManager.removeLease(node);
>   return !node.hasStaleStorages();
> } {code}
> When a disk was identified as the report is not the first time, namely 
> storageInfo. GetBlockReportCount > 0, Will remove the ticket from the 
> datanode, lead to a second report failed because no lease



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17094) EC: Fix bug in block recovery when there are stale datanodes

2023-07-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744835#comment-17744835
 ] 

ASF GitHub Bot commented on HDFS-17094:
---

tomscut commented on PR #5854:
URL: https://github.com/apache/hadoop/pull/5854#issuecomment-1642968772

   @zhangshuyan0 Could you please backport this to branch-3.3? Thanks!




> EC: Fix bug in block recovery when there are stale datanodes
> 
>
> Key: HDFS-17094
> URL: https://issues.apache.org/jira/browse/HDFS-17094
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Assignee: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> When a block recovery occurs, `RecoveryTaskStriped` in datanode expects 
> `rBlock.getLocations()` and `rBlock. getBlockIndices()` to be in one-to-one 
> correspondence. However, if there are locations in stale state when NameNode 
> handles heartbeat, this correspondence will be disrupted. In detail, there is 
> no stale location in `recoveryLocations`, but the block indices array is 
> still complete (i.e. contains the indices of all the locations). This will 
> cause `BlockRecoveryWorker.RecoveryTaskStriped#recover` to generate a wrong 
> internal block ID, and the corresponding datanode cannot find the replica, 
> thus making the recovery process fail. This bug needs to be fixed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Resolved] (HDFS-17094) EC: Fix bug in block recovery when there are stale datanodes

2023-07-19 Thread Tao Li (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tao Li resolved HDFS-17094.
---
Fix Version/s: 3.4.0
   Resolution: Fixed

> EC: Fix bug in block recovery when there are stale datanodes
> 
>
> Key: HDFS-17094
> URL: https://issues.apache.org/jira/browse/HDFS-17094
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Assignee: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> When a block recovery occurs, `RecoveryTaskStriped` in datanode expects 
> `rBlock.getLocations()` and `rBlock. getBlockIndices()` to be in one-to-one 
> correspondence. However, if there are locations in stale state when NameNode 
> handles heartbeat, this correspondence will be disrupted. In detail, there is 
> no stale location in `recoveryLocations`, but the block indices array is 
> still complete (i.e. contains the indices of all the locations). This will 
> cause `BlockRecoveryWorker.RecoveryTaskStriped#recover` to generate a wrong 
> internal block ID, and the corresponding datanode cannot find the replica, 
> thus making the recovery process fail. This bug needs to be fixed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17094) EC: Fix bug in block recovery when there are stale datanodes

2023-07-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744834#comment-17744834
 ] 

ASF GitHub Bot commented on HDFS-17094:
---

tomscut commented on PR #5854:
URL: https://github.com/apache/hadoop/pull/5854#issuecomment-1642967557

   Thanks @zhangshuyan0 for your contribution! Thanks @Hexiaoqiao for your 
review!




> EC: Fix bug in block recovery when there are stale datanodes
> 
>
> Key: HDFS-17094
> URL: https://issues.apache.org/jira/browse/HDFS-17094
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Assignee: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
>
> When a block recovery occurs, `RecoveryTaskStriped` in datanode expects 
> `rBlock.getLocations()` and `rBlock. getBlockIndices()` to be in one-to-one 
> correspondence. However, if there are locations in stale state when NameNode 
> handles heartbeat, this correspondence will be disrupted. In detail, there is 
> no stale location in `recoveryLocations`, but the block indices array is 
> still complete (i.e. contains the indices of all the locations). This will 
> cause `BlockRecoveryWorker.RecoveryTaskStriped#recover` to generate a wrong 
> internal block ID, and the corresponding datanode cannot find the replica, 
> thus making the recovery process fail. This bug needs to be fixed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17094) EC: Fix bug in block recovery when there are stale datanodes

2023-07-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744833#comment-17744833
 ] 

ASF GitHub Bot commented on HDFS-17094:
---

tomscut merged PR #5854:
URL: https://github.com/apache/hadoop/pull/5854




> EC: Fix bug in block recovery when there are stale datanodes
> 
>
> Key: HDFS-17094
> URL: https://issues.apache.org/jira/browse/HDFS-17094
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Assignee: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
>
> When a block recovery occurs, `RecoveryTaskStriped` in datanode expects 
> `rBlock.getLocations()` and `rBlock. getBlockIndices()` to be in one-to-one 
> correspondence. However, if there are locations in stale state when NameNode 
> handles heartbeat, this correspondence will be disrupted. In detail, there is 
> no stale location in `recoveryLocations`, but the block indices array is 
> still complete (i.e. contains the indices of all the locations). This will 
> cause `BlockRecoveryWorker.RecoveryTaskStriped#recover` to generate a wrong 
> internal block ID, and the corresponding datanode cannot find the replica, 
> thus making the recovery process fail. This bug needs to be fixed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17111) RBF: Optimize msync to only call nameservices that have observer reads enabled.

2023-07-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744820#comment-17744820
 ] 

ASF GitHub Bot commented on HDFS-17111:
---

hadoop-yetus commented on PR #5860:
URL: https://github.com/apache/hadoop/pull/5860#issuecomment-1642893412

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   1m 12s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  2s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  2s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  51m 48s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   0m 55s |  |  trunk passed with JDK 
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  compile  |   0m 45s |  |  trunk passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09  |
   | +1 :green_heart: |  checkstyle  |   0m 32s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 47s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 47s |  |  trunk passed with JDK 
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  javadoc  |   0m 33s |  |  trunk passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09  |
   | +1 :green_heart: |  spotbugs  |   1m 35s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  40m  4s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 33s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 35s |  |  the patch passed with JDK 
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  javac  |   0m 35s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 30s |  |  the patch passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09  |
   | +1 :green_heart: |  javac  |   0m 30s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  1s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 18s | 
[/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs-rbf.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5860/3/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs-rbf.txt)
 |  hadoop-hdfs-project/hadoop-hdfs-rbf: The patch generated 2 new + 2 
unchanged - 0 fixed = 4 total (was 2)  |
   | +1 :green_heart: |  mvnsite  |   0m 33s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 29s |  |  the patch passed with JDK 
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  javadoc  |   0m 23s |  |  the patch passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09  |
   | +1 :green_heart: |  spotbugs  |   1m 25s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  39m 21s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  |  22m 47s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5860/3/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt)
 |  hadoop-hdfs-rbf in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 37s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 170m 45s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.server.federation.router.TestRouter |
   |   | 
hadoop.hdfs.server.federation.fairness.TestRouterRefreshFairnessPolicyController
 |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.43 ServerAPI=1.43 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5860/3/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/5860 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 65e197c55b1e 4.15.0-212-generic #223-Ubuntu SMP Tue May 23 
13:09:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 350292a609d80aaee0ad44de4dd0e1b8393a63fa |
   | Default Java | Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 |
   | Multi-JDK versions | 

[jira] [Commented] (HDFS-17111) RBF: Optimize msync to only call nameservices that have observer reads enabled.

2023-07-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744815#comment-17744815
 ] 

ASF GitHub Bot commented on HDFS-17111:
---

hadoop-yetus commented on PR #5860:
URL: https://github.com/apache/hadoop/pull/5860#issuecomment-1642886100

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 39s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  46m  1s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   0m 43s |  |  trunk passed with JDK 
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  compile  |   0m 42s |  |  trunk passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09  |
   | +1 :green_heart: |  checkstyle  |   0m 36s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 47s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 49s |  |  trunk passed with JDK 
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  javadoc  |   0m 37s |  |  trunk passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09  |
   | +1 :green_heart: |  spotbugs  |   1m 29s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  33m 36s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 36s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 35s |  |  the patch passed with JDK 
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  javac  |   0m 35s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 33s |  |  the patch passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09  |
   | +1 :green_heart: |  javac  |   0m 33s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 21s | 
[/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs-rbf.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5860/4/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs-rbf.txt)
 |  hadoop-hdfs-project/hadoop-hdfs-rbf: The patch generated 2 new + 2 
unchanged - 0 fixed = 4 total (was 2)  |
   | +1 :green_heart: |  mvnsite  |   0m 35s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 32s |  |  the patch passed with JDK 
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  javadoc  |   0m 26s |  |  the patch passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09  |
   | +1 :green_heart: |  spotbugs  |   1m 23s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  34m 16s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  |  21m 32s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5860/4/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt)
 |  hadoop-hdfs-rbf in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 42s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 152m  7s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | 
hadoop.hdfs.server.federation.router.TestRouterRPCMultipleDestinationMountTableResolver
 |
   |   | 
hadoop.hdfs.server.federation.fairness.TestRouterRefreshFairnessPolicyController
 |
   |   | hadoop.hdfs.server.federation.router.TestRouter |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.43 ServerAPI=1.43 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5860/4/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/5860 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 767001fb66ff 4.15.0-212-generic #223-Ubuntu SMP Tue May 23 
13:09:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 66a10a183bc0b6eb1d669cc0854ba8dd0bff1154 |
   | Default Java | Private 

[jira] [Commented] (HDFS-17094) EC: Fix bug in block recovery when there are stale datanodes

2023-07-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744814#comment-17744814
 ] 

ASF GitHub Bot commented on HDFS-17094:
---

hadoop-yetus commented on PR #5854:
URL: https://github.com/apache/hadoop/pull/5854#issuecomment-1642880725

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 42s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  49m 27s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 27s |  |  trunk passed with JDK 
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  compile  |   1m 21s |  |  trunk passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09  |
   | +1 :green_heart: |  checkstyle  |   1m 14s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 29s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 12s |  |  trunk passed with JDK 
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  javadoc  |   1m 38s |  |  trunk passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09  |
   | +1 :green_heart: |  spotbugs  |   3m 20s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  35m 43s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m 16s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 12s |  |  the patch passed with JDK 
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  javac  |   1m 12s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 11s |  |  the patch passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09  |
   | +1 :green_heart: |  javac  |   1m 11s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   1m  1s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   1m 16s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 57s |  |  the patch passed with JDK 
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  javadoc  |   1m 31s |  |  the patch passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09  |
   | +1 :green_heart: |  spotbugs  |   3m 14s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  36m  2s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  | 215m 51s |  |  hadoop-hdfs in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   0m 58s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 362m 17s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.43 ServerAPI=1.43 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5854/3/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/5854 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 3913e9b84c85 4.15.0-212-generic #223-Ubuntu SMP Tue May 23 
13:09:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 446ddffc53cb891e0a410bd76a6864666f22ff11 |
   | Default Java | Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5854/3/testReport/ |
   | Max. process+thread count | 3028 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-hdfs U: 
hadoop-hdfs-project/hadoop-hdfs |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5854/3/console |
   | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   




> EC: Fix bug in 

[jira] [Commented] (HDFS-17111) RBF: Optimize msync to only call nameservices that have observer reads enabled.

2023-07-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744813#comment-17744813
 ] 

ASF GitHub Bot commented on HDFS-17111:
---

hadoop-yetus commented on PR #5860:
URL: https://github.com/apache/hadoop/pull/5860#issuecomment-1642870547

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 38s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  47m  2s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   0m 43s |  |  trunk passed with JDK 
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  compile  |   0m 44s |  |  trunk passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09  |
   | +1 :green_heart: |  checkstyle  |   0m 35s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 46s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 49s |  |  trunk passed with JDK 
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  javadoc  |   0m 37s |  |  trunk passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09  |
   | +1 :green_heart: |  spotbugs  |   1m 28s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  34m  4s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 36s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 34s |  |  the patch passed with JDK 
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  javac  |   0m 34s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 31s |  |  the patch passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09  |
   | +1 :green_heart: |  javac  |   0m 31s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 20s | 
[/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs-rbf.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5860/2/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs-rbf.txt)
 |  hadoop-hdfs-project/hadoop-hdfs-rbf: The patch generated 2 new + 2 
unchanged - 0 fixed = 4 total (was 2)  |
   | +1 :green_heart: |  mvnsite  |   0m 34s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 31s |  |  the patch passed with JDK 
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  javadoc  |   0m 26s |  |  the patch passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09  |
   | +1 :green_heart: |  spotbugs  |   1m 23s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  34m 36s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  |  21m  6s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5860/2/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt)
 |  hadoop-hdfs-rbf in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 40s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 153m  4s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.server.federation.router.TestRouter |
   |   | 
hadoop.hdfs.server.federation.fairness.TestRouterRefreshFairnessPolicyController
 |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.43 ServerAPI=1.43 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5860/2/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/5860 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 2057df3c064a 4.15.0-213-generic #224-Ubuntu SMP Mon Jun 19 
13:30:12 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / d5e218d17f658bafb0005e39e815ea9e8fc24bb5 |
   | Default Java | Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 |
   | Multi-JDK versions | 

[jira] [Commented] (HDFS-17111) RBF: Optimize msync to only call nameservices that have observer reads enabled.

2023-07-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744796#comment-17744796
 ] 

ASF GitHub Bot commented on HDFS-17111:
---

hadoop-yetus commented on PR #5860:
URL: https://github.com/apache/hadoop/pull/5860#issuecomment-1642792350

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |  18m 19s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  49m 48s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   0m 42s |  |  trunk passed with JDK 
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  compile  |   0m 36s |  |  trunk passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09  |
   | +1 :green_heart: |  checkstyle  |   0m 30s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   0m 42s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   0m 42s |  |  trunk passed with JDK 
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  javadoc  |   0m 30s |  |  trunk passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09  |
   | +1 :green_heart: |  spotbugs  |   1m 26s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  39m 37s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   0m 33s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 34s |  |  the patch passed with JDK 
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  javac  |   0m 34s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   0m 29s |  |  the patch passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09  |
   | +1 :green_heart: |  javac  |   0m 29s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  1s |  |  The patch has no blanks 
issues.  |
   | -0 :warning: |  checkstyle  |   0m 19s | 
[/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs-rbf.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5860/1/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs-rbf.txt)
 |  hadoop-hdfs-project/hadoop-hdfs-rbf: The patch generated 1 new + 2 
unchanged - 0 fixed = 3 total (was 2)  |
   | +1 :green_heart: |  mvnsite  |   0m 33s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 29s |  |  the patch passed with JDK 
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  javadoc  |   0m 23s |  |  the patch passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09  |
   | +1 :green_heart: |  spotbugs  |   1m 24s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  39m 13s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  |  21m 39s |  |  hadoop-hdfs-rbf in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   0m 36s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 183m  7s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.43 ServerAPI=1.43 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5860/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/5860 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux c67ae1fedf74 4.15.0-212-generic #223-Ubuntu SMP Tue May 23 
13:09:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 75f439c08cac2b8f4d7d79ed882b5e165d75b55d |
   | Default Java | Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5860/1/testReport/ |
   | Max. process+thread count | 2194 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-hdfs-rbf U: 
hadoop-hdfs-project/hadoop-hdfs-rbf |
   

[jira] [Updated] (HDFS-17111) RBF: Optimize msync to only call nameservices that have observer reads enabled.

2023-07-19 Thread Simbarashe Dzinamarira (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simbarashe Dzinamarira updated HDFS-17111:
--
Summary: RBF: Optimize msync to only call nameservices that have observer 
reads enabled.  (was: RBF: Optimize msync to only call nameservices with 
observer namenodes.)

> RBF: Optimize msync to only call nameservices that have observer reads 
> enabled.
> ---
>
> Key: HDFS-17111
> URL: https://issues.apache.org/jira/browse/HDFS-17111
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Simbarashe Dzinamarira
>Assignee: Simbarashe Dzinamarira
>Priority: Major
>  Labels: pull-request-available
>
> Right now when a client MSYNCs to the router, the call is fanned out to all 
> nameservices. We only need to proxy the msync to nameservices that have 
> observer reads configured.
> We can do this either by adding a new config for the admin to specify which 
> nameservices have CRS configured, or we can try to automatically detect these.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17093) In the case of all datanodes sending FBR when the namenode restarts (large clusters), there is an issue with incomplete block reporting

2023-07-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744762#comment-17744762
 ] 

ASF GitHub Bot commented on HDFS-17093:
---

hadoop-yetus commented on PR #5856:
URL: https://github.com/apache/hadoop/pull/5856#issuecomment-1642638647

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 49s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  0s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  0s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  52m 27s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 25s |  |  trunk passed with JDK 
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  compile  |   1m 15s |  |  trunk passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09  |
   | +1 :green_heart: |  checkstyle  |   1m 11s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 23s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 11s |  |  trunk passed with JDK 
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  javadoc  |   1m 38s |  |  trunk passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09  |
   | +1 :green_heart: |  spotbugs  |   3m 25s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  41m  6s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m 14s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 17s |  |  the patch passed with JDK 
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  javac  |   1m 17s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m  8s |  |  the patch passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09  |
   | +1 :green_heart: |  javac  |   1m  8s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   1m  3s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   1m 18s |  |  the patch passed  |
   | -1 :x: |  javadoc  |   0m 57s | 
[/patch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5856/1/artifact/out/patch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1.txt)
 |  hadoop-hdfs in the patch failed with JDK 
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1.  |
   | +1 :green_heart: |  javadoc  |   1m 29s |  |  the patch passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09  |
   | +1 :green_heart: |  spotbugs  |   3m 26s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  41m  7s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | -1 :x: |  unit  | 251m 21s | 
[/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5856/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch passed.  |
   | +1 :green_heart: |  asflicense  |   0m 50s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 411m 12s |  |  |
   
   
   | Reason | Tests |
   |---:|:--|
   | Failed junit tests | hadoop.hdfs.TestRollingUpgrade |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.43 ServerAPI=1.43 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5856/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/5856 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux dffc1d606d5b 4.15.0-212-generic #223-Ubuntu SMP Tue May 23 
13:09:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 757c68f18d3b5ff89cf750a1e02116dd86ff07b2 |
   | Default Java | Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 
/usr/lib/jvm/java-8-openjdk-amd64:Private 

[jira] [Commented] (HDFS-17042) Add rpcCallSuccesses and OverallRpcProcessingTime to RpcMetrics for Namenode

2023-07-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744753#comment-17744753
 ] 

ASF GitHub Bot commented on HDFS-17042:
---

goiri merged PR #5804:
URL: https://github.com/apache/hadoop/pull/5804




> Add rpcCallSuccesses and OverallRpcProcessingTime to RpcMetrics for Namenode
> 
>
> Key: HDFS-17042
> URL: https://issues.apache.org/jira/browse/HDFS-17042
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: hdfs
>Affects Versions: 3.4.0, 3.3.9
>Reporter: Xing Lin
>Assignee: Xing Lin
>Priority: Major
>  Labels: pull-request-available
> Fix For: 3.4.0
>
>
> We'd like to add two new types of metrics to the existing NN 
> RpcMetrics/RpcDetailedMetrics. These two metrics can then be used as part of 
> SLA/SLO for the HDFS service.
>  * {_}RpcCallSuccesses{_}: it measures the number of RPC requests where they 
> are successfully processed by a NN (e.g., with a response with an RpcStatus 
> {_}RpcStatusProto.SUCCESS){_}{_}.{_} Then, together with {_}RpcQueueNumOps 
> ({_}which refers the total number of RPC requests{_}){_}, we can derive the 
> RpcErrorRate for our NN, as (RpcQueueNumOps - RpcCallSuccesses) / 
> RpcQueueNumOps. 
>  * OverallRpcProcessingTime for each RPC method: this metric measures the 
> overall RPC processing time for each RPC method at the NN. It covers the time 
> from when a request arrives at the NN to when a response is sent back. We are 
> already emitting processingTime for each RPC method today in 
> RpcDetailedMetrics. We want to extend it to emit overallRpcProcessingTime for 
> each RPC method, which includes enqueueTime, queueTime, processingTime, 
> responseTime, and handlerTime.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17093) In the case of all datanodes sending FBR when the namenode restarts (large clusters), there is an issue with incomplete block reporting

2023-07-19 Thread Xing Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744751#comment-17744751
 ] 

Xing Lin commented on HDFS-17093:
-

FYI, we set dfs.namenode.max.full.block.report.leases = 6, even though we are 
running clusters at 10k DNs per cluster.

> In the case of all datanodes sending FBR when the namenode restarts (large 
> clusters), there is an issue with incomplete block reporting
> ---
>
> Key: HDFS-17093
> URL: https://issues.apache.org/jira/browse/HDFS-17093
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.3.4
>Reporter: Yanlei Yu
>Priority: Minor
>  Labels: pull-request-available
> Attachments: HDFS-17093.patch
>
>
> In our cluster of 800+ nodes, after restarting the namenode, we found that 
> some datanodes did not report enough blocks, causing the namenode to stay in 
> secure mode for a long time after restarting because of incomplete block 
> reporting
> I found in the logs of the datanode with incomplete block reporting that the 
> first FBR attempt failed, possibly due to namenode stress, and then a second 
> FBR attempt was made as follows:
> {code:java}
> 
> 2023-07-17 11:29:28,982 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Unsuccessfully sent block report 0x6237a52c1e817e,  containing 12 storage 
> report(s), of which we sent 1. The reports had 1099057 total blocks and used 
> 1 RPC(s). This took 294 msec to generate and 101721 msecs for RPC and NN 
> processing. Got back no commands.
> 2023-07-17 11:37:04,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Successfully sent block report 0x62382416f3f055,  containing 12 storage 
> report(s), of which we sent 12. The reports had 1099048 total blocks and used 
> 12 RPC(s). This took 295 msec to generate and 11647 msecs for RPC and NN 
> processing. Got back no commands. {code}
> There's nothing wrong with that. Retry the send if it fails But on the 
> namenode side of the logic:
> {code:java}
> if (namesystem.isInStartupSafeMode()
>     && !StorageType.PROVIDED.equals(storageInfo.getStorageType())
>     && storageInfo.getBlockReportCount() > 0) {
>   blockLog.info("BLOCK* processReport 0x{} with lease ID 0x{}: "
>       + "discarded non-initial block report from {}"
>       + " because namenode still in startup phase",
>       strBlockReportId, fullBrLeaseId, nodeID);
>   blockReportLeaseManager.removeLease(node);
>   return !node.hasStaleStorages();
> } {code}
> When a disk was identified as the report is not the first time, namely 
> storageInfo. GetBlockReportCount > 0, Will remove the ticket from the 
> datanode, lead to a second report failed because no lease



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17111) RBF: Optimize msync to only call nameservices with observer namenodes.

2023-07-19 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDFS-17111:
--
Labels: pull-request-available  (was: )

> RBF: Optimize msync to only call nameservices with observer namenodes.
> --
>
> Key: HDFS-17111
> URL: https://issues.apache.org/jira/browse/HDFS-17111
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Simbarashe Dzinamarira
>Assignee: Simbarashe Dzinamarira
>Priority: Major
>  Labels: pull-request-available
>
> Right now when a client MSYNCs to the router, the call is fanned out to all 
> nameservices. We only need to proxy the msync to nameservices that have 
> observer reads configured.
> We can do this either by adding a new config for the admin to specify which 
> nameservices have CRS configured, or we can try to automatically detect these.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17111) RBF: Optimize msync to only call nameservices with observer namenodes.

2023-07-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744750#comment-17744750
 ] 

ASF GitHub Bot commented on HDFS-17111:
---

simbadzina opened a new pull request, #5860:
URL: https://github.com/apache/hadoop/pull/5860

   HDFS-17111. RBF: Optimize msync to only call nameservices with observer 
namenodes.
   
   
   
   ### Description of PR
   Routers only need to msync to nameservices that have CRS configured.
   
   I'm still considering whether to just use a static configuration instead of 
trying to automatically identity the nameservices to msync to.
   
   ### How was this patch tested?
   New unit test.
   
   ### For code changes:
   
   - [ X] Does the title or this PR starts with the corresponding JIRA issue id 
(e.g. 'HADOOP-17799. Your PR title ...')?
   




> RBF: Optimize msync to only call nameservices with observer namenodes.
> --
>
> Key: HDFS-17111
> URL: https://issues.apache.org/jira/browse/HDFS-17111
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Simbarashe Dzinamarira
>Assignee: Simbarashe Dzinamarira
>Priority: Major
>
> Right now when a client MSYNCs to the router, the call is fanned out to all 
> nameservices. We only need to proxy the msync to nameservices that have 
> observer reads configured.
> We can do this either by adding a new config for the admin to specify which 
> nameservices have CRS configured, or we can try to automatically detect these.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17093) In the case of all datanodes sending FBR when the namenode restarts (large clusters), there is an issue with incomplete block reporting

2023-07-19 Thread Xing Lin (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744749#comment-17744749
 ] 

Xing Lin commented on HDFS-17093:
-

{quote}[~xinglin] ,I think you modify some more reasonable, datanode separate 
disk operation should be processed in the final set to perform

blockReportLeaseManager. RemoveLease (node);
return ! node.hasStaleStorages();

This is all at the datanode level
{quote}
 

not sure i understand what you said here.

> In the case of all datanodes sending FBR when the namenode restarts (large 
> clusters), there is an issue with incomplete block reporting
> ---
>
> Key: HDFS-17093
> URL: https://issues.apache.org/jira/browse/HDFS-17093
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.3.4
>Reporter: Yanlei Yu
>Priority: Minor
>  Labels: pull-request-available
> Attachments: HDFS-17093.patch
>
>
> In our cluster of 800+ nodes, after restarting the namenode, we found that 
> some datanodes did not report enough blocks, causing the namenode to stay in 
> secure mode for a long time after restarting because of incomplete block 
> reporting
> I found in the logs of the datanode with incomplete block reporting that the 
> first FBR attempt failed, possibly due to namenode stress, and then a second 
> FBR attempt was made as follows:
> {code:java}
> 
> 2023-07-17 11:29:28,982 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Unsuccessfully sent block report 0x6237a52c1e817e,  containing 12 storage 
> report(s), of which we sent 1. The reports had 1099057 total blocks and used 
> 1 RPC(s). This took 294 msec to generate and 101721 msecs for RPC and NN 
> processing. Got back no commands.
> 2023-07-17 11:37:04,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Successfully sent block report 0x62382416f3f055,  containing 12 storage 
> report(s), of which we sent 12. The reports had 1099048 total blocks and used 
> 12 RPC(s). This took 295 msec to generate and 11647 msecs for RPC and NN 
> processing. Got back no commands. {code}
> There's nothing wrong with that. Retry the send if it fails But on the 
> namenode side of the logic:
> {code:java}
> if (namesystem.isInStartupSafeMode()
>     && !StorageType.PROVIDED.equals(storageInfo.getStorageType())
>     && storageInfo.getBlockReportCount() > 0) {
>   blockLog.info("BLOCK* processReport 0x{} with lease ID 0x{}: "
>       + "discarded non-initial block report from {}"
>       + " because namenode still in startup phase",
>       strBlockReportId, fullBrLeaseId, nodeID);
>   blockReportLeaseManager.removeLease(node);
>   return !node.hasStaleStorages();
> } {code}
> When a disk was identified as the report is not the first time, namely 
> storageInfo. GetBlockReportCount > 0, Will remove the ticket from the 
> datanode, lead to a second report failed because no lease



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Assigned] (HDFS-17111) RBF: Optimize msync to only call nameservices with observer namenodes.

2023-07-19 Thread Simbarashe Dzinamarira (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Simbarashe Dzinamarira reassigned HDFS-17111:
-

Assignee: Simbarashe Dzinamarira

> RBF: Optimize msync to only call nameservices with observer namenodes.
> --
>
> Key: HDFS-17111
> URL: https://issues.apache.org/jira/browse/HDFS-17111
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Simbarashe Dzinamarira
>Assignee: Simbarashe Dzinamarira
>Priority: Major
>
> Right now when a client MSYNCs to the router, the call is fanned out to all 
> nameservices. We only need to proxy the msync to nameservices that have 
> observer reads configured.
> We can do this either by adding a new config for the admin to specify which 
> nameservices have CRS configured, or we can try to automatically detect these.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17111) RBF: Optimize msync to only call nameservices with observer namenodes.

2023-07-19 Thread Simbarashe Dzinamarira (Jira)
Simbarashe Dzinamarira created HDFS-17111:
-

 Summary: RBF: Optimize msync to only call nameservices with 
observer namenodes.
 Key: HDFS-17111
 URL: https://issues.apache.org/jira/browse/HDFS-17111
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: Simbarashe Dzinamarira


Right now when a client MSYNCs to the router, the call is fanned out to all 
nameservices. We only need to proxy the msync to nameservices that have 
observer reads configured.

We can do this either by adding a new config for the admin to specify which 
nameservices have CRS configured, or we can try to automatically detect these.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-15042) Add more tests for ByteBufferPositionedReadable

2023-07-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744741#comment-17744741
 ] 

ASF GitHub Bot commented on HDFS-15042:
---

mukund-thakur commented on code in PR #1747:
URL: https://github.com/apache/hadoop/pull/1747#discussion_r956411939


##
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java:
##
@@ -1684,6 +1685,9 @@ public int read(long position, final ByteBuffer buf) 
throws IOException {
   @Override
   public void readFully(long position, final ByteBuffer buf)
   throws IOException {
+if (position < 0) {
+  throw new EOFException(NEGATIVE_POSITION_READ);
+}

Review Comment:
   Yeah I would also not want to change this. 





> Add more tests for ByteBufferPositionedReadable 
> 
>
> Key: HDFS-15042
> URL: https://issues.apache.org/jira/browse/HDFS-15042
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: fs, test
>Affects Versions: 3.3.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> There's a few corner cases of ByteBufferPositionedReadable which need to be 
> tested, mainly illegal read positions. Add them



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17094) EC: Fix bug in block recovery when there are stale datanodes

2023-07-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744715#comment-17744715
 ] 

ASF GitHub Bot commented on HDFS-17094:
---

hadoop-yetus commented on PR #5854:
URL: https://github.com/apache/hadoop/pull/5854#issuecomment-1642442293

   :broken_heart: **-1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 42s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  52m 58s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 42s |  |  trunk passed with JDK 
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  compile  |   1m 29s |  |  trunk passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09  |
   | +1 :green_heart: |  checkstyle  |   1m 23s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 40s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 21s |  |  trunk passed with JDK 
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  javadoc  |   2m  0s |  |  trunk passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09  |
   | +1 :green_heart: |  spotbugs  |   4m  5s |  |  trunk passed  |
   | -1 :x: |  shadedclient  |  42m 13s |  |  branch has errors when building 
and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | -1 :x: |  mvninstall  |   0m 23s | 
[/patch-mvninstall-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5854/2/artifact/out/patch-mvninstall-hadoop-hdfs-project_hadoop-hdfs.txt)
 |  hadoop-hdfs in the patch failed.  |
   | +1 :green_heart: |  compile  |   1m 32s |  |  the patch passed with JDK 
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  javac  |   1m 32s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 26s |  |  the patch passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09  |
   | +1 :green_heart: |  javac  |   1m 26s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   1m 13s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   1m 30s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   1m  6s |  |  the patch passed with JDK 
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  javadoc  |   1m 36s |  |  the patch passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09  |
   | +1 :green_heart: |  spotbugs  |   3m 48s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  36m 38s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  | 223m 42s |  |  hadoop-hdfs in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   0m 58s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 382m 58s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.43 ServerAPI=1.43 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5854/2/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/5854 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 79c529060291 4.15.0-212-generic #223-Ubuntu SMP Tue May 23 
13:09:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / 446ddffc53cb891e0a410bd76a6864666f22ff11 |
   | Default Java | Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5854/2/testReport/ |
   | Max. process+thread count | 3594 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-hdfs U: 
hadoop-hdfs-project/hadoop-hdfs |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5854/2/console |
   | versions | 

[jira] [Commented] (HDFS-15042) Add more tests for ByteBufferPositionedReadable

2023-07-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744709#comment-17744709
 ] 

ASF GitHub Bot commented on HDFS-15042:
---

steveloughran commented on code in PR #1747:
URL: https://github.com/apache/hadoop/pull/1747#discussion_r1268335189


##
hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestByteBufferPread.java:
##
@@ -161,130 +229,264 @@ private void testPreadWithFullByteBuffer(ByteBuffer 
buffer)
* {@link ByteBuffer#limit()} on the buffer. Validates that only half of the
* testFile is loaded into the buffer.
*/
-  private void testPreadWithLimitedByteBuffer(
-  ByteBuffer buffer) throws IOException {
+  @Test
+  public void testPreadWithLimitedByteBuffer() throws IOException {
 int bytesRead;
 int totalBytesRead = 0;
 // Set the buffer limit to half the size of the file
-buffer.limit(FILE_SIZE / 2);
+buffer.limit(HALF_SIZE);
 
 try (FSDataInputStream in = fs.open(testFile)) {
+  in.seek(EOF_POS);
   while ((bytesRead = in.read(totalBytesRead, buffer)) > 0) {
 totalBytesRead += bytesRead;
 // Check that each call to read changes the position of the ByteBuffer
 // correctly
-assertEquals(totalBytesRead, buffer.position());
+assertBufferPosition(totalBytesRead);
   }
 
   // Since we set the buffer limit to half the size of the file, we should
   // have only read half of the file into the buffer
-  assertEquals(totalBytesRead, FILE_SIZE / 2);
+  assertEquals(HALF_SIZE, totalBytesRead);
   // Check that the buffer is full and the contents equal the first half of
   // the file
-  assertFalse(buffer.hasRemaining());
-  buffer.position(0);
-  byte[] bufferContents = new byte[FILE_SIZE / 2];
-  buffer.get(bufferContents);
-  assertArrayEquals(bufferContents,
-  Arrays.copyOfRange(fileContents, 0, FILE_SIZE / 2));
+  assertBufferIsFull();
+  assertBufferEqualsFileContents(0, HALF_SIZE, 0);
+
+  // position hasn't changed
+  assertStreamPosition(in, EOF_POS);
 }
   }
 
   /**
* Reads half of the testFile into the {@link ByteBuffer} by setting the
* {@link ByteBuffer#position()} the half the size of the file. Validates 
that
* only half of the testFile is loaded into the buffer.
+   * 
+   * This test interleaves reading from the stream by the classic input
+   * stream API, verifying those bytes are also as expected.
+   * This lets us validate the requirement that these positions reads must
+   * not interfere with the conventional read sequence.
*/
-  private void testPreadWithPositionedByteBuffer(
-  ByteBuffer buffer) throws IOException {
+  @Test
+  public void testPreadWithPositionedByteBuffer() throws IOException {
 int bytesRead;
 int totalBytesRead = 0;
 // Set the buffer position to half the size of the file
-buffer.position(FILE_SIZE / 2);
+buffer.position(HALF_SIZE);
+int counter = 0;
 
 try (FSDataInputStream in = fs.open(testFile)) {
+  assertEquals("Byte read from stream",
+  fileContents[counter++], in.read());
   while ((bytesRead = in.read(totalBytesRead, buffer)) > 0) {
 totalBytesRead += bytesRead;
 // Check that each call to read changes the position of the ByteBuffer
 // correctly
-assertEquals(totalBytesRead + FILE_SIZE / 2, buffer.position());
+assertBufferPosition(totalBytesRead + HALF_SIZE);
+// read the next byte.
+assertEquals("Byte read from stream",
+fileContents[counter++], in.read());
   }
 
   // Since we set the buffer position to half the size of the file, we
   // should have only read half of the file into the buffer
-  assertEquals(totalBytesRead, FILE_SIZE / 2);
+  assertEquals("bytes read",
+  HALF_SIZE, totalBytesRead);
   // Check that the buffer is full and the contents equal the first half of
   // the file
-  assertFalse(buffer.hasRemaining());
-  buffer.position(FILE_SIZE / 2);
-  byte[] bufferContents = new byte[FILE_SIZE / 2];
-  buffer.get(bufferContents);
-  assertArrayEquals(bufferContents,
-  Arrays.copyOfRange(fileContents, 0, FILE_SIZE / 2));
+  assertBufferIsFull();
+  assertBufferEqualsFileContents(HALF_SIZE, HALF_SIZE, 0);
 }
   }
 
+  /**
+   * Assert the buffer ranges matches that in the file.
+   * @param bufferPosition buffer position
+   * @param length length of data to check
+   * @param fileOffset offset in file.
+   */
+  private void assertBufferEqualsFileContents(int bufferPosition,
+  int length,
+  int fileOffset) {
+buffer.position(bufferPosition);
+byte[] bufferContents = new byte[length];
+buffer.get(bufferContents);
+assertArrayEquals(
+"Buffer data from [" + 

[jira] [Commented] (HDFS-15042) Add more tests for ByteBufferPositionedReadable

2023-07-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-15042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744708#comment-17744708
 ] 

ASF GitHub Bot commented on HDFS-15042:
---

steveloughran commented on code in PR #1747:
URL: https://github.com/apache/hadoop/pull/1747#discussion_r1268334625


##
hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java:
##
@@ -1684,6 +1685,9 @@ public int read(long position, final ByteBuffer buf) 
throws IOException {
   @Override
   public void readFully(long position, final ByteBuffer buf)
   throws IOException {
+if (position < 0) {
+  throw new EOFException(NEGATIVE_POSITION_READ);
+}

Review Comment:
   ok





> Add more tests for ByteBufferPositionedReadable 
> 
>
> Key: HDFS-15042
> URL: https://issues.apache.org/jira/browse/HDFS-15042
> Project: Hadoop HDFS
>  Issue Type: Improvement
>  Components: fs, test
>Affects Versions: 3.3.0
>Reporter: Steve Loughran
>Assignee: Steve Loughran
>Priority: Major
>  Labels: pull-request-available
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> There's a few corner cases of ByteBufferPositionedReadable which need to be 
> tested, mainly illegal read positions. Add them



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17110) Null Pointer Exception when running TestHarFileSystemWithHA#testHarUriWithHaUriWithNoPort

2023-07-19 Thread ConfX (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ConfX updated HDFS-17110:
-
Attachment: (was: reproduce.sh)

>  Null Pointer Exception when running 
> TestHarFileSystemWithHA#testHarUriWithHaUriWithNoPort
> --
>
> Key: HDFS-17110
> URL: https://issues.apache.org/jira/browse/HDFS-17110
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ConfX
>Priority: Critical
> Attachments: reproduce.sh
>
>
> h2. What happened
> After setting {{{}dfs.namenode.replication.min=12396{}}}, running test 
> {{org.apache.hadoop.hdfs.server.namenode.ha.TestHarFileSystemWithHA#testHarUriWithHaUriWithNoPort}}
>  results in a {{{}NullPointerException{}}}.
> h2. Where's the bug
> In the test 
> {{{}org.apache.hadoop.hdfs.server.namenode.ha.TestHarFileSystemWithHA#testHarUriWithHaUriWithNoPort{}}}:
> {noformat}
>     } finally {
>       cluster.shutdown();
>     }{noformat}
> the test tries to shutdown the cluster for cleaning up. However, if the 
> cluster is not generated and cluster=null, the NPE would conceal other 
> failures.
> h2. How to reproduce
>  # Set {{dfs.namenode.replication.min=12396}}
>  # Run 
> {{org.apache.hadoop.hdfs.server.namenode.ha.TestHarFileSystemWithHA#testHarUriWithHaUriWithNoPort}}
> and the following exception should be observed:
> {noformat}
> java.lang.NullPointerException
>     at 
> org.apache.hadoop.hdfs.server.namenode.ha.TestHarFileSystemWithHA.testHarUriWithHaUriWithNoPort(TestHarFileSystemWithHA.java:60){noformat}
> For an easy reproduction, run the reproduce.sh in the attachment.
> We are happy to provide a patch if this issue is confirmed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17110) Null Pointer Exception when running TestHarFileSystemWithHA#testHarUriWithHaUriWithNoPort

2023-07-19 Thread ConfX (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ConfX updated HDFS-17110:
-
Attachment: reproduce.sh

>  Null Pointer Exception when running 
> TestHarFileSystemWithHA#testHarUriWithHaUriWithNoPort
> --
>
> Key: HDFS-17110
> URL: https://issues.apache.org/jira/browse/HDFS-17110
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: ConfX
>Priority: Critical
> Attachments: reproduce.sh
>
>
> h2. What happened
> After setting {{{}dfs.namenode.replication.min=12396{}}}, running test 
> {{org.apache.hadoop.hdfs.server.namenode.ha.TestHarFileSystemWithHA#testHarUriWithHaUriWithNoPort}}
>  results in a {{{}NullPointerException{}}}.
> h2. Where's the bug
> In the test 
> {{{}org.apache.hadoop.hdfs.server.namenode.ha.TestHarFileSystemWithHA#testHarUriWithHaUriWithNoPort{}}}:
> {noformat}
>     } finally {
>       cluster.shutdown();
>     }{noformat}
> the test tries to shutdown the cluster for cleaning up. However, if the 
> cluster is not generated and cluster=null, the NPE would conceal other 
> failures.
> h2. How to reproduce
>  # Set {{dfs.namenode.replication.min=12396}}
>  # Run 
> {{org.apache.hadoop.hdfs.server.namenode.ha.TestHarFileSystemWithHA#testHarUriWithHaUriWithNoPort}}
> and the following exception should be observed:
> {noformat}
> java.lang.NullPointerException
>     at 
> org.apache.hadoop.hdfs.server.namenode.ha.TestHarFileSystemWithHA.testHarUriWithHaUriWithNoPort(TestHarFileSystemWithHA.java:60){noformat}
> For an easy reproduction, run the reproduce.sh in the attachment.
> We are happy to provide a patch if this issue is confirmed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17110) Null Pointer Exception when running TestHarFileSystemWithHA#testHarUriWithHaUriWithNoPort

2023-07-19 Thread ConfX (Jira)
ConfX created HDFS-17110:


 Summary:  Null Pointer Exception when running 
TestHarFileSystemWithHA#testHarUriWithHaUriWithNoPort
 Key: HDFS-17110
 URL: https://issues.apache.org/jira/browse/HDFS-17110
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: ConfX
 Attachments: reproduce.sh

h2. What happened

After setting {{{}dfs.namenode.replication.min=12396{}}}, running test 
{{org.apache.hadoop.hdfs.server.namenode.ha.TestHarFileSystemWithHA#testHarUriWithHaUriWithNoPort}}
 results in a {{{}NullPointerException{}}}.
h2. Where's the bug

In the test 
{{{}org.apache.hadoop.hdfs.server.namenode.ha.TestHarFileSystemWithHA#testHarUriWithHaUriWithNoPort{}}}:
{noformat}
    } finally {
      cluster.shutdown();
    }{noformat}
the test tries to shutdown the cluster for cleaning up. However, if the cluster 
is not generated and cluster=null, the NPE would conceal other failures.
h2. How to reproduce
 # Set {{dfs.namenode.replication.min=12396}}
 # Run 
{{org.apache.hadoop.hdfs.server.namenode.ha.TestHarFileSystemWithHA#testHarUriWithHaUriWithNoPort}}
and the following exception should be observed:
{noformat}
java.lang.NullPointerException
    at 
org.apache.hadoop.hdfs.server.namenode.ha.TestHarFileSystemWithHA.testHarUriWithHaUriWithNoPort(TestHarFileSystemWithHA.java:60){noformat}
For an easy reproduction, run the reproduce.sh in the attachment.

We are happy to provide a patch if this issue is confirmed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17093) In the case of all datanodes sending FBR when the namenode restarts (large clusters), there is an issue with incomplete block reporting

2023-07-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744696#comment-17744696
 ] 

ASF GitHub Bot commented on HDFS-17093:
---

hadoop-yetus commented on PR #5855:
URL: https://github.com/apache/hadoop/pull/5855#issuecomment-1642377590

   :confetti_ball: **+1 overall**
   
   
   
   
   
   
   | Vote | Subsystem | Runtime |  Logfile | Comment |
   |::|--:|:|::|:---:|
   | +0 :ok: |  reexec  |   0m 43s |  |  Docker mode activated.  |
    _ Prechecks _ |
   | +1 :green_heart: |  dupname  |   0m  0s |  |  No case conflicting files 
found.  |
   | +0 :ok: |  codespell  |   0m  1s |  |  codespell was not available.  |
   | +0 :ok: |  detsecrets  |   0m  1s |  |  detect-secrets was not available.  
|
   | +1 :green_heart: |  @author  |   0m  0s |  |  The patch does not contain 
any @author tags.  |
   | +1 :green_heart: |  test4tests  |   0m  0s |  |  The patch appears to 
include 1 new or modified test files.  |
    _ trunk Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |  51m 44s |  |  trunk passed  |
   | +1 :green_heart: |  compile  |   1m 31s |  |  trunk passed with JDK 
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  compile  |   1m 24s |  |  trunk passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09  |
   | +1 :green_heart: |  checkstyle  |   1m 15s |  |  trunk passed  |
   | +1 :green_heart: |  mvnsite  |   1m 30s |  |  trunk passed  |
   | +1 :green_heart: |  javadoc  |   1m 13s |  |  trunk passed with JDK 
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  javadoc  |   1m 41s |  |  trunk passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09  |
   | +1 :green_heart: |  spotbugs  |   3m 41s |  |  trunk passed  |
   | +1 :green_heart: |  shadedclient  |  38m 37s |  |  branch has no errors 
when building and testing our client artifacts.  |
    _ Patch Compile Tests _ |
   | +1 :green_heart: |  mvninstall  |   1m 15s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 17s |  |  the patch passed with JDK 
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  javac  |   1m 17s |  |  the patch passed  |
   | +1 :green_heart: |  compile  |   1m 15s |  |  the patch passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09  |
   | +1 :green_heart: |  javac  |   1m 15s |  |  the patch passed  |
   | +1 :green_heart: |  blanks  |   0m  0s |  |  The patch has no blanks 
issues.  |
   | +1 :green_heart: |  checkstyle  |   1m  4s |  |  the patch passed  |
   | +1 :green_heart: |  mvnsite  |   1m 19s |  |  the patch passed  |
   | +1 :green_heart: |  javadoc  |   0m 58s |  |  the patch passed with JDK 
Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1  |
   | +1 :green_heart: |  javadoc  |   1m 31s |  |  the patch passed with JDK 
Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09  |
   | +1 :green_heart: |  spotbugs  |   3m 15s |  |  the patch passed  |
   | +1 :green_heart: |  shadedclient  |  38m  6s |  |  patch has no errors 
when building and testing our client artifacts.  |
    _ Other Tests _ |
   | +1 :green_heart: |  unit  | 212m 51s |  |  hadoop-hdfs in the patch 
passed.  |
   | +1 :green_heart: |  asflicense  |   0m 56s |  |  The patch does not 
generate ASF License warnings.  |
   |  |   | 367m 50s |  |  |
   
   
   | Subsystem | Report/Notes |
   |--:|:-|
   | Docker | ClientAPI=1.43 ServerAPI=1.43 base: 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5855/1/artifact/out/Dockerfile
 |
   | GITHUB PR | https://github.com/apache/hadoop/pull/5855 |
   | Optional Tests | dupname asflicense compile javac javadoc mvninstall 
mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets |
   | uname | Linux 897492e024b6 4.15.0-212-generic #223-Ubuntu SMP Tue May 23 
13:09:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux |
   | Build tool | maven |
   | Personality | dev-support/bin/hadoop.sh |
   | git revision | trunk / a4b76d3d1e3785641758e2aca40069504b3c99b9 |
   | Default Java | Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 |
   | Multi-JDK versions | 
/usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 
/usr/lib/jvm/java-8-openjdk-amd64:Private 
Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 |
   |  Test Results | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5855/1/testReport/ |
   | Max. process+thread count | 2916 (vs. ulimit of 5500) |
   | modules | C: hadoop-hdfs-project/hadoop-hdfs U: 
hadoop-hdfs-project/hadoop-hdfs |
   | Console output | 
https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5855/1/console |
   | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 |
   | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org |
   
   
   This message was automatically generated.
   
   




> In the case of 

[jira] [Updated] (HDFS-17109) Null Pointer Exception when running TestBlockManager

2023-07-19 Thread ConfX (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ConfX updated HDFS-17109:
-
Description: 
h2. What happened

After setting {{{}dfs.namenode.redundancy.considerLoadByStorageType=true{}}}, 
running test 
{{org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager#testOneOfTwoRacksDecommissioned}}
 results in a {{{}NullPointerException{}}}.
h2. Where's the bug

In the class {{{}BlockPlacementPolicyDefault{}}}:
{noformat}
    for (StorageType s : storageTypes) {
      StorageTypeStats storageTypeStats = storageStats.get(s);
      numNodes += storageTypeStats.getNodesInService();
      numXceiver += storageTypeStats.getNodesInServiceXceiverCount();
    }{noformat}
However, the class does not check if the storageTypeStats is null, causing the 
NPE.
h2. How to reproduce
 # Set {{dfs.namenode.redundancy.considerLoadByStorageType=true}}
 # Run 
{{org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager#testOneOfTwoRacksDecommissioned}}
and the following exception should be observed:
{noformat}
java.lang.NullPointerException
    at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.getInServiceXceiverAverageByStorageType(BlockPlacementPolicyDefault.java:1044)
    at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.getInServiceXceiverAverage(BlockPlacementPolicyDefault.java:1023)
    at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.excludeNodeByLoad(BlockPlacementPolicyDefault.java:1000)
    at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.isGoodDatanode(BlockPlacementPolicyDefault.java:1086)
    at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:855)
    at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRemoteRack(BlockPlacementPolicyDefault.java:782)
    at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:557)
    at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:478)
    at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:350)
    at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:170)
    at 
org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:51)
    at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:2031)
    at 
org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.scheduleSingleReplication(TestBlockManager.java:641)
    at 
org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.doTestOneOfTwoRacksDecommissioned(TestBlockManager.java:364)
    at 
org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.testOneOfTwoRacksDecommissioned(TestBlockManager.java:351){noformat}
 
For an easy reproduction, run the reproduce.sh in the attachment.  
We are happy to provide a patch if this issue is confirmed.

  was:
h2. What happened

After setting {{{}dfs.namenode.redundancy.considerLoadByStorageType=true{}}}, 
running test 
{{org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager#testOneOfTwoRacksDecommissioned}}
 results in a {{{}NullPointerException{}}}.
h2. Where's the bug

In the class {{{}BlockPlacementPolicyDefault{}}}:
{noformat}
    for (StorageType s : storageTypes) {
      StorageTypeStats storageTypeStats = storageStats.get(s);
      numNodes += storageTypeStats.getNodesInService();
      numXceiver += storageTypeStats.getNodesInServiceXceiverCount();
    }{noformat}
However, the class does not check if the storageTypeStats is null, causing the 
NPE.
h2. How to reproduce
 # Set {{dfs.namenode.redundancy.considerLoadByStorageType=true}}
 # Run 
{{org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager#testOneOfTwoRacksDecommissioned}}
and the following exception should be observed:
{noformat}
java.lang.NullPointerException
    at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.getInServiceXceiverAverageByStorageType(BlockPlacementPolicyDefault.java:1044)
    at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.getInServiceXceiverAverage(BlockPlacementPolicyDefault.java:1023)
    at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.excludeNodeByLoad(BlockPlacementPolicyDefault.java:1000)
    at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.isGoodDatanode(BlockPlacementPolicyDefault.java:1086)
    at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:855)
    at 

[jira] [Created] (HDFS-17109) Null Pointer Exception when running TestBlockManager

2023-07-19 Thread ConfX (Jira)
ConfX created HDFS-17109:


 Summary: Null Pointer Exception when running TestBlockManager
 Key: HDFS-17109
 URL: https://issues.apache.org/jira/browse/HDFS-17109
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: ConfX
 Attachments: reproduce.sh

h2. What happened

After setting {{{}dfs.namenode.redundancy.considerLoadByStorageType=true{}}}, 
running test 
{{org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager#testOneOfTwoRacksDecommissioned}}
 results in a {{{}NullPointerException{}}}.
h2. Where's the bug

In the class {{{}BlockPlacementPolicyDefault{}}}:
{noformat}
    for (StorageType s : storageTypes) {
      StorageTypeStats storageTypeStats = storageStats.get(s);
      numNodes += storageTypeStats.getNodesInService();
      numXceiver += storageTypeStats.getNodesInServiceXceiverCount();
    }{noformat}
However, the class does not check if the storageTypeStats is null, causing the 
NPE.
h2. How to reproduce
 # Set {{dfs.namenode.redundancy.considerLoadByStorageType=true}}
 # Run 
{{org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager#testOneOfTwoRacksDecommissioned}}
and the following exception should be observed:
{noformat}
java.lang.NullPointerException
    at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.getInServiceXceiverAverageByStorageType(BlockPlacementPolicyDefault.java:1044)
    at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.getInServiceXceiverAverage(BlockPlacementPolicyDefault.java:1023)
    at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.excludeNodeByLoad(BlockPlacementPolicyDefault.java:1000)
    at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.isGoodDatanode(BlockPlacementPolicyDefault.java:1086)
    at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:855)
    at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRemoteRack(BlockPlacementPolicyDefault.java:782)
    at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:557)
    at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:478)
    at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:350)
    at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:170)
    at 
org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:51)
    at 
org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:2031)
    at 
org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.scheduleSingleReplication(TestBlockManager.java:641)
    at 
org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.doTestOneOfTwoRacksDecommissioned(TestBlockManager.java:364)
    at 
org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.testOneOfTwoRacksDecommissioned(TestBlockManager.java:351){noformat}
For an easy reproduction, run the reproduce.sh in the attachment.

We are happy to provide a patch if this issue is confirmed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17108) Null Pointer Exception when running TestDecommissionWithBackoffMonitor

2023-07-19 Thread ConfX (Jira)
ConfX created HDFS-17108:


 Summary: Null Pointer Exception when running 
TestDecommissionWithBackoffMonitor
 Key: HDFS-17108
 URL: https://issues.apache.org/jira/browse/HDFS-17108
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: ConfX
 Attachments: reproduce.sh

h2. What happened

After setting {{{}dfs.client.read.shortcircuit=true{}}}, running test 
{{org.apache.hadoop.hdfs.TestDecommissionWithBackoffMonitor#testNodeUsageWhileDecommissioining}}
 results in a {{{}NullPointerException{}}}.
h2. Where's the bug

In the test class {{{}org.apache.hadoop.hdfs.TestDecommission{}}}:
{noformat}
    } finally {
      cleanupFile(fileSys, file1);
    }{noformat}
However, the class does not check if the fileSys is null, causing the NPE.
h2. How to reproduce
 # Set {{dfs.client.read.shortcircuit=true}}
 # Run 
{{org.apache.hadoop.hdfs.TestDecommissionWithBackoffMonitor#testNodeUsageWhileDecommissioining}}
and the following exception should be observed:
{noformat}
java.lang.NullPointerException
    at 
org.apache.hadoop.hdfs.AdminStatesBaseTest.cleanupFile(AdminStatesBaseTest.java:459)
    at 
org.apache.hadoop.hdfs.TestDecommission.nodeUsageVerification(TestDecommission.java:1575)
    at 
org.apache.hadoop.hdfs.TestDecommission.testNodeUsageWhileDecommissioining(TestDecommission.java:1510){noformat}
For an easy reproduction, run the reproduce.sh in the attachment.

We are happy to provide a patch if this issue is confirmed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Created] (HDFS-17107) Null Pointer Exception after turned on detail metric for namenode lock

2023-07-19 Thread ConfX (Jira)
ConfX created HDFS-17107:


 Summary: Null Pointer Exception after turned on detail metric for 
namenode lock
 Key: HDFS-17107
 URL: https://issues.apache.org/jira/browse/HDFS-17107
 Project: Hadoop HDFS
  Issue Type: Bug
Reporter: ConfX
 Attachments: reproduce.sh

h2. What happened

After setting {{{}dfs.namenode.lock.detailed-metrics.enabled=true{}}}, running 
test 
{{org.apache.hadoop.hdfs.server.namenode.TestFSNamesystemLock#testFSWriteLockReportSuppressed}}
 results in a {{{}NullPointerException{}}}.
h2. Where's the bug

In class {{{}FSNameSystemLock{}}}:
{noformat}
    if (metricsEnabled) {
      String opMetric = getMetricName(operationName, isWrite);
      detailedHoldTimeMetrics.add(opMetric, value);{noformat}
here it may be that the metric is enabled but the detailedHoldTimeMetrics is 
null.
h2. How to reproduce
 # Set {{dfs.namenode.lock.detailed-metrics.enabled=true}}
 # Run 
{{org.apache.hadoop.hdfs.server.namenode.TestFSNamesystemLock#testFSWriteLockReportSuppressed}}
and the following exception should be observed:
{noformat}
java.lang.NullPointerException
    at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.addMetric(FSNamesystemLock.java:359)
    at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:287)
    at 
org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:236)
    at 
org.apache.hadoop.hdfs.server.namenode.TestFSNamesystemLock.testFSWriteLockReportSuppressed(TestFSNamesystemLock.java:433){noformat}
For an easy reproduction, run the reproduce.sh in the attachment.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17093) In the case of all datanodes sending FBR when the namenode restarts (large clusters), there is an issue with incomplete block reporting

2023-07-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744617#comment-17744617
 ] 

ASF GitHub Bot commented on HDFS-17093:
---

Hexiaoqiao closed pull request #5856: HDFS-17093. In the case of all datanodes 
sending FBR when the namenode restarts (large clusters), there is an issue with 
incomplete block reporting
URL: https://github.com/apache/hadoop/pull/5856




> In the case of all datanodes sending FBR when the namenode restarts (large 
> clusters), there is an issue with incomplete block reporting
> ---
>
> Key: HDFS-17093
> URL: https://issues.apache.org/jira/browse/HDFS-17093
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.3.4
>Reporter: Yanlei Yu
>Priority: Minor
>  Labels: pull-request-available
> Attachments: HDFS-17093.patch
>
>
> In our cluster of 800+ nodes, after restarting the namenode, we found that 
> some datanodes did not report enough blocks, causing the namenode to stay in 
> secure mode for a long time after restarting because of incomplete block 
> reporting
> I found in the logs of the datanode with incomplete block reporting that the 
> first FBR attempt failed, possibly due to namenode stress, and then a second 
> FBR attempt was made as follows:
> {code:java}
> 
> 2023-07-17 11:29:28,982 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Unsuccessfully sent block report 0x6237a52c1e817e,  containing 12 storage 
> report(s), of which we sent 1. The reports had 1099057 total blocks and used 
> 1 RPC(s). This took 294 msec to generate and 101721 msecs for RPC and NN 
> processing. Got back no commands.
> 2023-07-17 11:37:04,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Successfully sent block report 0x62382416f3f055,  containing 12 storage 
> report(s), of which we sent 12. The reports had 1099048 total blocks and used 
> 12 RPC(s). This took 295 msec to generate and 11647 msecs for RPC and NN 
> processing. Got back no commands. {code}
> There's nothing wrong with that. Retry the send if it fails But on the 
> namenode side of the logic:
> {code:java}
> if (namesystem.isInStartupSafeMode()
>     && !StorageType.PROVIDED.equals(storageInfo.getStorageType())
>     && storageInfo.getBlockReportCount() > 0) {
>   blockLog.info("BLOCK* processReport 0x{} with lease ID 0x{}: "
>       + "discarded non-initial block report from {}"
>       + " because namenode still in startup phase",
>       strBlockReportId, fullBrLeaseId, nodeID);
>   blockReportLeaseManager.removeLease(node);
>   return !node.hasStaleStorages();
> } {code}
> When a disk was identified as the report is not the first time, namely 
> storageInfo. GetBlockReportCount > 0, Will remove the ticket from the 
> datanode, lead to a second report failed because no lease



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17093) In the case of all datanodes sending FBR when the namenode restarts (large clusters), there is an issue with incomplete block reporting

2023-07-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744616#comment-17744616
 ] 

ASF GitHub Bot commented on HDFS-17093:
---

Hexiaoqiao commented on PR #5856:
URL: https://github.com/apache/hadoop/pull/5856#issuecomment-1642036068

   Please push update to your same original branch (for this case which is 
Tre2878:HDFS-17093), DO NOT pull repeat request for same issue.




> In the case of all datanodes sending FBR when the namenode restarts (large 
> clusters), there is an issue with incomplete block reporting
> ---
>
> Key: HDFS-17093
> URL: https://issues.apache.org/jira/browse/HDFS-17093
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.3.4
>Reporter: Yanlei Yu
>Priority: Minor
>  Labels: pull-request-available
> Attachments: HDFS-17093.patch
>
>
> In our cluster of 800+ nodes, after restarting the namenode, we found that 
> some datanodes did not report enough blocks, causing the namenode to stay in 
> secure mode for a long time after restarting because of incomplete block 
> reporting
> I found in the logs of the datanode with incomplete block reporting that the 
> first FBR attempt failed, possibly due to namenode stress, and then a second 
> FBR attempt was made as follows:
> {code:java}
> 
> 2023-07-17 11:29:28,982 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Unsuccessfully sent block report 0x6237a52c1e817e,  containing 12 storage 
> report(s), of which we sent 1. The reports had 1099057 total blocks and used 
> 1 RPC(s). This took 294 msec to generate and 101721 msecs for RPC and NN 
> processing. Got back no commands.
> 2023-07-17 11:37:04,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Successfully sent block report 0x62382416f3f055,  containing 12 storage 
> report(s), of which we sent 12. The reports had 1099048 total blocks and used 
> 12 RPC(s). This took 295 msec to generate and 11647 msecs for RPC and NN 
> processing. Got back no commands. {code}
> There's nothing wrong with that. Retry the send if it fails But on the 
> namenode side of the logic:
> {code:java}
> if (namesystem.isInStartupSafeMode()
>     && !StorageType.PROVIDED.equals(storageInfo.getStorageType())
>     && storageInfo.getBlockReportCount() > 0) {
>   blockLog.info("BLOCK* processReport 0x{} with lease ID 0x{}: "
>       + "discarded non-initial block report from {}"
>       + " because namenode still in startup phase",
>       strBlockReportId, fullBrLeaseId, nodeID);
>   blockReportLeaseManager.removeLease(node);
>   return !node.hasStaleStorages();
> } {code}
> When a disk was identified as the report is not the first time, namely 
> storageInfo. GetBlockReportCount > 0, Will remove the ticket from the 
> datanode, lead to a second report failed because no lease



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17093) In the case of all datanodes sending FBR when the namenode restarts (large clusters), there is an issue with incomplete block reporting

2023-07-19 Thread Yanlei Yu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744602#comment-17744602
 ] 

Yanlei Yu commented on HDFS-17093:
--

[~hexiaoqiao] I have modified and resubmitted PB[:GitHub Pull Request 
#5856|https://github.com/apache/hadoop/pull/5856]

> In the case of all datanodes sending FBR when the namenode restarts (large 
> clusters), there is an issue with incomplete block reporting
> ---
>
> Key: HDFS-17093
> URL: https://issues.apache.org/jira/browse/HDFS-17093
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.3.4
>Reporter: Yanlei Yu
>Priority: Minor
>  Labels: pull-request-available
> Attachments: HDFS-17093.patch
>
>
> In our cluster of 800+ nodes, after restarting the namenode, we found that 
> some datanodes did not report enough blocks, causing the namenode to stay in 
> secure mode for a long time after restarting because of incomplete block 
> reporting
> I found in the logs of the datanode with incomplete block reporting that the 
> first FBR attempt failed, possibly due to namenode stress, and then a second 
> FBR attempt was made as follows:
> {code:java}
> 
> 2023-07-17 11:29:28,982 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Unsuccessfully sent block report 0x6237a52c1e817e,  containing 12 storage 
> report(s), of which we sent 1. The reports had 1099057 total blocks and used 
> 1 RPC(s). This took 294 msec to generate and 101721 msecs for RPC and NN 
> processing. Got back no commands.
> 2023-07-17 11:37:04,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Successfully sent block report 0x62382416f3f055,  containing 12 storage 
> report(s), of which we sent 12. The reports had 1099048 total blocks and used 
> 12 RPC(s). This took 295 msec to generate and 11647 msecs for RPC and NN 
> processing. Got back no commands. {code}
> There's nothing wrong with that. Retry the send if it fails But on the 
> namenode side of the logic:
> {code:java}
> if (namesystem.isInStartupSafeMode()
>     && !StorageType.PROVIDED.equals(storageInfo.getStorageType())
>     && storageInfo.getBlockReportCount() > 0) {
>   blockLog.info("BLOCK* processReport 0x{} with lease ID 0x{}: "
>       + "discarded non-initial block report from {}"
>       + " because namenode still in startup phase",
>       strBlockReportId, fullBrLeaseId, nodeID);
>   blockReportLeaseManager.removeLease(node);
>   return !node.hasStaleStorages();
> } {code}
> When a disk was identified as the report is not the first time, namely 
> storageInfo. GetBlockReportCount > 0, Will remove the ticket from the 
> datanode, lead to a second report failed because no lease



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17093) In the case of all datanodes sending FBR when the namenode restarts (large clusters), there is an issue with incomplete block reporting

2023-07-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744600#comment-17744600
 ] 

ASF GitHub Bot commented on HDFS-17093:
---

Tre2878 commented on code in PR #5855:
URL: https://github.com/apache/hadoop/pull/5855#discussion_r1268012644


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java:
##
@@ -2873,7 +2873,9 @@ public boolean checkBlockReportLease(BlockReportContext 
context,
   public boolean processReport(final DatanodeID nodeID,
   final DatanodeStorage storage,
   final BlockListAsLongs newReport,
-  BlockReportContext context) throws IOException {
+  BlockReportContext context,
+  int totalReportNum,
+  int currentReportNum) throws IOException {

Review Comment:
   I think it's ok





> In the case of all datanodes sending FBR when the namenode restarts (large 
> clusters), there is an issue with incomplete block reporting
> ---
>
> Key: HDFS-17093
> URL: https://issues.apache.org/jira/browse/HDFS-17093
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.3.4
>Reporter: Yanlei Yu
>Priority: Minor
>  Labels: pull-request-available
> Attachments: HDFS-17093.patch
>
>
> In our cluster of 800+ nodes, after restarting the namenode, we found that 
> some datanodes did not report enough blocks, causing the namenode to stay in 
> secure mode for a long time after restarting because of incomplete block 
> reporting
> I found in the logs of the datanode with incomplete block reporting that the 
> first FBR attempt failed, possibly due to namenode stress, and then a second 
> FBR attempt was made as follows:
> {code:java}
> 
> 2023-07-17 11:29:28,982 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Unsuccessfully sent block report 0x6237a52c1e817e,  containing 12 storage 
> report(s), of which we sent 1. The reports had 1099057 total blocks and used 
> 1 RPC(s). This took 294 msec to generate and 101721 msecs for RPC and NN 
> processing. Got back no commands.
> 2023-07-17 11:37:04,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Successfully sent block report 0x62382416f3f055,  containing 12 storage 
> report(s), of which we sent 12. The reports had 1099048 total blocks and used 
> 12 RPC(s). This took 295 msec to generate and 11647 msecs for RPC and NN 
> processing. Got back no commands. {code}
> There's nothing wrong with that. Retry the send if it fails But on the 
> namenode side of the logic:
> {code:java}
> if (namesystem.isInStartupSafeMode()
>     && !StorageType.PROVIDED.equals(storageInfo.getStorageType())
>     && storageInfo.getBlockReportCount() > 0) {
>   blockLog.info("BLOCK* processReport 0x{} with lease ID 0x{}: "
>       + "discarded non-initial block report from {}"
>       + " because namenode still in startup phase",
>       strBlockReportId, fullBrLeaseId, nodeID);
>   blockReportLeaseManager.removeLease(node);
>   return !node.hasStaleStorages();
> } {code}
> When a disk was identified as the report is not the first time, namely 
> storageInfo. GetBlockReportCount > 0, Will remove the ticket from the 
> datanode, lead to a second report failed because no lease



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17093) In the case of all datanodes sending FBR when the namenode restarts (large clusters), there is an issue with incomplete block reporting

2023-07-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744599#comment-17744599
 ] 

ASF GitHub Bot commented on HDFS-17093:
---

Tre2878 opened a new pull request, #5856:
URL: https://github.com/apache/hadoop/pull/5856

   In the case of all datanodes sending FBR when the namenode restarts (large 
clusters), there is an issue with incomplete block reporting




> In the case of all datanodes sending FBR when the namenode restarts (large 
> clusters), there is an issue with incomplete block reporting
> ---
>
> Key: HDFS-17093
> URL: https://issues.apache.org/jira/browse/HDFS-17093
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.3.4
>Reporter: Yanlei Yu
>Priority: Minor
>  Labels: pull-request-available
> Attachments: HDFS-17093.patch
>
>
> In our cluster of 800+ nodes, after restarting the namenode, we found that 
> some datanodes did not report enough blocks, causing the namenode to stay in 
> secure mode for a long time after restarting because of incomplete block 
> reporting
> I found in the logs of the datanode with incomplete block reporting that the 
> first FBR attempt failed, possibly due to namenode stress, and then a second 
> FBR attempt was made as follows:
> {code:java}
> 
> 2023-07-17 11:29:28,982 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Unsuccessfully sent block report 0x6237a52c1e817e,  containing 12 storage 
> report(s), of which we sent 1. The reports had 1099057 total blocks and used 
> 1 RPC(s). This took 294 msec to generate and 101721 msecs for RPC and NN 
> processing. Got back no commands.
> 2023-07-17 11:37:04,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Successfully sent block report 0x62382416f3f055,  containing 12 storage 
> report(s), of which we sent 12. The reports had 1099048 total blocks and used 
> 12 RPC(s). This took 295 msec to generate and 11647 msecs for RPC and NN 
> processing. Got back no commands. {code}
> There's nothing wrong with that. Retry the send if it fails But on the 
> namenode side of the logic:
> {code:java}
> if (namesystem.isInStartupSafeMode()
>     && !StorageType.PROVIDED.equals(storageInfo.getStorageType())
>     && storageInfo.getBlockReportCount() > 0) {
>   blockLog.info("BLOCK* processReport 0x{} with lease ID 0x{}: "
>       + "discarded non-initial block report from {}"
>       + " because namenode still in startup phase",
>       strBlockReportId, fullBrLeaseId, nodeID);
>   blockReportLeaseManager.removeLease(node);
>   return !node.hasStaleStorages();
> } {code}
> When a disk was identified as the report is not the first time, namely 
> storageInfo. GetBlockReportCount > 0, Will remove the ticket from the 
> datanode, lead to a second report failed because no lease



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Updated] (HDFS-17093) In the case of all datanodes sending FBR when the namenode restarts (large clusters), there is an issue with incomplete block reporting

2023-07-19 Thread ASF GitHub Bot (Jira)


 [ 
https://issues.apache.org/jira/browse/HDFS-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

ASF GitHub Bot updated HDFS-17093:
--
Labels: pull-request-available  (was: )

> In the case of all datanodes sending FBR when the namenode restarts (large 
> clusters), there is an issue with incomplete block reporting
> ---
>
> Key: HDFS-17093
> URL: https://issues.apache.org/jira/browse/HDFS-17093
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.3.4
>Reporter: Yanlei Yu
>Priority: Minor
>  Labels: pull-request-available
> Attachments: HDFS-17093.patch
>
>
> In our cluster of 800+ nodes, after restarting the namenode, we found that 
> some datanodes did not report enough blocks, causing the namenode to stay in 
> secure mode for a long time after restarting because of incomplete block 
> reporting
> I found in the logs of the datanode with incomplete block reporting that the 
> first FBR attempt failed, possibly due to namenode stress, and then a second 
> FBR attempt was made as follows:
> {code:java}
> 
> 2023-07-17 11:29:28,982 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Unsuccessfully sent block report 0x6237a52c1e817e,  containing 12 storage 
> report(s), of which we sent 1. The reports had 1099057 total blocks and used 
> 1 RPC(s). This took 294 msec to generate and 101721 msecs for RPC and NN 
> processing. Got back no commands.
> 2023-07-17 11:37:04,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Successfully sent block report 0x62382416f3f055,  containing 12 storage 
> report(s), of which we sent 12. The reports had 1099048 total blocks and used 
> 12 RPC(s). This took 295 msec to generate and 11647 msecs for RPC and NN 
> processing. Got back no commands. {code}
> There's nothing wrong with that. Retry the send if it fails But on the 
> namenode side of the logic:
> {code:java}
> if (namesystem.isInStartupSafeMode()
>     && !StorageType.PROVIDED.equals(storageInfo.getStorageType())
>     && storageInfo.getBlockReportCount() > 0) {
>   blockLog.info("BLOCK* processReport 0x{} with lease ID 0x{}: "
>       + "discarded non-initial block report from {}"
>       + " because namenode still in startup phase",
>       strBlockReportId, fullBrLeaseId, nodeID);
>   blockReportLeaseManager.removeLease(node);
>   return !node.hasStaleStorages();
> } {code}
> When a disk was identified as the report is not the first time, namely 
> storageInfo. GetBlockReportCount > 0, Will remove the ticket from the 
> datanode, lead to a second report failed because no lease



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17094) EC: Fix bug in block recovery when there are stale datanodes

2023-07-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744537#comment-17744537
 ] 

ASF GitHub Bot commented on HDFS-17094:
---

zhangshuyan0 commented on PR #5854:
URL: https://github.com/apache/hadoop/pull/5854#issuecomment-1641844035

   @Hexiaoqiao @tomscut Thanks for your review. I've update this PR according 
to the suggestions. Please take a look, thanks again.




> EC: Fix bug in block recovery when there are stale datanodes
> 
>
> Key: HDFS-17094
> URL: https://issues.apache.org/jira/browse/HDFS-17094
> Project: Hadoop HDFS
>  Issue Type: Bug
>Reporter: Shuyan Zhang
>Assignee: Shuyan Zhang
>Priority: Major
>  Labels: pull-request-available
>
> When a block recovery occurs, `RecoveryTaskStriped` in datanode expects 
> `rBlock.getLocations()` and `rBlock. getBlockIndices()` to be in one-to-one 
> correspondence. However, if there are locations in stale state when NameNode 
> handles heartbeat, this correspondence will be disrupted. In detail, there is 
> no stale location in `recoveryLocations`, but the block indices array is 
> still complete (i.e. contains the indices of all the locations). This will 
> cause `BlockRecoveryWorker.RecoveryTaskStriped#recover` to generate a wrong 
> internal block ID, and the corresponding datanode cannot find the replica, 
> thus making the recovery process fail. This bug needs to be fixed.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17093) In the case of all datanodes sending FBR when the namenode restarts (large clusters), there is an issue with incomplete block reporting

2023-07-19 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744538#comment-17744538
 ] 

ASF GitHub Bot commented on HDFS-17093:
---

Hexiaoqiao commented on code in PR #5855:
URL: https://github.com/apache/hadoop/pull/5855#discussion_r1267889826


##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java:
##
@@ -2873,7 +2873,9 @@ public boolean checkBlockReportLease(BlockReportContext 
context,
   public boolean processReport(final DatanodeID nodeID,
   final DatanodeStorage storage,
   final BlockListAsLongs newReport,
-  BlockReportContext context) throws IOException {
+  BlockReportContext context,
+  int totalReportNum,
+  int currentReportNum) throws IOException {

Review Comment:
   a. Please add some Javadoc about added parameter.
   b. Will this name be more readable?
   totalReportNum  -> totalStorageReportsNum, 
   currentReportNum -> storageReportIndex



##
hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeRpcServer.java:
##
@@ -1650,7 +1650,7 @@ public DatanodeCommand blockReport(final 
DatanodeRegistration nodeReg,
   final int index = r;
   noStaleStorages = bm.runBlockOp(() ->
 bm.processReport(nodeReg, reports[index].getStorage(),
-blocks, context));
+blocks, context, reports.length, index+1));

Review Comment:
   codestyle: `index + 1` (leave one space here)





> In the case of all datanodes sending FBR when the namenode restarts (large 
> clusters), there is an issue with incomplete block reporting
> ---
>
> Key: HDFS-17093
> URL: https://issues.apache.org/jira/browse/HDFS-17093
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.3.4
>Reporter: Yanlei Yu
>Priority: Minor
> Attachments: HDFS-17093.patch
>
>
> In our cluster of 800+ nodes, after restarting the namenode, we found that 
> some datanodes did not report enough blocks, causing the namenode to stay in 
> secure mode for a long time after restarting because of incomplete block 
> reporting
> I found in the logs of the datanode with incomplete block reporting that the 
> first FBR attempt failed, possibly due to namenode stress, and then a second 
> FBR attempt was made as follows:
> {code:java}
> 
> 2023-07-17 11:29:28,982 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Unsuccessfully sent block report 0x6237a52c1e817e,  containing 12 storage 
> report(s), of which we sent 1. The reports had 1099057 total blocks and used 
> 1 RPC(s). This took 294 msec to generate and 101721 msecs for RPC and NN 
> processing. Got back no commands.
> 2023-07-17 11:37:04,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Successfully sent block report 0x62382416f3f055,  containing 12 storage 
> report(s), of which we sent 12. The reports had 1099048 total blocks and used 
> 12 RPC(s). This took 295 msec to generate and 11647 msecs for RPC and NN 
> processing. Got back no commands. {code}
> There's nothing wrong with that. Retry the send if it fails But on the 
> namenode side of the logic:
> {code:java}
> if (namesystem.isInStartupSafeMode()
>     && !StorageType.PROVIDED.equals(storageInfo.getStorageType())
>     && storageInfo.getBlockReportCount() > 0) {
>   blockLog.info("BLOCK* processReport 0x{} with lease ID 0x{}: "
>       + "discarded non-initial block report from {}"
>       + " because namenode still in startup phase",
>       strBlockReportId, fullBrLeaseId, nodeID);
>   blockReportLeaseManager.removeLease(node);
>   return !node.hasStaleStorages();
> } {code}
> When a disk was identified as the report is not the first time, namely 
> storageInfo. GetBlockReportCount > 0, Will remove the ticket from the 
> datanode, lead to a second report failed because no lease



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17093) In the case of all datanodes sending FBR when the namenode restarts (large clusters), there is an issue with incomplete block reporting

2023-07-19 Thread Xiaoqiao He (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744526#comment-17744526
 ] 

Xiaoqiao He commented on HDFS-17093:


{quote}# DN wants to send 12 reports but only sent 1 report.
# NN processes 1 report (then storageInfo.getBlockReportCount() > 0 will be 
true)
# DN continues to send 12 reports to NN.
# NN will simply discard these reports, because 
storageInfo.getBlockReportCount() > 0{quote}

>From this description and log information, it is different with HDFS-17090 
>actually IMO.

For this case and bugfix, it makes sense to me.

> In the case of all datanodes sending FBR when the namenode restarts (large 
> clusters), there is an issue with incomplete block reporting
> ---
>
> Key: HDFS-17093
> URL: https://issues.apache.org/jira/browse/HDFS-17093
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.3.4
>Reporter: Yanlei Yu
>Priority: Minor
> Attachments: HDFS-17093.patch
>
>
> In our cluster of 800+ nodes, after restarting the namenode, we found that 
> some datanodes did not report enough blocks, causing the namenode to stay in 
> secure mode for a long time after restarting because of incomplete block 
> reporting
> I found in the logs of the datanode with incomplete block reporting that the 
> first FBR attempt failed, possibly due to namenode stress, and then a second 
> FBR attempt was made as follows:
> {code:java}
> 
> 2023-07-17 11:29:28,982 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Unsuccessfully sent block report 0x6237a52c1e817e,  containing 12 storage 
> report(s), of which we sent 1. The reports had 1099057 total blocks and used 
> 1 RPC(s). This took 294 msec to generate and 101721 msecs for RPC and NN 
> processing. Got back no commands.
> 2023-07-17 11:37:04,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Successfully sent block report 0x62382416f3f055,  containing 12 storage 
> report(s), of which we sent 12. The reports had 1099048 total blocks and used 
> 12 RPC(s). This took 295 msec to generate and 11647 msecs for RPC and NN 
> processing. Got back no commands. {code}
> There's nothing wrong with that. Retry the send if it fails But on the 
> namenode side of the logic:
> {code:java}
> if (namesystem.isInStartupSafeMode()
>     && !StorageType.PROVIDED.equals(storageInfo.getStorageType())
>     && storageInfo.getBlockReportCount() > 0) {
>   blockLog.info("BLOCK* processReport 0x{} with lease ID 0x{}: "
>       + "discarded non-initial block report from {}"
>       + " because namenode still in startup phase",
>       strBlockReportId, fullBrLeaseId, nodeID);
>   blockReportLeaseManager.removeLease(node);
>   return !node.hasStaleStorages();
> } {code}
> When a disk was identified as the report is not the first time, namely 
> storageInfo. GetBlockReportCount > 0, Will remove the ticket from the 
> datanode, lead to a second report failed because no lease



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-17093) In the case of all datanodes sending FBR when the namenode restarts (large clusters), there is an issue with incomplete block reporting

2023-07-19 Thread Yanlei Yu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744524#comment-17744524
 ] 

Yanlei Yu edited comment on HDFS-17093 at 7/19/23 10:24 AM:


[~hexiaoqiao] ,
{quote}would you mind to submit PR via Github if need?
{quote}
PR:[GitHub Pull Request #5855|https://github.com/apache/hadoop/pull/5855]


was (Author: JIRAUSER294151):
[~hexiaoqiao] ,
{quote}would you mind to submit PR via Github if need?
{quote}
PR:[[GitHub Pull Request 
#5855|https://github.com/apache/hadoop/pull/5855]|[http://example.com|https://github.com/apache/hadoop/pull/5855]]

> In the case of all datanodes sending FBR when the namenode restarts (large 
> clusters), there is an issue with incomplete block reporting
> ---
>
> Key: HDFS-17093
> URL: https://issues.apache.org/jira/browse/HDFS-17093
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.3.4
>Reporter: Yanlei Yu
>Priority: Minor
> Attachments: HDFS-17093.patch
>
>
> In our cluster of 800+ nodes, after restarting the namenode, we found that 
> some datanodes did not report enough blocks, causing the namenode to stay in 
> secure mode for a long time after restarting because of incomplete block 
> reporting
> I found in the logs of the datanode with incomplete block reporting that the 
> first FBR attempt failed, possibly due to namenode stress, and then a second 
> FBR attempt was made as follows:
> {code:java}
> 
> 2023-07-17 11:29:28,982 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Unsuccessfully sent block report 0x6237a52c1e817e,  containing 12 storage 
> report(s), of which we sent 1. The reports had 1099057 total blocks and used 
> 1 RPC(s). This took 294 msec to generate and 101721 msecs for RPC and NN 
> processing. Got back no commands.
> 2023-07-17 11:37:04,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Successfully sent block report 0x62382416f3f055,  containing 12 storage 
> report(s), of which we sent 12. The reports had 1099048 total blocks and used 
> 12 RPC(s). This took 295 msec to generate and 11647 msecs for RPC and NN 
> processing. Got back no commands. {code}
> There's nothing wrong with that. Retry the send if it fails But on the 
> namenode side of the logic:
> {code:java}
> if (namesystem.isInStartupSafeMode()
>     && !StorageType.PROVIDED.equals(storageInfo.getStorageType())
>     && storageInfo.getBlockReportCount() > 0) {
>   blockLog.info("BLOCK* processReport 0x{} with lease ID 0x{}: "
>       + "discarded non-initial block report from {}"
>       + " because namenode still in startup phase",
>       strBlockReportId, fullBrLeaseId, nodeID);
>   blockReportLeaseManager.removeLease(node);
>   return !node.hasStaleStorages();
> } {code}
> When a disk was identified as the report is not the first time, namely 
> storageInfo. GetBlockReportCount > 0, Will remove the ticket from the 
> datanode, lead to a second report failed because no lease



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17093) In the case of all datanodes sending FBR when the namenode restarts (large clusters), there is an issue with incomplete block reporting

2023-07-19 Thread Yanlei Yu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744524#comment-17744524
 ] 

Yanlei Yu commented on HDFS-17093:
--

[~hexiaoqiao] ,
{quote}would you mind to submit PR via Github if need?
{quote}
PR:[[GitHub Pull Request 
#5855|https://github.com/apache/hadoop/pull/5855]|[http://example.com|https://github.com/apache/hadoop/pull/5855]]

> In the case of all datanodes sending FBR when the namenode restarts (large 
> clusters), there is an issue with incomplete block reporting
> ---
>
> Key: HDFS-17093
> URL: https://issues.apache.org/jira/browse/HDFS-17093
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.3.4
>Reporter: Yanlei Yu
>Priority: Minor
> Attachments: HDFS-17093.patch
>
>
> In our cluster of 800+ nodes, after restarting the namenode, we found that 
> some datanodes did not report enough blocks, causing the namenode to stay in 
> secure mode for a long time after restarting because of incomplete block 
> reporting
> I found in the logs of the datanode with incomplete block reporting that the 
> first FBR attempt failed, possibly due to namenode stress, and then a second 
> FBR attempt was made as follows:
> {code:java}
> 
> 2023-07-17 11:29:28,982 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Unsuccessfully sent block report 0x6237a52c1e817e,  containing 12 storage 
> report(s), of which we sent 1. The reports had 1099057 total blocks and used 
> 1 RPC(s). This took 294 msec to generate and 101721 msecs for RPC and NN 
> processing. Got back no commands.
> 2023-07-17 11:37:04,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Successfully sent block report 0x62382416f3f055,  containing 12 storage 
> report(s), of which we sent 12. The reports had 1099048 total blocks and used 
> 12 RPC(s). This took 295 msec to generate and 11647 msecs for RPC and NN 
> processing. Got back no commands. {code}
> There's nothing wrong with that. Retry the send if it fails But on the 
> namenode side of the logic:
> {code:java}
> if (namesystem.isInStartupSafeMode()
>     && !StorageType.PROVIDED.equals(storageInfo.getStorageType())
>     && storageInfo.getBlockReportCount() > 0) {
>   blockLog.info("BLOCK* processReport 0x{} with lease ID 0x{}: "
>       + "discarded non-initial block report from {}"
>       + " because namenode still in startup phase",
>       strBlockReportId, fullBrLeaseId, nodeID);
>   blockReportLeaseManager.removeLease(node);
>   return !node.hasStaleStorages();
> } {code}
> When a disk was identified as the report is not the first time, namely 
> storageInfo. GetBlockReportCount > 0, Will remove the ticket from the 
> datanode, lead to a second report failed because no lease



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Comment Edited] (HDFS-17093) In the case of all datanodes sending FBR when the namenode restarts (large clusters), there is an issue with incomplete block reporting

2023-07-19 Thread Yanlei Yu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1778#comment-1778
 ] 

Yanlei Yu edited comment on HDFS-17093 at 7/19/23 6:08 AM:
---

Just to add to that,Our cluster configuration is 
dfs.namenode.max.full.block.report.leases=1500(we have 800+ nodes),When the 
namenode restarts, all 800+ nodes will send FBRS,This happens when the namenode 
is under a lot of pressure,Of course will not rule out 
dfs.namenode.max.full.block.report.leases Set to a smaller value,not sure it 
will happen


was (Author: JIRAUSER294151):
Just to add to that,Our cluster configuration is 
dfs.namenode.max.full.block.report.leases=1500(we have 800+ nodes),When the 
namenode restarts, all 800+ nodes will send FBRS,This happens when the namenode 
is under a lot of pressure,Of course will not rule out 
dfs.namenode.max.full.block.report.leases Set to a smaller value,will not happen

> In the case of all datanodes sending FBR when the namenode restarts (large 
> clusters), there is an issue with incomplete block reporting
> ---
>
> Key: HDFS-17093
> URL: https://issues.apache.org/jira/browse/HDFS-17093
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.3.4
>Reporter: Yanlei Yu
>Priority: Minor
> Attachments: HDFS-17093.patch
>
>
> In our cluster of 800+ nodes, after restarting the namenode, we found that 
> some datanodes did not report enough blocks, causing the namenode to stay in 
> secure mode for a long time after restarting because of incomplete block 
> reporting
> I found in the logs of the datanode with incomplete block reporting that the 
> first FBR attempt failed, possibly due to namenode stress, and then a second 
> FBR attempt was made as follows:
> {code:java}
> 
> 2023-07-17 11:29:28,982 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Unsuccessfully sent block report 0x6237a52c1e817e,  containing 12 storage 
> report(s), of which we sent 1. The reports had 1099057 total blocks and used 
> 1 RPC(s). This took 294 msec to generate and 101721 msecs for RPC and NN 
> processing. Got back no commands.
> 2023-07-17 11:37:04,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Successfully sent block report 0x62382416f3f055,  containing 12 storage 
> report(s), of which we sent 12. The reports had 1099048 total blocks and used 
> 12 RPC(s). This took 295 msec to generate and 11647 msecs for RPC and NN 
> processing. Got back no commands. {code}
> There's nothing wrong with that. Retry the send if it fails But on the 
> namenode side of the logic:
> {code:java}
> if (namesystem.isInStartupSafeMode()
>     && !StorageType.PROVIDED.equals(storageInfo.getStorageType())
>     && storageInfo.getBlockReportCount() > 0) {
>   blockLog.info("BLOCK* processReport 0x{} with lease ID 0x{}: "
>       + "discarded non-initial block report from {}"
>       + " because namenode still in startup phase",
>       strBlockReportId, fullBrLeaseId, nodeID);
>   blockReportLeaseManager.removeLease(node);
>   return !node.hasStaleStorages();
> } {code}
> When a disk was identified as the report is not the first time, namely 
> storageInfo. GetBlockReportCount > 0, Will remove the ticket from the 
> datanode, lead to a second report failed because no lease



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org



[jira] [Commented] (HDFS-17093) In the case of all datanodes sending FBR when the namenode restarts (large clusters), there is an issue with incomplete block reporting

2023-07-19 Thread Yanlei Yu (Jira)


[ 
https://issues.apache.org/jira/browse/HDFS-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1778#comment-1778
 ] 

Yanlei Yu commented on HDFS-17093:
--

Just to add to that,Our cluster configuration is 
dfs.namenode.max.full.block.report.leases=1500(we have 800+ nodes),When the 
namenode restarts, all 800+ nodes will send FBRS,This happens when the namenode 
is under a lot of pressure,Of course will not rule out 
dfs.namenode.max.full.block.report.leases Set to a smaller value,will not happen

> In the case of all datanodes sending FBR when the namenode restarts (large 
> clusters), there is an issue with incomplete block reporting
> ---
>
> Key: HDFS-17093
> URL: https://issues.apache.org/jira/browse/HDFS-17093
> Project: Hadoop HDFS
>  Issue Type: Bug
>  Components: namenode
>Affects Versions: 3.3.4
>Reporter: Yanlei Yu
>Priority: Minor
> Attachments: HDFS-17093.patch
>
>
> In our cluster of 800+ nodes, after restarting the namenode, we found that 
> some datanodes did not report enough blocks, causing the namenode to stay in 
> secure mode for a long time after restarting because of incomplete block 
> reporting
> I found in the logs of the datanode with incomplete block reporting that the 
> first FBR attempt failed, possibly due to namenode stress, and then a second 
> FBR attempt was made as follows:
> {code:java}
> 
> 2023-07-17 11:29:28,982 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Unsuccessfully sent block report 0x6237a52c1e817e,  containing 12 storage 
> report(s), of which we sent 1. The reports had 1099057 total blocks and used 
> 1 RPC(s). This took 294 msec to generate and 101721 msecs for RPC and NN 
> processing. Got back no commands.
> 2023-07-17 11:37:04,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: 
> Successfully sent block report 0x62382416f3f055,  containing 12 storage 
> report(s), of which we sent 12. The reports had 1099048 total blocks and used 
> 12 RPC(s). This took 295 msec to generate and 11647 msecs for RPC and NN 
> processing. Got back no commands. {code}
> There's nothing wrong with that. Retry the send if it fails But on the 
> namenode side of the logic:
> {code:java}
> if (namesystem.isInStartupSafeMode()
>     && !StorageType.PROVIDED.equals(storageInfo.getStorageType())
>     && storageInfo.getBlockReportCount() > 0) {
>   blockLog.info("BLOCK* processReport 0x{} with lease ID 0x{}: "
>       + "discarded non-initial block report from {}"
>       + " because namenode still in startup phase",
>       strBlockReportId, fullBrLeaseId, nodeID);
>   blockReportLeaseManager.removeLease(node);
>   return !node.hasStaleStorages();
> } {code}
> When a disk was identified as the report is not the first time, namely 
> storageInfo. GetBlockReportCount > 0, Will remove the ticket from the 
> datanode, lead to a second report failed because no lease



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org