[jira] [Commented] (HDFS-17093) In the case of all datanodes sending FBR when the namenode restarts (large clusters), there is an issue with incomplete block reporting
[ https://issues.apache.org/jira/browse/HDFS-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744882#comment-17744882 ] ASF GitHub Bot commented on HDFS-17093: --- Tre2878 commented on PR #5856: URL: https://github.com/apache/hadoop/pull/5856#issuecomment-1643253493 @Hexiaoqiao Ok, I've committed in the original branch > In the case of all datanodes sending FBR when the namenode restarts (large > clusters), there is an issue with incomplete block reporting > --- > > Key: HDFS-17093 > URL: https://issues.apache.org/jira/browse/HDFS-17093 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.3.4 >Reporter: Yanlei Yu >Priority: Minor > Labels: pull-request-available > > In our cluster of 800+ nodes, after restarting the namenode, we found that > some datanodes did not report enough blocks, causing the namenode to stay in > secure mode for a long time after restarting because of incomplete block > reporting > I found in the logs of the datanode with incomplete block reporting that the > first FBR attempt failed, possibly due to namenode stress, and then a second > FBR attempt was made as follows: > {code:java} > > 2023-07-17 11:29:28,982 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Unsuccessfully sent block report 0x6237a52c1e817e, containing 12 storage > report(s), of which we sent 1. The reports had 1099057 total blocks and used > 1 RPC(s). This took 294 msec to generate and 101721 msecs for RPC and NN > processing. Got back no commands. > 2023-07-17 11:37:04,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Successfully sent block report 0x62382416f3f055, containing 12 storage > report(s), of which we sent 12. The reports had 1099048 total blocks and used > 12 RPC(s). This took 295 msec to generate and 11647 msecs for RPC and NN > processing. Got back no commands. {code} > There's nothing wrong with that. Retry the send if it fails But on the > namenode side of the logic: > {code:java} > if (namesystem.isInStartupSafeMode() > && !StorageType.PROVIDED.equals(storageInfo.getStorageType()) > && storageInfo.getBlockReportCount() > 0) { > blockLog.info("BLOCK* processReport 0x{} with lease ID 0x{}: " > + "discarded non-initial block report from {}" > + " because namenode still in startup phase", > strBlockReportId, fullBrLeaseId, nodeID); > blockReportLeaseManager.removeLease(node); > return !node.hasStaleStorages(); > } {code} > When a disk was identified as the report is not the first time, namely > storageInfo. GetBlockReportCount > 0, Will remove the ticket from the > datanode, lead to a second report failed because no lease -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17093) In the case of all datanodes sending FBR when the namenode restarts (large clusters), there is an issue with incomplete block reporting
[ https://issues.apache.org/jira/browse/HDFS-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744861#comment-17744861 ] Yanlei Yu commented on HDFS-17093: -- {quote}Please push update to your same original branch (for this case which is Tre2878:HDFS-17093), DO NOT pull repeat request for same issue. {quote} Sorry, I now made changes and commits in the original branch > In the case of all datanodes sending FBR when the namenode restarts (large > clusters), there is an issue with incomplete block reporting > --- > > Key: HDFS-17093 > URL: https://issues.apache.org/jira/browse/HDFS-17093 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.3.4 >Reporter: Yanlei Yu >Priority: Minor > Labels: pull-request-available > > In our cluster of 800+ nodes, after restarting the namenode, we found that > some datanodes did not report enough blocks, causing the namenode to stay in > secure mode for a long time after restarting because of incomplete block > reporting > I found in the logs of the datanode with incomplete block reporting that the > first FBR attempt failed, possibly due to namenode stress, and then a second > FBR attempt was made as follows: > {code:java} > > 2023-07-17 11:29:28,982 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Unsuccessfully sent block report 0x6237a52c1e817e, containing 12 storage > report(s), of which we sent 1. The reports had 1099057 total blocks and used > 1 RPC(s). This took 294 msec to generate and 101721 msecs for RPC and NN > processing. Got back no commands. > 2023-07-17 11:37:04,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Successfully sent block report 0x62382416f3f055, containing 12 storage > report(s), of which we sent 12. The reports had 1099048 total blocks and used > 12 RPC(s). This took 295 msec to generate and 11647 msecs for RPC and NN > processing. Got back no commands. {code} > There's nothing wrong with that. Retry the send if it fails But on the > namenode side of the logic: > {code:java} > if (namesystem.isInStartupSafeMode() > && !StorageType.PROVIDED.equals(storageInfo.getStorageType()) > && storageInfo.getBlockReportCount() > 0) { > blockLog.info("BLOCK* processReport 0x{} with lease ID 0x{}: " > + "discarded non-initial block report from {}" > + " because namenode still in startup phase", > strBlockReportId, fullBrLeaseId, nodeID); > blockReportLeaseManager.removeLease(node); > return !node.hasStaleStorages(); > } {code} > When a disk was identified as the report is not the first time, namely > storageInfo. GetBlockReportCount > 0, Will remove the ticket from the > datanode, lead to a second report failed because no lease -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17094) EC: Fix bug in block recovery when there are stale datanodes
[ https://issues.apache.org/jira/browse/HDFS-17094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744854#comment-17744854 ] ASF GitHub Bot commented on HDFS-17094: --- tomscut commented on PR #5854: URL: https://github.com/apache/hadoop/pull/5854#issuecomment-1643035843 > @tomscut This PR can cherry pick to branch-3.3 smoothly, Please cherry-pick directly if you evaluate it also need to fix for branch-3.3 rather than submit another PR. Thanks. OKK, I have backport to branch-3.3. I thought it would be safer to trigger jenkins. But for this PR, it's really not necessary. Thank you for your advice. > EC: Fix bug in block recovery when there are stale datanodes > > > Key: HDFS-17094 > URL: https://issues.apache.org/jira/browse/HDFS-17094 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Shuyan Zhang >Assignee: Shuyan Zhang >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > > When a block recovery occurs, `RecoveryTaskStriped` in datanode expects > `rBlock.getLocations()` and `rBlock. getBlockIndices()` to be in one-to-one > correspondence. However, if there are locations in stale state when NameNode > handles heartbeat, this correspondence will be disrupted. In detail, there is > no stale location in `recoveryLocations`, but the block indices array is > still complete (i.e. contains the indices of all the locations). This will > cause `BlockRecoveryWorker.RecoveryTaskStriped#recover` to generate a wrong > internal block ID, and the corresponding datanode cannot find the replica, > thus making the recovery process fail. This bug needs to be fixed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17111) RBF: Optimize msync to only call nameservices that have observer reads enabled.
[ https://issues.apache.org/jira/browse/HDFS-17111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744850#comment-17744850 ] ASF GitHub Bot commented on HDFS-17111: --- hadoop-yetus commented on PR #5860: URL: https://github.com/apache/hadoop/pull/5860#issuecomment-1643028418 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 56s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 50m 8s | | trunk passed | | +1 :green_heart: | compile | 0m 42s | | trunk passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | compile | 0m 38s | | trunk passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | checkstyle | 0m 29s | | trunk passed | | +1 :green_heart: | mvnsite | 0m 43s | | trunk passed | | +1 :green_heart: | javadoc | 0m 42s | | trunk passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | javadoc | 0m 30s | | trunk passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | spotbugs | 1m 27s | | trunk passed | | +1 :green_heart: | shadedclient | 38m 49s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 34s | | the patch passed | | +1 :green_heart: | compile | 0m 34s | | the patch passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | javac | 0m 34s | | the patch passed | | +1 :green_heart: | compile | 0m 31s | | the patch passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | javac | 0m 31s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 18s | | hadoop-hdfs-project/hadoop-hdfs-rbf: The patch generated 0 new + 0 unchanged - 2 fixed = 0 total (was 2) | | +1 :green_heart: | mvnsite | 0m 33s | | the patch passed | | +1 :green_heart: | javadoc | 0m 29s | | the patch passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | javadoc | 0m 23s | | the patch passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | spotbugs | 1m 23s | | the patch passed | | +1 :green_heart: | shadedclient | 40m 1s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 21m 27s | | hadoop-hdfs-rbf in the patch passed. | | +1 :green_heart: | asflicense | 0m 37s | | The patch does not generate ASF License warnings. | | | | 166m 27s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5860/6/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5860 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 56fb1db7c75c 4.15.0-212-generic #223-Ubuntu SMP Tue May 23 13:09:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 6e81175b5dfc13c8f1bc09604bd08b02023e1d06 | | Default Java | Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5860/6/testReport/ | | Max. process+thread count | 2609 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs-rbf U: hadoop-hdfs-project/hadoop-hdfs-rbf | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5860/6/console | | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 | | Powered by | Apache Yetus 0.14.0
[jira] [Commented] (HDFS-17111) RBF: Optimize msync to only call nameservices that have observer reads enabled.
[ https://issues.apache.org/jira/browse/HDFS-17111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744845#comment-17744845 ] ASF GitHub Bot commented on HDFS-17111: --- hadoop-yetus commented on PR #5860: URL: https://github.com/apache/hadoop/pull/5860#issuecomment-1643006475 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 49s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 50m 13s | | trunk passed | | +1 :green_heart: | compile | 0m 53s | | trunk passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | compile | 0m 37s | | trunk passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | checkstyle | 0m 31s | | trunk passed | | +1 :green_heart: | mvnsite | 0m 42s | | trunk passed | | +1 :green_heart: | javadoc | 0m 41s | | trunk passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | javadoc | 0m 30s | | trunk passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | spotbugs | 1m 26s | | trunk passed | | +1 :green_heart: | shadedclient | 39m 3s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 36s | | the patch passed | | +1 :green_heart: | compile | 0m 35s | | the patch passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | javac | 0m 35s | | the patch passed | | +1 :green_heart: | compile | 0m 29s | | the patch passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | javac | 0m 29s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 18s | | hadoop-hdfs-project/hadoop-hdfs-rbf: The patch generated 0 new + 0 unchanged - 2 fixed = 0 total (was 2) | | +1 :green_heart: | mvnsite | 0m 32s | | the patch passed | | +1 :green_heart: | javadoc | 0m 29s | | the patch passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | javadoc | 0m 23s | | the patch passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | spotbugs | 1m 22s | | the patch passed | | +1 :green_heart: | shadedclient | 39m 8s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 21m 48s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5860/5/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt) | hadoop-hdfs-rbf in the patch passed. | | +1 :green_heart: | asflicense | 0m 36s | | The patch does not generate ASF License warnings. | | | | 165m 46s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.server.federation.router.TestRouter | | | hadoop.hdfs.server.federation.fairness.TestRouterRefreshFairnessPolicyController | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5860/5/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5860 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 16b73b494aa3 4.15.0-212-generic #223-Ubuntu SMP Tue May 23 13:09:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 732691ee8552a3ad24e7642266a74c05321efbae | | Default Java | Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | Test Results |
[jira] [Commented] (HDFS-17094) EC: Fix bug in block recovery when there are stale datanodes
[ https://issues.apache.org/jira/browse/HDFS-17094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744839#comment-17744839 ] ASF GitHub Bot commented on HDFS-17094: --- Hexiaoqiao commented on PR #5854: URL: https://github.com/apache/hadoop/pull/5854#issuecomment-1642997064 @tomscut This PR can cherry pick to branch-3.3 smoothly, Please cherry-pick directly if you evaluate it also need to fix for branch-3.3 rather than submit another PR. Thanks. > EC: Fix bug in block recovery when there are stale datanodes > > > Key: HDFS-17094 > URL: https://issues.apache.org/jira/browse/HDFS-17094 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Shuyan Zhang >Assignee: Shuyan Zhang >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > > When a block recovery occurs, `RecoveryTaskStriped` in datanode expects > `rBlock.getLocations()` and `rBlock. getBlockIndices()` to be in one-to-one > correspondence. However, if there are locations in stale state when NameNode > handles heartbeat, this correspondence will be disrupted. In detail, there is > no stale location in `recoveryLocations`, but the block indices array is > still complete (i.e. contains the indices of all the locations). This will > cause `BlockRecoveryWorker.RecoveryTaskStriped#recover` to generate a wrong > internal block ID, and the corresponding datanode cannot find the replica, > thus making the recovery process fail. This bug needs to be fixed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17094) EC: Fix bug in block recovery when there are stale datanodes
[ https://issues.apache.org/jira/browse/HDFS-17094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744838#comment-17744838 ] ASF GitHub Bot commented on HDFS-17094: --- zhangshuyan0 commented on PR #5854: URL: https://github.com/apache/hadoop/pull/5854#issuecomment-1642992778 > @zhangshuyan0 Could you please backport this to branch-3.3? Thanks! Ok, I'll do this later. > EC: Fix bug in block recovery when there are stale datanodes > > > Key: HDFS-17094 > URL: https://issues.apache.org/jira/browse/HDFS-17094 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Shuyan Zhang >Assignee: Shuyan Zhang >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > > When a block recovery occurs, `RecoveryTaskStriped` in datanode expects > `rBlock.getLocations()` and `rBlock. getBlockIndices()` to be in one-to-one > correspondence. However, if there are locations in stale state when NameNode > handles heartbeat, this correspondence will be disrupted. In detail, there is > no stale location in `recoveryLocations`, but the block indices array is > still complete (i.e. contains the indices of all the locations). This will > cause `BlockRecoveryWorker.RecoveryTaskStriped#recover` to generate a wrong > internal block ID, and the corresponding datanode cannot find the replica, > thus making the recovery process fail. This bug needs to be fixed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17092) Datanode Full Block Report failed can lead to missing and under replicated blocks
[ https://issues.apache.org/jira/browse/HDFS-17092?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744836#comment-17744836 ] Tao Li commented on HDFS-17092: --- Seems duplicated with HDFS-17093. > Datanode Full Block Report failed can lead to missing and under replicated > blocks > - > > Key: HDFS-17092 > URL: https://issues.apache.org/jira/browse/HDFS-17092 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Reporter: microle.dong >Priority: Major > > when restarting namenode, we found that some datanodes did not report enough > blocks, which can lead to missing and under replicated blocks. > Datanode use multipul RPCs to report blocks, I found in the logs of the > datanode with incomplete block reporting that the first FBR attempt failed, > due to namenode error > > {code:java} > 2023-07-14 17:29:24,776 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Unsuccessfully sent block report 0x7b738b02996cd2, containing 12 storage > report(s), of which we sent 1. The reports had 633013 total blocks and used 1 > RPC(s). This took 234 msec to generate and 98739 msecs for RPC and NN > processing. Got back no commands. > 2023-07-14 17:29:24,776 WARN org.apache.hadoop.hdfs.server.datanode.DataNode: > IOException in offerService > java.net.SocketTimeoutException: Call From x.x.x.x/x.x.x.x to x.x.x.x:9002 > failed on socket timeout exception: java.net.SocketTimeoutException: 6 > millis timeout while waiting for channel to be ready for read. ch : > java.nio.channels.SocketChannel[connected local=/x.x.x.x:13868 > remote=x.x.x.x/x.x.x.x:9002]; For more details see: > http://wiki.apache.org/hadoop/SocketTimeout > t sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) > at > sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) > at > sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) > at java.lang.reflect.Constructor.newInstance(Constructor.java:422) > at org.apache.hadoop.net.NetUtils.wrapWithMessage(NetUtils.java:863) > at org.apache.hadoop.net.NetUtils.wrapException(NetUtils.java:822) > at org.apache.hadoop.ipc.Client.call(Client.java:1480) > at org.apache.hadoop.ipc.Client.call(Client.java:1413) > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:229) > at com.sun.proxy.$Proxy14.blockReport(Unknown Source) > at > org.apache.hadoop.hdfs.protocolPB.DatanodeProtocolClientSideTranslatorPB.blockReport(DatanodeProtocolClientSideTranslatorPB.java:205) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.blockReport(BPServiceActor.java:333) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.offerService(BPServiceActor.java:572) > at > org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:706) > at java.lang.Thread.run(Thread.java:745){code} > the Datanode second FBR will use same lease , which will make namenode > remove the datanode lease (just as HDFS-8930) , lead to other FBR RPC > failed because no lease is left. > we should rest a new lease and try again when datanode FBR failed . > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17093) In the case of all datanodes sending FBR when the namenode restarts (large clusters), there is an issue with incomplete block reporting
[ https://issues.apache.org/jira/browse/HDFS-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yanlei Yu updated HDFS-17093: - Attachment: (was: HDFS-17093.patch) > In the case of all datanodes sending FBR when the namenode restarts (large > clusters), there is an issue with incomplete block reporting > --- > > Key: HDFS-17093 > URL: https://issues.apache.org/jira/browse/HDFS-17093 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.3.4 >Reporter: Yanlei Yu >Priority: Minor > Labels: pull-request-available > > In our cluster of 800+ nodes, after restarting the namenode, we found that > some datanodes did not report enough blocks, causing the namenode to stay in > secure mode for a long time after restarting because of incomplete block > reporting > I found in the logs of the datanode with incomplete block reporting that the > first FBR attempt failed, possibly due to namenode stress, and then a second > FBR attempt was made as follows: > {code:java} > > 2023-07-17 11:29:28,982 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Unsuccessfully sent block report 0x6237a52c1e817e, containing 12 storage > report(s), of which we sent 1. The reports had 1099057 total blocks and used > 1 RPC(s). This took 294 msec to generate and 101721 msecs for RPC and NN > processing. Got back no commands. > 2023-07-17 11:37:04,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Successfully sent block report 0x62382416f3f055, containing 12 storage > report(s), of which we sent 12. The reports had 1099048 total blocks and used > 12 RPC(s). This took 295 msec to generate and 11647 msecs for RPC and NN > processing. Got back no commands. {code} > There's nothing wrong with that. Retry the send if it fails But on the > namenode side of the logic: > {code:java} > if (namesystem.isInStartupSafeMode() > && !StorageType.PROVIDED.equals(storageInfo.getStorageType()) > && storageInfo.getBlockReportCount() > 0) { > blockLog.info("BLOCK* processReport 0x{} with lease ID 0x{}: " > + "discarded non-initial block report from {}" > + " because namenode still in startup phase", > strBlockReportId, fullBrLeaseId, nodeID); > blockReportLeaseManager.removeLease(node); > return !node.hasStaleStorages(); > } {code} > When a disk was identified as the report is not the first time, namely > storageInfo. GetBlockReportCount > 0, Will remove the ticket from the > datanode, lead to a second report failed because no lease -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17094) EC: Fix bug in block recovery when there are stale datanodes
[ https://issues.apache.org/jira/browse/HDFS-17094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744835#comment-17744835 ] ASF GitHub Bot commented on HDFS-17094: --- tomscut commented on PR #5854: URL: https://github.com/apache/hadoop/pull/5854#issuecomment-1642968772 @zhangshuyan0 Could you please backport this to branch-3.3? Thanks! > EC: Fix bug in block recovery when there are stale datanodes > > > Key: HDFS-17094 > URL: https://issues.apache.org/jira/browse/HDFS-17094 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Shuyan Zhang >Assignee: Shuyan Zhang >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > > When a block recovery occurs, `RecoveryTaskStriped` in datanode expects > `rBlock.getLocations()` and `rBlock. getBlockIndices()` to be in one-to-one > correspondence. However, if there are locations in stale state when NameNode > handles heartbeat, this correspondence will be disrupted. In detail, there is > no stale location in `recoveryLocations`, but the block indices array is > still complete (i.e. contains the indices of all the locations). This will > cause `BlockRecoveryWorker.RecoveryTaskStriped#recover` to generate a wrong > internal block ID, and the corresponding datanode cannot find the replica, > thus making the recovery process fail. This bug needs to be fixed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-17094) EC: Fix bug in block recovery when there are stale datanodes
[ https://issues.apache.org/jira/browse/HDFS-17094?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Li resolved HDFS-17094. --- Fix Version/s: 3.4.0 Resolution: Fixed > EC: Fix bug in block recovery when there are stale datanodes > > > Key: HDFS-17094 > URL: https://issues.apache.org/jira/browse/HDFS-17094 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Shuyan Zhang >Assignee: Shuyan Zhang >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > > When a block recovery occurs, `RecoveryTaskStriped` in datanode expects > `rBlock.getLocations()` and `rBlock. getBlockIndices()` to be in one-to-one > correspondence. However, if there are locations in stale state when NameNode > handles heartbeat, this correspondence will be disrupted. In detail, there is > no stale location in `recoveryLocations`, but the block indices array is > still complete (i.e. contains the indices of all the locations). This will > cause `BlockRecoveryWorker.RecoveryTaskStriped#recover` to generate a wrong > internal block ID, and the corresponding datanode cannot find the replica, > thus making the recovery process fail. This bug needs to be fixed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17094) EC: Fix bug in block recovery when there are stale datanodes
[ https://issues.apache.org/jira/browse/HDFS-17094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744834#comment-17744834 ] ASF GitHub Bot commented on HDFS-17094: --- tomscut commented on PR #5854: URL: https://github.com/apache/hadoop/pull/5854#issuecomment-1642967557 Thanks @zhangshuyan0 for your contribution! Thanks @Hexiaoqiao for your review! > EC: Fix bug in block recovery when there are stale datanodes > > > Key: HDFS-17094 > URL: https://issues.apache.org/jira/browse/HDFS-17094 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Shuyan Zhang >Assignee: Shuyan Zhang >Priority: Major > Labels: pull-request-available > > When a block recovery occurs, `RecoveryTaskStriped` in datanode expects > `rBlock.getLocations()` and `rBlock. getBlockIndices()` to be in one-to-one > correspondence. However, if there are locations in stale state when NameNode > handles heartbeat, this correspondence will be disrupted. In detail, there is > no stale location in `recoveryLocations`, but the block indices array is > still complete (i.e. contains the indices of all the locations). This will > cause `BlockRecoveryWorker.RecoveryTaskStriped#recover` to generate a wrong > internal block ID, and the corresponding datanode cannot find the replica, > thus making the recovery process fail. This bug needs to be fixed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17094) EC: Fix bug in block recovery when there are stale datanodes
[ https://issues.apache.org/jira/browse/HDFS-17094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744833#comment-17744833 ] ASF GitHub Bot commented on HDFS-17094: --- tomscut merged PR #5854: URL: https://github.com/apache/hadoop/pull/5854 > EC: Fix bug in block recovery when there are stale datanodes > > > Key: HDFS-17094 > URL: https://issues.apache.org/jira/browse/HDFS-17094 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Shuyan Zhang >Assignee: Shuyan Zhang >Priority: Major > Labels: pull-request-available > > When a block recovery occurs, `RecoveryTaskStriped` in datanode expects > `rBlock.getLocations()` and `rBlock. getBlockIndices()` to be in one-to-one > correspondence. However, if there are locations in stale state when NameNode > handles heartbeat, this correspondence will be disrupted. In detail, there is > no stale location in `recoveryLocations`, but the block indices array is > still complete (i.e. contains the indices of all the locations). This will > cause `BlockRecoveryWorker.RecoveryTaskStriped#recover` to generate a wrong > internal block ID, and the corresponding datanode cannot find the replica, > thus making the recovery process fail. This bug needs to be fixed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17111) RBF: Optimize msync to only call nameservices that have observer reads enabled.
[ https://issues.apache.org/jira/browse/HDFS-17111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744820#comment-17744820 ] ASF GitHub Bot commented on HDFS-17111: --- hadoop-yetus commented on PR #5860: URL: https://github.com/apache/hadoop/pull/5860#issuecomment-1642893412 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 1m 12s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 2s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 2s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 51m 48s | | trunk passed | | +1 :green_heart: | compile | 0m 55s | | trunk passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | compile | 0m 45s | | trunk passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | checkstyle | 0m 32s | | trunk passed | | +1 :green_heart: | mvnsite | 0m 47s | | trunk passed | | +1 :green_heart: | javadoc | 0m 47s | | trunk passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | javadoc | 0m 33s | | trunk passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | spotbugs | 1m 35s | | trunk passed | | +1 :green_heart: | shadedclient | 40m 4s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 33s | | the patch passed | | +1 :green_heart: | compile | 0m 35s | | the patch passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | javac | 0m 35s | | the patch passed | | +1 :green_heart: | compile | 0m 30s | | the patch passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | javac | 0m 30s | | the patch passed | | +1 :green_heart: | blanks | 0m 1s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 0m 18s | [/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs-rbf.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5860/3/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs-rbf.txt) | hadoop-hdfs-project/hadoop-hdfs-rbf: The patch generated 2 new + 2 unchanged - 0 fixed = 4 total (was 2) | | +1 :green_heart: | mvnsite | 0m 33s | | the patch passed | | +1 :green_heart: | javadoc | 0m 29s | | the patch passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | javadoc | 0m 23s | | the patch passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | spotbugs | 1m 25s | | the patch passed | | +1 :green_heart: | shadedclient | 39m 21s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 22m 47s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5860/3/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt) | hadoop-hdfs-rbf in the patch passed. | | +1 :green_heart: | asflicense | 0m 37s | | The patch does not generate ASF License warnings. | | | | 170m 45s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.server.federation.router.TestRouter | | | hadoop.hdfs.server.federation.fairness.TestRouterRefreshFairnessPolicyController | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5860/3/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5860 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 65e197c55b1e 4.15.0-212-generic #223-Ubuntu SMP Tue May 23 13:09:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 350292a609d80aaee0ad44de4dd0e1b8393a63fa | | Default Java | Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | Multi-JDK versions |
[jira] [Commented] (HDFS-17111) RBF: Optimize msync to only call nameservices that have observer reads enabled.
[ https://issues.apache.org/jira/browse/HDFS-17111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744815#comment-17744815 ] ASF GitHub Bot commented on HDFS-17111: --- hadoop-yetus commented on PR #5860: URL: https://github.com/apache/hadoop/pull/5860#issuecomment-1642886100 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 39s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 46m 1s | | trunk passed | | +1 :green_heart: | compile | 0m 43s | | trunk passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | compile | 0m 42s | | trunk passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | checkstyle | 0m 36s | | trunk passed | | +1 :green_heart: | mvnsite | 0m 47s | | trunk passed | | +1 :green_heart: | javadoc | 0m 49s | | trunk passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | javadoc | 0m 37s | | trunk passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | spotbugs | 1m 29s | | trunk passed | | +1 :green_heart: | shadedclient | 33m 36s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 36s | | the patch passed | | +1 :green_heart: | compile | 0m 35s | | the patch passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | javac | 0m 35s | | the patch passed | | +1 :green_heart: | compile | 0m 33s | | the patch passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | javac | 0m 33s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 0m 21s | [/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs-rbf.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5860/4/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs-rbf.txt) | hadoop-hdfs-project/hadoop-hdfs-rbf: The patch generated 2 new + 2 unchanged - 0 fixed = 4 total (was 2) | | +1 :green_heart: | mvnsite | 0m 35s | | the patch passed | | +1 :green_heart: | javadoc | 0m 32s | | the patch passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | javadoc | 0m 26s | | the patch passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | spotbugs | 1m 23s | | the patch passed | | +1 :green_heart: | shadedclient | 34m 16s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 21m 32s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5860/4/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt) | hadoop-hdfs-rbf in the patch passed. | | +1 :green_heart: | asflicense | 0m 42s | | The patch does not generate ASF License warnings. | | | | 152m 7s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.server.federation.router.TestRouterRPCMultipleDestinationMountTableResolver | | | hadoop.hdfs.server.federation.fairness.TestRouterRefreshFairnessPolicyController | | | hadoop.hdfs.server.federation.router.TestRouter | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5860/4/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5860 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 767001fb66ff 4.15.0-212-generic #223-Ubuntu SMP Tue May 23 13:09:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 66a10a183bc0b6eb1d669cc0854ba8dd0bff1154 | | Default Java | Private
[jira] [Commented] (HDFS-17094) EC: Fix bug in block recovery when there are stale datanodes
[ https://issues.apache.org/jira/browse/HDFS-17094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744814#comment-17744814 ] ASF GitHub Bot commented on HDFS-17094: --- hadoop-yetus commented on PR #5854: URL: https://github.com/apache/hadoop/pull/5854#issuecomment-1642880725 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 42s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 49m 27s | | trunk passed | | +1 :green_heart: | compile | 1m 27s | | trunk passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | compile | 1m 21s | | trunk passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | checkstyle | 1m 14s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 29s | | trunk passed | | +1 :green_heart: | javadoc | 1m 12s | | trunk passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | javadoc | 1m 38s | | trunk passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | spotbugs | 3m 20s | | trunk passed | | +1 :green_heart: | shadedclient | 35m 43s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 16s | | the patch passed | | +1 :green_heart: | compile | 1m 12s | | the patch passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | javac | 1m 12s | | the patch passed | | +1 :green_heart: | compile | 1m 11s | | the patch passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | javac | 1m 11s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 1m 1s | | the patch passed | | +1 :green_heart: | mvnsite | 1m 16s | | the patch passed | | +1 :green_heart: | javadoc | 0m 57s | | the patch passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | javadoc | 1m 31s | | the patch passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | spotbugs | 3m 14s | | the patch passed | | +1 :green_heart: | shadedclient | 36m 2s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 215m 51s | | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 58s | | The patch does not generate ASF License warnings. | | | | 362m 17s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5854/3/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5854 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 3913e9b84c85 4.15.0-212-generic #223-Ubuntu SMP Tue May 23 13:09:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 446ddffc53cb891e0a410bd76a6864666f22ff11 | | Default Java | Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5854/3/testReport/ | | Max. process+thread count | 3028 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5854/3/console | | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 | | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org | This message was automatically generated. > EC: Fix bug in
[jira] [Commented] (HDFS-17111) RBF: Optimize msync to only call nameservices that have observer reads enabled.
[ https://issues.apache.org/jira/browse/HDFS-17111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744813#comment-17744813 ] ASF GitHub Bot commented on HDFS-17111: --- hadoop-yetus commented on PR #5860: URL: https://github.com/apache/hadoop/pull/5860#issuecomment-1642870547 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 38s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 47m 2s | | trunk passed | | +1 :green_heart: | compile | 0m 43s | | trunk passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | compile | 0m 44s | | trunk passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | checkstyle | 0m 35s | | trunk passed | | +1 :green_heart: | mvnsite | 0m 46s | | trunk passed | | +1 :green_heart: | javadoc | 0m 49s | | trunk passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | javadoc | 0m 37s | | trunk passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | spotbugs | 1m 28s | | trunk passed | | +1 :green_heart: | shadedclient | 34m 4s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 36s | | the patch passed | | +1 :green_heart: | compile | 0m 34s | | the patch passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | javac | 0m 34s | | the patch passed | | +1 :green_heart: | compile | 0m 31s | | the patch passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | javac | 0m 31s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 0m 20s | [/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs-rbf.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5860/2/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs-rbf.txt) | hadoop-hdfs-project/hadoop-hdfs-rbf: The patch generated 2 new + 2 unchanged - 0 fixed = 4 total (was 2) | | +1 :green_heart: | mvnsite | 0m 34s | | the patch passed | | +1 :green_heart: | javadoc | 0m 31s | | the patch passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | javadoc | 0m 26s | | the patch passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | spotbugs | 1m 23s | | the patch passed | | +1 :green_heart: | shadedclient | 34m 36s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 21m 6s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5860/2/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs-rbf.txt) | hadoop-hdfs-rbf in the patch passed. | | +1 :green_heart: | asflicense | 0m 40s | | The patch does not generate ASF License warnings. | | | | 153m 4s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.server.federation.router.TestRouter | | | hadoop.hdfs.server.federation.fairness.TestRouterRefreshFairnessPolicyController | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5860/2/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5860 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 2057df3c064a 4.15.0-213-generic #224-Ubuntu SMP Mon Jun 19 13:30:12 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / d5e218d17f658bafb0005e39e815ea9e8fc24bb5 | | Default Java | Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | Multi-JDK versions |
[jira] [Commented] (HDFS-17111) RBF: Optimize msync to only call nameservices that have observer reads enabled.
[ https://issues.apache.org/jira/browse/HDFS-17111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744796#comment-17744796 ] ASF GitHub Bot commented on HDFS-17111: --- hadoop-yetus commented on PR #5860: URL: https://github.com/apache/hadoop/pull/5860#issuecomment-1642792350 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 18m 19s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 49m 48s | | trunk passed | | +1 :green_heart: | compile | 0m 42s | | trunk passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | compile | 0m 36s | | trunk passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | checkstyle | 0m 30s | | trunk passed | | +1 :green_heart: | mvnsite | 0m 42s | | trunk passed | | +1 :green_heart: | javadoc | 0m 42s | | trunk passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | javadoc | 0m 30s | | trunk passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | spotbugs | 1m 26s | | trunk passed | | +1 :green_heart: | shadedclient | 39m 37s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 33s | | the patch passed | | +1 :green_heart: | compile | 0m 34s | | the patch passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | javac | 0m 34s | | the patch passed | | +1 :green_heart: | compile | 0m 29s | | the patch passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | javac | 0m 29s | | the patch passed | | +1 :green_heart: | blanks | 0m 1s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 0m 19s | [/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs-rbf.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5860/1/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs-rbf.txt) | hadoop-hdfs-project/hadoop-hdfs-rbf: The patch generated 1 new + 2 unchanged - 0 fixed = 3 total (was 2) | | +1 :green_heart: | mvnsite | 0m 33s | | the patch passed | | +1 :green_heart: | javadoc | 0m 29s | | the patch passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | javadoc | 0m 23s | | the patch passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | spotbugs | 1m 24s | | the patch passed | | +1 :green_heart: | shadedclient | 39m 13s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 21m 39s | | hadoop-hdfs-rbf in the patch passed. | | +1 :green_heart: | asflicense | 0m 36s | | The patch does not generate ASF License warnings. | | | | 183m 7s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5860/1/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5860 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux c67ae1fedf74 4.15.0-212-generic #223-Ubuntu SMP Tue May 23 13:09:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 75f439c08cac2b8f4d7d79ed882b5e165d75b55d | | Default Java | Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5860/1/testReport/ | | Max. process+thread count | 2194 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs-rbf U: hadoop-hdfs-project/hadoop-hdfs-rbf |
[jira] [Updated] (HDFS-17111) RBF: Optimize msync to only call nameservices that have observer reads enabled.
[ https://issues.apache.org/jira/browse/HDFS-17111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simbarashe Dzinamarira updated HDFS-17111: -- Summary: RBF: Optimize msync to only call nameservices that have observer reads enabled. (was: RBF: Optimize msync to only call nameservices with observer namenodes.) > RBF: Optimize msync to only call nameservices that have observer reads > enabled. > --- > > Key: HDFS-17111 > URL: https://issues.apache.org/jira/browse/HDFS-17111 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Simbarashe Dzinamarira >Assignee: Simbarashe Dzinamarira >Priority: Major > Labels: pull-request-available > > Right now when a client MSYNCs to the router, the call is fanned out to all > nameservices. We only need to proxy the msync to nameservices that have > observer reads configured. > We can do this either by adding a new config for the admin to specify which > nameservices have CRS configured, or we can try to automatically detect these. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17093) In the case of all datanodes sending FBR when the namenode restarts (large clusters), there is an issue with incomplete block reporting
[ https://issues.apache.org/jira/browse/HDFS-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744762#comment-17744762 ] ASF GitHub Bot commented on HDFS-17093: --- hadoop-yetus commented on PR #5856: URL: https://github.com/apache/hadoop/pull/5856#issuecomment-1642638647 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 49s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 52m 27s | | trunk passed | | +1 :green_heart: | compile | 1m 25s | | trunk passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | compile | 1m 15s | | trunk passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | checkstyle | 1m 11s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 23s | | trunk passed | | +1 :green_heart: | javadoc | 1m 11s | | trunk passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | javadoc | 1m 38s | | trunk passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | spotbugs | 3m 25s | | trunk passed | | +1 :green_heart: | shadedclient | 41m 6s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 14s | | the patch passed | | +1 :green_heart: | compile | 1m 17s | | the patch passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | javac | 1m 17s | | the patch passed | | +1 :green_heart: | compile | 1m 8s | | the patch passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | javac | 1m 8s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 1m 3s | | the patch passed | | +1 :green_heart: | mvnsite | 1m 18s | | the patch passed | | -1 :x: | javadoc | 0m 57s | [/patch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5856/1/artifact/out/patch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1.txt) | hadoop-hdfs in the patch failed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1. | | +1 :green_heart: | javadoc | 1m 29s | | the patch passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | spotbugs | 3m 26s | | the patch passed | | +1 :green_heart: | shadedclient | 41m 7s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 251m 21s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5856/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 50s | | The patch does not generate ASF License warnings. | | | | 411m 12s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.TestRollingUpgrade | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5856/1/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5856 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux dffc1d606d5b 4.15.0-212-generic #223-Ubuntu SMP Tue May 23 13:09:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 757c68f18d3b5ff89cf750a1e02116dd86ff07b2 | | Default Java | Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private
[jira] [Commented] (HDFS-17042) Add rpcCallSuccesses and OverallRpcProcessingTime to RpcMetrics for Namenode
[ https://issues.apache.org/jira/browse/HDFS-17042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744753#comment-17744753 ] ASF GitHub Bot commented on HDFS-17042: --- goiri merged PR #5804: URL: https://github.com/apache/hadoop/pull/5804 > Add rpcCallSuccesses and OverallRpcProcessingTime to RpcMetrics for Namenode > > > Key: HDFS-17042 > URL: https://issues.apache.org/jira/browse/HDFS-17042 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.4.0, 3.3.9 >Reporter: Xing Lin >Assignee: Xing Lin >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > > We'd like to add two new types of metrics to the existing NN > RpcMetrics/RpcDetailedMetrics. These two metrics can then be used as part of > SLA/SLO for the HDFS service. > * {_}RpcCallSuccesses{_}: it measures the number of RPC requests where they > are successfully processed by a NN (e.g., with a response with an RpcStatus > {_}RpcStatusProto.SUCCESS){_}{_}.{_} Then, together with {_}RpcQueueNumOps > ({_}which refers the total number of RPC requests{_}){_}, we can derive the > RpcErrorRate for our NN, as (RpcQueueNumOps - RpcCallSuccesses) / > RpcQueueNumOps. > * OverallRpcProcessingTime for each RPC method: this metric measures the > overall RPC processing time for each RPC method at the NN. It covers the time > from when a request arrives at the NN to when a response is sent back. We are > already emitting processingTime for each RPC method today in > RpcDetailedMetrics. We want to extend it to emit overallRpcProcessingTime for > each RPC method, which includes enqueueTime, queueTime, processingTime, > responseTime, and handlerTime. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17093) In the case of all datanodes sending FBR when the namenode restarts (large clusters), there is an issue with incomplete block reporting
[ https://issues.apache.org/jira/browse/HDFS-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744751#comment-17744751 ] Xing Lin commented on HDFS-17093: - FYI, we set dfs.namenode.max.full.block.report.leases = 6, even though we are running clusters at 10k DNs per cluster. > In the case of all datanodes sending FBR when the namenode restarts (large > clusters), there is an issue with incomplete block reporting > --- > > Key: HDFS-17093 > URL: https://issues.apache.org/jira/browse/HDFS-17093 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.3.4 >Reporter: Yanlei Yu >Priority: Minor > Labels: pull-request-available > Attachments: HDFS-17093.patch > > > In our cluster of 800+ nodes, after restarting the namenode, we found that > some datanodes did not report enough blocks, causing the namenode to stay in > secure mode for a long time after restarting because of incomplete block > reporting > I found in the logs of the datanode with incomplete block reporting that the > first FBR attempt failed, possibly due to namenode stress, and then a second > FBR attempt was made as follows: > {code:java} > > 2023-07-17 11:29:28,982 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Unsuccessfully sent block report 0x6237a52c1e817e, containing 12 storage > report(s), of which we sent 1. The reports had 1099057 total blocks and used > 1 RPC(s). This took 294 msec to generate and 101721 msecs for RPC and NN > processing. Got back no commands. > 2023-07-17 11:37:04,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Successfully sent block report 0x62382416f3f055, containing 12 storage > report(s), of which we sent 12. The reports had 1099048 total blocks and used > 12 RPC(s). This took 295 msec to generate and 11647 msecs for RPC and NN > processing. Got back no commands. {code} > There's nothing wrong with that. Retry the send if it fails But on the > namenode side of the logic: > {code:java} > if (namesystem.isInStartupSafeMode() > && !StorageType.PROVIDED.equals(storageInfo.getStorageType()) > && storageInfo.getBlockReportCount() > 0) { > blockLog.info("BLOCK* processReport 0x{} with lease ID 0x{}: " > + "discarded non-initial block report from {}" > + " because namenode still in startup phase", > strBlockReportId, fullBrLeaseId, nodeID); > blockReportLeaseManager.removeLease(node); > return !node.hasStaleStorages(); > } {code} > When a disk was identified as the report is not the first time, namely > storageInfo. GetBlockReportCount > 0, Will remove the ticket from the > datanode, lead to a second report failed because no lease -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17111) RBF: Optimize msync to only call nameservices with observer namenodes.
[ https://issues.apache.org/jira/browse/HDFS-17111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HDFS-17111: -- Labels: pull-request-available (was: ) > RBF: Optimize msync to only call nameservices with observer namenodes. > -- > > Key: HDFS-17111 > URL: https://issues.apache.org/jira/browse/HDFS-17111 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Simbarashe Dzinamarira >Assignee: Simbarashe Dzinamarira >Priority: Major > Labels: pull-request-available > > Right now when a client MSYNCs to the router, the call is fanned out to all > nameservices. We only need to proxy the msync to nameservices that have > observer reads configured. > We can do this either by adding a new config for the admin to specify which > nameservices have CRS configured, or we can try to automatically detect these. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17111) RBF: Optimize msync to only call nameservices with observer namenodes.
[ https://issues.apache.org/jira/browse/HDFS-17111?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744750#comment-17744750 ] ASF GitHub Bot commented on HDFS-17111: --- simbadzina opened a new pull request, #5860: URL: https://github.com/apache/hadoop/pull/5860 HDFS-17111. RBF: Optimize msync to only call nameservices with observer namenodes. ### Description of PR Routers only need to msync to nameservices that have CRS configured. I'm still considering whether to just use a static configuration instead of trying to automatically identity the nameservices to msync to. ### How was this patch tested? New unit test. ### For code changes: - [ X] Does the title or this PR starts with the corresponding JIRA issue id (e.g. 'HADOOP-17799. Your PR title ...')? > RBF: Optimize msync to only call nameservices with observer namenodes. > -- > > Key: HDFS-17111 > URL: https://issues.apache.org/jira/browse/HDFS-17111 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Simbarashe Dzinamarira >Assignee: Simbarashe Dzinamarira >Priority: Major > > Right now when a client MSYNCs to the router, the call is fanned out to all > nameservices. We only need to proxy the msync to nameservices that have > observer reads configured. > We can do this either by adding a new config for the admin to specify which > nameservices have CRS configured, or we can try to automatically detect these. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17093) In the case of all datanodes sending FBR when the namenode restarts (large clusters), there is an issue with incomplete block reporting
[ https://issues.apache.org/jira/browse/HDFS-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744749#comment-17744749 ] Xing Lin commented on HDFS-17093: - {quote}[~xinglin] ,I think you modify some more reasonable, datanode separate disk operation should be processed in the final set to perform blockReportLeaseManager. RemoveLease (node); return ! node.hasStaleStorages(); This is all at the datanode level {quote} not sure i understand what you said here. > In the case of all datanodes sending FBR when the namenode restarts (large > clusters), there is an issue with incomplete block reporting > --- > > Key: HDFS-17093 > URL: https://issues.apache.org/jira/browse/HDFS-17093 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.3.4 >Reporter: Yanlei Yu >Priority: Minor > Labels: pull-request-available > Attachments: HDFS-17093.patch > > > In our cluster of 800+ nodes, after restarting the namenode, we found that > some datanodes did not report enough blocks, causing the namenode to stay in > secure mode for a long time after restarting because of incomplete block > reporting > I found in the logs of the datanode with incomplete block reporting that the > first FBR attempt failed, possibly due to namenode stress, and then a second > FBR attempt was made as follows: > {code:java} > > 2023-07-17 11:29:28,982 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Unsuccessfully sent block report 0x6237a52c1e817e, containing 12 storage > report(s), of which we sent 1. The reports had 1099057 total blocks and used > 1 RPC(s). This took 294 msec to generate and 101721 msecs for RPC and NN > processing. Got back no commands. > 2023-07-17 11:37:04,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Successfully sent block report 0x62382416f3f055, containing 12 storage > report(s), of which we sent 12. The reports had 1099048 total blocks and used > 12 RPC(s). This took 295 msec to generate and 11647 msecs for RPC and NN > processing. Got back no commands. {code} > There's nothing wrong with that. Retry the send if it fails But on the > namenode side of the logic: > {code:java} > if (namesystem.isInStartupSafeMode() > && !StorageType.PROVIDED.equals(storageInfo.getStorageType()) > && storageInfo.getBlockReportCount() > 0) { > blockLog.info("BLOCK* processReport 0x{} with lease ID 0x{}: " > + "discarded non-initial block report from {}" > + " because namenode still in startup phase", > strBlockReportId, fullBrLeaseId, nodeID); > blockReportLeaseManager.removeLease(node); > return !node.hasStaleStorages(); > } {code} > When a disk was identified as the report is not the first time, namely > storageInfo. GetBlockReportCount > 0, Will remove the ticket from the > datanode, lead to a second report failed because no lease -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-17111) RBF: Optimize msync to only call nameservices with observer namenodes.
[ https://issues.apache.org/jira/browse/HDFS-17111?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Simbarashe Dzinamarira reassigned HDFS-17111: - Assignee: Simbarashe Dzinamarira > RBF: Optimize msync to only call nameservices with observer namenodes. > -- > > Key: HDFS-17111 > URL: https://issues.apache.org/jira/browse/HDFS-17111 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Simbarashe Dzinamarira >Assignee: Simbarashe Dzinamarira >Priority: Major > > Right now when a client MSYNCs to the router, the call is fanned out to all > nameservices. We only need to proxy the msync to nameservices that have > observer reads configured. > We can do this either by adding a new config for the admin to specify which > nameservices have CRS configured, or we can try to automatically detect these. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-17111) RBF: Optimize msync to only call nameservices with observer namenodes.
Simbarashe Dzinamarira created HDFS-17111: - Summary: RBF: Optimize msync to only call nameservices with observer namenodes. Key: HDFS-17111 URL: https://issues.apache.org/jira/browse/HDFS-17111 Project: Hadoop HDFS Issue Type: Bug Reporter: Simbarashe Dzinamarira Right now when a client MSYNCs to the router, the call is fanned out to all nameservices. We only need to proxy the msync to nameservices that have observer reads configured. We can do this either by adding a new config for the admin to specify which nameservices have CRS configured, or we can try to automatically detect these. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-15042) Add more tests for ByteBufferPositionedReadable
[ https://issues.apache.org/jira/browse/HDFS-15042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744741#comment-17744741 ] ASF GitHub Bot commented on HDFS-15042: --- mukund-thakur commented on code in PR #1747: URL: https://github.com/apache/hadoop/pull/1747#discussion_r956411939 ## hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java: ## @@ -1684,6 +1685,9 @@ public int read(long position, final ByteBuffer buf) throws IOException { @Override public void readFully(long position, final ByteBuffer buf) throws IOException { +if (position < 0) { + throw new EOFException(NEGATIVE_POSITION_READ); +} Review Comment: Yeah I would also not want to change this. > Add more tests for ByteBufferPositionedReadable > > > Key: HDFS-15042 > URL: https://issues.apache.org/jira/browse/HDFS-15042 > Project: Hadoop HDFS > Issue Type: Improvement > Components: fs, test >Affects Versions: 3.3.0 >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Major > Labels: pull-request-available > Time Spent: 1.5h > Remaining Estimate: 0h > > There's a few corner cases of ByteBufferPositionedReadable which need to be > tested, mainly illegal read positions. Add them -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17094) EC: Fix bug in block recovery when there are stale datanodes
[ https://issues.apache.org/jira/browse/HDFS-17094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744715#comment-17744715 ] ASF GitHub Bot commented on HDFS-17094: --- hadoop-yetus commented on PR #5854: URL: https://github.com/apache/hadoop/pull/5854#issuecomment-1642442293 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 42s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 52m 58s | | trunk passed | | +1 :green_heart: | compile | 1m 42s | | trunk passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | compile | 1m 29s | | trunk passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | checkstyle | 1m 23s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 40s | | trunk passed | | +1 :green_heart: | javadoc | 1m 21s | | trunk passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | javadoc | 2m 0s | | trunk passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | spotbugs | 4m 5s | | trunk passed | | -1 :x: | shadedclient | 42m 13s | | branch has errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | -1 :x: | mvninstall | 0m 23s | [/patch-mvninstall-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5854/2/artifact/out/patch-mvninstall-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch failed. | | +1 :green_heart: | compile | 1m 32s | | the patch passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | javac | 1m 32s | | the patch passed | | +1 :green_heart: | compile | 1m 26s | | the patch passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | javac | 1m 26s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 1m 13s | | the patch passed | | +1 :green_heart: | mvnsite | 1m 30s | | the patch passed | | +1 :green_heart: | javadoc | 1m 6s | | the patch passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | javadoc | 1m 36s | | the patch passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | spotbugs | 3m 48s | | the patch passed | | +1 :green_heart: | shadedclient | 36m 38s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 223m 42s | | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 58s | | The patch does not generate ASF License warnings. | | | | 382m 58s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5854/2/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5854 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 79c529060291 4.15.0-212-generic #223-Ubuntu SMP Tue May 23 13:09:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 446ddffc53cb891e0a410bd76a6864666f22ff11 | | Default Java | Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5854/2/testReport/ | | Max. process+thread count | 3594 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5854/2/console | | versions |
[jira] [Commented] (HDFS-15042) Add more tests for ByteBufferPositionedReadable
[ https://issues.apache.org/jira/browse/HDFS-15042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744709#comment-17744709 ] ASF GitHub Bot commented on HDFS-15042: --- steveloughran commented on code in PR #1747: URL: https://github.com/apache/hadoop/pull/1747#discussion_r1268335189 ## hadoop-hdfs-project/hadoop-hdfs/src/test/java/org/apache/hadoop/hdfs/TestByteBufferPread.java: ## @@ -161,130 +229,264 @@ private void testPreadWithFullByteBuffer(ByteBuffer buffer) * {@link ByteBuffer#limit()} on the buffer. Validates that only half of the * testFile is loaded into the buffer. */ - private void testPreadWithLimitedByteBuffer( - ByteBuffer buffer) throws IOException { + @Test + public void testPreadWithLimitedByteBuffer() throws IOException { int bytesRead; int totalBytesRead = 0; // Set the buffer limit to half the size of the file -buffer.limit(FILE_SIZE / 2); +buffer.limit(HALF_SIZE); try (FSDataInputStream in = fs.open(testFile)) { + in.seek(EOF_POS); while ((bytesRead = in.read(totalBytesRead, buffer)) > 0) { totalBytesRead += bytesRead; // Check that each call to read changes the position of the ByteBuffer // correctly -assertEquals(totalBytesRead, buffer.position()); +assertBufferPosition(totalBytesRead); } // Since we set the buffer limit to half the size of the file, we should // have only read half of the file into the buffer - assertEquals(totalBytesRead, FILE_SIZE / 2); + assertEquals(HALF_SIZE, totalBytesRead); // Check that the buffer is full and the contents equal the first half of // the file - assertFalse(buffer.hasRemaining()); - buffer.position(0); - byte[] bufferContents = new byte[FILE_SIZE / 2]; - buffer.get(bufferContents); - assertArrayEquals(bufferContents, - Arrays.copyOfRange(fileContents, 0, FILE_SIZE / 2)); + assertBufferIsFull(); + assertBufferEqualsFileContents(0, HALF_SIZE, 0); + + // position hasn't changed + assertStreamPosition(in, EOF_POS); } } /** * Reads half of the testFile into the {@link ByteBuffer} by setting the * {@link ByteBuffer#position()} the half the size of the file. Validates that * only half of the testFile is loaded into the buffer. + * + * This test interleaves reading from the stream by the classic input + * stream API, verifying those bytes are also as expected. + * This lets us validate the requirement that these positions reads must + * not interfere with the conventional read sequence. */ - private void testPreadWithPositionedByteBuffer( - ByteBuffer buffer) throws IOException { + @Test + public void testPreadWithPositionedByteBuffer() throws IOException { int bytesRead; int totalBytesRead = 0; // Set the buffer position to half the size of the file -buffer.position(FILE_SIZE / 2); +buffer.position(HALF_SIZE); +int counter = 0; try (FSDataInputStream in = fs.open(testFile)) { + assertEquals("Byte read from stream", + fileContents[counter++], in.read()); while ((bytesRead = in.read(totalBytesRead, buffer)) > 0) { totalBytesRead += bytesRead; // Check that each call to read changes the position of the ByteBuffer // correctly -assertEquals(totalBytesRead + FILE_SIZE / 2, buffer.position()); +assertBufferPosition(totalBytesRead + HALF_SIZE); +// read the next byte. +assertEquals("Byte read from stream", +fileContents[counter++], in.read()); } // Since we set the buffer position to half the size of the file, we // should have only read half of the file into the buffer - assertEquals(totalBytesRead, FILE_SIZE / 2); + assertEquals("bytes read", + HALF_SIZE, totalBytesRead); // Check that the buffer is full and the contents equal the first half of // the file - assertFalse(buffer.hasRemaining()); - buffer.position(FILE_SIZE / 2); - byte[] bufferContents = new byte[FILE_SIZE / 2]; - buffer.get(bufferContents); - assertArrayEquals(bufferContents, - Arrays.copyOfRange(fileContents, 0, FILE_SIZE / 2)); + assertBufferIsFull(); + assertBufferEqualsFileContents(HALF_SIZE, HALF_SIZE, 0); } } + /** + * Assert the buffer ranges matches that in the file. + * @param bufferPosition buffer position + * @param length length of data to check + * @param fileOffset offset in file. + */ + private void assertBufferEqualsFileContents(int bufferPosition, + int length, + int fileOffset) { +buffer.position(bufferPosition); +byte[] bufferContents = new byte[length]; +buffer.get(bufferContents); +assertArrayEquals( +"Buffer data from [" +
[jira] [Commented] (HDFS-15042) Add more tests for ByteBufferPositionedReadable
[ https://issues.apache.org/jira/browse/HDFS-15042?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744708#comment-17744708 ] ASF GitHub Bot commented on HDFS-15042: --- steveloughran commented on code in PR #1747: URL: https://github.com/apache/hadoop/pull/1747#discussion_r1268334625 ## hadoop-hdfs-project/hadoop-hdfs-client/src/main/java/org/apache/hadoop/hdfs/DFSInputStream.java: ## @@ -1684,6 +1685,9 @@ public int read(long position, final ByteBuffer buf) throws IOException { @Override public void readFully(long position, final ByteBuffer buf) throws IOException { +if (position < 0) { + throw new EOFException(NEGATIVE_POSITION_READ); +} Review Comment: ok > Add more tests for ByteBufferPositionedReadable > > > Key: HDFS-15042 > URL: https://issues.apache.org/jira/browse/HDFS-15042 > Project: Hadoop HDFS > Issue Type: Improvement > Components: fs, test >Affects Versions: 3.3.0 >Reporter: Steve Loughran >Assignee: Steve Loughran >Priority: Major > Labels: pull-request-available > Time Spent: 1.5h > Remaining Estimate: 0h > > There's a few corner cases of ByteBufferPositionedReadable which need to be > tested, mainly illegal read positions. Add them -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17110) Null Pointer Exception when running TestHarFileSystemWithHA#testHarUriWithHaUriWithNoPort
[ https://issues.apache.org/jira/browse/HDFS-17110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ConfX updated HDFS-17110: - Attachment: (was: reproduce.sh) > Null Pointer Exception when running > TestHarFileSystemWithHA#testHarUriWithHaUriWithNoPort > -- > > Key: HDFS-17110 > URL: https://issues.apache.org/jira/browse/HDFS-17110 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ConfX >Priority: Critical > Attachments: reproduce.sh > > > h2. What happened > After setting {{{}dfs.namenode.replication.min=12396{}}}, running test > {{org.apache.hadoop.hdfs.server.namenode.ha.TestHarFileSystemWithHA#testHarUriWithHaUriWithNoPort}} > results in a {{{}NullPointerException{}}}. > h2. Where's the bug > In the test > {{{}org.apache.hadoop.hdfs.server.namenode.ha.TestHarFileSystemWithHA#testHarUriWithHaUriWithNoPort{}}}: > {noformat} > } finally { > cluster.shutdown(); > }{noformat} > the test tries to shutdown the cluster for cleaning up. However, if the > cluster is not generated and cluster=null, the NPE would conceal other > failures. > h2. How to reproduce > # Set {{dfs.namenode.replication.min=12396}} > # Run > {{org.apache.hadoop.hdfs.server.namenode.ha.TestHarFileSystemWithHA#testHarUriWithHaUriWithNoPort}} > and the following exception should be observed: > {noformat} > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.namenode.ha.TestHarFileSystemWithHA.testHarUriWithHaUriWithNoPort(TestHarFileSystemWithHA.java:60){noformat} > For an easy reproduction, run the reproduce.sh in the attachment. > We are happy to provide a patch if this issue is confirmed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17110) Null Pointer Exception when running TestHarFileSystemWithHA#testHarUriWithHaUriWithNoPort
[ https://issues.apache.org/jira/browse/HDFS-17110?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ConfX updated HDFS-17110: - Attachment: reproduce.sh > Null Pointer Exception when running > TestHarFileSystemWithHA#testHarUriWithHaUriWithNoPort > -- > > Key: HDFS-17110 > URL: https://issues.apache.org/jira/browse/HDFS-17110 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: ConfX >Priority: Critical > Attachments: reproduce.sh > > > h2. What happened > After setting {{{}dfs.namenode.replication.min=12396{}}}, running test > {{org.apache.hadoop.hdfs.server.namenode.ha.TestHarFileSystemWithHA#testHarUriWithHaUriWithNoPort}} > results in a {{{}NullPointerException{}}}. > h2. Where's the bug > In the test > {{{}org.apache.hadoop.hdfs.server.namenode.ha.TestHarFileSystemWithHA#testHarUriWithHaUriWithNoPort{}}}: > {noformat} > } finally { > cluster.shutdown(); > }{noformat} > the test tries to shutdown the cluster for cleaning up. However, if the > cluster is not generated and cluster=null, the NPE would conceal other > failures. > h2. How to reproduce > # Set {{dfs.namenode.replication.min=12396}} > # Run > {{org.apache.hadoop.hdfs.server.namenode.ha.TestHarFileSystemWithHA#testHarUriWithHaUriWithNoPort}} > and the following exception should be observed: > {noformat} > java.lang.NullPointerException > at > org.apache.hadoop.hdfs.server.namenode.ha.TestHarFileSystemWithHA.testHarUriWithHaUriWithNoPort(TestHarFileSystemWithHA.java:60){noformat} > For an easy reproduction, run the reproduce.sh in the attachment. > We are happy to provide a patch if this issue is confirmed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-17110) Null Pointer Exception when running TestHarFileSystemWithHA#testHarUriWithHaUriWithNoPort
ConfX created HDFS-17110: Summary: Null Pointer Exception when running TestHarFileSystemWithHA#testHarUriWithHaUriWithNoPort Key: HDFS-17110 URL: https://issues.apache.org/jira/browse/HDFS-17110 Project: Hadoop HDFS Issue Type: Bug Reporter: ConfX Attachments: reproduce.sh h2. What happened After setting {{{}dfs.namenode.replication.min=12396{}}}, running test {{org.apache.hadoop.hdfs.server.namenode.ha.TestHarFileSystemWithHA#testHarUriWithHaUriWithNoPort}} results in a {{{}NullPointerException{}}}. h2. Where's the bug In the test {{{}org.apache.hadoop.hdfs.server.namenode.ha.TestHarFileSystemWithHA#testHarUriWithHaUriWithNoPort{}}}: {noformat} } finally { cluster.shutdown(); }{noformat} the test tries to shutdown the cluster for cleaning up. However, if the cluster is not generated and cluster=null, the NPE would conceal other failures. h2. How to reproduce # Set {{dfs.namenode.replication.min=12396}} # Run {{org.apache.hadoop.hdfs.server.namenode.ha.TestHarFileSystemWithHA#testHarUriWithHaUriWithNoPort}} and the following exception should be observed: {noformat} java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.ha.TestHarFileSystemWithHA.testHarUriWithHaUriWithNoPort(TestHarFileSystemWithHA.java:60){noformat} For an easy reproduction, run the reproduce.sh in the attachment. We are happy to provide a patch if this issue is confirmed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17093) In the case of all datanodes sending FBR when the namenode restarts (large clusters), there is an issue with incomplete block reporting
[ https://issues.apache.org/jira/browse/HDFS-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744696#comment-17744696 ] ASF GitHub Bot commented on HDFS-17093: --- hadoop-yetus commented on PR #5855: URL: https://github.com/apache/hadoop/pull/5855#issuecomment-1642377590 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 43s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 51m 44s | | trunk passed | | +1 :green_heart: | compile | 1m 31s | | trunk passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | compile | 1m 24s | | trunk passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | checkstyle | 1m 15s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 30s | | trunk passed | | +1 :green_heart: | javadoc | 1m 13s | | trunk passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | javadoc | 1m 41s | | trunk passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | spotbugs | 3m 41s | | trunk passed | | +1 :green_heart: | shadedclient | 38m 37s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 15s | | the patch passed | | +1 :green_heart: | compile | 1m 17s | | the patch passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | javac | 1m 17s | | the patch passed | | +1 :green_heart: | compile | 1m 15s | | the patch passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | javac | 1m 15s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 1m 4s | | the patch passed | | +1 :green_heart: | mvnsite | 1m 19s | | the patch passed | | +1 :green_heart: | javadoc | 0m 58s | | the patch passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | javadoc | 1m 31s | | the patch passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | spotbugs | 3m 15s | | the patch passed | | +1 :green_heart: | shadedclient | 38m 6s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 212m 51s | | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 56s | | The patch does not generate ASF License warnings. | | | | 367m 50s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5855/1/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5855 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 897492e024b6 4.15.0-212-generic #223-Ubuntu SMP Tue May 23 13:09:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / a4b76d3d1e3785641758e2aca40069504b3c99b9 | | Default Java | Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5855/1/testReport/ | | Max. process+thread count | 2916 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5855/1/console | | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 | | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org | This message was automatically generated. > In the case of
[jira] [Updated] (HDFS-17109) Null Pointer Exception when running TestBlockManager
[ https://issues.apache.org/jira/browse/HDFS-17109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ConfX updated HDFS-17109: - Description: h2. What happened After setting {{{}dfs.namenode.redundancy.considerLoadByStorageType=true{}}}, running test {{org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager#testOneOfTwoRacksDecommissioned}} results in a {{{}NullPointerException{}}}. h2. Where's the bug In the class {{{}BlockPlacementPolicyDefault{}}}: {noformat} for (StorageType s : storageTypes) { StorageTypeStats storageTypeStats = storageStats.get(s); numNodes += storageTypeStats.getNodesInService(); numXceiver += storageTypeStats.getNodesInServiceXceiverCount(); }{noformat} However, the class does not check if the storageTypeStats is null, causing the NPE. h2. How to reproduce # Set {{dfs.namenode.redundancy.considerLoadByStorageType=true}} # Run {{org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager#testOneOfTwoRacksDecommissioned}} and the following exception should be observed: {noformat} java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.getInServiceXceiverAverageByStorageType(BlockPlacementPolicyDefault.java:1044) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.getInServiceXceiverAverage(BlockPlacementPolicyDefault.java:1023) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.excludeNodeByLoad(BlockPlacementPolicyDefault.java:1000) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.isGoodDatanode(BlockPlacementPolicyDefault.java:1086) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:855) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRemoteRack(BlockPlacementPolicyDefault.java:782) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:557) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:478) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:350) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:170) at org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:51) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:2031) at org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.scheduleSingleReplication(TestBlockManager.java:641) at org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.doTestOneOfTwoRacksDecommissioned(TestBlockManager.java:364) at org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.testOneOfTwoRacksDecommissioned(TestBlockManager.java:351){noformat} For an easy reproduction, run the reproduce.sh in the attachment. We are happy to provide a patch if this issue is confirmed. was: h2. What happened After setting {{{}dfs.namenode.redundancy.considerLoadByStorageType=true{}}}, running test {{org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager#testOneOfTwoRacksDecommissioned}} results in a {{{}NullPointerException{}}}. h2. Where's the bug In the class {{{}BlockPlacementPolicyDefault{}}}: {noformat} for (StorageType s : storageTypes) { StorageTypeStats storageTypeStats = storageStats.get(s); numNodes += storageTypeStats.getNodesInService(); numXceiver += storageTypeStats.getNodesInServiceXceiverCount(); }{noformat} However, the class does not check if the storageTypeStats is null, causing the NPE. h2. How to reproduce # Set {{dfs.namenode.redundancy.considerLoadByStorageType=true}} # Run {{org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager#testOneOfTwoRacksDecommissioned}} and the following exception should be observed: {noformat} java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.getInServiceXceiverAverageByStorageType(BlockPlacementPolicyDefault.java:1044) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.getInServiceXceiverAverage(BlockPlacementPolicyDefault.java:1023) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.excludeNodeByLoad(BlockPlacementPolicyDefault.java:1000) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.isGoodDatanode(BlockPlacementPolicyDefault.java:1086) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:855) at
[jira] [Created] (HDFS-17109) Null Pointer Exception when running TestBlockManager
ConfX created HDFS-17109: Summary: Null Pointer Exception when running TestBlockManager Key: HDFS-17109 URL: https://issues.apache.org/jira/browse/HDFS-17109 Project: Hadoop HDFS Issue Type: Bug Reporter: ConfX Attachments: reproduce.sh h2. What happened After setting {{{}dfs.namenode.redundancy.considerLoadByStorageType=true{}}}, running test {{org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager#testOneOfTwoRacksDecommissioned}} results in a {{{}NullPointerException{}}}. h2. Where's the bug In the class {{{}BlockPlacementPolicyDefault{}}}: {noformat} for (StorageType s : storageTypes) { StorageTypeStats storageTypeStats = storageStats.get(s); numNodes += storageTypeStats.getNodesInService(); numXceiver += storageTypeStats.getNodesInServiceXceiverCount(); }{noformat} However, the class does not check if the storageTypeStats is null, causing the NPE. h2. How to reproduce # Set {{dfs.namenode.redundancy.considerLoadByStorageType=true}} # Run {{org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager#testOneOfTwoRacksDecommissioned}} and the following exception should be observed: {noformat} java.lang.NullPointerException at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.getInServiceXceiverAverageByStorageType(BlockPlacementPolicyDefault.java:1044) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.getInServiceXceiverAverage(BlockPlacementPolicyDefault.java:1023) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.excludeNodeByLoad(BlockPlacementPolicyDefault.java:1000) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.isGoodDatanode(BlockPlacementPolicyDefault.java:1086) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRandom(BlockPlacementPolicyDefault.java:855) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseRemoteRack(BlockPlacementPolicyDefault.java:782) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTargetInOrder(BlockPlacementPolicyDefault.java:557) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:478) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:350) at org.apache.hadoop.hdfs.server.blockmanagement.BlockPlacementPolicyDefault.chooseTarget(BlockPlacementPolicyDefault.java:170) at org.apache.hadoop.hdfs.server.blockmanagement.ReplicationWork.chooseTargets(ReplicationWork.java:51) at org.apache.hadoop.hdfs.server.blockmanagement.BlockManager.computeReconstructionWorkForBlocks(BlockManager.java:2031) at org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.scheduleSingleReplication(TestBlockManager.java:641) at org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.doTestOneOfTwoRacksDecommissioned(TestBlockManager.java:364) at org.apache.hadoop.hdfs.server.blockmanagement.TestBlockManager.testOneOfTwoRacksDecommissioned(TestBlockManager.java:351){noformat} For an easy reproduction, run the reproduce.sh in the attachment. We are happy to provide a patch if this issue is confirmed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-17108) Null Pointer Exception when running TestDecommissionWithBackoffMonitor
ConfX created HDFS-17108: Summary: Null Pointer Exception when running TestDecommissionWithBackoffMonitor Key: HDFS-17108 URL: https://issues.apache.org/jira/browse/HDFS-17108 Project: Hadoop HDFS Issue Type: Bug Reporter: ConfX Attachments: reproduce.sh h2. What happened After setting {{{}dfs.client.read.shortcircuit=true{}}}, running test {{org.apache.hadoop.hdfs.TestDecommissionWithBackoffMonitor#testNodeUsageWhileDecommissioining}} results in a {{{}NullPointerException{}}}. h2. Where's the bug In the test class {{{}org.apache.hadoop.hdfs.TestDecommission{}}}: {noformat} } finally { cleanupFile(fileSys, file1); }{noformat} However, the class does not check if the fileSys is null, causing the NPE. h2. How to reproduce # Set {{dfs.client.read.shortcircuit=true}} # Run {{org.apache.hadoop.hdfs.TestDecommissionWithBackoffMonitor#testNodeUsageWhileDecommissioining}} and the following exception should be observed: {noformat} java.lang.NullPointerException at org.apache.hadoop.hdfs.AdminStatesBaseTest.cleanupFile(AdminStatesBaseTest.java:459) at org.apache.hadoop.hdfs.TestDecommission.nodeUsageVerification(TestDecommission.java:1575) at org.apache.hadoop.hdfs.TestDecommission.testNodeUsageWhileDecommissioining(TestDecommission.java:1510){noformat} For an easy reproduction, run the reproduce.sh in the attachment. We are happy to provide a patch if this issue is confirmed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-17107) Null Pointer Exception after turned on detail metric for namenode lock
ConfX created HDFS-17107: Summary: Null Pointer Exception after turned on detail metric for namenode lock Key: HDFS-17107 URL: https://issues.apache.org/jira/browse/HDFS-17107 Project: Hadoop HDFS Issue Type: Bug Reporter: ConfX Attachments: reproduce.sh h2. What happened After setting {{{}dfs.namenode.lock.detailed-metrics.enabled=true{}}}, running test {{org.apache.hadoop.hdfs.server.namenode.TestFSNamesystemLock#testFSWriteLockReportSuppressed}} results in a {{{}NullPointerException{}}}. h2. Where's the bug In class {{{}FSNameSystemLock{}}}: {noformat} if (metricsEnabled) { String opMetric = getMetricName(operationName, isWrite); detailedHoldTimeMetrics.add(opMetric, value);{noformat} here it may be that the metric is enabled but the detailedHoldTimeMetrics is null. h2. How to reproduce # Set {{dfs.namenode.lock.detailed-metrics.enabled=true}} # Run {{org.apache.hadoop.hdfs.server.namenode.TestFSNamesystemLock#testFSWriteLockReportSuppressed}} and the following exception should be observed: {noformat} java.lang.NullPointerException at org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.addMetric(FSNamesystemLock.java:359) at org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:287) at org.apache.hadoop.hdfs.server.namenode.FSNamesystemLock.writeUnlock(FSNamesystemLock.java:236) at org.apache.hadoop.hdfs.server.namenode.TestFSNamesystemLock.testFSWriteLockReportSuppressed(TestFSNamesystemLock.java:433){noformat} For an easy reproduction, run the reproduce.sh in the attachment. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17093) In the case of all datanodes sending FBR when the namenode restarts (large clusters), there is an issue with incomplete block reporting
[ https://issues.apache.org/jira/browse/HDFS-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744617#comment-17744617 ] ASF GitHub Bot commented on HDFS-17093: --- Hexiaoqiao closed pull request #5856: HDFS-17093. In the case of all datanodes sending FBR when the namenode restarts (large clusters), there is an issue with incomplete block reporting URL: https://github.com/apache/hadoop/pull/5856 > In the case of all datanodes sending FBR when the namenode restarts (large > clusters), there is an issue with incomplete block reporting > --- > > Key: HDFS-17093 > URL: https://issues.apache.org/jira/browse/HDFS-17093 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.3.4 >Reporter: Yanlei Yu >Priority: Minor > Labels: pull-request-available > Attachments: HDFS-17093.patch > > > In our cluster of 800+ nodes, after restarting the namenode, we found that > some datanodes did not report enough blocks, causing the namenode to stay in > secure mode for a long time after restarting because of incomplete block > reporting > I found in the logs of the datanode with incomplete block reporting that the > first FBR attempt failed, possibly due to namenode stress, and then a second > FBR attempt was made as follows: > {code:java} > > 2023-07-17 11:29:28,982 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Unsuccessfully sent block report 0x6237a52c1e817e, containing 12 storage > report(s), of which we sent 1. The reports had 1099057 total blocks and used > 1 RPC(s). This took 294 msec to generate and 101721 msecs for RPC and NN > processing. Got back no commands. > 2023-07-17 11:37:04,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Successfully sent block report 0x62382416f3f055, containing 12 storage > report(s), of which we sent 12. The reports had 1099048 total blocks and used > 12 RPC(s). This took 295 msec to generate and 11647 msecs for RPC and NN > processing. Got back no commands. {code} > There's nothing wrong with that. Retry the send if it fails But on the > namenode side of the logic: > {code:java} > if (namesystem.isInStartupSafeMode() > && !StorageType.PROVIDED.equals(storageInfo.getStorageType()) > && storageInfo.getBlockReportCount() > 0) { > blockLog.info("BLOCK* processReport 0x{} with lease ID 0x{}: " > + "discarded non-initial block report from {}" > + " because namenode still in startup phase", > strBlockReportId, fullBrLeaseId, nodeID); > blockReportLeaseManager.removeLease(node); > return !node.hasStaleStorages(); > } {code} > When a disk was identified as the report is not the first time, namely > storageInfo. GetBlockReportCount > 0, Will remove the ticket from the > datanode, lead to a second report failed because no lease -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17093) In the case of all datanodes sending FBR when the namenode restarts (large clusters), there is an issue with incomplete block reporting
[ https://issues.apache.org/jira/browse/HDFS-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744616#comment-17744616 ] ASF GitHub Bot commented on HDFS-17093: --- Hexiaoqiao commented on PR #5856: URL: https://github.com/apache/hadoop/pull/5856#issuecomment-1642036068 Please push update to your same original branch (for this case which is Tre2878:HDFS-17093), DO NOT pull repeat request for same issue. > In the case of all datanodes sending FBR when the namenode restarts (large > clusters), there is an issue with incomplete block reporting > --- > > Key: HDFS-17093 > URL: https://issues.apache.org/jira/browse/HDFS-17093 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.3.4 >Reporter: Yanlei Yu >Priority: Minor > Labels: pull-request-available > Attachments: HDFS-17093.patch > > > In our cluster of 800+ nodes, after restarting the namenode, we found that > some datanodes did not report enough blocks, causing the namenode to stay in > secure mode for a long time after restarting because of incomplete block > reporting > I found in the logs of the datanode with incomplete block reporting that the > first FBR attempt failed, possibly due to namenode stress, and then a second > FBR attempt was made as follows: > {code:java} > > 2023-07-17 11:29:28,982 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Unsuccessfully sent block report 0x6237a52c1e817e, containing 12 storage > report(s), of which we sent 1. The reports had 1099057 total blocks and used > 1 RPC(s). This took 294 msec to generate and 101721 msecs for RPC and NN > processing. Got back no commands. > 2023-07-17 11:37:04,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Successfully sent block report 0x62382416f3f055, containing 12 storage > report(s), of which we sent 12. The reports had 1099048 total blocks and used > 12 RPC(s). This took 295 msec to generate and 11647 msecs for RPC and NN > processing. Got back no commands. {code} > There's nothing wrong with that. Retry the send if it fails But on the > namenode side of the logic: > {code:java} > if (namesystem.isInStartupSafeMode() > && !StorageType.PROVIDED.equals(storageInfo.getStorageType()) > && storageInfo.getBlockReportCount() > 0) { > blockLog.info("BLOCK* processReport 0x{} with lease ID 0x{}: " > + "discarded non-initial block report from {}" > + " because namenode still in startup phase", > strBlockReportId, fullBrLeaseId, nodeID); > blockReportLeaseManager.removeLease(node); > return !node.hasStaleStorages(); > } {code} > When a disk was identified as the report is not the first time, namely > storageInfo. GetBlockReportCount > 0, Will remove the ticket from the > datanode, lead to a second report failed because no lease -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17093) In the case of all datanodes sending FBR when the namenode restarts (large clusters), there is an issue with incomplete block reporting
[ https://issues.apache.org/jira/browse/HDFS-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744602#comment-17744602 ] Yanlei Yu commented on HDFS-17093: -- [~hexiaoqiao] I have modified and resubmitted PB[:GitHub Pull Request #5856|https://github.com/apache/hadoop/pull/5856] > In the case of all datanodes sending FBR when the namenode restarts (large > clusters), there is an issue with incomplete block reporting > --- > > Key: HDFS-17093 > URL: https://issues.apache.org/jira/browse/HDFS-17093 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.3.4 >Reporter: Yanlei Yu >Priority: Minor > Labels: pull-request-available > Attachments: HDFS-17093.patch > > > In our cluster of 800+ nodes, after restarting the namenode, we found that > some datanodes did not report enough blocks, causing the namenode to stay in > secure mode for a long time after restarting because of incomplete block > reporting > I found in the logs of the datanode with incomplete block reporting that the > first FBR attempt failed, possibly due to namenode stress, and then a second > FBR attempt was made as follows: > {code:java} > > 2023-07-17 11:29:28,982 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Unsuccessfully sent block report 0x6237a52c1e817e, containing 12 storage > report(s), of which we sent 1. The reports had 1099057 total blocks and used > 1 RPC(s). This took 294 msec to generate and 101721 msecs for RPC and NN > processing. Got back no commands. > 2023-07-17 11:37:04,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Successfully sent block report 0x62382416f3f055, containing 12 storage > report(s), of which we sent 12. The reports had 1099048 total blocks and used > 12 RPC(s). This took 295 msec to generate and 11647 msecs for RPC and NN > processing. Got back no commands. {code} > There's nothing wrong with that. Retry the send if it fails But on the > namenode side of the logic: > {code:java} > if (namesystem.isInStartupSafeMode() > && !StorageType.PROVIDED.equals(storageInfo.getStorageType()) > && storageInfo.getBlockReportCount() > 0) { > blockLog.info("BLOCK* processReport 0x{} with lease ID 0x{}: " > + "discarded non-initial block report from {}" > + " because namenode still in startup phase", > strBlockReportId, fullBrLeaseId, nodeID); > blockReportLeaseManager.removeLease(node); > return !node.hasStaleStorages(); > } {code} > When a disk was identified as the report is not the first time, namely > storageInfo. GetBlockReportCount > 0, Will remove the ticket from the > datanode, lead to a second report failed because no lease -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17093) In the case of all datanodes sending FBR when the namenode restarts (large clusters), there is an issue with incomplete block reporting
[ https://issues.apache.org/jira/browse/HDFS-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744600#comment-17744600 ] ASF GitHub Bot commented on HDFS-17093: --- Tre2878 commented on code in PR #5855: URL: https://github.com/apache/hadoop/pull/5855#discussion_r1268012644 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java: ## @@ -2873,7 +2873,9 @@ public boolean checkBlockReportLease(BlockReportContext context, public boolean processReport(final DatanodeID nodeID, final DatanodeStorage storage, final BlockListAsLongs newReport, - BlockReportContext context) throws IOException { + BlockReportContext context, + int totalReportNum, + int currentReportNum) throws IOException { Review Comment: I think it's ok > In the case of all datanodes sending FBR when the namenode restarts (large > clusters), there is an issue with incomplete block reporting > --- > > Key: HDFS-17093 > URL: https://issues.apache.org/jira/browse/HDFS-17093 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.3.4 >Reporter: Yanlei Yu >Priority: Minor > Labels: pull-request-available > Attachments: HDFS-17093.patch > > > In our cluster of 800+ nodes, after restarting the namenode, we found that > some datanodes did not report enough blocks, causing the namenode to stay in > secure mode for a long time after restarting because of incomplete block > reporting > I found in the logs of the datanode with incomplete block reporting that the > first FBR attempt failed, possibly due to namenode stress, and then a second > FBR attempt was made as follows: > {code:java} > > 2023-07-17 11:29:28,982 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Unsuccessfully sent block report 0x6237a52c1e817e, containing 12 storage > report(s), of which we sent 1. The reports had 1099057 total blocks and used > 1 RPC(s). This took 294 msec to generate and 101721 msecs for RPC and NN > processing. Got back no commands. > 2023-07-17 11:37:04,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Successfully sent block report 0x62382416f3f055, containing 12 storage > report(s), of which we sent 12. The reports had 1099048 total blocks and used > 12 RPC(s). This took 295 msec to generate and 11647 msecs for RPC and NN > processing. Got back no commands. {code} > There's nothing wrong with that. Retry the send if it fails But on the > namenode side of the logic: > {code:java} > if (namesystem.isInStartupSafeMode() > && !StorageType.PROVIDED.equals(storageInfo.getStorageType()) > && storageInfo.getBlockReportCount() > 0) { > blockLog.info("BLOCK* processReport 0x{} with lease ID 0x{}: " > + "discarded non-initial block report from {}" > + " because namenode still in startup phase", > strBlockReportId, fullBrLeaseId, nodeID); > blockReportLeaseManager.removeLease(node); > return !node.hasStaleStorages(); > } {code} > When a disk was identified as the report is not the first time, namely > storageInfo. GetBlockReportCount > 0, Will remove the ticket from the > datanode, lead to a second report failed because no lease -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17093) In the case of all datanodes sending FBR when the namenode restarts (large clusters), there is an issue with incomplete block reporting
[ https://issues.apache.org/jira/browse/HDFS-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744599#comment-17744599 ] ASF GitHub Bot commented on HDFS-17093: --- Tre2878 opened a new pull request, #5856: URL: https://github.com/apache/hadoop/pull/5856 In the case of all datanodes sending FBR when the namenode restarts (large clusters), there is an issue with incomplete block reporting > In the case of all datanodes sending FBR when the namenode restarts (large > clusters), there is an issue with incomplete block reporting > --- > > Key: HDFS-17093 > URL: https://issues.apache.org/jira/browse/HDFS-17093 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.3.4 >Reporter: Yanlei Yu >Priority: Minor > Labels: pull-request-available > Attachments: HDFS-17093.patch > > > In our cluster of 800+ nodes, after restarting the namenode, we found that > some datanodes did not report enough blocks, causing the namenode to stay in > secure mode for a long time after restarting because of incomplete block > reporting > I found in the logs of the datanode with incomplete block reporting that the > first FBR attempt failed, possibly due to namenode stress, and then a second > FBR attempt was made as follows: > {code:java} > > 2023-07-17 11:29:28,982 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Unsuccessfully sent block report 0x6237a52c1e817e, containing 12 storage > report(s), of which we sent 1. The reports had 1099057 total blocks and used > 1 RPC(s). This took 294 msec to generate and 101721 msecs for RPC and NN > processing. Got back no commands. > 2023-07-17 11:37:04,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Successfully sent block report 0x62382416f3f055, containing 12 storage > report(s), of which we sent 12. The reports had 1099048 total blocks and used > 12 RPC(s). This took 295 msec to generate and 11647 msecs for RPC and NN > processing. Got back no commands. {code} > There's nothing wrong with that. Retry the send if it fails But on the > namenode side of the logic: > {code:java} > if (namesystem.isInStartupSafeMode() > && !StorageType.PROVIDED.equals(storageInfo.getStorageType()) > && storageInfo.getBlockReportCount() > 0) { > blockLog.info("BLOCK* processReport 0x{} with lease ID 0x{}: " > + "discarded non-initial block report from {}" > + " because namenode still in startup phase", > strBlockReportId, fullBrLeaseId, nodeID); > blockReportLeaseManager.removeLease(node); > return !node.hasStaleStorages(); > } {code} > When a disk was identified as the report is not the first time, namely > storageInfo. GetBlockReportCount > 0, Will remove the ticket from the > datanode, lead to a second report failed because no lease -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-17093) In the case of all datanodes sending FBR when the namenode restarts (large clusters), there is an issue with incomplete block reporting
[ https://issues.apache.org/jira/browse/HDFS-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] ASF GitHub Bot updated HDFS-17093: -- Labels: pull-request-available (was: ) > In the case of all datanodes sending FBR when the namenode restarts (large > clusters), there is an issue with incomplete block reporting > --- > > Key: HDFS-17093 > URL: https://issues.apache.org/jira/browse/HDFS-17093 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.3.4 >Reporter: Yanlei Yu >Priority: Minor > Labels: pull-request-available > Attachments: HDFS-17093.patch > > > In our cluster of 800+ nodes, after restarting the namenode, we found that > some datanodes did not report enough blocks, causing the namenode to stay in > secure mode for a long time after restarting because of incomplete block > reporting > I found in the logs of the datanode with incomplete block reporting that the > first FBR attempt failed, possibly due to namenode stress, and then a second > FBR attempt was made as follows: > {code:java} > > 2023-07-17 11:29:28,982 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Unsuccessfully sent block report 0x6237a52c1e817e, containing 12 storage > report(s), of which we sent 1. The reports had 1099057 total blocks and used > 1 RPC(s). This took 294 msec to generate and 101721 msecs for RPC and NN > processing. Got back no commands. > 2023-07-17 11:37:04,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Successfully sent block report 0x62382416f3f055, containing 12 storage > report(s), of which we sent 12. The reports had 1099048 total blocks and used > 12 RPC(s). This took 295 msec to generate and 11647 msecs for RPC and NN > processing. Got back no commands. {code} > There's nothing wrong with that. Retry the send if it fails But on the > namenode side of the logic: > {code:java} > if (namesystem.isInStartupSafeMode() > && !StorageType.PROVIDED.equals(storageInfo.getStorageType()) > && storageInfo.getBlockReportCount() > 0) { > blockLog.info("BLOCK* processReport 0x{} with lease ID 0x{}: " > + "discarded non-initial block report from {}" > + " because namenode still in startup phase", > strBlockReportId, fullBrLeaseId, nodeID); > blockReportLeaseManager.removeLease(node); > return !node.hasStaleStorages(); > } {code} > When a disk was identified as the report is not the first time, namely > storageInfo. GetBlockReportCount > 0, Will remove the ticket from the > datanode, lead to a second report failed because no lease -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17094) EC: Fix bug in block recovery when there are stale datanodes
[ https://issues.apache.org/jira/browse/HDFS-17094?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744537#comment-17744537 ] ASF GitHub Bot commented on HDFS-17094: --- zhangshuyan0 commented on PR #5854: URL: https://github.com/apache/hadoop/pull/5854#issuecomment-1641844035 @Hexiaoqiao @tomscut Thanks for your review. I've update this PR according to the suggestions. Please take a look, thanks again. > EC: Fix bug in block recovery when there are stale datanodes > > > Key: HDFS-17094 > URL: https://issues.apache.org/jira/browse/HDFS-17094 > Project: Hadoop HDFS > Issue Type: Bug >Reporter: Shuyan Zhang >Assignee: Shuyan Zhang >Priority: Major > Labels: pull-request-available > > When a block recovery occurs, `RecoveryTaskStriped` in datanode expects > `rBlock.getLocations()` and `rBlock. getBlockIndices()` to be in one-to-one > correspondence. However, if there are locations in stale state when NameNode > handles heartbeat, this correspondence will be disrupted. In detail, there is > no stale location in `recoveryLocations`, but the block indices array is > still complete (i.e. contains the indices of all the locations). This will > cause `BlockRecoveryWorker.RecoveryTaskStriped#recover` to generate a wrong > internal block ID, and the corresponding datanode cannot find the replica, > thus making the recovery process fail. This bug needs to be fixed. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17093) In the case of all datanodes sending FBR when the namenode restarts (large clusters), there is an issue with incomplete block reporting
[ https://issues.apache.org/jira/browse/HDFS-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744538#comment-17744538 ] ASF GitHub Bot commented on HDFS-17093: --- Hexiaoqiao commented on code in PR #5855: URL: https://github.com/apache/hadoop/pull/5855#discussion_r1267889826 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/blockmanagement/BlockManager.java: ## @@ -2873,7 +2873,9 @@ public boolean checkBlockReportLease(BlockReportContext context, public boolean processReport(final DatanodeID nodeID, final DatanodeStorage storage, final BlockListAsLongs newReport, - BlockReportContext context) throws IOException { + BlockReportContext context, + int totalReportNum, + int currentReportNum) throws IOException { Review Comment: a. Please add some Javadoc about added parameter. b. Will this name be more readable? totalReportNum -> totalStorageReportsNum, currentReportNum -> storageReportIndex ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNodeRpcServer.java: ## @@ -1650,7 +1650,7 @@ public DatanodeCommand blockReport(final DatanodeRegistration nodeReg, final int index = r; noStaleStorages = bm.runBlockOp(() -> bm.processReport(nodeReg, reports[index].getStorage(), -blocks, context)); +blocks, context, reports.length, index+1)); Review Comment: codestyle: `index + 1` (leave one space here) > In the case of all datanodes sending FBR when the namenode restarts (large > clusters), there is an issue with incomplete block reporting > --- > > Key: HDFS-17093 > URL: https://issues.apache.org/jira/browse/HDFS-17093 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.3.4 >Reporter: Yanlei Yu >Priority: Minor > Attachments: HDFS-17093.patch > > > In our cluster of 800+ nodes, after restarting the namenode, we found that > some datanodes did not report enough blocks, causing the namenode to stay in > secure mode for a long time after restarting because of incomplete block > reporting > I found in the logs of the datanode with incomplete block reporting that the > first FBR attempt failed, possibly due to namenode stress, and then a second > FBR attempt was made as follows: > {code:java} > > 2023-07-17 11:29:28,982 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Unsuccessfully sent block report 0x6237a52c1e817e, containing 12 storage > report(s), of which we sent 1. The reports had 1099057 total blocks and used > 1 RPC(s). This took 294 msec to generate and 101721 msecs for RPC and NN > processing. Got back no commands. > 2023-07-17 11:37:04,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Successfully sent block report 0x62382416f3f055, containing 12 storage > report(s), of which we sent 12. The reports had 1099048 total blocks and used > 12 RPC(s). This took 295 msec to generate and 11647 msecs for RPC and NN > processing. Got back no commands. {code} > There's nothing wrong with that. Retry the send if it fails But on the > namenode side of the logic: > {code:java} > if (namesystem.isInStartupSafeMode() > && !StorageType.PROVIDED.equals(storageInfo.getStorageType()) > && storageInfo.getBlockReportCount() > 0) { > blockLog.info("BLOCK* processReport 0x{} with lease ID 0x{}: " > + "discarded non-initial block report from {}" > + " because namenode still in startup phase", > strBlockReportId, fullBrLeaseId, nodeID); > blockReportLeaseManager.removeLease(node); > return !node.hasStaleStorages(); > } {code} > When a disk was identified as the report is not the first time, namely > storageInfo. GetBlockReportCount > 0, Will remove the ticket from the > datanode, lead to a second report failed because no lease -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17093) In the case of all datanodes sending FBR when the namenode restarts (large clusters), there is an issue with incomplete block reporting
[ https://issues.apache.org/jira/browse/HDFS-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744526#comment-17744526 ] Xiaoqiao He commented on HDFS-17093: {quote}# DN wants to send 12 reports but only sent 1 report. # NN processes 1 report (then storageInfo.getBlockReportCount() > 0 will be true) # DN continues to send 12 reports to NN. # NN will simply discard these reports, because storageInfo.getBlockReportCount() > 0{quote} >From this description and log information, it is different with HDFS-17090 >actually IMO. For this case and bugfix, it makes sense to me. > In the case of all datanodes sending FBR when the namenode restarts (large > clusters), there is an issue with incomplete block reporting > --- > > Key: HDFS-17093 > URL: https://issues.apache.org/jira/browse/HDFS-17093 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.3.4 >Reporter: Yanlei Yu >Priority: Minor > Attachments: HDFS-17093.patch > > > In our cluster of 800+ nodes, after restarting the namenode, we found that > some datanodes did not report enough blocks, causing the namenode to stay in > secure mode for a long time after restarting because of incomplete block > reporting > I found in the logs of the datanode with incomplete block reporting that the > first FBR attempt failed, possibly due to namenode stress, and then a second > FBR attempt was made as follows: > {code:java} > > 2023-07-17 11:29:28,982 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Unsuccessfully sent block report 0x6237a52c1e817e, containing 12 storage > report(s), of which we sent 1. The reports had 1099057 total blocks and used > 1 RPC(s). This took 294 msec to generate and 101721 msecs for RPC and NN > processing. Got back no commands. > 2023-07-17 11:37:04,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Successfully sent block report 0x62382416f3f055, containing 12 storage > report(s), of which we sent 12. The reports had 1099048 total blocks and used > 12 RPC(s). This took 295 msec to generate and 11647 msecs for RPC and NN > processing. Got back no commands. {code} > There's nothing wrong with that. Retry the send if it fails But on the > namenode side of the logic: > {code:java} > if (namesystem.isInStartupSafeMode() > && !StorageType.PROVIDED.equals(storageInfo.getStorageType()) > && storageInfo.getBlockReportCount() > 0) { > blockLog.info("BLOCK* processReport 0x{} with lease ID 0x{}: " > + "discarded non-initial block report from {}" > + " because namenode still in startup phase", > strBlockReportId, fullBrLeaseId, nodeID); > blockReportLeaseManager.removeLease(node); > return !node.hasStaleStorages(); > } {code} > When a disk was identified as the report is not the first time, namely > storageInfo. GetBlockReportCount > 0, Will remove the ticket from the > datanode, lead to a second report failed because no lease -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-17093) In the case of all datanodes sending FBR when the namenode restarts (large clusters), there is an issue with incomplete block reporting
[ https://issues.apache.org/jira/browse/HDFS-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744524#comment-17744524 ] Yanlei Yu edited comment on HDFS-17093 at 7/19/23 10:24 AM: [~hexiaoqiao] , {quote}would you mind to submit PR via Github if need? {quote} PR:[GitHub Pull Request #5855|https://github.com/apache/hadoop/pull/5855] was (Author: JIRAUSER294151): [~hexiaoqiao] , {quote}would you mind to submit PR via Github if need? {quote} PR:[[GitHub Pull Request #5855|https://github.com/apache/hadoop/pull/5855]|[http://example.com|https://github.com/apache/hadoop/pull/5855]] > In the case of all datanodes sending FBR when the namenode restarts (large > clusters), there is an issue with incomplete block reporting > --- > > Key: HDFS-17093 > URL: https://issues.apache.org/jira/browse/HDFS-17093 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.3.4 >Reporter: Yanlei Yu >Priority: Minor > Attachments: HDFS-17093.patch > > > In our cluster of 800+ nodes, after restarting the namenode, we found that > some datanodes did not report enough blocks, causing the namenode to stay in > secure mode for a long time after restarting because of incomplete block > reporting > I found in the logs of the datanode with incomplete block reporting that the > first FBR attempt failed, possibly due to namenode stress, and then a second > FBR attempt was made as follows: > {code:java} > > 2023-07-17 11:29:28,982 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Unsuccessfully sent block report 0x6237a52c1e817e, containing 12 storage > report(s), of which we sent 1. The reports had 1099057 total blocks and used > 1 RPC(s). This took 294 msec to generate and 101721 msecs for RPC and NN > processing. Got back no commands. > 2023-07-17 11:37:04,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Successfully sent block report 0x62382416f3f055, containing 12 storage > report(s), of which we sent 12. The reports had 1099048 total blocks and used > 12 RPC(s). This took 295 msec to generate and 11647 msecs for RPC and NN > processing. Got back no commands. {code} > There's nothing wrong with that. Retry the send if it fails But on the > namenode side of the logic: > {code:java} > if (namesystem.isInStartupSafeMode() > && !StorageType.PROVIDED.equals(storageInfo.getStorageType()) > && storageInfo.getBlockReportCount() > 0) { > blockLog.info("BLOCK* processReport 0x{} with lease ID 0x{}: " > + "discarded non-initial block report from {}" > + " because namenode still in startup phase", > strBlockReportId, fullBrLeaseId, nodeID); > blockReportLeaseManager.removeLease(node); > return !node.hasStaleStorages(); > } {code} > When a disk was identified as the report is not the first time, namely > storageInfo. GetBlockReportCount > 0, Will remove the ticket from the > datanode, lead to a second report failed because no lease -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17093) In the case of all datanodes sending FBR when the namenode restarts (large clusters), there is an issue with incomplete block reporting
[ https://issues.apache.org/jira/browse/HDFS-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744524#comment-17744524 ] Yanlei Yu commented on HDFS-17093: -- [~hexiaoqiao] , {quote}would you mind to submit PR via Github if need? {quote} PR:[[GitHub Pull Request #5855|https://github.com/apache/hadoop/pull/5855]|[http://example.com|https://github.com/apache/hadoop/pull/5855]] > In the case of all datanodes sending FBR when the namenode restarts (large > clusters), there is an issue with incomplete block reporting > --- > > Key: HDFS-17093 > URL: https://issues.apache.org/jira/browse/HDFS-17093 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.3.4 >Reporter: Yanlei Yu >Priority: Minor > Attachments: HDFS-17093.patch > > > In our cluster of 800+ nodes, after restarting the namenode, we found that > some datanodes did not report enough blocks, causing the namenode to stay in > secure mode for a long time after restarting because of incomplete block > reporting > I found in the logs of the datanode with incomplete block reporting that the > first FBR attempt failed, possibly due to namenode stress, and then a second > FBR attempt was made as follows: > {code:java} > > 2023-07-17 11:29:28,982 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Unsuccessfully sent block report 0x6237a52c1e817e, containing 12 storage > report(s), of which we sent 1. The reports had 1099057 total blocks and used > 1 RPC(s). This took 294 msec to generate and 101721 msecs for RPC and NN > processing. Got back no commands. > 2023-07-17 11:37:04,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Successfully sent block report 0x62382416f3f055, containing 12 storage > report(s), of which we sent 12. The reports had 1099048 total blocks and used > 12 RPC(s). This took 295 msec to generate and 11647 msecs for RPC and NN > processing. Got back no commands. {code} > There's nothing wrong with that. Retry the send if it fails But on the > namenode side of the logic: > {code:java} > if (namesystem.isInStartupSafeMode() > && !StorageType.PROVIDED.equals(storageInfo.getStorageType()) > && storageInfo.getBlockReportCount() > 0) { > blockLog.info("BLOCK* processReport 0x{} with lease ID 0x{}: " > + "discarded non-initial block report from {}" > + " because namenode still in startup phase", > strBlockReportId, fullBrLeaseId, nodeID); > blockReportLeaseManager.removeLease(node); > return !node.hasStaleStorages(); > } {code} > When a disk was identified as the report is not the first time, namely > storageInfo. GetBlockReportCount > 0, Will remove the ticket from the > datanode, lead to a second report failed because no lease -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Comment Edited] (HDFS-17093) In the case of all datanodes sending FBR when the namenode restarts (large clusters), there is an issue with incomplete block reporting
[ https://issues.apache.org/jira/browse/HDFS-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1778#comment-1778 ] Yanlei Yu edited comment on HDFS-17093 at 7/19/23 6:08 AM: --- Just to add to that,Our cluster configuration is dfs.namenode.max.full.block.report.leases=1500(we have 800+ nodes),When the namenode restarts, all 800+ nodes will send FBRS,This happens when the namenode is under a lot of pressure,Of course will not rule out dfs.namenode.max.full.block.report.leases Set to a smaller value,not sure it will happen was (Author: JIRAUSER294151): Just to add to that,Our cluster configuration is dfs.namenode.max.full.block.report.leases=1500(we have 800+ nodes),When the namenode restarts, all 800+ nodes will send FBRS,This happens when the namenode is under a lot of pressure,Of course will not rule out dfs.namenode.max.full.block.report.leases Set to a smaller value,will not happen > In the case of all datanodes sending FBR when the namenode restarts (large > clusters), there is an issue with incomplete block reporting > --- > > Key: HDFS-17093 > URL: https://issues.apache.org/jira/browse/HDFS-17093 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.3.4 >Reporter: Yanlei Yu >Priority: Minor > Attachments: HDFS-17093.patch > > > In our cluster of 800+ nodes, after restarting the namenode, we found that > some datanodes did not report enough blocks, causing the namenode to stay in > secure mode for a long time after restarting because of incomplete block > reporting > I found in the logs of the datanode with incomplete block reporting that the > first FBR attempt failed, possibly due to namenode stress, and then a second > FBR attempt was made as follows: > {code:java} > > 2023-07-17 11:29:28,982 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Unsuccessfully sent block report 0x6237a52c1e817e, containing 12 storage > report(s), of which we sent 1. The reports had 1099057 total blocks and used > 1 RPC(s). This took 294 msec to generate and 101721 msecs for RPC and NN > processing. Got back no commands. > 2023-07-17 11:37:04,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Successfully sent block report 0x62382416f3f055, containing 12 storage > report(s), of which we sent 12. The reports had 1099048 total blocks and used > 12 RPC(s). This took 295 msec to generate and 11647 msecs for RPC and NN > processing. Got back no commands. {code} > There's nothing wrong with that. Retry the send if it fails But on the > namenode side of the logic: > {code:java} > if (namesystem.isInStartupSafeMode() > && !StorageType.PROVIDED.equals(storageInfo.getStorageType()) > && storageInfo.getBlockReportCount() > 0) { > blockLog.info("BLOCK* processReport 0x{} with lease ID 0x{}: " > + "discarded non-initial block report from {}" > + " because namenode still in startup phase", > strBlockReportId, fullBrLeaseId, nodeID); > blockReportLeaseManager.removeLease(node); > return !node.hasStaleStorages(); > } {code} > When a disk was identified as the report is not the first time, namely > storageInfo. GetBlockReportCount > 0, Will remove the ticket from the > datanode, lead to a second report failed because no lease -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17093) In the case of all datanodes sending FBR when the namenode restarts (large clusters), there is an issue with incomplete block reporting
[ https://issues.apache.org/jira/browse/HDFS-17093?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1778#comment-1778 ] Yanlei Yu commented on HDFS-17093: -- Just to add to that,Our cluster configuration is dfs.namenode.max.full.block.report.leases=1500(we have 800+ nodes),When the namenode restarts, all 800+ nodes will send FBRS,This happens when the namenode is under a lot of pressure,Of course will not rule out dfs.namenode.max.full.block.report.leases Set to a smaller value,will not happen > In the case of all datanodes sending FBR when the namenode restarts (large > clusters), there is an issue with incomplete block reporting > --- > > Key: HDFS-17093 > URL: https://issues.apache.org/jira/browse/HDFS-17093 > Project: Hadoop HDFS > Issue Type: Bug > Components: namenode >Affects Versions: 3.3.4 >Reporter: Yanlei Yu >Priority: Minor > Attachments: HDFS-17093.patch > > > In our cluster of 800+ nodes, after restarting the namenode, we found that > some datanodes did not report enough blocks, causing the namenode to stay in > secure mode for a long time after restarting because of incomplete block > reporting > I found in the logs of the datanode with incomplete block reporting that the > first FBR attempt failed, possibly due to namenode stress, and then a second > FBR attempt was made as follows: > {code:java} > > 2023-07-17 11:29:28,982 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Unsuccessfully sent block report 0x6237a52c1e817e, containing 12 storage > report(s), of which we sent 1. The reports had 1099057 total blocks and used > 1 RPC(s). This took 294 msec to generate and 101721 msecs for RPC and NN > processing. Got back no commands. > 2023-07-17 11:37:04,014 INFO org.apache.hadoop.hdfs.server.datanode.DataNode: > Successfully sent block report 0x62382416f3f055, containing 12 storage > report(s), of which we sent 12. The reports had 1099048 total blocks and used > 12 RPC(s). This took 295 msec to generate and 11647 msecs for RPC and NN > processing. Got back no commands. {code} > There's nothing wrong with that. Retry the send if it fails But on the > namenode side of the logic: > {code:java} > if (namesystem.isInStartupSafeMode() > && !StorageType.PROVIDED.equals(storageInfo.getStorageType()) > && storageInfo.getBlockReportCount() > 0) { > blockLog.info("BLOCK* processReport 0x{} with lease ID 0x{}: " > + "discarded non-initial block report from {}" > + " because namenode still in startup phase", > strBlockReportId, fullBrLeaseId, nodeID); > blockReportLeaseManager.removeLease(node); > return !node.hasStaleStorages(); > } {code} > When a disk was identified as the report is not the first time, namely > storageInfo. GetBlockReportCount > 0, Will remove the ticket from the > datanode, lead to a second report failed because no lease -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org