[jira] [Commented] (HDFS-16896) HDFS Client hedged read has increased failure rate than without hedged read
[ https://issues.apache.org/jira/browse/HDFS-16896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17690152#comment-17690152 ] ASF GitHub Bot commented on HDFS-16896: --- hadoop-yetus commented on PR #5322: URL: https://github.com/apache/hadoop/pull/5322#issuecomment-1434122848 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 50s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 2 new or modified test files. | _ trunk Compile Tests _ | | +0 :ok: | mvndep | 15m 18s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 33m 57s | | trunk passed | | +1 :green_heart: | compile | 7m 0s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | compile | 6m 31s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | checkstyle | 1m 26s | | trunk passed | | +1 :green_heart: | mvnsite | 2m 35s | | trunk passed | | +1 :green_heart: | javadoc | 1m 58s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 2m 17s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 6m 34s | | trunk passed | | +1 :green_heart: | shadedclient | 28m 32s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 43s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 2m 25s | | the patch passed | | +1 :green_heart: | compile | 5m 53s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javac | 5m 53s | | the patch passed | | +1 :green_heart: | compile | 5m 41s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | javac | 5m 41s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 1m 6s | [/results-checkstyle-hadoop-hdfs-project.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5322/8/artifact/out/results-checkstyle-hadoop-hdfs-project.txt) | hadoop-hdfs-project: The patch generated 1 new + 42 unchanged - 0 fixed = 43 total (was 42) | | +1 :green_heart: | mvnsite | 2m 15s | | the patch passed | | +1 :green_heart: | javadoc | 1m 28s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 2m 0s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 5m 50s | | the patch passed | | +1 :green_heart: | shadedclient | 26m 20s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 2m 26s | | hadoop-hdfs-client in the patch passed. | | -1 :x: | unit | 204m 16s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5322/8/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 51s | | The patch does not generate ASF License warnings. | | | | 366m 39s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.server.namenode.TestAuditLogs | | | hadoop.hdfs.server.namenode.TestFsck | | | hadoop.hdfs.server.namenode.TestAuditLogger | | | hadoop.hdfs.server.namenode.TestFSNamesystemLockReport | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5322/8/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5322 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux b5a1be0c0e37 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bi
[jira] [Commented] (HDFS-16896) HDFS Client hedged read has increased failure rate than without hedged read
[ https://issues.apache.org/jira/browse/HDFS-16896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17690148#comment-17690148 ] ASF GitHub Bot commented on HDFS-16896: --- hadoop-yetus commented on PR #5322: URL: https://github.com/apache/hadoop/pull/5322#issuecomment-1434115847 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 43s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 1s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 2 new or modified test files. | _ trunk Compile Tests _ | | +0 :ok: | mvndep | 15m 31s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 31m 12s | | trunk passed | | +1 :green_heart: | compile | 6m 14s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | compile | 5m 41s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | checkstyle | 1m 21s | | trunk passed | | +1 :green_heart: | mvnsite | 2m 30s | | trunk passed | | +1 :green_heart: | javadoc | 1m 49s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 2m 18s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 5m 57s | | trunk passed | | +1 :green_heart: | shadedclient | 25m 50s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 47s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 2m 20s | | the patch passed | | +1 :green_heart: | compile | 5m 54s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javac | 5m 54s | | the patch passed | | +1 :green_heart: | compile | 5m 37s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | javac | 5m 37s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 1m 4s | [/results-checkstyle-hadoop-hdfs-project.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5322/7/artifact/out/results-checkstyle-hadoop-hdfs-project.txt) | hadoop-hdfs-project: The patch generated 2 new + 42 unchanged - 0 fixed = 44 total (was 42) | | +1 :green_heart: | mvnsite | 2m 14s | | the patch passed | | +1 :green_heart: | javadoc | 1m 26s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 2m 1s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 5m 49s | | the patch passed | | +1 :green_heart: | shadedclient | 25m 46s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 2m 29s | | hadoop-hdfs-client in the patch passed. | | -1 :x: | unit | 207m 40s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5322/7/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 51s | | The patch does not generate ASF License warnings. | | | | 361m 50s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.server.namenode.TestAuditLogs | | | hadoop.hdfs.server.namenode.TestAuditLogger | | | hadoop.hdfs.server.namenode.TestFSNamesystemLockReport | | | hadoop.hdfs.server.namenode.TestFsck | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5322/7/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5322 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 41ce24a7c7bf 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bi
[jira] [Updated] (HDFS-16925) Namenode audit log to only include IP address of client
[ https://issues.apache.org/jira/browse/HDFS-16925?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Jasani updated HDFS-16925: Summary: Namenode audit log to only include IP address of client (was: Fix regex pattern for namenode audit log tests) > Namenode audit log to only include IP address of client > --- > > Key: HDFS-16925 > URL: https://issues.apache.org/jira/browse/HDFS-16925 > Project: Hadoop HDFS > Issue Type: Task >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > > With HADOOP-18628 in place, we perform InetAddress#getHostName in addition to > InetAddress#getHostAddress, to save host name with IPC Connection object. > When we perform InetAddress#getHostName, toString() of InetAddress would > automatically print \{hostName}/\{hostIPAddress} if hostname is already > resolved: > {code:java} > /** > * Converts this IP address to a {@code String}. The > * string returned is of the form: hostname / literal IP > * address. > * > * If the host name is unresolved, no reverse name service lookup > * is performed. The hostname part will be represented by an empty string. > * > * @return a string representation of this IP address. > */ > public String toString() { > String hostName = holder().getHostName(); > return ((hostName != null) ? hostName : "") > + "/" + getHostAddress(); > }{code} > > For namenode audit logs, this means that when dfs client makes filesystem > updates, the audit logs would also print host name in the audit logs in > addition to ip address. We have some tests that performs regex pattern > matching to identify the log pattern of audit logs, we will have to change > them to reflect the change in host address. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16925) Fix regex pattern for namenode audit log tests
[ https://issues.apache.org/jira/browse/HDFS-16925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17690114#comment-17690114 ] ASF GitHub Bot commented on HDFS-16925: --- hadoop-yetus commented on PR #5407: URL: https://github.com/apache/hadoop/pull/5407#issuecomment-1434043812 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 44s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 4 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 50m 6s | | trunk passed | | +1 :green_heart: | compile | 1m 27s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | compile | 1m 21s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | checkstyle | 1m 9s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 30s | | trunk passed | | +1 :green_heart: | javadoc | 1m 9s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 1m 33s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 3m 30s | | trunk passed | | +1 :green_heart: | shadedclient | 26m 12s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 27s | | the patch passed | | +1 :green_heart: | compile | 1m 18s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javac | 1m 18s | | the patch passed | | +1 :green_heart: | compile | 1m 13s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | javac | 1m 13s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 53s | | the patch passed | | +1 :green_heart: | mvnsite | 1m 21s | | the patch passed | | +1 :green_heart: | javadoc | 0m 51s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 1m 25s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 3m 16s | | the patch passed | | +1 :green_heart: | shadedclient | 25m 30s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 206m 29s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5407/2/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 50s | | The patch does not generate ASF License warnings. | | | | 331m 13s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.server.namenode.TestNamenodeRetryCache | | | hadoop.hdfs.server.namenode.TestFSNamesystemLockReport | | | hadoop.hdfs.server.namenode.TestAuditLoggerWithCommands | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5407/2/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5407 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux d143c4098f32 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / d5510935e82f3353c374a3fd4024c3746574105c | | Default Java | Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5407/2/testReport/ | | Max. process+thread count | 2982 (vs. ulimit of 5500) | | modules | C: ha
[jira] [Commented] (HDFS-16925) Fix regex pattern for namenode audit log tests
[ https://issues.apache.org/jira/browse/HDFS-16925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17690106#comment-17690106 ] ASF GitHub Bot commented on HDFS-16925: --- virajjasani commented on PR #5407: URL: https://github.com/apache/hadoop/pull/5407#issuecomment-1434027211 Apologies. The above IS.Stable would only mean that method arguments do not change, it doesn't matter what we do as part of the logic. Please ignore above comment about making the method IS.Stable. > Fix regex pattern for namenode audit log tests > -- > > Key: HDFS-16925 > URL: https://issues.apache.org/jira/browse/HDFS-16925 > Project: Hadoop HDFS > Issue Type: Task >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > > With HADOOP-18628 in place, we perform InetAddress#getHostName in addition to > InetAddress#getHostAddress, to save host name with IPC Connection object. > When we perform InetAddress#getHostName, toString() of InetAddress would > automatically print \{hostName}/\{hostIPAddress} if hostname is already > resolved: > {code:java} > /** > * Converts this IP address to a {@code String}. The > * string returned is of the form: hostname / literal IP > * address. > * > * If the host name is unresolved, no reverse name service lookup > * is performed. The hostname part will be represented by an empty string. > * > * @return a string representation of this IP address. > */ > public String toString() { > String hostName = holder().getHostName(); > return ((hostName != null) ? hostName : "") > + "/" + getHostAddress(); > }{code} > > For namenode audit logs, this means that when dfs client makes filesystem > updates, the audit logs would also print host name in the audit logs in > addition to ip address. We have some tests that performs regex pattern > matching to identify the log pattern of audit logs, we will have to change > them to reflect the change in host address. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16917) Add transfer rate quantile metrics for DataNode reads
[ https://issues.apache.org/jira/browse/HDFS-16917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17690105#comment-17690105 ] ASF GitHub Bot commented on HDFS-16917: --- hadoop-yetus commented on PR #5397: URL: https://github.com/apache/hadoop/pull/5397#issuecomment-1434026427 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 40s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +0 :ok: | markdownlint | 0m 0s | | markdownlint was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +0 :ok: | mvndep | 15m 30s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 30m 54s | | trunk passed | | +1 :green_heart: | compile | 23m 17s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | compile | 20m 32s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | checkstyle | 3m 48s | | trunk passed | | +1 :green_heart: | mvnsite | 3m 28s | | trunk passed | | +1 :green_heart: | javadoc | 2m 28s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 2m 40s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 6m 8s | | trunk passed | | +1 :green_heart: | shadedclient | 26m 26s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 29s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 2m 32s | | the patch passed | | +1 :green_heart: | compile | 22m 36s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javac | 22m 36s | | the patch passed | | +1 :green_heart: | compile | 20m 34s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | javac | 20m 34s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 3m 41s | | the patch passed | | +1 :green_heart: | mvnsite | 3m 29s | | the patch passed | | +1 :green_heart: | javadoc | 2m 20s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 2m 42s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 6m 20s | | the patch passed | | +1 :green_heart: | shadedclient | 26m 42s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 18m 16s | | hadoop-common in the patch passed. | | -1 :x: | unit | 204m 35s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5397/10/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 1m 10s | | The patch does not generate ASF License warnings. | | | | 451m 15s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.server.namenode.TestAuditLogger | | | hadoop.hdfs.server.namenode.TestFsck | | | hadoop.hdfs.server.namenode.TestAuditLogs | | | hadoop.hdfs.server.namenode.TestFSNamesystemLockReport | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5397/10/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5397 | | Optional Tests | dupname asflicense mvnsite codespell detsecrets markdownlint compile javac javadoc mvninstall unit shadedclient spotbugs checkstyle | | uname | Linux 873e7c4ff4e0 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / b3267f35f78f0625039a10d1c970e1f05cf677cd | | Default Java | Private Build-1.8.0_352-8u352-ga-1~20.04-b0
[jira] [Commented] (HDFS-16925) Fix regex pattern for namenode audit log tests
[ https://issues.apache.org/jira/browse/HDFS-16925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17690101#comment-17690101 ] ASF GitHub Bot commented on HDFS-16925: --- virajjasani commented on code in PR #5407: URL: https://github.com/apache/hadoop/pull/5407#discussion_r1109234913 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java: ## @@ -8787,7 +8787,10 @@ public void logAuditEvent(boolean succeeded, String userName, sb.setLength(0); sb.append("allowed=").append(succeeded).append("\t") .append("ugi=").append(userName).append("\t") -.append("ip=").append(addr).append("\t") +// InetAddress#tostring can also return hostname in addition to IP address. +// To not include hostname regardless of the hostname resolution, we should +// only include IP address here. See HADOOP-18628. Review Comment: Done > Fix regex pattern for namenode audit log tests > -- > > Key: HDFS-16925 > URL: https://issues.apache.org/jira/browse/HDFS-16925 > Project: Hadoop HDFS > Issue Type: Task >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > > With HADOOP-18628 in place, we perform InetAddress#getHostName in addition to > InetAddress#getHostAddress, to save host name with IPC Connection object. > When we perform InetAddress#getHostName, toString() of InetAddress would > automatically print \{hostName}/\{hostIPAddress} if hostname is already > resolved: > {code:java} > /** > * Converts this IP address to a {@code String}. The > * string returned is of the form: hostname / literal IP > * address. > * > * If the host name is unresolved, no reverse name service lookup > * is performed. The hostname part will be represented by an empty string. > * > * @return a string representation of this IP address. > */ > public String toString() { > String hostName = holder().getHostName(); > return ((hostName != null) ? hostName : "") > + "/" + getHostAddress(); > }{code} > > For namenode audit logs, this means that when dfs client makes filesystem > updates, the audit logs would also print host name in the audit logs in > addition to ip address. We have some tests that performs regex pattern > matching to identify the log pattern of audit logs, we will have to change > them to reflect the change in host address. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16925) Fix regex pattern for namenode audit log tests
[ https://issues.apache.org/jira/browse/HDFS-16925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17690100#comment-17690100 ] ASF GitHub Bot commented on HDFS-16925: --- virajjasani commented on PR #5407: URL: https://github.com/apache/hadoop/pull/5407#issuecomment-1434019817 We have this currently and I looked at this first for the first commit on this PR: ``` /** * Interface defining an audit logger. */ @InterfaceAudience.Public @InterfaceStability.Evolving public interface AuditLogger { ... ... ``` and ``` /** * Extension of {@link AuditLogger}. */ @InterfaceAudience.Public @InterfaceStability.Evolving public abstract class HdfsAuditLogger implements AuditLogger { ... ``` Looks like they should rather be IS.Stable or at least this method should be IS.Stable, WDYT? ``` public abstract void logAuditEvent(boolean succeeded, String userName, InetAddress addr, String cmd, String src, String dst, FileStatus stat, CallerContext callerContext, UserGroupInformation ugi, DelegationTokenSecretManager dtSecretManager); ... ... ``` > Fix regex pattern for namenode audit log tests > -- > > Key: HDFS-16925 > URL: https://issues.apache.org/jira/browse/HDFS-16925 > Project: Hadoop HDFS > Issue Type: Task >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > > With HADOOP-18628 in place, we perform InetAddress#getHostName in addition to > InetAddress#getHostAddress, to save host name with IPC Connection object. > When we perform InetAddress#getHostName, toString() of InetAddress would > automatically print \{hostName}/\{hostIPAddress} if hostname is already > resolved: > {code:java} > /** > * Converts this IP address to a {@code String}. The > * string returned is of the form: hostname / literal IP > * address. > * > * If the host name is unresolved, no reverse name service lookup > * is performed. The hostname part will be represented by an empty string. > * > * @return a string representation of this IP address. > */ > public String toString() { > String hostName = holder().getHostName(); > return ((hostName != null) ? hostName : "") > + "/" + getHostAddress(); > }{code} > > For namenode audit logs, this means that when dfs client makes filesystem > updates, the audit logs would also print host name in the audit logs in > addition to ip address. We have some tests that performs regex pattern > matching to identify the log pattern of audit logs, we will have to change > them to reflect the change in host address. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16925) Fix regex pattern for namenode audit log tests
[ https://issues.apache.org/jira/browse/HDFS-16925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17690098#comment-17690098 ] ASF GitHub Bot commented on HDFS-16925: --- virajjasani commented on PR #5407: URL: https://github.com/apache/hadoop/pull/5407#issuecomment-1434017468 > Sorry, I didn't think #5385 changes the audit log format. We have to keep it as it was before. So it shouldn't print hostname in audit logs. The last change looks good to me if CI is ok. I was not aware of the compatibility guidelines for audit logs until Ayush mentioned the link. @ayushtkn @tasanuma do you think it makes sense to keep the method that produces the final string that we print in the audit log to be marked as `IA.Public` and `IS.Stable`? > Fix regex pattern for namenode audit log tests > -- > > Key: HDFS-16925 > URL: https://issues.apache.org/jira/browse/HDFS-16925 > Project: Hadoop HDFS > Issue Type: Task >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > > With HADOOP-18628 in place, we perform InetAddress#getHostName in addition to > InetAddress#getHostAddress, to save host name with IPC Connection object. > When we perform InetAddress#getHostName, toString() of InetAddress would > automatically print \{hostName}/\{hostIPAddress} if hostname is already > resolved: > {code:java} > /** > * Converts this IP address to a {@code String}. The > * string returned is of the form: hostname / literal IP > * address. > * > * If the host name is unresolved, no reverse name service lookup > * is performed. The hostname part will be represented by an empty string. > * > * @return a string representation of this IP address. > */ > public String toString() { > String hostName = holder().getHostName(); > return ((hostName != null) ? hostName : "") > + "/" + getHostAddress(); > }{code} > > For namenode audit logs, this means that when dfs client makes filesystem > updates, the audit logs would also print host name in the audit logs in > addition to ip address. We have some tests that performs regex pattern > matching to identify the log pattern of audit logs, we will have to change > them to reflect the change in host address. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16925) Fix regex pattern for namenode audit log tests
[ https://issues.apache.org/jira/browse/HDFS-16925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17690096#comment-17690096 ] ASF GitHub Bot commented on HDFS-16925: --- tasanuma commented on code in PR #5407: URL: https://github.com/apache/hadoop/pull/5407#discussion_r1109228327 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/FSNamesystem.java: ## @@ -8787,7 +8787,10 @@ public void logAuditEvent(boolean succeeded, String userName, sb.setLength(0); sb.append("allowed=").append(succeeded).append("\t") .append("ugi=").append(userName).append("\t") -.append("ip=").append(addr).append("\t") +// InetAddress#tostring can also return hostname in addition to IP address. +// To not include hostname regardless of the hostname resolution, we should +// only include IP address here. See HADOOP-18628. Review Comment: I think this change is independent of HADOOP-18628. So we don't need to refer to it. ```suggestion // only include IP address here. ``` > Fix regex pattern for namenode audit log tests > -- > > Key: HDFS-16925 > URL: https://issues.apache.org/jira/browse/HDFS-16925 > Project: Hadoop HDFS > Issue Type: Task >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > > With HADOOP-18628 in place, we perform InetAddress#getHostName in addition to > InetAddress#getHostAddress, to save host name with IPC Connection object. > When we perform InetAddress#getHostName, toString() of InetAddress would > automatically print \{hostName}/\{hostIPAddress} if hostname is already > resolved: > {code:java} > /** > * Converts this IP address to a {@code String}. The > * string returned is of the form: hostname / literal IP > * address. > * > * If the host name is unresolved, no reverse name service lookup > * is performed. The hostname part will be represented by an empty string. > * > * @return a string representation of this IP address. > */ > public String toString() { > String hostName = holder().getHostName(); > return ((hostName != null) ? hostName : "") > + "/" + getHostAddress(); > }{code} > > For namenode audit logs, this means that when dfs client makes filesystem > updates, the audit logs would also print host name in the audit logs in > addition to ip address. We have some tests that performs regex pattern > matching to identify the log pattern of audit logs, we will have to change > them to reflect the change in host address. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16925) Fix regex pattern for namenode audit log tests
[ https://issues.apache.org/jira/browse/HDFS-16925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17690095#comment-17690095 ] ASF GitHub Bot commented on HDFS-16925: --- tasanuma commented on PR #5407: URL: https://github.com/apache/hadoop/pull/5407#issuecomment-1434014182 Sorry, I didn't think #5385 changes the audit log format. We have to keep it as it was before. So it shouldn't print hostname in audit logs. The last change looks good to me if CI is ok. > Fix regex pattern for namenode audit log tests > -- > > Key: HDFS-16925 > URL: https://issues.apache.org/jira/browse/HDFS-16925 > Project: Hadoop HDFS > Issue Type: Task >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > > With HADOOP-18628 in place, we perform InetAddress#getHostName in addition to > InetAddress#getHostAddress, to save host name with IPC Connection object. > When we perform InetAddress#getHostName, toString() of InetAddress would > automatically print \{hostName}/\{hostIPAddress} if hostname is already > resolved: > {code:java} > /** > * Converts this IP address to a {@code String}. The > * string returned is of the form: hostname / literal IP > * address. > * > * If the host name is unresolved, no reverse name service lookup > * is performed. The hostname part will be represented by an empty string. > * > * @return a string representation of this IP address. > */ > public String toString() { > String hostName = holder().getHostName(); > return ((hostName != null) ? hostName : "") > + "/" + getHostAddress(); > }{code} > > For namenode audit logs, this means that when dfs client makes filesystem > updates, the audit logs would also print host name in the audit logs in > addition to ip address. We have some tests that performs regex pattern > matching to identify the log pattern of audit logs, we will have to change > them to reflect the change in host address. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16917) Add transfer rate quantile metrics for DataNode reads
[ https://issues.apache.org/jira/browse/HDFS-16917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17690081#comment-17690081 ] ASF GitHub Bot commented on HDFS-16917: --- hadoop-yetus commented on PR #5397: URL: https://github.com/apache/hadoop/pull/5397#issuecomment-1433979733 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 37s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +0 :ok: | markdownlint | 0m 0s | | markdownlint was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +0 :ok: | mvndep | 15m 38s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 31m 23s | | trunk passed | | +1 :green_heart: | compile | 23m 9s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | compile | 20m 39s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | checkstyle | 3m 47s | | trunk passed | | +1 :green_heart: | mvnsite | 3m 27s | | trunk passed | | +1 :green_heart: | javadoc | 2m 27s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 2m 41s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 6m 4s | | trunk passed | | +1 :green_heart: | shadedclient | 26m 10s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 28s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 2m 31s | | the patch passed | | +1 :green_heart: | compile | 22m 26s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javac | 22m 26s | | the patch passed | | +1 :green_heart: | compile | 20m 38s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | javac | 20m 38s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 3m 39s | | the patch passed | | +1 :green_heart: | mvnsite | 3m 26s | | the patch passed | | +1 :green_heart: | javadoc | 2m 22s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 2m 40s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 6m 18s | | the patch passed | | +1 :green_heart: | shadedclient | 26m 23s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 18m 18s | | hadoop-common in the patch passed. | | -1 :x: | unit | 205m 38s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5397/9/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 1m 14s | | The patch does not generate ASF License warnings. | | | | 451m 41s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.server.namenode.TestAuditLogs | | | hadoop.hdfs.server.namenode.TestFSNamesystemLockReport | | | hadoop.hdfs.server.namenode.TestFsck | | | hadoop.hdfs.server.namenode.TestAuditLogger | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5397/9/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5397 | | Optional Tests | dupname asflicense mvnsite codespell detsecrets markdownlint compile javac javadoc mvninstall unit shadedclient spotbugs checkstyle | | uname | Linux e20e3c851708 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 42714178a1e6034a88a612b20972b2e40bc125fd | | Default Java | Private Build-1.8.0_352-8u352-ga-1~20.04-b08
[jira] [Commented] (HDFS-16896) HDFS Client hedged read has increased failure rate than without hedged read
[ https://issues.apache.org/jira/browse/HDFS-16896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17690075#comment-17690075 ] ASF GitHub Bot commented on HDFS-16896: --- hadoop-yetus commented on PR #5322: URL: https://github.com/apache/hadoop/pull/5322#issuecomment-1433945193 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 46s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 2 new or modified test files. | _ trunk Compile Tests _ | | +0 :ok: | mvndep | 23m 40s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 33m 28s | | trunk passed | | +1 :green_heart: | compile | 6m 6s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | compile | 5m 55s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | checkstyle | 1m 19s | | trunk passed | | +1 :green_heart: | mvnsite | 2m 30s | | trunk passed | | +1 :green_heart: | javadoc | 1m 50s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 2m 16s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 5m 53s | | trunk passed | | +1 :green_heart: | shadedclient | 25m 51s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 28s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 2m 20s | | the patch passed | | +1 :green_heart: | compile | 6m 4s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javac | 6m 4s | | the patch passed | | +1 :green_heart: | compile | 5m 38s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | javac | 5m 38s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 1m 6s | [/results-checkstyle-hadoop-hdfs-project.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5322/6/artifact/out/results-checkstyle-hadoop-hdfs-project.txt) | hadoop-hdfs-project: The patch generated 2 new + 42 unchanged - 0 fixed = 44 total (was 42) | | +1 :green_heart: | mvnsite | 2m 14s | | the patch passed | | +1 :green_heart: | javadoc | 1m 27s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 1m 57s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 5m 49s | | the patch passed | | +1 :green_heart: | shadedclient | 25m 37s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 2m 28s | | hadoop-hdfs-client in the patch passed. | | -1 :x: | unit | 205m 29s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5322/6/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 51s | | The patch does not generate ASF License warnings. | | | | 369m 46s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.TestPread | | | hadoop.hdfs.server.namenode.TestAuditLogs | | | hadoop.hdfs.server.namenode.TestFsck | | | hadoop.hdfs.server.namenode.TestAuditLogger | | | hadoop.hdfs.server.namenode.TestFSNamesystemLockReport | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5322/6/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5322 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux d9582a898494 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven |
[jira] [Commented] (HDFS-16925) Fix regex pattern for namenode audit log tests
[ https://issues.apache.org/jira/browse/HDFS-16925?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17690028#comment-17690028 ] ASF GitHub Bot commented on HDFS-16925: --- virajjasani commented on PR #5407: URL: https://github.com/apache/hadoop/pull/5407#issuecomment-1433760364 @ayushtkn can you please take a look? > Fix regex pattern for namenode audit log tests > -- > > Key: HDFS-16925 > URL: https://issues.apache.org/jira/browse/HDFS-16925 > Project: Hadoop HDFS > Issue Type: Task >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > > With HADOOP-18628 in place, we perform InetAddress#getHostName in addition to > InetAddress#getHostAddress, to save host name with IPC Connection object. > When we perform InetAddress#getHostName, toString() of InetAddress would > automatically print \{hostName}/\{hostIPAddress} if hostname is already > resolved: > {code:java} > /** > * Converts this IP address to a {@code String}. The > * string returned is of the form: hostname / literal IP > * address. > * > * If the host name is unresolved, no reverse name service lookup > * is performed. The hostname part will be represented by an empty string. > * > * @return a string representation of this IP address. > */ > public String toString() { > String hostName = holder().getHostName(); > return ((hostName != null) ? hostName : "") > + "/" + getHostAddress(); > }{code} > > For namenode audit logs, this means that when dfs client makes filesystem > updates, the audit logs would also print host name in the audit logs in > addition to ip address. We have some tests that performs regex pattern > matching to identify the log pattern of audit logs, we will have to change > them to reflect the change in host address. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Resolved] (HDFS-16918) Optionally shut down datanode if it does not stay connected to active namenode
[ https://issues.apache.org/jira/browse/HDFS-16918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Jasani resolved HDFS-16918. - Resolution: Won't Fix > Optionally shut down datanode if it does not stay connected to active namenode > -- > > Key: HDFS-16918 > URL: https://issues.apache.org/jira/browse/HDFS-16918 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > > While deploying Hdfs on Envoy proxy setup, depending on the socket timeout > configured at envoy, the network connection issues or packet loss could be > observed. All of envoys basically form a transparent communication mesh in > which each app can send and receive packets to and from localhost and is > unaware of the network topology. > The primary purpose of Envoy is to make the network transparent to > applications, in order to identify network issues reliably. However, > sometimes such proxy based setup could result into socket connection issues > b/ datanode and namenode. > Many deployment frameworks provide auto-start functionality when any of the > hadoop daemons are stopped. If a given datanode does not stay connected to > active namenode in the cluster i.e. does not receive heartbeat response in > time from active namenode (even though active namenode is not terminated), it > would not be much useful. We should be able to provide configurable behavior > such that if a given datanode cannot receive heartbeat response from active > namenode in configurable time duration, it should terminate itself to avoid > impacting the availability SLA. This is specifically helpful when the > underlying deployment or observability framework (e.g. K8S) can start up the > datanode automatically upon it's shutdown (unless it is being restarted as > part of rolling upgrade) and help the newly brought up datanode (in case of > k8s, a new pod with dynamically changing nodes) establish new socket > connection to active and standby namenodes. This should be an opt-in behavior > and not default one. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16918) Optionally shut down datanode if it does not stay connected to active namenode
[ https://issues.apache.org/jira/browse/HDFS-16918?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Viraj Jasani updated HDFS-16918: Description: While deploying Hdfs on Envoy proxy setup, depending on the socket timeout configured at envoy, the network connection issues or packet loss could be observed. All of envoys basically form a transparent communication mesh in which each app can send and receive packets to and from localhost and is unaware of the network topology. The primary purpose of Envoy is to make the network transparent to applications, in order to identify network issues reliably. However, sometimes such proxy based setup could result into socket connection issues b/ datanode and namenode. Many deployment frameworks provide auto-start functionality when any of the hadoop daemons are stopped. If a given datanode does not stay connected to active namenode in the cluster i.e. does not receive heartbeat response in time from active namenode (even though active namenode is not terminated), it would not be much useful. We should be able to provide configurable behavior such that if a given datanode cannot receive heartbeat response from active namenode in configurable time duration, it should terminate itself to avoid impacting the availability SLA. This is specifically helpful when the underlying deployment or observability framework (e.g. K8S) can start up the datanode automatically upon it's shutdown (unless it is being restarted as part of rolling upgrade) and help the newly brought up datanode (in case of k8s, a new pod with dynamically changing nodes) establish new socket connection to active and standby namenodes. This should be an opt-in behavior and not default one. was: While deploying Hdfs on Envoy proxy setup, depending on the socket timeout configured at envoy, the network connection issues or packet loss could be observed. All of envoys basically form a transparent communication mesh in which each app can send and receive packets to and from localhost and is unaware of the network topology. The primary purpose of Envoy is to make the network transparent to applications, in order to identify network issues reliably. However, sometimes such proxy based setup could result into socket connection issues b/ datanode and namenode. Many deployment frameworks provide auto-start functionality when any of the hadoop daemons are stopped. If a given datanode does not stay connected to active namenode in the cluster i.e. does not receive heartbeat response in time from active namenode (even though active namenode is not terminated), it would not be much useful. We should be able to provide configurable behavior such that if a given datanode cannot receive heartbeat response from active namenode in configurable time duration, it should terminate itself to avoid impacting the availability SLA. This is specifically helpful when the underlying deployment or observability framework (e.g. K8S) can start up the datanode automatically upon it's shutdown (unless it is being restarted as part of rolling upgrade) and help the newly brought up datanode (in case of k8s, a new pod with dynamically changing nodes) establish new socket connection to active and standby namenodes. This should be an opt-in behavior and not default one. In a distributed system, it is essential to have robust fail-fast mechanisms in place to prevent issues related to network partitioning. The system must be designed to prevent further degradation of availability and consistency in the event of a network partition. Several distributed systems offer fail-safe approaches, and for some, partition tolerance is critical to the extent that even a few seconds of heartbeat loss can trigger the removal of an application server instance from the cluster. For instance, a majority of zooKeeper clients utilize the ephemeral nodes for this purpose to make system reliable, fault-tolerant and strongly consistent in the event of network partition. >From the hdfs architecture viewpoint, it is crucial to understand the critical >role that active and observer namenode play in file system operations. In a >large-scale cluster, if the datanodes holding the same block (primary and >replicas) lose connection to both active and observer namenodes for a >significant amount of time, delaying the process of shutting down such >datanodes and restarting it to re-establish the connection with the namenodes >(assuming the active namenode is alive, assumption is important in the even of >network partition to reestablish the connection) will further deteriorate the >availability of the service. This scenario underscores the importance of >resolving network partitioning. This is a real use case for hdfs and it is not prudent to assume that every deployment or cluster management application must be able to restart datanodes based on JM
[jira] [Commented] (HDFS-16918) Optionally shut down datanode if it does not stay connected to active namenode
[ https://issues.apache.org/jira/browse/HDFS-16918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17690020#comment-17690020 ] ASF GitHub Bot commented on HDFS-16918: --- virajjasani closed pull request #5396: HDFS-16918. Optionally shut down datanode if it does not stay connected to active namenode URL: https://github.com/apache/hadoop/pull/5396 > Optionally shut down datanode if it does not stay connected to active namenode > -- > > Key: HDFS-16918 > URL: https://issues.apache.org/jira/browse/HDFS-16918 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > > While deploying Hdfs on Envoy proxy setup, depending on the socket timeout > configured at envoy, the network connection issues or packet loss could be > observed. All of envoys basically form a transparent communication mesh in > which each app can send and receive packets to and from localhost and is > unaware of the network topology. > The primary purpose of Envoy is to make the network transparent to > applications, in order to identify network issues reliably. However, > sometimes such proxy based setup could result into socket connection issues > b/ datanode and namenode. > Many deployment frameworks provide auto-start functionality when any of the > hadoop daemons are stopped. If a given datanode does not stay connected to > active namenode in the cluster i.e. does not receive heartbeat response in > time from active namenode (even though active namenode is not terminated), it > would not be much useful. We should be able to provide configurable behavior > such that if a given datanode cannot receive heartbeat response from active > namenode in configurable time duration, it should terminate itself to avoid > impacting the availability SLA. This is specifically helpful when the > underlying deployment or observability framework (e.g. K8S) can start up the > datanode automatically upon it's shutdown (unless it is being restarted as > part of rolling upgrade) and help the newly brought up datanode (in case of > k8s, a new pod with dynamically changing nodes) establish new socket > connection to active and standby namenodes. This should be an opt-in behavior > and not default one. > > In a distributed system, it is essential to have robust fail-fast mechanisms > in place to prevent issues related to network partitioning. The system must > be designed to prevent further degradation of availability and consistency in > the event of a network partition. Several distributed systems offer fail-safe > approaches, and for some, partition tolerance is critical to the extent that > even a few seconds of heartbeat loss can trigger the removal of an > application server instance from the cluster. For instance, a majority of > zooKeeper clients utilize the ephemeral nodes for this purpose to make system > reliable, fault-tolerant and strongly consistent in the event of network > partition. > From the hdfs architecture viewpoint, it is crucial to understand the > critical role that active and observer namenode play in file system > operations. In a large-scale cluster, if the datanodes holding the same block > (primary and replicas) lose connection to both active and observer namenodes > for a significant amount of time, delaying the process of shutting down such > datanodes and restarting it to re-establish the connection with the namenodes > (assuming the active namenode is alive, assumption is important in the even > of network partition to reestablish the connection) will further deteriorate > the availability of the service. This scenario underscores the importance of > resolving network partitioning. > This is a real use case for hdfs and it is not prudent to assume that every > deployment or cluster management application must be able to restart > datanodes based on JMX metrics, as this would introduce another application > to resolve the network partition impact of hdfs. Besides, popular cluster > management applications are not typically used in all cloud-native env. Even > if these cluster management applications are deployed, certain security > constraints may restrict their access to JMX metrics and prevent them from > interfering with hdfs operations. The applications that can only trigger > alerts for users based on set parameters (for instance, missing blocks > 0) > are allowed to access JMX metrics. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issu
[jira] [Updated] (HDFS-16917) Add transfer rate quantile metrics for DataNode reads
[ https://issues.apache.org/jira/browse/HDFS-16917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravindra Dingankar updated HDFS-16917: -- Description: Currently we have the following metrics for datanode reads. |BytesRead BlocksRead TotalReadTime|Total number of bytes read from DataNode Total number of blocks read from DataNode Total number of milliseconds spent on read operation| We would like to add a new quantile metric calculating the transfer rate for datanode reads. This will give us a distribution across a window of the read transfer rate for each datanode. Quantiles for transfer rate per host will help in identifying issues like hotspotting of datasets as well as finding repetitive slow datanodes. was: Currently we have the following metrics for datanode reads. |BytesRead BlocksRead TotalReadTime|Total number of bytes read from DataNode Total number of blocks read from DataNode Total number of milliseconds spent on read operation| We would like to add a new quantile metric calculating the transfer rate for datanode reads. This will give us a distribution across a window of the read transfer rate for each datanode. Percentiles for transfer rate will help in identifying issues like hotspotting of datasets as well as finding repetitive slow datanodes. > Add transfer rate quantile metrics for DataNode reads > - > > Key: HDFS-16917 > URL: https://issues.apache.org/jira/browse/HDFS-16917 > Project: Hadoop HDFS > Issue Type: Task > Components: datanode >Reporter: Ravindra Dingankar >Priority: Minor > Labels: pull-request-available > > Currently we have the following metrics for datanode reads. > |BytesRead > BlocksRead > TotalReadTime|Total number of bytes read from DataNode > Total number of blocks read from DataNode > Total number of milliseconds spent on read operation| > We would like to add a new quantile metric calculating the transfer rate for > datanode reads. > This will give us a distribution across a window of the read transfer rate > for each datanode. > Quantiles for transfer rate per host will help in identifying issues like > hotspotting of datasets as well as finding repetitive slow datanodes. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16917) Add transfer rate quantile metrics for DataNode reads
[ https://issues.apache.org/jira/browse/HDFS-16917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravindra Dingankar updated HDFS-16917: -- Description: Currently we have the following metrics for datanode reads. |BytesRead BlocksRead TotalReadTime|Total number of bytes read from DataNode Total number of blocks read from DataNode Total number of milliseconds spent on read operation| We would like to add a new quantile metric calculating the transfer rate for datanode reads. This will give us a distribution across a window of the read transfer rate for each datanode. Percentiles for transfer rate will help in identifying issues like hotspotting of datasets as well as finding repetitive slow datanodes. was: Currently we have the following metrics for datanode reads. |BytesRead BlocksRead TotalReadTime|Total number of bytes read from DataNode Total number of blocks read from DataNode Total number of milliseconds spent on read operation| We would like to add a new quantile metric calculating the data transfer rate for datanode reads. This will give us a distribution across a window of the transfer rate for each datanode. Percentiles for transfer rate will help in identifying repetitive slow datanodes or nodes with hotspots. > Add transfer rate quantile metrics for DataNode reads > - > > Key: HDFS-16917 > URL: https://issues.apache.org/jira/browse/HDFS-16917 > Project: Hadoop HDFS > Issue Type: Task > Components: datanode >Reporter: Ravindra Dingankar >Priority: Minor > Labels: pull-request-available > > Currently we have the following metrics for datanode reads. > |BytesRead > BlocksRead > TotalReadTime|Total number of bytes read from DataNode > Total number of blocks read from DataNode > Total number of milliseconds spent on read operation| > We would like to add a new quantile metric calculating the transfer rate for > datanode reads. > This will give us a distribution across a window of the read transfer rate > for each datanode. > Percentiles for transfer rate will help in identifying issues like > hotspotting of datasets as well as finding repetitive slow datanodes. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16917) Add transfer rate quantile metrics for DataNode reads
[ https://issues.apache.org/jira/browse/HDFS-16917?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Ravindra Dingankar updated HDFS-16917: -- Description: Currently we have the following metrics for datanode reads. |BytesRead BlocksRead TotalReadTime|Total number of bytes read from DataNode Total number of blocks read from DataNode Total number of milliseconds spent on read operation| We would like to add a new quantile metric calculating the data transfer rate for datanode reads. This will give us a distribution across a window of the transfer rate for each datanode. Percentiles for transfer rate will help in identifying repetitive slow datanodes or nodes with hotspots. was: Currently we have the following metrics for datanode reads. |BytesRead BlocksRead TotalReadTime|Total number of bytes read from DataNode Total number of blocks read from DataNode Total number of milliseconds spent on read operation| We would like to add a new quantile metric calculating the distribution of data transfer rate for datanode reads. > Add transfer rate quantile metrics for DataNode reads > - > > Key: HDFS-16917 > URL: https://issues.apache.org/jira/browse/HDFS-16917 > Project: Hadoop HDFS > Issue Type: Task > Components: datanode >Reporter: Ravindra Dingankar >Priority: Minor > Labels: pull-request-available > > Currently we have the following metrics for datanode reads. > |BytesRead > BlocksRead > TotalReadTime|Total number of bytes read from DataNode > Total number of blocks read from DataNode > Total number of milliseconds spent on read operation| > We would like to add a new quantile metric calculating the data transfer rate > for datanode reads. > This will give us a distribution across a window of the transfer rate for > each datanode. > Percentiles for transfer rate will help in identifying repetitive slow > datanodes or nodes with hotspots. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16918) Optionally shut down datanode if it does not stay connected to active namenode
[ https://issues.apache.org/jira/browse/HDFS-16918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689990#comment-17689990 ] ASF GitHub Bot commented on HDFS-16918: --- virajjasani commented on PR #5396: URL: https://github.com/apache/hadoop/pull/5396#issuecomment-1433713252 > The Dn case seems a corner case, it won't be very common and need to be careful around not getting pass a split-brain scenario. There are bunch of checks around though, but they are just to verify we don't get a false active claim acknowledged.. Oh yes, this was my first focus, I tried adding bunch of logs internally just to ensure if we are seeing some bugs here, but so far things look good. It's the TCP connection that Envoy infra is messing up (only sometimes, not often). But nvm, this is still worth spending time on. Thanks for all good points, I am going to close this PR mostly soon (just figuring out a few more details) so that it doesn't pile up in the open PRs. Thanks a lot for spending a lot of your time here, it's just so priceless! > Optionally shut down datanode if it does not stay connected to active namenode > -- > > Key: HDFS-16918 > URL: https://issues.apache.org/jira/browse/HDFS-16918 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > > While deploying Hdfs on Envoy proxy setup, depending on the socket timeout > configured at envoy, the network connection issues or packet loss could be > observed. All of envoys basically form a transparent communication mesh in > which each app can send and receive packets to and from localhost and is > unaware of the network topology. > The primary purpose of Envoy is to make the network transparent to > applications, in order to identify network issues reliably. However, > sometimes such proxy based setup could result into socket connection issues > b/ datanode and namenode. > Many deployment frameworks provide auto-start functionality when any of the > hadoop daemons are stopped. If a given datanode does not stay connected to > active namenode in the cluster i.e. does not receive heartbeat response in > time from active namenode (even though active namenode is not terminated), it > would not be much useful. We should be able to provide configurable behavior > such that if a given datanode cannot receive heartbeat response from active > namenode in configurable time duration, it should terminate itself to avoid > impacting the availability SLA. This is specifically helpful when the > underlying deployment or observability framework (e.g. K8S) can start up the > datanode automatically upon it's shutdown (unless it is being restarted as > part of rolling upgrade) and help the newly brought up datanode (in case of > k8s, a new pod with dynamically changing nodes) establish new socket > connection to active and standby namenodes. This should be an opt-in behavior > and not default one. > > In a distributed system, it is essential to have robust fail-fast mechanisms > in place to prevent issues related to network partitioning. The system must > be designed to prevent further degradation of availability and consistency in > the event of a network partition. Several distributed systems offer fail-safe > approaches, and for some, partition tolerance is critical to the extent that > even a few seconds of heartbeat loss can trigger the removal of an > application server instance from the cluster. For instance, a majority of > zooKeeper clients utilize the ephemeral nodes for this purpose to make system > reliable, fault-tolerant and strongly consistent in the event of network > partition. > From the hdfs architecture viewpoint, it is crucial to understand the > critical role that active and observer namenode play in file system > operations. In a large-scale cluster, if the datanodes holding the same block > (primary and replicas) lose connection to both active and observer namenodes > for a significant amount of time, delaying the process of shutting down such > datanodes and restarting it to re-establish the connection with the namenodes > (assuming the active namenode is alive, assumption is important in the even > of network partition to reestablish the connection) will further deteriorate > the availability of the service. This scenario underscores the importance of > resolving network partitioning. > This is a real use case for hdfs and it is not prudent to assume that every > deployment or cluster management application must be able to restart > datanodes based on JMX metrics, as this would introduce another application > to resolve the network part
[jira] [Commented] (HDFS-16918) Optionally shut down datanode if it does not stay connected to active namenode
[ https://issues.apache.org/jira/browse/HDFS-16918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689989#comment-17689989 ] ASF GitHub Bot commented on HDFS-16918: --- ayushtkn commented on PR #5396: URL: https://github.com/apache/hadoop/pull/5396#issuecomment-1433706087 I think I have lost the flow now 😅 But I think using the getDataNodeStats is a cool thing to explore, it is under a read lock so not costly either, and would be easier to process also may be... "Usually" around metrics, if things can be derived using the exposed ones, we don't coin new ones, generally, there are tools which can do that logics and show you fancy graphs and all also combining metrics together and doing maths on them as well... The Dn case seems a corner case, it won't be very common and need to be careful around not getting pass a split-brain scenario. There are bunch of checks around though, but they are just to verify we don't get a false active claim acknowledged.. But just thinking about this case, it can be figured out by simple logic or scripts, if there are two claiming active, the one from which the last response time is less can be used for those decisions. Something like ``` Variables to Store: activeNnId and LastActiveResponseTime=MAX Fetch Metrics From DN Iterate over all Namenodes. Check if Active NnLastResponseTime < LastActiveResponseTime Store the nnId and last Response Time else Move Forward if LastActiveResponseTime < configurable value conclude dead and do ``` May be some if else or equality might have got inverted, just for idea sake... > Optionally shut down datanode if it does not stay connected to active namenode > -- > > Key: HDFS-16918 > URL: https://issues.apache.org/jira/browse/HDFS-16918 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > > While deploying Hdfs on Envoy proxy setup, depending on the socket timeout > configured at envoy, the network connection issues or packet loss could be > observed. All of envoys basically form a transparent communication mesh in > which each app can send and receive packets to and from localhost and is > unaware of the network topology. > The primary purpose of Envoy is to make the network transparent to > applications, in order to identify network issues reliably. However, > sometimes such proxy based setup could result into socket connection issues > b/ datanode and namenode. > Many deployment frameworks provide auto-start functionality when any of the > hadoop daemons are stopped. If a given datanode does not stay connected to > active namenode in the cluster i.e. does not receive heartbeat response in > time from active namenode (even though active namenode is not terminated), it > would not be much useful. We should be able to provide configurable behavior > such that if a given datanode cannot receive heartbeat response from active > namenode in configurable time duration, it should terminate itself to avoid > impacting the availability SLA. This is specifically helpful when the > underlying deployment or observability framework (e.g. K8S) can start up the > datanode automatically upon it's shutdown (unless it is being restarted as > part of rolling upgrade) and help the newly brought up datanode (in case of > k8s, a new pod with dynamically changing nodes) establish new socket > connection to active and standby namenodes. This should be an opt-in behavior > and not default one. > > In a distributed system, it is essential to have robust fail-fast mechanisms > in place to prevent issues related to network partitioning. The system must > be designed to prevent further degradation of availability and consistency in > the event of a network partition. Several distributed systems offer fail-safe > approaches, and for some, partition tolerance is critical to the extent that > even a few seconds of heartbeat loss can trigger the removal of an > application server instance from the cluster. For instance, a majority of > zooKeeper clients utilize the ephemeral nodes for this purpose to make system > reliable, fault-tolerant and strongly consistent in the event of network > partition. > From the hdfs architecture viewpoint, it is crucial to understand the > critical role that active and observer namenode play in file system > operations. In a large-scale cluster, if the datanodes holding the same block > (primary and replicas) lose connection to both active and observer namenodes > for a significant amount of time, delaying the process of shutting down such >
[jira] [Commented] (HDFS-16918) Optionally shut down datanode if it does not stay connected to active namenode
[ https://issues.apache.org/jira/browse/HDFS-16918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689973#comment-17689973 ] ASF GitHub Bot commented on HDFS-16918: --- virajjasani commented on PR #5396: URL: https://github.com/apache/hadoop/pull/5396#issuecomment-1433668204 Any BP service actor with "Namenode HA state" as "Active" and "Last Heartbeat Response" > 60s (configurable), should be treated as "State Active Namenode". Maybe we can do that. Alright, sorry for adding up more and more comments, let me find the best way to expose things. For cloud native infra, it's still not easy to let metrics be exposed to the pod where we want to but will have to go for some security approvals, will work on this in parallel. Let me try fixing or at least normalizing the Namenode states in such a manner that we can expose "Stale Active Namenode" kind of Namenode HA state in metrics. That would be fairly easy for client to consume. It should also not be backward incompatible given that HDFS-16902 has been a very recent change. So making changes now in it before it can make it to a release should be fine I guess. > Optionally shut down datanode if it does not stay connected to active namenode > -- > > Key: HDFS-16918 > URL: https://issues.apache.org/jira/browse/HDFS-16918 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > > While deploying Hdfs on Envoy proxy setup, depending on the socket timeout > configured at envoy, the network connection issues or packet loss could be > observed. All of envoys basically form a transparent communication mesh in > which each app can send and receive packets to and from localhost and is > unaware of the network topology. > The primary purpose of Envoy is to make the network transparent to > applications, in order to identify network issues reliably. However, > sometimes such proxy based setup could result into socket connection issues > b/ datanode and namenode. > Many deployment frameworks provide auto-start functionality when any of the > hadoop daemons are stopped. If a given datanode does not stay connected to > active namenode in the cluster i.e. does not receive heartbeat response in > time from active namenode (even though active namenode is not terminated), it > would not be much useful. We should be able to provide configurable behavior > such that if a given datanode cannot receive heartbeat response from active > namenode in configurable time duration, it should terminate itself to avoid > impacting the availability SLA. This is specifically helpful when the > underlying deployment or observability framework (e.g. K8S) can start up the > datanode automatically upon it's shutdown (unless it is being restarted as > part of rolling upgrade) and help the newly brought up datanode (in case of > k8s, a new pod with dynamically changing nodes) establish new socket > connection to active and standby namenodes. This should be an opt-in behavior > and not default one. > > In a distributed system, it is essential to have robust fail-fast mechanisms > in place to prevent issues related to network partitioning. The system must > be designed to prevent further degradation of availability and consistency in > the event of a network partition. Several distributed systems offer fail-safe > approaches, and for some, partition tolerance is critical to the extent that > even a few seconds of heartbeat loss can trigger the removal of an > application server instance from the cluster. For instance, a majority of > zooKeeper clients utilize the ephemeral nodes for this purpose to make system > reliable, fault-tolerant and strongly consistent in the event of network > partition. > From the hdfs architecture viewpoint, it is crucial to understand the > critical role that active and observer namenode play in file system > operations. In a large-scale cluster, if the datanodes holding the same block > (primary and replicas) lose connection to both active and observer namenodes > for a significant amount of time, delaying the process of shutting down such > datanodes and restarting it to re-establish the connection with the namenodes > (assuming the active namenode is alive, assumption is important in the even > of network partition to reestablish the connection) will further deteriorate > the availability of the service. This scenario underscores the importance of > resolving network partitioning. > This is a real use case for hdfs and it is not prudent to assume that every > deployment or cluster management application must be able to restart > datanodes based on J
[jira] [Commented] (HDFS-16918) Optionally shut down datanode if it does not stay connected to active namenode
[ https://issues.apache.org/jira/browse/HDFS-16918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689971#comment-17689971 ] ASF GitHub Bot commented on HDFS-16918: --- virajjasani commented on PR #5396: URL: https://github.com/apache/hadoop/pull/5396#issuecomment-1433662089 In the second case where dn is not connected to active nn, the BP offer service would still list active nn as nn-1. The only way for us to actually let a client (administrative applications in this case) know that the given dn is actually out of luck connecting to active nn is by exposing new metric which does internal check of looping through BP service actor metrics and making sure that all BPs have exactly one nn listed as "Active" and has lastHeartbeatReponseTime within few seconds. This is the logic we somehow needs to expose for the clients (admins to take actions, for k8s, it will be some scripting that checks health of dn pods periodically). > Optionally shut down datanode if it does not stay connected to active namenode > -- > > Key: HDFS-16918 > URL: https://issues.apache.org/jira/browse/HDFS-16918 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > > While deploying Hdfs on Envoy proxy setup, depending on the socket timeout > configured at envoy, the network connection issues or packet loss could be > observed. All of envoys basically form a transparent communication mesh in > which each app can send and receive packets to and from localhost and is > unaware of the network topology. > The primary purpose of Envoy is to make the network transparent to > applications, in order to identify network issues reliably. However, > sometimes such proxy based setup could result into socket connection issues > b/ datanode and namenode. > Many deployment frameworks provide auto-start functionality when any of the > hadoop daemons are stopped. If a given datanode does not stay connected to > active namenode in the cluster i.e. does not receive heartbeat response in > time from active namenode (even though active namenode is not terminated), it > would not be much useful. We should be able to provide configurable behavior > such that if a given datanode cannot receive heartbeat response from active > namenode in configurable time duration, it should terminate itself to avoid > impacting the availability SLA. This is specifically helpful when the > underlying deployment or observability framework (e.g. K8S) can start up the > datanode automatically upon it's shutdown (unless it is being restarted as > part of rolling upgrade) and help the newly brought up datanode (in case of > k8s, a new pod with dynamically changing nodes) establish new socket > connection to active and standby namenodes. This should be an opt-in behavior > and not default one. > > In a distributed system, it is essential to have robust fail-fast mechanisms > in place to prevent issues related to network partitioning. The system must > be designed to prevent further degradation of availability and consistency in > the event of a network partition. Several distributed systems offer fail-safe > approaches, and for some, partition tolerance is critical to the extent that > even a few seconds of heartbeat loss can trigger the removal of an > application server instance from the cluster. For instance, a majority of > zooKeeper clients utilize the ephemeral nodes for this purpose to make system > reliable, fault-tolerant and strongly consistent in the event of network > partition. > From the hdfs architecture viewpoint, it is crucial to understand the > critical role that active and observer namenode play in file system > operations. In a large-scale cluster, if the datanodes holding the same block > (primary and replicas) lose connection to both active and observer namenodes > for a significant amount of time, delaying the process of shutting down such > datanodes and restarting it to re-establish the connection with the namenodes > (assuming the active namenode is alive, assumption is important in the even > of network partition to reestablish the connection) will further deteriorate > the availability of the service. This scenario underscores the importance of > resolving network partitioning. > This is a real use case for hdfs and it is not prudent to assume that every > deployment or cluster management application must be able to restart > datanodes based on JMX metrics, as this would introduce another application > to resolve the network partition impact of hdfs. Besides, popular cluster > management applications are not typically used in all cloud-n
[jira] [Commented] (HDFS-16918) Optionally shut down datanode if it does not stay connected to active namenode
[ https://issues.apache.org/jira/browse/HDFS-16918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689968#comment-17689968 ] ASF GitHub Bot commented on HDFS-16918: --- virajjasani commented on PR #5396: URL: https://github.com/apache/hadoop/pull/5396#issuecomment-1433655322 Btw just to give you more insights, what I am worried about is cases like this: In this case, dn is connected to active nn https://user-images.githubusercontent.com/34790606/219476091-989e6a73-d54d-4c34-b60f-65152cf6980c.png";> However in this case, dn is live, it's TCP connection is lost to active nn and it is healthy and yet not connected to active nn https://user-images.githubusercontent.com/34790606/219476263-9a039e14-9a0a-4c18-a103-fb448a61cf58.png";> > Optionally shut down datanode if it does not stay connected to active namenode > -- > > Key: HDFS-16918 > URL: https://issues.apache.org/jira/browse/HDFS-16918 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > > While deploying Hdfs on Envoy proxy setup, depending on the socket timeout > configured at envoy, the network connection issues or packet loss could be > observed. All of envoys basically form a transparent communication mesh in > which each app can send and receive packets to and from localhost and is > unaware of the network topology. > The primary purpose of Envoy is to make the network transparent to > applications, in order to identify network issues reliably. However, > sometimes such proxy based setup could result into socket connection issues > b/ datanode and namenode. > Many deployment frameworks provide auto-start functionality when any of the > hadoop daemons are stopped. If a given datanode does not stay connected to > active namenode in the cluster i.e. does not receive heartbeat response in > time from active namenode (even though active namenode is not terminated), it > would not be much useful. We should be able to provide configurable behavior > such that if a given datanode cannot receive heartbeat response from active > namenode in configurable time duration, it should terminate itself to avoid > impacting the availability SLA. This is specifically helpful when the > underlying deployment or observability framework (e.g. K8S) can start up the > datanode automatically upon it's shutdown (unless it is being restarted as > part of rolling upgrade) and help the newly brought up datanode (in case of > k8s, a new pod with dynamically changing nodes) establish new socket > connection to active and standby namenodes. This should be an opt-in behavior > and not default one. > > In a distributed system, it is essential to have robust fail-fast mechanisms > in place to prevent issues related to network partitioning. The system must > be designed to prevent further degradation of availability and consistency in > the event of a network partition. Several distributed systems offer fail-safe > approaches, and for some, partition tolerance is critical to the extent that > even a few seconds of heartbeat loss can trigger the removal of an > application server instance from the cluster. For instance, a majority of > zooKeeper clients utilize the ephemeral nodes for this purpose to make system > reliable, fault-tolerant and strongly consistent in the event of network > partition. > From the hdfs architecture viewpoint, it is crucial to understand the > critical role that active and observer namenode play in file system > operations. In a large-scale cluster, if the datanodes holding the same block > (primary and replicas) lose connection to both active and observer namenodes > for a significant amount of time, delaying the process of shutting down such > datanodes and restarting it to re-establish the connection with the namenodes > (assuming the active namenode is alive, assumption is important in the even > of network partition to reestablish the connection) will further deteriorate > the availability of the service. This scenario underscores the importance of > resolving network partitioning. > This is a real use case for hdfs and it is not prudent to assume that every > deployment or cluster management application must be able to restart > datanodes based on JMX metrics, as this would introduce another application > to resolve the network partition impact of hdfs. Besides, popular cluster > management applications are not typically used in all cloud-native env. Even > if these cluster management applications are deployed, certain security > constraints may restrict their access to JMX metrics and prevent them from > in
[jira] [Commented] (HDFS-16918) Optionally shut down datanode if it does not stay connected to active namenode
[ https://issues.apache.org/jira/browse/HDFS-16918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689966#comment-17689966 ] ASF GitHub Bot commented on HDFS-16918: --- virajjasani commented on PR #5396: URL: https://github.com/apache/hadoop/pull/5396#issuecomment-1433649342 Yes, we do have check at namenode side: ``` @Override // NameNodeMXBean public String getDeadNodes() { final Map> info = new HashMap>(); final List dead = new ArrayList(); blockManager.getDatanodeManager().fetchDatanodes(null, dead, false); for (DatanodeDescriptor node : dead) { ... ... ... ``` I am thinking more about `getDataNodeStats()` API > Optionally shut down datanode if it does not stay connected to active namenode > -- > > Key: HDFS-16918 > URL: https://issues.apache.org/jira/browse/HDFS-16918 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > > While deploying Hdfs on Envoy proxy setup, depending on the socket timeout > configured at envoy, the network connection issues or packet loss could be > observed. All of envoys basically form a transparent communication mesh in > which each app can send and receive packets to and from localhost and is > unaware of the network topology. > The primary purpose of Envoy is to make the network transparent to > applications, in order to identify network issues reliably. However, > sometimes such proxy based setup could result into socket connection issues > b/ datanode and namenode. > Many deployment frameworks provide auto-start functionality when any of the > hadoop daemons are stopped. If a given datanode does not stay connected to > active namenode in the cluster i.e. does not receive heartbeat response in > time from active namenode (even though active namenode is not terminated), it > would not be much useful. We should be able to provide configurable behavior > such that if a given datanode cannot receive heartbeat response from active > namenode in configurable time duration, it should terminate itself to avoid > impacting the availability SLA. This is specifically helpful when the > underlying deployment or observability framework (e.g. K8S) can start up the > datanode automatically upon it's shutdown (unless it is being restarted as > part of rolling upgrade) and help the newly brought up datanode (in case of > k8s, a new pod with dynamically changing nodes) establish new socket > connection to active and standby namenodes. This should be an opt-in behavior > and not default one. > > In a distributed system, it is essential to have robust fail-fast mechanisms > in place to prevent issues related to network partitioning. The system must > be designed to prevent further degradation of availability and consistency in > the event of a network partition. Several distributed systems offer fail-safe > approaches, and for some, partition tolerance is critical to the extent that > even a few seconds of heartbeat loss can trigger the removal of an > application server instance from the cluster. For instance, a majority of > zooKeeper clients utilize the ephemeral nodes for this purpose to make system > reliable, fault-tolerant and strongly consistent in the event of network > partition. > From the hdfs architecture viewpoint, it is crucial to understand the > critical role that active and observer namenode play in file system > operations. In a large-scale cluster, if the datanodes holding the same block > (primary and replicas) lose connection to both active and observer namenodes > for a significant amount of time, delaying the process of shutting down such > datanodes and restarting it to re-establish the connection with the namenodes > (assuming the active namenode is alive, assumption is important in the even > of network partition to reestablish the connection) will further deteriorate > the availability of the service. This scenario underscores the importance of > resolving network partitioning. > This is a real use case for hdfs and it is not prudent to assume that every > deployment or cluster management application must be able to restart > datanodes based on JMX metrics, as this would introduce another application > to resolve the network partition impact of hdfs. Besides, popular cluster > management applications are not typically used in all cloud-native env. Even > if these cluster management applications are deployed, certain security > constraints may restrict their access to JMX metrics and prevent them from > interfering with hdfs operations. The applications that can only trigger > aler
[jira] [Commented] (HDFS-16917) Add transfer rate quantile metrics for DataNode reads
[ https://issues.apache.org/jira/browse/HDFS-16917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689959#comment-17689959 ] ASF GitHub Bot commented on HDFS-16917: --- rdingankar commented on PR #5397: URL: https://github.com/apache/hadoop/pull/5397#issuecomment-1433632301 The UTs failing in the build are unrelated to my change. Verified locally that the same tests fail after rebasing to current trunk. > Add transfer rate quantile metrics for DataNode reads > - > > Key: HDFS-16917 > URL: https://issues.apache.org/jira/browse/HDFS-16917 > Project: Hadoop HDFS > Issue Type: Task > Components: datanode >Reporter: Ravindra Dingankar >Priority: Minor > Labels: pull-request-available > > Currently we have the following metrics for datanode reads. > |BytesRead > BlocksRead > TotalReadTime|Total number of bytes read from DataNode > Total number of blocks read from DataNode > Total number of milliseconds spent on read operation| > We would like to add a new quantile metric calculating the distribution of > data transfer rate for datanode reads. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16918) Optionally shut down datanode if it does not stay connected to active namenode
[ https://issues.apache.org/jira/browse/HDFS-16918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689958#comment-17689958 ] ASF GitHub Bot commented on HDFS-16918: --- ayushtkn commented on PR #5396: URL: https://github.com/apache/hadoop/pull/5396#issuecomment-1433632089 The active NN knows that which datanode is dead. That is how it shows in the UI as well. There would be some param in the JMX which must be telling the state of the datanode to the active namenode. I can pull that out for you, if you want, but it is in the UI, so there would be a metric for sure, just being lazy to check the code again: ![image](https://user-images.githubusercontent.com/25608848/219469872-4d561fdb-98b0-46d8-abb7-e5c2eb0dcd46.png) Datanode has metrics and you know post what time it is declared dead. Any service can have periodic health checks and have a check. Anyway you have a service which checks if datanode is dead and restarts, some logics here and there in that to have a periodic check to shoot a shutdown as well, should do. https://user-images.githubusercontent.com/25608848/219470918-db38d602-984f-4baa-9860-aee19b2af646.png";> Code point of view implementing such a logic sounds very naive to me. or may be minimal effort thing Not dragging the use case list either, because there ain't no end to that, client was X and he was in Y state and blah blah, datanode block reconstruction works, around block movements and it won't end > Optionally shut down datanode if it does not stay connected to active namenode > -- > > Key: HDFS-16918 > URL: https://issues.apache.org/jira/browse/HDFS-16918 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > > While deploying Hdfs on Envoy proxy setup, depending on the socket timeout > configured at envoy, the network connection issues or packet loss could be > observed. All of envoys basically form a transparent communication mesh in > which each app can send and receive packets to and from localhost and is > unaware of the network topology. > The primary purpose of Envoy is to make the network transparent to > applications, in order to identify network issues reliably. However, > sometimes such proxy based setup could result into socket connection issues > b/ datanode and namenode. > Many deployment frameworks provide auto-start functionality when any of the > hadoop daemons are stopped. If a given datanode does not stay connected to > active namenode in the cluster i.e. does not receive heartbeat response in > time from active namenode (even though active namenode is not terminated), it > would not be much useful. We should be able to provide configurable behavior > such that if a given datanode cannot receive heartbeat response from active > namenode in configurable time duration, it should terminate itself to avoid > impacting the availability SLA. This is specifically helpful when the > underlying deployment or observability framework (e.g. K8S) can start up the > datanode automatically upon it's shutdown (unless it is being restarted as > part of rolling upgrade) and help the newly brought up datanode (in case of > k8s, a new pod with dynamically changing nodes) establish new socket > connection to active and standby namenodes. This should be an opt-in behavior > and not default one. > > In a distributed system, it is essential to have robust fail-fast mechanisms > in place to prevent issues related to network partitioning. The system must > be designed to prevent further degradation of availability and consistency in > the event of a network partition. Several distributed systems offer fail-safe > approaches, and for some, partition tolerance is critical to the extent that > even a few seconds of heartbeat loss can trigger the removal of an > application server instance from the cluster. For instance, a majority of > zooKeeper clients utilize the ephemeral nodes for this purpose to make system > reliable, fault-tolerant and strongly consistent in the event of network > partition. > From the hdfs architecture viewpoint, it is crucial to understand the > critical role that active and observer namenode play in file system > operations. In a large-scale cluster, if the datanodes holding the same block > (primary and replicas) lose connection to both active and observer namenodes > for a significant amount of time, delaying the process of shutting down such > datanodes and restarting it to re-establish the connection with the namenodes > (assuming the active namenode is alive, assumption is important in the even > of network partition to rees
[jira] [Commented] (HDFS-16918) Optionally shut down datanode if it does not stay connected to active namenode
[ https://issues.apache.org/jira/browse/HDFS-16918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689957#comment-17689957 ] ASF GitHub Bot commented on HDFS-16918: --- virajjasani commented on PR #5396: URL: https://github.com/apache/hadoop/pull/5396#issuecomment-1433630564 > or there is an API which getDatanodeStats and which can take dead as param or so, and such a logic can be developed by a periodic check or so. I think this is interesting point, indeed I should have considered this API that is already in use. Let me get back with some improvements to it and then I can suggest building K8S script around this, that should have similar impact as what we are trying to achieve within datanode itself. > Optionally shut down datanode if it does not stay connected to active namenode > -- > > Key: HDFS-16918 > URL: https://issues.apache.org/jira/browse/HDFS-16918 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > > While deploying Hdfs on Envoy proxy setup, depending on the socket timeout > configured at envoy, the network connection issues or packet loss could be > observed. All of envoys basically form a transparent communication mesh in > which each app can send and receive packets to and from localhost and is > unaware of the network topology. > The primary purpose of Envoy is to make the network transparent to > applications, in order to identify network issues reliably. However, > sometimes such proxy based setup could result into socket connection issues > b/ datanode and namenode. > Many deployment frameworks provide auto-start functionality when any of the > hadoop daemons are stopped. If a given datanode does not stay connected to > active namenode in the cluster i.e. does not receive heartbeat response in > time from active namenode (even though active namenode is not terminated), it > would not be much useful. We should be able to provide configurable behavior > such that if a given datanode cannot receive heartbeat response from active > namenode in configurable time duration, it should terminate itself to avoid > impacting the availability SLA. This is specifically helpful when the > underlying deployment or observability framework (e.g. K8S) can start up the > datanode automatically upon it's shutdown (unless it is being restarted as > part of rolling upgrade) and help the newly brought up datanode (in case of > k8s, a new pod with dynamically changing nodes) establish new socket > connection to active and standby namenodes. This should be an opt-in behavior > and not default one. > > In a distributed system, it is essential to have robust fail-fast mechanisms > in place to prevent issues related to network partitioning. The system must > be designed to prevent further degradation of availability and consistency in > the event of a network partition. Several distributed systems offer fail-safe > approaches, and for some, partition tolerance is critical to the extent that > even a few seconds of heartbeat loss can trigger the removal of an > application server instance from the cluster. For instance, a majority of > zooKeeper clients utilize the ephemeral nodes for this purpose to make system > reliable, fault-tolerant and strongly consistent in the event of network > partition. > From the hdfs architecture viewpoint, it is crucial to understand the > critical role that active and observer namenode play in file system > operations. In a large-scale cluster, if the datanodes holding the same block > (primary and replicas) lose connection to both active and observer namenodes > for a significant amount of time, delaying the process of shutting down such > datanodes and restarting it to re-establish the connection with the namenodes > (assuming the active namenode is alive, assumption is important in the even > of network partition to reestablish the connection) will further deteriorate > the availability of the service. This scenario underscores the importance of > resolving network partitioning. > This is a real use case for hdfs and it is not prudent to assume that every > deployment or cluster management application must be able to restart > datanodes based on JMX metrics, as this would introduce another application > to resolve the network partition impact of hdfs. Besides, popular cluster > management applications are not typically used in all cloud-native env. Even > if these cluster management applications are deployed, certain security > constraints may restrict their access to JMX metrics and prevent them from > interfering with hdfs operations. The applications that
[jira] [Commented] (HDFS-16918) Optionally shut down datanode if it does not stay connected to active namenode
[ https://issues.apache.org/jira/browse/HDFS-16918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689949#comment-17689949 ] ASF GitHub Bot commented on HDFS-16918: --- virajjasani commented on PR #5396: URL: https://github.com/apache/hadoop/pull/5396#issuecomment-1433614873 > I have mentioned a bunch of reasons above, I even think if some client is connected to datanode and happily reading a file, he might get impacted, AFAIK block location can be cached as well and there are many other reasons Yes this is valid point but only until block locations stay cached at client :) But I understand there is no point discussing on the same reasons as we would keep dragging the same point. Thanks!! > Optionally shut down datanode if it does not stay connected to active namenode > -- > > Key: HDFS-16918 > URL: https://issues.apache.org/jira/browse/HDFS-16918 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > > While deploying Hdfs on Envoy proxy setup, depending on the socket timeout > configured at envoy, the network connection issues or packet loss could be > observed. All of envoys basically form a transparent communication mesh in > which each app can send and receive packets to and from localhost and is > unaware of the network topology. > The primary purpose of Envoy is to make the network transparent to > applications, in order to identify network issues reliably. However, > sometimes such proxy based setup could result into socket connection issues > b/ datanode and namenode. > Many deployment frameworks provide auto-start functionality when any of the > hadoop daemons are stopped. If a given datanode does not stay connected to > active namenode in the cluster i.e. does not receive heartbeat response in > time from active namenode (even though active namenode is not terminated), it > would not be much useful. We should be able to provide configurable behavior > such that if a given datanode cannot receive heartbeat response from active > namenode in configurable time duration, it should terminate itself to avoid > impacting the availability SLA. This is specifically helpful when the > underlying deployment or observability framework (e.g. K8S) can start up the > datanode automatically upon it's shutdown (unless it is being restarted as > part of rolling upgrade) and help the newly brought up datanode (in case of > k8s, a new pod with dynamically changing nodes) establish new socket > connection to active and standby namenodes. This should be an opt-in behavior > and not default one. > > In a distributed system, it is essential to have robust fail-fast mechanisms > in place to prevent issues related to network partitioning. The system must > be designed to prevent further degradation of availability and consistency in > the event of a network partition. Several distributed systems offer fail-safe > approaches, and for some, partition tolerance is critical to the extent that > even a few seconds of heartbeat loss can trigger the removal of an > application server instance from the cluster. For instance, a majority of > zooKeeper clients utilize the ephemeral nodes for this purpose to make system > reliable, fault-tolerant and strongly consistent in the event of network > partition. > From the hdfs architecture viewpoint, it is crucial to understand the > critical role that active and observer namenode play in file system > operations. In a large-scale cluster, if the datanodes holding the same block > (primary and replicas) lose connection to both active and observer namenodes > for a significant amount of time, delaying the process of shutting down such > datanodes and restarting it to re-establish the connection with the namenodes > (assuming the active namenode is alive, assumption is important in the even > of network partition to reestablish the connection) will further deteriorate > the availability of the service. This scenario underscores the importance of > resolving network partitioning. > This is a real use case for hdfs and it is not prudent to assume that every > deployment or cluster management application must be able to restart > datanodes based on JMX metrics, as this would introduce another application > to resolve the network partition impact of hdfs. Besides, popular cluster > management applications are not typically used in all cloud-native env. Even > if these cluster management applications are deployed, certain security > constraints may restrict their access to JMX metrics and prevent them from > interfering with hdfs operations. The applications that can only
[jira] [Commented] (HDFS-16918) Optionally shut down datanode if it does not stay connected to active namenode
[ https://issues.apache.org/jira/browse/HDFS-16918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689948#comment-17689948 ] ASF GitHub Bot commented on HDFS-16918: --- virajjasani commented on PR #5396: URL: https://github.com/apache/hadoop/pull/5396#issuecomment-1433612218 Thanks for the reply Ayush, appreciate it as always. Are you saying that default implementation of "logging and/or exposing JMX metrics" for a given datanode if it doesn't stay connected is also not feasible according to you? I know we have metric that says "lastHeartbeat" and "lastHeartbeatResponseTime" but it's still difficult for user or script to apply a loop into BP service actor metrics rather than getting as simple log or metric as "this datanode has not heard from active namenode in the last 60s or so". Are you at least fine with keeping this as default implementation logic? > Optionally shut down datanode if it does not stay connected to active namenode > -- > > Key: HDFS-16918 > URL: https://issues.apache.org/jira/browse/HDFS-16918 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > > While deploying Hdfs on Envoy proxy setup, depending on the socket timeout > configured at envoy, the network connection issues or packet loss could be > observed. All of envoys basically form a transparent communication mesh in > which each app can send and receive packets to and from localhost and is > unaware of the network topology. > The primary purpose of Envoy is to make the network transparent to > applications, in order to identify network issues reliably. However, > sometimes such proxy based setup could result into socket connection issues > b/ datanode and namenode. > Many deployment frameworks provide auto-start functionality when any of the > hadoop daemons are stopped. If a given datanode does not stay connected to > active namenode in the cluster i.e. does not receive heartbeat response in > time from active namenode (even though active namenode is not terminated), it > would not be much useful. We should be able to provide configurable behavior > such that if a given datanode cannot receive heartbeat response from active > namenode in configurable time duration, it should terminate itself to avoid > impacting the availability SLA. This is specifically helpful when the > underlying deployment or observability framework (e.g. K8S) can start up the > datanode automatically upon it's shutdown (unless it is being restarted as > part of rolling upgrade) and help the newly brought up datanode (in case of > k8s, a new pod with dynamically changing nodes) establish new socket > connection to active and standby namenodes. This should be an opt-in behavior > and not default one. > > In a distributed system, it is essential to have robust fail-fast mechanisms > in place to prevent issues related to network partitioning. The system must > be designed to prevent further degradation of availability and consistency in > the event of a network partition. Several distributed systems offer fail-safe > approaches, and for some, partition tolerance is critical to the extent that > even a few seconds of heartbeat loss can trigger the removal of an > application server instance from the cluster. For instance, a majority of > zooKeeper clients utilize the ephemeral nodes for this purpose to make system > reliable, fault-tolerant and strongly consistent in the event of network > partition. > From the hdfs architecture viewpoint, it is crucial to understand the > critical role that active and observer namenode play in file system > operations. In a large-scale cluster, if the datanodes holding the same block > (primary and replicas) lose connection to both active and observer namenodes > for a significant amount of time, delaying the process of shutting down such > datanodes and restarting it to re-establish the connection with the namenodes > (assuming the active namenode is alive, assumption is important in the even > of network partition to reestablish the connection) will further deteriorate > the availability of the service. This scenario underscores the importance of > resolving network partitioning. > This is a real use case for hdfs and it is not prudent to assume that every > deployment or cluster management application must be able to restart > datanodes based on JMX metrics, as this would introduce another application > to resolve the network partition impact of hdfs. Besides, popular cluster > management applications are not typically used in all cloud-native env. Even > if these cluster management applications are deploy
[jira] [Commented] (HDFS-16917) Add transfer rate quantile metrics for DataNode reads
[ https://issues.apache.org/jira/browse/HDFS-16917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689947#comment-17689947 ] ASF GitHub Bot commented on HDFS-16917: --- rdingankar commented on code in PR #5397: URL: https://github.com/apache/hadoop/pull/5397#discussion_r1108939915 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/DFSUtil.java: ## @@ -1936,4 +1936,17 @@ public static boolean isParentEntry(final String path, final String parent) { return path.charAt(parent.length()) == Path.SEPARATOR_CHAR || parent.equals(Path.SEPARATOR); } + + /** + * Calculate the transfer rate in megabytes/second. + * @param bytes bytes + * @param durationMS duration in milliseconds + * @return the number of megabytes/second of the transfer rate + */ + public static long transferRateMBs(long bytes, long durationMS) { +if (durationMS == 0) { Review Comment: updated > Add transfer rate quantile metrics for DataNode reads > - > > Key: HDFS-16917 > URL: https://issues.apache.org/jira/browse/HDFS-16917 > Project: Hadoop HDFS > Issue Type: Task > Components: datanode >Reporter: Ravindra Dingankar >Priority: Minor > Labels: pull-request-available > > Currently we have the following metrics for datanode reads. > |BytesRead > BlocksRead > TotalReadTime|Total number of bytes read from DataNode > Total number of blocks read from DataNode > Total number of milliseconds spent on read operation| > We would like to add a new quantile metric calculating the distribution of > data transfer rate for datanode reads. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16918) Optionally shut down datanode if it does not stay connected to active namenode
[ https://issues.apache.org/jira/browse/HDFS-16918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689944#comment-17689944 ] ASF GitHub Bot commented on HDFS-16918: --- ayushtkn commented on PR #5396: URL: https://github.com/apache/hadoop/pull/5396#issuecomment-1433606299 Viraj I am Sorry I am totally against this. I am writing this because I don't want to ghost you and then if someone comes and agrees then shoot a vote against. I have mentioned a bunch of reasons above, I even think if some client is connected to datanode and happily reading a file, he might get impacted, AFAIK block location can be cached as well and there are many other reasons, I don't to get you a list, I am pretty sure you would be aware of almost all of them... A service like datanode killing itself doesn't sound something feasible to me at all. Having these hooks and all in a service which holds data, sounds just doing the same thing but opening ways to get exploited. That sound even more risky to me. This is something a cluster Admin services should handle. A datanode going down or having troubles is something a basic use case for HDFS, that is where replication pitches in. Ideally it should just alarm the admins and they should figure out what went wrong, may be a restart won't fix things and you would be in loop, doing a shutdown shoot BR to the ones your are still connected and then restart. Metrics are there which can tell you which datanode is dead, so advanced cluster administrator services can leverage that. There is [JMXJsonServlet](https://github.com/apache/hadoop/blob/trunk/hadoop-common-project/hadoop-common/src/main/java/org/apache/hadoop/jmx/JMXJsonServlet.java) which can be leveraged. If those services can trigger a restart and do operations like shutdown, they should be allowed to fetch metrics as well. or there is an API which getDatanodeStats and which can take dead as param or so, and such a logic can be developed by a periodic check or so. Regarding the cloud thing and metrics stuff. I got a chance to talk to some cloud Infra folks at my org and we do have ways to get metrics. I am not sharing how, because I don't know how professionally safe it is for me. But there are ways to do so. So, This can be handled at deployment levels. Should be done there only and this auto shutdown logic based on some factors, I am just repeating myself I am totally against it > Optionally shut down datanode if it does not stay connected to active namenode > -- > > Key: HDFS-16918 > URL: https://issues.apache.org/jira/browse/HDFS-16918 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > > While deploying Hdfs on Envoy proxy setup, depending on the socket timeout > configured at envoy, the network connection issues or packet loss could be > observed. All of envoys basically form a transparent communication mesh in > which each app can send and receive packets to and from localhost and is > unaware of the network topology. > The primary purpose of Envoy is to make the network transparent to > applications, in order to identify network issues reliably. However, > sometimes such proxy based setup could result into socket connection issues > b/ datanode and namenode. > Many deployment frameworks provide auto-start functionality when any of the > hadoop daemons are stopped. If a given datanode does not stay connected to > active namenode in the cluster i.e. does not receive heartbeat response in > time from active namenode (even though active namenode is not terminated), it > would not be much useful. We should be able to provide configurable behavior > such that if a given datanode cannot receive heartbeat response from active > namenode in configurable time duration, it should terminate itself to avoid > impacting the availability SLA. This is specifically helpful when the > underlying deployment or observability framework (e.g. K8S) can start up the > datanode automatically upon it's shutdown (unless it is being restarted as > part of rolling upgrade) and help the newly brought up datanode (in case of > k8s, a new pod with dynamically changing nodes) establish new socket > connection to active and standby namenodes. This should be an opt-in behavior > and not default one. > > In a distributed system, it is essential to have robust fail-fast mechanisms > in place to prevent issues related to network partitioning. The system must > be designed to prevent further degradation of availability and consistency in > the event of a network partition. Several distributed systems o
[jira] [Commented] (HDFS-16918) Optionally shut down datanode if it does not stay connected to active namenode
[ https://issues.apache.org/jira/browse/HDFS-16918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689900#comment-17689900 ] ASF GitHub Bot commented on HDFS-16918: --- virajjasani commented on PR #5396: URL: https://github.com/apache/hadoop/pull/5396#issuecomment-1433513544 How about this? This change is working fine on the cluster as is and it is a real requirement as I explained in the [above comment](https://github.com/apache/hadoop/pull/5396#issuecomment-1432160563). If we do not want to keep this change as is i.e. shutdown datanode if not connected to active namenode, how about we provide a pluggable implementation? Let's say, by default, if the datanode does not stay connected to active namenode for 60s, in the default implementation (that we can provide with this patch) we take action of just logging (or maybe expose metric, whatever reviewers feel feasible) the fact that this datanode is not being useful for client as it has lost connected to active namenode for more than the past 60s. This is the default implementation that we can keep. On the other hand, users are allowed to have their own pluggable implementation so let's say if someone wants to shutdown datanode after 60s (default) of loosing connection, they will have to use new implementation with action as "shutdown datanode". Hence, we have two configs for this change: 1. time duration for loosing connection (`dfs.datanode.health.activennconnect.timeout`) which we already have, but with default value as 60s 2. action to be performed by datanode when above threshold is reached (maybe something like `dfs.datanode.activennconnect.timeout.action.impl`) with default implementation that would take action of just logging or exposing metric as per consensus. Any user can have their own implementation separately maintained and that implementation can take action of shutting down datanode, or running another script that could invoke dfsadmin action. Anything should be fine but now the code stays with users. Thoughts? > Optionally shut down datanode if it does not stay connected to active namenode > -- > > Key: HDFS-16918 > URL: https://issues.apache.org/jira/browse/HDFS-16918 > Project: Hadoop HDFS > Issue Type: New Feature >Reporter: Viraj Jasani >Assignee: Viraj Jasani >Priority: Major > Labels: pull-request-available > > While deploying Hdfs on Envoy proxy setup, depending on the socket timeout > configured at envoy, the network connection issues or packet loss could be > observed. All of envoys basically form a transparent communication mesh in > which each app can send and receive packets to and from localhost and is > unaware of the network topology. > The primary purpose of Envoy is to make the network transparent to > applications, in order to identify network issues reliably. However, > sometimes such proxy based setup could result into socket connection issues > b/ datanode and namenode. > Many deployment frameworks provide auto-start functionality when any of the > hadoop daemons are stopped. If a given datanode does not stay connected to > active namenode in the cluster i.e. does not receive heartbeat response in > time from active namenode (even though active namenode is not terminated), it > would not be much useful. We should be able to provide configurable behavior > such that if a given datanode cannot receive heartbeat response from active > namenode in configurable time duration, it should terminate itself to avoid > impacting the availability SLA. This is specifically helpful when the > underlying deployment or observability framework (e.g. K8S) can start up the > datanode automatically upon it's shutdown (unless it is being restarted as > part of rolling upgrade) and help the newly brought up datanode (in case of > k8s, a new pod with dynamically changing nodes) establish new socket > connection to active and standby namenodes. This should be an opt-in behavior > and not default one. > > In a distributed system, it is essential to have robust fail-fast mechanisms > in place to prevent issues related to network partitioning. The system must > be designed to prevent further degradation of availability and consistency in > the event of a network partition. Several distributed systems offer fail-safe > approaches, and for some, partition tolerance is critical to the extent that > even a few seconds of heartbeat loss can trigger the removal of an > application server instance from the cluster. For instance, a majority of > zooKeeper clients utilize the ephemeral nodes for this purpose to make system > reliable, fault-tolerant and strongly consistent in the event of network > partition. >
[jira] [Commented] (HDFS-16761) Namenode UI for Datanodes page not loading if any data node is down
[ https://issues.apache.org/jira/browse/HDFS-16761?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689891#comment-17689891 ] Stephen O'Donnell commented on HDFS-16761: -- Branch 3.2 seems to be OK too, so resolving this one. > Namenode UI for Datanodes page not loading if any data node is down > --- > > Key: HDFS-16761 > URL: https://issues.apache.org/jira/browse/HDFS-16761 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.2 >Reporter: Krishna Reddy >Assignee: Zita Dombi >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > > Steps to reproduce: > - Install the hadoop components and add 3 datanodes > - Enable namenode HA > - Open Namenode UI and check datanode page > - check all datanodes will display > - Now make one datanode down > - wait for 10 minutes time as heartbeat expires > - Refresh namenode page and check > > Actual Result: It is showing error message "NameNode is still loading. > Redirecting to the Startup Progress page." -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Updated] (HDFS-16761) Namenode UI for Datanodes page not loading if any data node is down
[ https://issues.apache.org/jira/browse/HDFS-16761?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Stephen O'Donnell updated HDFS-16761: - Resolution: Fixed Status: Resolved (was: Patch Available) > Namenode UI for Datanodes page not loading if any data node is down > --- > > Key: HDFS-16761 > URL: https://issues.apache.org/jira/browse/HDFS-16761 > Project: Hadoop HDFS > Issue Type: Bug >Affects Versions: 3.2.2 >Reporter: Krishna Reddy >Assignee: Zita Dombi >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > > Steps to reproduce: > - Install the hadoop components and add 3 datanodes > - Enable namenode HA > - Open Namenode UI and check datanode page > - check all datanodes will display > - Now make one datanode down > - wait for 10 minutes time as heartbeat expires > - Refresh namenode page and check > > Actual Result: It is showing error message "NameNode is still loading. > Redirecting to the Startup Progress page." -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16917) Add transfer rate quantile metrics for DataNode reads
[ https://issues.apache.org/jira/browse/HDFS-16917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689786#comment-17689786 ] ASF GitHub Bot commented on HDFS-16917: --- hadoop-yetus commented on PR #5397: URL: https://github.com/apache/hadoop/pull/5397#issuecomment-1433166779 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 47s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +0 :ok: | markdownlint | 0m 0s | | markdownlint was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +0 :ok: | mvndep | 15m 35s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 30m 57s | | trunk passed | | +1 :green_heart: | compile | 23m 9s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | compile | 20m 35s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | checkstyle | 3m 49s | | trunk passed | | +1 :green_heart: | mvnsite | 3m 28s | | trunk passed | | +1 :green_heart: | javadoc | 2m 30s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 2m 38s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 6m 8s | | trunk passed | | +1 :green_heart: | shadedclient | 26m 12s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 29s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 2m 34s | | the patch passed | | +1 :green_heart: | compile | 22m 34s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javac | 22m 34s | | the patch passed | | +1 :green_heart: | compile | 20m 32s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | javac | 20m 32s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 3m 41s | [/results-checkstyle-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5397/8/artifact/out/results-checkstyle-root.txt) | root: The patch generated 2 new + 139 unchanged - 0 fixed = 141 total (was 139) | | +1 :green_heart: | mvnsite | 3m 35s | | the patch passed | | +1 :green_heart: | javadoc | 2m 20s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 2m 42s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 6m 25s | | the patch passed | | +1 :green_heart: | shadedclient | 26m 54s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 18m 21s | | hadoop-common in the patch passed. | | -1 :x: | unit | 210m 41s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5397/8/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 1m 15s | | The patch does not generate ASF License warnings. | | | | 457m 38s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.server.namenode.TestAuditLogger | | | hadoop.hdfs.server.namenode.TestFsck | | | hadoop.hdfs.server.namenode.TestAuditLogs | | | hadoop.hdfs.server.namenode.ha.TestObserverNode | | | hadoop.hdfs.server.namenode.TestFSNamesystemLockReport | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5397/8/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5397 | | Optional Tests | dupname asflicense mvnsite codespell detsecrets markdownlint compile javac javadoc mvninstall unit shadedclient spotbugs checkstyle | | uname | Linux 4a7bede6aee2 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 1
[jira] [Commented] (HDFS-16922) The logic of IncrementalBlockReportManager#addRDBI method may cause missing blocks when cluster is busy.
[ https://issues.apache.org/jira/browse/HDFS-16922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689725#comment-17689725 ] ASF GitHub Bot commented on HDFS-16922: --- hfutatzhanghb commented on PR #5398: URL: https://github.com/apache/hadoop/pull/5398#issuecomment-1433001412 > hi, @zhangshuyan0 . thanks for your replying~. yes, it is still possible to miss blocks in trunk when the replace policy is set to NEVER. I'am developing a UT to reproduce this case. > The logic of IncrementalBlockReportManager#addRDBI method may cause missing > blocks when cluster is busy. > > > Key: HDFS-16922 > URL: https://issues.apache.org/jira/browse/HDFS-16922 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Reporter: ZhangHB >Priority: Major > Labels: pull-request-available > > The current logic of IncrementalBlockReportManager# addRDBI method could lead > to the missing blocks when datanodes in pipeline are I/O busy. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16922) The logic of IncrementalBlockReportManager#addRDBI method may cause missing blocks when cluster is busy.
[ https://issues.apache.org/jira/browse/HDFS-16922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689720#comment-17689720 ] ASF GitHub Bot commented on HDFS-16922: --- zhangshuyan0 commented on PR #5398: URL: https://github.com/apache/hadoop/pull/5398#issuecomment-1432985896 It's great to make sure dn only report the replica with maximum timestamp. Even though [HDFS-16146](https://issues.apache.org/jira/browse/HDFS-16146) already merged, is it still possible to miss blocks in trunk when the replace policy is set to NEVER ? Would you mind adding a UT for reproducing this case? > The logic of IncrementalBlockReportManager#addRDBI method may cause missing > blocks when cluster is busy. > > > Key: HDFS-16922 > URL: https://issues.apache.org/jira/browse/HDFS-16922 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Reporter: ZhangHB >Priority: Major > Labels: pull-request-available > > The current logic of IncrementalBlockReportManager# addRDBI method could lead > to the missing blocks when datanodes in pipeline are I/O busy. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16922) The logic of IncrementalBlockReportManager#addRDBI method may cause missing blocks when cluster is busy.
[ https://issues.apache.org/jira/browse/HDFS-16922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689715#comment-17689715 ] ASF GitHub Bot commented on HDFS-16922: --- zhangshuyan0 commented on PR #5398: URL: https://github.com/apache/hadoop/pull/5398#issuecomment-1432976667 It's great to make sure dn only report the replica with maximum timestamp. Even though [HDFS-16146](https://issues.apache.org/jira/browse/HDFS-16146) already merged, is it still possible to miss blocks in trunk when the replication factor is 2 ? Would you mind adding a UT for reproducing this case? > The logic of IncrementalBlockReportManager#addRDBI method may cause missing > blocks when cluster is busy. > > > Key: HDFS-16922 > URL: https://issues.apache.org/jira/browse/HDFS-16922 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Reporter: ZhangHB >Priority: Major > Labels: pull-request-available > > The current logic of IncrementalBlockReportManager# addRDBI method could lead > to the missing blocks when datanodes in pipeline are I/O busy. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16898) Remove write lock for processCommandFromActor of DataNode to reduce impact on heartbeat
[ https://issues.apache.org/jira/browse/HDFS-16898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689656#comment-17689656 ] ASF GitHub Bot commented on HDFS-16898: --- Hexiaoqiao commented on PR #5408: URL: https://github.com/apache/hadoop/pull/5408#issuecomment-1432809951 @hfutatzhanghb Please check failed unit tests if relate with this changes. Thanks. > Remove write lock for processCommandFromActor of DataNode to reduce impact on > heartbeat > --- > > Key: HDFS-16898 > URL: https://issues.apache.org/jira/browse/HDFS-16898 > Project: Hadoop HDFS > Issue Type: Improvement >Affects Versions: 3.3.4 >Reporter: ZhangHB >Assignee: ZhangHB >Priority: Major > Labels: pull-request-available > Fix For: 3.4.0 > > > Now in method processCommandFromActor, we have code like below: > > {code:java} > writeLock(); > try { > if (actor == bpServiceToActive) { > return processCommandFromActive(cmd, actor); > } else { > return processCommandFromStandby(cmd, actor); > } > } finally { > writeUnlock(); > } {code} > if method processCommandFromActive costs much time, the write lock would not > release. > > It maybe block the updateActorStatesFromHeartbeat method in > offerService,furthermore, it can cause the lastcontact of datanode very high, > even dead when lastcontact beyond 600s. > {code:java} > bpos.updateActorStatesFromHeartbeat( > this, resp.getNameNodeHaState());{code} > here we can make write lock fine-grain in processCommandFromActor method to > address this problem > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16922) The logic of IncrementalBlockReportManager#addRDBI method may cause missing blocks when cluster is busy.
[ https://issues.apache.org/jira/browse/HDFS-16922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689654#comment-17689654 ] ASF GitHub Bot commented on HDFS-16922: --- Hexiaoqiao commented on code in PR #5398: URL: https://github.com/apache/hadoop/pull/5398#discussion_r1108243429 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/IncrementalBlockReportManager.java: ## @@ -252,7 +256,9 @@ synchronized void addRDBI(ReceivedDeletedBlockInfo rdbi, // Make sure another entry for the same block is first removed. // There may only be one such entry. for (PerStorageIBR perStorage : pendingIBRs.values()) { - if (perStorage.remove(rdbi.getBlock()) != null) { + ReceivedDeletedBlockInfo oldRdbi = perStorage.get(rdbi.getBlock()); + if (oldRdbi != null && oldRdbi.getBlock().getGenerationStamp() < rdbi.getBlock().getGenerationStamp() Review Comment: This fix still leave one unexpected case, consider the new entry's generation stamp is less than the old one, it will put again at line 265 and overwrite it, right? How about the following patch? ADD it will be better to add new unit test to verify this bugfix. Thanks. ``` - > The logic of IncrementalBlockReportManager#addRDBI method may cause missing > blocks when cluster is busy. > > > Key: HDFS-16922 > URL: https://issues.apache.org/jira/browse/HDFS-16922 > Project: Hadoop HDFS > Issue Type: Bug > Components: datanode >Reporter: ZhangHB >Priority: Major > Labels: pull-request-available > > The current logic of IncrementalBlockReportManager# addRDBI method could lead > to the missing blocks when datanodes in pipeline are I/O busy. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-16898) Remove write lock for processCommandFromActor of DataNode to reduce impact on heartbeat
[ https://issues.apache.org/jira/browse/HDFS-16898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689648#comment-17689648 ] ASF GitHub Bot commented on HDFS-16898: --- hadoop-yetus commented on PR #5408: URL: https://github.com/apache/hadoop/pull/5408#issuecomment-1432800673 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 1m 2s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | -1 :x: | test4tests | 0m 0s | | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | _ branch-3.3 Compile Tests _ | | +1 :green_heart: | mvninstall | 45m 2s | | branch-3.3 passed | | +1 :green_heart: | compile | 1m 25s | | branch-3.3 passed | | +1 :green_heart: | checkstyle | 1m 1s | | branch-3.3 passed | | +1 :green_heart: | mvnsite | 1m 36s | | branch-3.3 passed | | +1 :green_heart: | javadoc | 1m 42s | | branch-3.3 passed | | +1 :green_heart: | spotbugs | 3m 51s | | branch-3.3 passed | | +1 :green_heart: | shadedclient | 32m 48s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 37s | | the patch passed | | +1 :green_heart: | compile | 1m 20s | | the patch passed | | +1 :green_heart: | javac | 1m 20s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 45s | | the patch passed | | +1 :green_heart: | mvnsite | 1m 26s | | the patch passed | | +1 :green_heart: | javadoc | 1m 24s | | the patch passed | | +1 :green_heart: | spotbugs | 3m 40s | | the patch passed | | +1 :green_heart: | shadedclient | 32m 59s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 233m 6s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5408/2/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 43s | | The patch does not generate ASF License warnings. | | | | 362m 56s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.server.namenode.TestAuditLogger | | | hadoop.hdfs.server.namenode.TestFsck | | | hadoop.hdfs.tools.TestDFSAdmin | | | hadoop.hdfs.server.mover.TestMover | | | hadoop.hdfs.server.namenode.TestAuditLogs | | | hadoop.hdfs.TestRollingUpgrade | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5408/2/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5408 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 71030960ce81 4.15.0-197-generic #208-Ubuntu SMP Tue Nov 1 17:23:37 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | branch-3.3 / db600f1a3ef142390860ef96e2ae1a8f017031b2 | | Default Java | Private Build-1.8.0_352-8u352-ga-1~18.04-b08 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5408/2/testReport/ | | Max. process+thread count | 2025 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5408/2/console | | versions | git=2.17.1 maven=3.6.0 spotbugs=4.2.2 | | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org | This message was automatically generated. > Remove write lock for processCommandFromActor of DataNode to reduce impact on > heartbeat > --- > > Key: HDFS-16898 > URL: https://issues.apache.org/jira/browse/HDFS-16898 > Project: Hadoop HDFS > Issue Type: Impro
[jira] [Commented] (HDFS-16918) Optionally shut down datanode if it does not stay connected to active namenode
[ https://issues.apache.org/jira/browse/HDFS-16918?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689640#comment-17689640 ] ASF GitHub Bot commented on HDFS-16918: --- hadoop-yetus commented on PR #5396: URL: https://github.com/apache/hadoop/pull/5396#issuecomment-1432794232 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 1m 14s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 1s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +0 :ok: | xmllint | 0m 0s | | xmllint was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 2 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 47m 21s | | trunk passed | | +1 :green_heart: | compile | 1m 35s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | compile | 1m 21s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | checkstyle | 1m 6s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 33s | | trunk passed | | +1 :green_heart: | javadoc | 1m 7s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 1m 31s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 3m 36s | | trunk passed | | +1 :green_heart: | shadedclient | 29m 15s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 30s | | the patch passed | | +1 :green_heart: | compile | 1m 22s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javac | 1m 22s | | the patch passed | | +1 :green_heart: | compile | 1m 16s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | javac | 1m 16s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 55s | | the patch passed | | +1 :green_heart: | mvnsite | 1m 21s | | the patch passed | | -1 :x: | javadoc | 0m 54s | [/patch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5396/3/artifact/out/patch-javadoc-hadoop-hdfs-project_hadoop-hdfs-jdkUbuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04.txt) | hadoop-hdfs in the patch failed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04. | | +1 :green_heart: | javadoc | 1m 28s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 3m 30s | | the patch passed | | +1 :green_heart: | shadedclient | 24m 14s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 241m 33s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5396/3/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 45s | | The patch does not generate ASF License warnings. | | | | 366m 1s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.server.namenode.TestAuditLogger | | | hadoop.hdfs.server.namenode.TestFSNamesystemLockReport | | | hadoop.hdfs.server.datanode.TestDirectoryScanner | | | hadoop.hdfs.server.namenode.TestAuditLogs | | | hadoop.hdfs.server.namenode.TestFsck | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5396/3/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5396 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets xmllint | | uname | Linux 3a3903bc636b 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 91e7a726426d754ab4bfc548d1e9d51
[jira] [Commented] (HDFS-16898) Remove write lock for processCommandFromActor of DataNode to reduce impact on heartbeat
[ https://issues.apache.org/jira/browse/HDFS-16898?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689634#comment-17689634 ] ASF GitHub Bot commented on HDFS-16898: --- hadoop-yetus commented on PR #5408: URL: https://github.com/apache/hadoop/pull/5408#issuecomment-1432771976 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 51s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | -1 :x: | test4tests | 0m 0s | | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | _ branch-3.3 Compile Tests _ | | +1 :green_heart: | mvninstall | 43m 11s | | branch-3.3 passed | | +1 :green_heart: | compile | 1m 13s | | branch-3.3 passed | | +1 :green_heart: | checkstyle | 0m 53s | | branch-3.3 passed | | +1 :green_heart: | mvnsite | 1m 22s | | branch-3.3 passed | | +1 :green_heart: | javadoc | 1m 33s | | branch-3.3 passed | | +1 :green_heart: | spotbugs | 3m 21s | | branch-3.3 passed | | +1 :green_heart: | shadedclient | 30m 41s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 21s | | the patch passed | | +1 :green_heart: | compile | 1m 9s | | the patch passed | | +1 :green_heart: | javac | 1m 9s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 41s | | the patch passed | | +1 :green_heart: | mvnsite | 1m 18s | | the patch passed | | +1 :green_heart: | javadoc | 1m 20s | | the patch passed | | +1 :green_heart: | spotbugs | 3m 19s | | the patch passed | | +1 :green_heart: | shadedclient | 30m 13s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 221m 19s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5408/1/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 41s | | The patch does not generate ASF License warnings. | | | | 342m 3s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.server.namenode.TestAuditLogs | | | hadoop.hdfs.TestFileCreation | | | hadoop.hdfs.server.balancer.TestBalancer | | | hadoop.hdfs.server.balancer.TestBalancerWithHANameNodes | | | hadoop.hdfs.TestRollingUpgrade | | | hadoop.hdfs.server.namenode.TestFsck | | | hadoop.hdfs.server.namenode.TestAuditLogger | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5408/1/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5408 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 2e9e1e2e533d 4.15.0-200-generic #211-Ubuntu SMP Thu Nov 24 18:16:04 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | branch-3.3 / db600f1a3ef142390860ef96e2ae1a8f017031b2 | | Default Java | Private Build-1.8.0_352-8u352-ga-1~18.04-b08 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5408/1/testReport/ | | Max. process+thread count | 2360 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5408/1/console | | versions | git=2.17.1 maven=3.6.0 spotbugs=4.2.2 | | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org | This message was automatically generated. > Remove write lock for processCommandFromActor of DataNode to reduce impact on > heartbeat > --- > > Key: HDFS-16898 > URL: https://issues.apache.org/jira/browse/HD
[jira] [Commented] (HDFS-16917) Add transfer rate quantile metrics for DataNode reads
[ https://issues.apache.org/jira/browse/HDFS-16917?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689621#comment-17689621 ] ASF GitHub Bot commented on HDFS-16917: --- hadoop-yetus commented on PR #5397: URL: https://github.com/apache/hadoop/pull/5397#issuecomment-1432717633 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 50s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +0 :ok: | markdownlint | 0m 0s | | markdownlint was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | -1 :x: | test4tests | 0m 0s | | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | _ trunk Compile Tests _ | | +0 :ok: | mvndep | 15m 29s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 33m 57s | | trunk passed | | +1 :green_heart: | compile | 25m 21s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | compile | 21m 50s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | checkstyle | 4m 5s | | trunk passed | | +1 :green_heart: | mvnsite | 3m 23s | | trunk passed | | +1 :green_heart: | javadoc | 2m 15s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 2m 23s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 6m 15s | | trunk passed | | +1 :green_heart: | shadedclient | 29m 13s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 23s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 2m 37s | | the patch passed | | +1 :green_heart: | compile | 24m 32s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javac | 24m 32s | | the patch passed | | +1 :green_heart: | compile | 21m 50s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | javac | 21m 50s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 3m 55s | [/results-checkstyle-root.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5397/7/artifact/out/results-checkstyle-root.txt) | root: The patch generated 2 new + 130 unchanged - 0 fixed = 132 total (was 130) | | +1 :green_heart: | mvnsite | 3m 17s | | the patch passed | | +1 :green_heart: | javadoc | 2m 9s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 2m 29s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 6m 25s | | the patch passed | | +1 :green_heart: | shadedclient | 29m 43s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 18m 22s | | hadoop-common in the patch passed. | | -1 :x: | unit | 228m 44s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5397/7/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 58s | | The patch does not generate ASF License warnings. | | | | 488m 54s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.server.namenode.TestAuditLogs | | | hadoop.hdfs.server.namenode.TestFSNamesystemLockReport | | | hadoop.hdfs.server.namenode.TestFsck | | | hadoop.hdfs.server.namenode.ha.TestObserverNode | | | hadoop.hdfs.TestRollingUpgrade | | | hadoop.hdfs.server.namenode.TestAuditLogger | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5397/7/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5397 | | Optional Tests | dupname asflicense mvnsite codespell detsecrets
[jira] [Commented] (HDFS-16896) HDFS Client hedged read has increased failure rate than without hedged read
[ https://issues.apache.org/jira/browse/HDFS-16896?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689611#comment-17689611 ] ASF GitHub Bot commented on HDFS-16896: --- hadoop-yetus commented on PR #5322: URL: https://github.com/apache/hadoop/pull/5322#issuecomment-1432703908 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 52s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +0 :ok: | mvndep | 15m 19s | | Maven dependency ordering for branch | | +1 :green_heart: | mvninstall | 38m 29s | | trunk passed | | +1 :green_heart: | compile | 6m 43s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | compile | 6m 12s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | checkstyle | 1m 18s | | trunk passed | | +1 :green_heart: | mvnsite | 2m 26s | | trunk passed | | +1 :green_heart: | javadoc | 1m 47s | | trunk passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 2m 4s | | trunk passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 6m 11s | | trunk passed | | +1 :green_heart: | shadedclient | 28m 55s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +0 :ok: | mvndep | 0m 22s | | Maven dependency ordering for patch | | +1 :green_heart: | mvninstall | 2m 22s | | the patch passed | | +1 :green_heart: | compile | 6m 32s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javac | 6m 32s | | the patch passed | | +1 :green_heart: | compile | 6m 26s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | javac | 6m 26s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 1m 12s | [/results-checkstyle-hadoop-hdfs-project.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5322/5/artifact/out/results-checkstyle-hadoop-hdfs-project.txt) | hadoop-hdfs-project: The patch generated 1 new + 31 unchanged - 0 fixed = 32 total (was 31) | | +1 :green_heart: | mvnsite | 2m 17s | | the patch passed | | +1 :green_heart: | javadoc | 1m 30s | | the patch passed with JDK Ubuntu-11.0.17+8-post-Ubuntu-1ubuntu220.04 | | +1 :green_heart: | javadoc | 2m 0s | | the patch passed with JDK Private Build-1.8.0_352-8u352-ga-1~20.04-b08 | | +1 :green_heart: | spotbugs | 6m 52s | | the patch passed | | +1 :green_heart: | shadedclient | 30m 40s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 2m 20s | | hadoop-hdfs-client in the patch passed. | | -1 :x: | unit | 241m 59s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5322/5/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 59s | | The patch does not generate ASF License warnings. | | | | 413m 31s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.server.namenode.TestAuditLogger | | | hadoop.hdfs.server.namenode.TestFSNamesystemLockReport | | | hadoop.hdfs.server.namenode.TestFsck | | | hadoop.hdfs.server.datanode.TestDirectoryScanner | | | hadoop.hdfs.server.namenode.TestAuditLogs | | | hadoop.hdfs.TestPread | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.42 ServerAPI=1.42 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5322/5/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5322 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 62ba32aa466f 4.15.0-197-generic #208-Ubuntu SMP Tue Nov 1 17:23:37 UTC 2022