[jira] [Created] (HDFS-17057) Add DataNode maintenance states to Federation UI
Haiyang Hu created HDFS-17057: - Summary: Add DataNode maintenance states to Federation UI Key: HDFS-17057 URL: https://issues.apache.org/jira/browse/HDFS-17057 Project: Hadoop HDFS Issue Type: Improvement Reporter: Haiyang Hu Add DataNode maintenance states to Federation UI -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Assigned] (HDFS-17057) Add DataNode maintenance states to Federation UI
[ https://issues.apache.org/jira/browse/HDFS-17057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Haiyang Hu reassigned HDFS-17057: - Assignee: Haiyang Hu > Add DataNode maintenance states to Federation UI > - > > Key: HDFS-17057 > URL: https://issues.apache.org/jira/browse/HDFS-17057 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: Haiyang Hu >Assignee: Haiyang Hu >Priority: Major > > Add DataNode maintenance states to Federation UI -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17055) Export HAState as a metric from Namenode for monitoring
[ https://issues.apache.org/jira/browse/HDFS-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17736329#comment-17736329 ] ASF GitHub Bot commented on HDFS-17055: --- hadoop-yetus commented on PR #5764: URL: https://github.com/apache/hadoop/pull/5764#issuecomment-1603488495 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 1m 17s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 1s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 5 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 43m 17s | | trunk passed | | +1 :green_heart: | compile | 1m 36s | | trunk passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | compile | 1m 26s | | trunk passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | checkstyle | 1m 29s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 50s | | trunk passed | | +1 :green_heart: | javadoc | 1m 27s | | trunk passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | javadoc | 1m 51s | | trunk passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | spotbugs | 4m 4s | | trunk passed | | +1 :green_heart: | shadedclient | 31m 5s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 28s | | the patch passed | | +1 :green_heart: | compile | 1m 27s | | the patch passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | javac | 1m 27s | | the patch passed | | +1 :green_heart: | compile | 1m 21s | | the patch passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | javac | 1m 21s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 1m 9s | | the patch passed | | +1 :green_heart: | mvnsite | 1m 29s | | the patch passed | | +1 :green_heart: | javadoc | 1m 9s | | the patch passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | javadoc | 1m 39s | | the patch passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | spotbugs | 3m 50s | | the patch passed | | +1 :green_heart: | shadedclient | 30m 34s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 248m 2s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5764/3/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 58s | | The patch does not generate ASF License warnings. | | | | 381m 56s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.TestRollingUpgrade | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5764/3/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5764 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 383a9054d10f 4.15.0-212-generic #223-Ubuntu SMP Tue May 23 13:09:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / ba9e87f2b294deb1cd67100e1c39e317ffb76295 | | Default Java | Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5764/3/testReport/ | | Max. process+thread count | 2108 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | | Console output |
[jira] [Commented] (HDFS-17055) Export HAState as a metric from Namenode for monitoring
[ https://issues.apache.org/jira/browse/HDFS-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17736314#comment-17736314 ] ASF GitHub Bot commented on HDFS-17055: --- hadoop-yetus commented on PR #5764: URL: https://github.com/apache/hadoop/pull/5764#issuecomment-1603458394 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 39s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 1s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 5 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 36m 59s | | trunk passed | | +1 :green_heart: | compile | 1m 25s | | trunk passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | compile | 1m 20s | | trunk passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | checkstyle | 1m 18s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 31s | | trunk passed | | +1 :green_heart: | javadoc | 1m 15s | | trunk passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | javadoc | 1m 38s | | trunk passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | spotbugs | 3m 26s | | trunk passed | | +1 :green_heart: | shadedclient | 23m 53s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 15s | | the patch passed | | +1 :green_heart: | compile | 1m 16s | | the patch passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | javac | 1m 16s | | the patch passed | | +1 :green_heart: | compile | 1m 11s | | the patch passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | javac | 1m 11s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 1m 5s | | the patch passed | | +1 :green_heart: | mvnsite | 1m 14s | | the patch passed | | +1 :green_heart: | javadoc | 0m 56s | | the patch passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | javadoc | 1m 29s | | the patch passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | spotbugs | 3m 12s | | the patch passed | | +1 :green_heart: | shadedclient | 23m 42s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 219m 11s | | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 0m 59s | | The patch does not generate ASF License warnings. | | | | 329m 14s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5764/4/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5764 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux f1712d7a704a 4.15.0-212-generic #223-Ubuntu SMP Tue May 23 13:09:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / ba9e87f2b294deb1cd67100e1c39e317ffb76295 | | Default Java | Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5764/4/testReport/ | | Max. process+thread count | 3144 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs U: hadoop-hdfs-project/hadoop-hdfs | | Console output | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5764/4/console | | versions | git=2.25.1 maven=3.6.3 spotbugs=4.2.2 | | Powered by | Apache Yetus 0.14.0 https://yetus.apache.org | This message was automatically generated. > Export HAState
[jira] [Commented] (HDFS-17055) Export HAState as a metric from Namenode for monitoring
[ https://issues.apache.org/jira/browse/HDFS-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17736312#comment-17736312 ] ASF GitHub Bot commented on HDFS-17055: --- xinglin commented on code in PR #5764: URL: https://github.com/apache/hadoop/pull/5764#discussion_r1239137137 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java: ## @@ -2051,6 +2056,26 @@ synchronized HAServiceState getServiceState() { return state.getServiceState(); } + /** + * Emit Namenode HA service state as an integer so that one can monitor NN HA + * state based on this metric. + * + * @return 0 when not fully started + * 1 for active or standalone (non-HA) NN + * 2 for standby + * 3 for observer + * Review Comment: Searching codebase, it seems we would set a state to STOPPING state only in YARN ResourceManager HA. We are not using that state in HDFS. > Export HAState as a metric from Namenode for monitoring > --- > > Key: HDFS-17055 > URL: https://issues.apache.org/jira/browse/HDFS-17055 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.4.0, 3.3.9 >Reporter: Xing Lin >Assignee: Xing Lin >Priority: Minor > Labels: pull-request-available > > We'd like measure the uptime for Namenodes: percentage of time when we have > the active/standby/observer node available (up and running). We could monitor > the namenode from an external service, such as ZKFC. But that would require > the external service to be available 100% itself. And when this third-party > external monitoring service is down, we won't have info on whether our > Namenodes are still up. > We propose to take a different approach: we will emit Namenode state directly > from namenode itself. Whenever we miss a data point for this metric, we > consider the corresponding namenode to be down/not available. In other words, > we assume the metric collection/monitoring infrastructure to be 100% reliable. > One implementation detail: in hadoop, we have the _NameNodeMetrics_ class, > which is currently used to emit all metrics for {_}NameNode.java{_}. However, > we don't think that is a good place to emit NameNode HAState. HAState is > stored in NameNode.java and we should directly emit it from NameNode.java. > Otherwise, we basically duplicate this info in two classes and we would have > to keep them in sync. Besides, _NameNodeMetrics_ class does not have a > reference to the _NameNode_ object which it belongs to. An _NameNodeMetrics_ > is created by a _static_ function _initMetrics()_ in {_}NameNode.java{_}. > We shouldn't emit HA state from FSNameSystem.java either, as it is > initialized from NameNode.java and all state transitions are implemented in > NameNode.java. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17055) Export HAState as a metric from Namenode for monitoring
[ https://issues.apache.org/jira/browse/HDFS-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17736309#comment-17736309 ] ASF GitHub Bot commented on HDFS-17055: --- melissayou commented on code in PR #5764: URL: https://github.com/apache/hadoop/pull/5764#discussion_r1239128912 ## hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/NameNode.java: ## @@ -2051,6 +2056,26 @@ synchronized HAServiceState getServiceState() { return state.getServiceState(); } + /** + * Emit Namenode HA service state as an integer so that one can monitor NN HA + * state based on this metric. + * + * @return 0 when not fully started + * 1 for active or standalone (non-HA) NN + * 2 for standby + * 3 for observer + * Review Comment: I saw HAState has a stopping enum. We won't encounter that state? > Export HAState as a metric from Namenode for monitoring > --- > > Key: HDFS-17055 > URL: https://issues.apache.org/jira/browse/HDFS-17055 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.4.0, 3.3.9 >Reporter: Xing Lin >Assignee: Xing Lin >Priority: Minor > Labels: pull-request-available > > We'd like measure the uptime for Namenodes: percentage of time when we have > the active/standby/observer node available (up and running). We could monitor > the namenode from an external service, such as ZKFC. But that would require > the external service to be available 100% itself. And when this third-party > external monitoring service is down, we won't have info on whether our > Namenodes are still up. > We propose to take a different approach: we will emit Namenode state directly > from namenode itself. Whenever we miss a data point for this metric, we > consider the corresponding namenode to be down/not available. In other words, > we assume the metric collection/monitoring infrastructure to be 100% reliable. > One implementation detail: in hadoop, we have the _NameNodeMetrics_ class, > which is currently used to emit all metrics for {_}NameNode.java{_}. However, > we don't think that is a good place to emit NameNode HAState. HAState is > stored in NameNode.java and we should directly emit it from NameNode.java. > Otherwise, we basically duplicate this info in two classes and we would have > to keep them in sync. Besides, _NameNodeMetrics_ class does not have a > reference to the _NameNode_ object which it belongs to. An _NameNodeMetrics_ > is created by a _static_ function _initMetrics()_ in {_}NameNode.java{_}. > We shouldn't emit HA state from FSNameSystem.java either, as it is > initialized from NameNode.java and all state transitions are implemented in > NameNode.java. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17055) Export HAState as a metric from Namenode for monitoring
[ https://issues.apache.org/jira/browse/HDFS-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17736289#comment-17736289 ] ASF GitHub Bot commented on HDFS-17055: --- hadoop-yetus commented on PR #5764: URL: https://github.com/apache/hadoop/pull/5764#issuecomment-1603360305 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 1m 8s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 39m 17s | | trunk passed | | +1 :green_heart: | compile | 1m 27s | | trunk passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | compile | 1m 14s | | trunk passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | checkstyle | 1m 16s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 25s | | trunk passed | | +1 :green_heart: | javadoc | 1m 14s | | trunk passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | javadoc | 1m 37s | | trunk passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | spotbugs | 3m 37s | | trunk passed | | +1 :green_heart: | shadedclient | 26m 58s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 1m 13s | | the patch passed | | +1 :green_heart: | compile | 1m 20s | | the patch passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | javac | 1m 20s | | the patch passed | | +1 :green_heart: | compile | 1m 8s | | the patch passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | javac | 1m 8s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 1m 1s | [/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5764/2/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs-project/hadoop-hdfs: The patch generated 1 new + 72 unchanged - 0 fixed = 73 total (was 72) | | +1 :green_heart: | mvnsite | 1m 16s | | the patch passed | | +1 :green_heart: | javadoc | 0m 56s | | the patch passed with JDK Ubuntu-11.0.19+7-post-Ubuntu-0ubuntu120.04.1 | | +1 :green_heart: | javadoc | 1m 27s | | the patch passed with JDK Private Build-1.8.0_362-8u372-ga~us1-0ubuntu1~20.04-b09 | | +1 :green_heart: | spotbugs | 3m 21s | | the patch passed | | +1 :green_heart: | shadedclient | 26m 32s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | -1 :x: | unit | 260m 10s | [/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5764/2/artifact/out/patch-unit-hadoop-hdfs-project_hadoop-hdfs.txt) | hadoop-hdfs in the patch passed. | | +1 :green_heart: | asflicense | 1m 3s | | The patch does not generate ASF License warnings. | | | | 378m 1s | | | | Reason | Tests | |---:|:--| | Failed junit tests | hadoop.hdfs.TestDFSUpgrade | | | hadoop.hdfs.server.namenode.TestBackupNode | | | hadoop.hdfs.TestDFSRollback | | | hadoop.hdfs.server.namenode.TestNameNodeMetricsLogger | | | hadoop.hdfs.TestDFSFinalize | | | hadoop.hdfs.TestHDFSServerPorts | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.43 ServerAPI=1.43 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-5764/2/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/5764 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets | | uname | Linux 6ec9fcc6fcf3 4.15.0-212-generic #223-Ubuntu SMP Tue May 23 13:09:22 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 3ede4ef8d2206de23aa992ead2d19baeb4ebf88a | | Default Java | Private
[jira] [Commented] (HDFS-17055) Export HAState as a metric from Namenode for monitoring
[ https://issues.apache.org/jira/browse/HDFS-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17736248#comment-17736248 ] ASF GitHub Bot commented on HDFS-17055: --- xinglin commented on PR #5764: URL: https://github.com/apache/hadoop/pull/5764#issuecomment-1603146650 Hi @goiri, Summary of changes to fix unit test failures. - Make two subclasses of NameNode as metric source as well, by adding a `@Metrics` annotation, since we have made NameNode a metric source. - Fixed `getCurrentBlockPoolID() `bug in a couple of unit tests. We set `NameNode.started ` flag to `false` in `NameNode.stop()` method. However, if we pass a `cluster` object to `getCurrentBlockPoolID`(), it will check to ensure the NameNode is started. Fixed it by passing a `null` to `getCurrentBlockPoolID()`, which is what existing code is doing as well. TestRollingUpgrade passed all tests on my laptop. Let's see how the build goes. > Export HAState as a metric from Namenode for monitoring > --- > > Key: HDFS-17055 > URL: https://issues.apache.org/jira/browse/HDFS-17055 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.4.0, 3.3.9 >Reporter: Xing Lin >Assignee: Xing Lin >Priority: Minor > Labels: pull-request-available > > We'd like measure the uptime for Namenodes: percentage of time when we have > the active/standby/observer node available (up and running). We could monitor > the namenode from an external service, such as ZKFC. But that would require > the external service to be available 100% itself. And when this third-party > external monitoring service is down, we won't have info on whether our > Namenodes are still up. > We propose to take a different approach: we will emit Namenode state directly > from namenode itself. Whenever we miss a data point for this metric, we > consider the corresponding namenode to be down/not available. In other words, > we assume the metric collection/monitoring infrastructure to be 100% reliable. > One implementation detail: in hadoop, we have the _NameNodeMetrics_ class, > which is currently used to emit all metrics for {_}NameNode.java{_}. However, > we don't think that is a good place to emit NameNode HAState. HAState is > stored in NameNode.java and we should directly emit it from NameNode.java. > Otherwise, we basically duplicate this info in two classes and we would have > to keep them in sync. Besides, _NameNodeMetrics_ class does not have a > reference to the _NameNode_ object which it belongs to. An _NameNodeMetrics_ > is created by a _static_ function _initMetrics()_ in {_}NameNode.java{_}. > We shouldn't emit HA state from FSNameSystem.java either, as it is > initialized from NameNode.java and all state transitions are implemented in > NameNode.java. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17055) Export HAState as a metric from Namenode for monitoring
[ https://issues.apache.org/jira/browse/HDFS-17055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17736241#comment-17736241 ] ASF GitHub Bot commented on HDFS-17055: --- xinglin commented on PR #5764: URL: https://github.com/apache/hadoop/pull/5764#issuecomment-1603105542 Hi @goiri, thanks for taking a look at this PR and approving it. Please don't merge it yet. I am still working on fixing some unit test failures. > Export HAState as a metric from Namenode for monitoring > --- > > Key: HDFS-17055 > URL: https://issues.apache.org/jira/browse/HDFS-17055 > Project: Hadoop HDFS > Issue Type: Improvement > Components: hdfs >Affects Versions: 3.4.0, 3.3.9 >Reporter: Xing Lin >Assignee: Xing Lin >Priority: Minor > Labels: pull-request-available > > We'd like measure the uptime for Namenodes: percentage of time when we have > the active/standby/observer node available (up and running). We could monitor > the namenode from an external service, such as ZKFC. But that would require > the external service to be available 100% itself. And when this third-party > external monitoring service is down, we won't have info on whether our > Namenodes are still up. > We propose to take a different approach: we will emit Namenode state directly > from namenode itself. Whenever we miss a data point for this metric, we > consider the corresponding namenode to be down/not available. In other words, > we assume the metric collection/monitoring infrastructure to be 100% reliable. > One implementation detail: in hadoop, we have the _NameNodeMetrics_ class, > which is currently used to emit all metrics for {_}NameNode.java{_}. However, > we don't think that is a good place to emit NameNode HAState. HAState is > stored in NameNode.java and we should directly emit it from NameNode.java. > Otherwise, we basically duplicate this info in two classes and we would have > to keep them in sync. Besides, _NameNodeMetrics_ class does not have a > reference to the _NameNode_ object which it belongs to. An _NameNodeMetrics_ > is created by a _static_ function _initMetrics()_ in {_}NameNode.java{_}. > We shouldn't emit HA state from FSNameSystem.java either, as it is > initialized from NameNode.java and all state transitions are implemented in > NameNode.java. > -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Commented] (HDFS-17056) EC: Fix verifyClusterSetup output in case of an invalid param
[ https://issues.apache.org/jira/browse/HDFS-17056?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17736174#comment-17736174 ] Ayush Saxena commented on HDFS-17056: - Found while trying 3.3.6 RC, minor stuff should be present in trunk as well... A typical fix would be to add a simple if check like other commands in the verifyClusterOutput command, something like this & things should work {code:java} throw e; } } else { if (args.size() > 0) { System.err.println(getName() + ": Too many arguments"); return 1; } result = dfs.getECTopologyResultForPolicies(); {code} > EC: Fix verifyClusterSetup output in case of an invalid param > - > > Key: HDFS-17056 > URL: https://issues.apache.org/jira/browse/HDFS-17056 > Project: Hadoop HDFS > Issue Type: Bug > Components: ec >Reporter: Ayush Saxena >Priority: Major > > {code:java} > bin/hdfs ec -verifyClusterSetup XOR-2-1-1024k > 9 DataNodes are required for the erasure coding policies: RS-6-3-1024k, > XOR-2-1-1024k. The number of DataNodes is only 3. {code} > verifyClusterSetup requires -policy then the name of policies, else it > defaults to all enabled policies. > In case there are additional invalid options it silently ignores them, unlike > other EC commands which throws out Too Many Argument exception. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Created] (HDFS-17056) EC: Fix verifyClusterSetup output in case of an invalid param
Ayush Saxena created HDFS-17056: --- Summary: EC: Fix verifyClusterSetup output in case of an invalid param Key: HDFS-17056 URL: https://issues.apache.org/jira/browse/HDFS-17056 Project: Hadoop HDFS Issue Type: Bug Components: ec Reporter: Ayush Saxena {code:java} bin/hdfs ec -verifyClusterSetup XOR-2-1-1024k 9 DataNodes are required for the erasure coding policies: RS-6-3-1024k, XOR-2-1-1024k. The number of DataNodes is only 3. {code} verifyClusterSetup requires -policy then the name of policies, else it defaults to all enabled policies. In case there are additional invalid options it silently ignores them, unlike other EC commands which throws out Too Many Argument exception. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org