[jira] [Work logged] (HDFS-16678) RBF supports disable getNodeUsage() in RBFMetrics
[ https://issues.apache.org/jira/browse/HDFS-16678?focusedWorklogId=796559=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-796559 ] ASF GitHub Bot logged work on HDFS-16678: - Author: ASF GitHub Bot Created on: 30/Jul/22 02:30 Start Date: 30/Jul/22 02:30 Worklog Time Spent: 10m Work Description: ZanderXu commented on code in PR #4606: URL: https://github.com/apache/hadoop/pull/4606#discussion_r933715281 ## hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/metrics/RBFMetrics.java: ## @@ -537,35 +547,34 @@ public int getNumEnteringMaintenanceDataNodes() { @Override // NameNodeMXBean public String getNodeUsage() { -float median = 0; -float max = 0; -float min = 0; -float dev = 0; +double median = 0; +double max = 0; +double min = 0; +double dev = 0; final Map> info = new HashMap<>(); try { - RouterRpcServer rpcServer = this.router.getRpcServer(); - DatanodeInfo[] live = rpcServer.getDatanodeReport( - DatanodeReportType.LIVE, false, timeOut); + DatanodeInfo[] live = null; + if (this.enableGetDNUsage) { +RouterRpcServer rpcServer = this.router.getRpcServer(); +live = rpcServer.getDatanodeReport(DatanodeReportType.LIVE, false, timeOut); + } else { +LOG.debug("Getting node usage is disabled."); + } - if (live.length > 0) { -float totalDfsUsed = 0; -float[] usages = new float[live.length]; + if (live != null && live.length > 0) { +double[] usages = new double[live.length]; int i = 0; for (DatanodeInfo dn : live) { usages[i++] = dn.getDfsUsedPercent(); - totalDfsUsed += dn.getDfsUsedPercent(); } -totalDfsUsed /= live.length; Arrays.sort(usages); median = usages[usages.length / 2]; max = usages[usages.length - 1]; min = usages[0]; -for (i = 0; i < usages.length; i++) { - dev += (usages[i] - totalDfsUsed) * (usages[i] - totalDfsUsed); -} -dev = (float) Math.sqrt(dev / usages.length); +StandardDeviation deviation = new StandardDeviation(); +dev = deviation.evaluate(usages); } } catch (IOException e) { LOG.error("Cannot get the live nodes: {}", e.getMessage()); Review Comment: Thanks @slfan1989 @goiri for your review. I think `e.getMessage()` is enough. @slfan1989 Do you have some cases that need the full stack? Issue Time Tracking --- Worklog Id: (was: 796559) Time Spent: 2h (was: 1h 50m) > RBF supports disable getNodeUsage() in RBFMetrics > - > > Key: HDFS-16678 > URL: https://issues.apache.org/jira/browse/HDFS-16678 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Major > Labels: pull-request-available > Time Spent: 2h > Remaining Estimate: 0h > > In our prod environment, we try to collect RBF metrics every 15s through > jmx_exporter. And we found that collection task often failed. > After tracing and found that the collection task is blocked at getNodeUsage() > in RBFMetrics, because it will collection all datanode's usage from > downstream nameservices. This is a very expensive and almost useless > operation. Because in most scenarios, each NameSerivce contains almost the > same DNs. We can get the data usage's from any one nameservices, not from RBF. > So I feel that RBF should supports disable getNodeUsage() in RBFMetrics. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-16678) RBF supports disable getNodeUsage() in RBFMetrics
[ https://issues.apache.org/jira/browse/HDFS-16678?focusedWorklogId=796166=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-796166 ] ASF GitHub Bot logged work on HDFS-16678: - Author: ASF GitHub Bot Created on: 28/Jul/22 18:52 Start Date: 28/Jul/22 18:52 Worklog Time Spent: 10m Work Description: goiri commented on code in PR #4606: URL: https://github.com/apache/hadoop/pull/4606#discussion_r932570164 ## hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/metrics/RBFMetrics.java: ## @@ -537,35 +547,34 @@ public int getNumEnteringMaintenanceDataNodes() { @Override // NameNodeMXBean public String getNodeUsage() { -float median = 0; -float max = 0; -float min = 0; -float dev = 0; +double median = 0; +double max = 0; +double min = 0; +double dev = 0; final Map> info = new HashMap<>(); try { - RouterRpcServer rpcServer = this.router.getRpcServer(); - DatanodeInfo[] live = rpcServer.getDatanodeReport( - DatanodeReportType.LIVE, false, timeOut); + DatanodeInfo[] live = null; + if (this.enableGetDNUsage) { +RouterRpcServer rpcServer = this.router.getRpcServer(); +live = rpcServer.getDatanodeReport(DatanodeReportType.LIVE, false, timeOut); + } else { +LOG.debug("Getting node usage is disabled."); + } - if (live.length > 0) { -float totalDfsUsed = 0; -float[] usages = new float[live.length]; + if (live != null && live.length > 0) { +double[] usages = new double[live.length]; int i = 0; for (DatanodeInfo dn : live) { usages[i++] = dn.getDfsUsedPercent(); - totalDfsUsed += dn.getDfsUsedPercent(); } -totalDfsUsed /= live.length; Arrays.sort(usages); median = usages[usages.length / 2]; max = usages[usages.length - 1]; min = usages[0]; -for (i = 0; i < usages.length; i++) { - dev += (usages[i] - totalDfsUsed) * (usages[i] - totalDfsUsed); -} -dev = (float) Math.sqrt(dev / usages.length); +StandardDeviation deviation = new StandardDeviation(); +dev = deviation.evaluate(usages); } } catch (IOException e) { LOG.error("Cannot get the live nodes: {}", e.getMessage()); Review Comment: > I feel it would be better this way. > > ``` > LOG.error("Cannot get the live nodes.", e). > ``` Do we want to have the full stack trace? I think it is pretty clear what the error is here without it. Issue Time Tracking --- Worklog Id: (was: 796166) Time Spent: 1h 50m (was: 1h 40m) > RBF supports disable getNodeUsage() in RBFMetrics > - > > Key: HDFS-16678 > URL: https://issues.apache.org/jira/browse/HDFS-16678 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Major > Labels: pull-request-available > Time Spent: 1h 50m > Remaining Estimate: 0h > > In our prod environment, we try to collect RBF metrics every 15s through > jmx_exporter. And we found that collection task often failed. > After tracing and found that the collection task is blocked at getNodeUsage() > in RBFMetrics, because it will collection all datanode's usage from > downstream nameservices. This is a very expensive and almost useless > operation. Because in most scenarios, each NameSerivce contains almost the > same DNs. We can get the data usage's from any one nameservices, not from RBF. > So I feel that RBF should supports disable getNodeUsage() in RBFMetrics. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-16678) RBF supports disable getNodeUsage() in RBFMetrics
[ https://issues.apache.org/jira/browse/HDFS-16678?focusedWorklogId=795936=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-795936 ] ASF GitHub Bot logged work on HDFS-16678: - Author: ASF GitHub Bot Created on: 28/Jul/22 05:31 Start Date: 28/Jul/22 05:31 Worklog Time Spent: 10m Work Description: slfan1989 commented on code in PR #4606: URL: https://github.com/apache/hadoop/pull/4606#discussion_r931795987 ## hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/metrics/RBFMetrics.java: ## @@ -537,35 +547,34 @@ public int getNumEnteringMaintenanceDataNodes() { @Override // NameNodeMXBean public String getNodeUsage() { -float median = 0; -float max = 0; -float min = 0; -float dev = 0; +double median = 0; +double max = 0; +double min = 0; +double dev = 0; final Map> info = new HashMap<>(); try { - RouterRpcServer rpcServer = this.router.getRpcServer(); - DatanodeInfo[] live = rpcServer.getDatanodeReport( - DatanodeReportType.LIVE, false, timeOut); + DatanodeInfo[] live = null; + if (this.enableGetDNUsage) { +RouterRpcServer rpcServer = this.router.getRpcServer(); +live = rpcServer.getDatanodeReport(DatanodeReportType.LIVE, false, timeOut); + } else { +LOG.debug("Getting node usage is disabled."); + } - if (live.length > 0) { -float totalDfsUsed = 0; -float[] usages = new float[live.length]; + if (live != null && live.length > 0) { +double[] usages = new double[live.length]; int i = 0; for (DatanodeInfo dn : live) { usages[i++] = dn.getDfsUsedPercent(); - totalDfsUsed += dn.getDfsUsedPercent(); } -totalDfsUsed /= live.length; Arrays.sort(usages); median = usages[usages.length / 2]; max = usages[usages.length - 1]; min = usages[0]; -for (i = 0; i < usages.length; i++) { - dev += (usages[i] - totalDfsUsed) * (usages[i] - totalDfsUsed); -} -dev = (float) Math.sqrt(dev / usages.length); +StandardDeviation deviation = new StandardDeviation(); +dev = deviation.evaluate(usages); } } catch (IOException e) { LOG.error("Cannot get the live nodes: {}", e.getMessage()); Review Comment: I feel it would be better this way. ``` LOG.error("Cannot get the live nodes.", e). ``` Issue Time Tracking --- Worklog Id: (was: 795936) Time Spent: 1h 40m (was: 1.5h) > RBF supports disable getNodeUsage() in RBFMetrics > - > > Key: HDFS-16678 > URL: https://issues.apache.org/jira/browse/HDFS-16678 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Major > Labels: pull-request-available > Time Spent: 1h 40m > Remaining Estimate: 0h > > In our prod environment, we try to collect RBF metrics every 15s through > jmx_exporter. And we found that collection task often failed. > After tracing and found that the collection task is blocked at getNodeUsage() > in RBFMetrics, because it will collection all datanode's usage from > downstream nameservices. This is a very expensive and almost useless > operation. Because in most scenarios, each NameSerivce contains almost the > same DNs. We can get the data usage's from any one nameservices, not from RBF. > So I feel that RBF should supports disable getNodeUsage() in RBFMetrics. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-16678) RBF supports disable getNodeUsage() in RBFMetrics
[ https://issues.apache.org/jira/browse/HDFS-16678?focusedWorklogId=795860=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-795860 ] ASF GitHub Bot logged work on HDFS-16678: - Author: ASF GitHub Bot Created on: 28/Jul/22 00:45 Start Date: 28/Jul/22 00:45 Worklog Time Spent: 10m Work Description: ZanderXu commented on PR #4606: URL: https://github.com/apache/hadoop/pull/4606#issuecomment-1197525467 @goiri Hi, master, can you help me merge it into the trunk? Issue Time Tracking --- Worklog Id: (was: 795860) Time Spent: 1.5h (was: 1h 20m) > RBF supports disable getNodeUsage() in RBFMetrics > - > > Key: HDFS-16678 > URL: https://issues.apache.org/jira/browse/HDFS-16678 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Major > Labels: pull-request-available > Time Spent: 1.5h > Remaining Estimate: 0h > > In our prod environment, we try to collect RBF metrics every 15s through > jmx_exporter. And we found that collection task often failed. > After tracing and found that the collection task is blocked at getNodeUsage() > in RBFMetrics, because it will collection all datanode's usage from > downstream nameservices. This is a very expensive and almost useless > operation. Because in most scenarios, each NameSerivce contains almost the > same DNs. We can get the data usage's from any one nameservices, not from RBF. > So I feel that RBF should supports disable getNodeUsage() in RBFMetrics. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-16678) RBF supports disable getNodeUsage() in RBFMetrics
[ https://issues.apache.org/jira/browse/HDFS-16678?focusedWorklogId=794614=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-794614 ] ASF GitHub Bot logged work on HDFS-16678: - Author: ASF GitHub Bot Created on: 24/Jul/22 08:54 Start Date: 24/Jul/22 08:54 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on PR #4606: URL: https://github.com/apache/hadoop/pull/4606#issuecomment-1193275454 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 36s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +0 :ok: | xmllint | 0m 0s | | xmllint was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 38m 51s | | trunk passed | | +1 :green_heart: | compile | 1m 0s | | trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | compile | 0m 56s | | trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | checkstyle | 0m 48s | | trunk passed | | +1 :green_heart: | mvnsite | 1m 0s | | trunk passed | | +1 :green_heart: | javadoc | 1m 7s | | trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javadoc | 1m 17s | | trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 1m 48s | | trunk passed | | +1 :green_heart: | shadedclient | 21m 3s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 42s | | the patch passed | | +1 :green_heart: | compile | 0m 44s | | the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javac | 0m 44s | | the patch passed | | +1 :green_heart: | compile | 0m 40s | | the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | javac | 0m 40s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 26s | | the patch passed | | +1 :green_heart: | mvnsite | 0m 42s | | the patch passed | | +1 :green_heart: | javadoc | 0m 40s | | the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javadoc | 0m 59s | | the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 1m 26s | | the patch passed | | +1 :green_heart: | shadedclient | 20m 39s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 22m 16s | | hadoop-hdfs-rbf in the patch passed. | | +1 :green_heart: | asflicense | 0m 52s | | The patch does not generate ASF License warnings. | | | | 120m 20s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4606/3/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/4606 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets xmllint | | uname | Linux fadcae59afbb 4.15.0-175-generic #184-Ubuntu SMP Thu Mar 24 17:48:36 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 287868a54232c45a03b5f32b9d9ecc084419d585 | | Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4606/3/testReport/ | | Max. process+thread count | 2808 (vs. ulimit of 5500) | | modules | C: hadoop-hdfs-project/hadoop-hdfs-rbf U: hadoop-hdfs-project/hadoop-hdfs-rbf | | Console output |
[jira] [Work logged] (HDFS-16678) RBF supports disable getNodeUsage() in RBFMetrics
[ https://issues.apache.org/jira/browse/HDFS-16678?focusedWorklogId=794595=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-794595 ] ASF GitHub Bot logged work on HDFS-16678: - Author: ASF GitHub Bot Created on: 24/Jul/22 04:27 Start Date: 24/Jul/22 04:27 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on PR #4606: URL: https://github.com/apache/hadoop/pull/4606#issuecomment-1193244409 :confetti_ball: **+1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 0m 58s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 0s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 0s | | detect-secrets was not available. | | +0 :ok: | xmllint | 0m 0s | | xmllint was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | +1 :green_heart: | test4tests | 0m 0s | | The patch appears to include 1 new or modified test files. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 48m 37s | | trunk passed | | +1 :green_heart: | compile | 0m 54s | | trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | compile | 0m 51s | | trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | checkstyle | 0m 44s | | trunk passed | | +1 :green_heart: | mvnsite | 0m 56s | | trunk passed | | +1 :green_heart: | javadoc | 1m 3s | | trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javadoc | 1m 11s | | trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 1m 50s | | trunk passed | | +1 :green_heart: | shadedclient | 25m 16s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 40s | | the patch passed | | +1 :green_heart: | compile | 0m 46s | | the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javac | 0m 46s | | the patch passed | | +1 :green_heart: | compile | 0m 38s | | the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | javac | 0m 38s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | -0 :warning: | checkstyle | 0m 25s | [/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs-rbf.txt](https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4606/2/artifact/out/results-checkstyle-hadoop-hdfs-project_hadoop-hdfs-rbf.txt) | hadoop-hdfs-project/hadoop-hdfs-rbf: The patch generated 2 new + 0 unchanged - 0 fixed = 2 total (was 0) | | +1 :green_heart: | mvnsite | 0m 40s | | the patch passed | | +1 :green_heart: | javadoc | 0m 39s | | the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javadoc | 0m 56s | | the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 1m 35s | | the patch passed | | +1 :green_heart: | shadedclient | 24m 15s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 41m 5s | | hadoop-hdfs-rbf in the patch passed. | | +1 :green_heart: | asflicense | 0m 48s | | The patch does not generate ASF License warnings. | | | | 156m 21s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4606/2/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/4606 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets xmllint | | uname | Linux 911f56c609dd 4.15.0-175-generic #184-Ubuntu SMP Thu Mar 24 17:48:36 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / e71c527a2f1dae800980e1da5694914a5fd0a93a | | Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private
[jira] [Work logged] (HDFS-16678) RBF supports disable getNodeUsage() in RBFMetrics
[ https://issues.apache.org/jira/browse/HDFS-16678?focusedWorklogId=794584=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-794584 ] ASF GitHub Bot logged work on HDFS-16678: - Author: ASF GitHub Bot Created on: 24/Jul/22 01:50 Start Date: 24/Jul/22 01:50 Worklog Time Spent: 10m Work Description: ZanderXu commented on PR #4606: URL: https://github.com/apache/hadoop/pull/4606#issuecomment-1193226247 @goiri Sir, I have updated the patch and added some UTs, please help me review it again. Thanks Issue Time Tracking --- Worklog Id: (was: 794584) Time Spent: 1h (was: 50m) > RBF supports disable getNodeUsage() in RBFMetrics > - > > Key: HDFS-16678 > URL: https://issues.apache.org/jira/browse/HDFS-16678 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Major > Labels: pull-request-available > Time Spent: 1h > Remaining Estimate: 0h > > In our prod environment, we try to collect RBF metrics every 15s through > jmx_exporter. And we found that collection task often failed. > After tracing and found that the collection task is blocked at getNodeUsage() > in RBFMetrics, because it will collection all datanode's usage from > downstream nameservices. This is a very expensive and almost useless > operation. Because in most scenarios, each NameSerivce contains almost the > same DNs. We can get the data usage's from any one nameservices, not from RBF. > So I feel that RBF should supports disable getNodeUsage() in RBFMetrics. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-16678) RBF supports disable getNodeUsage() in RBFMetrics
[ https://issues.apache.org/jira/browse/HDFS-16678?focusedWorklogId=793961=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-793961 ] ASF GitHub Bot logged work on HDFS-16678: - Author: ASF GitHub Bot Created on: 21/Jul/22 22:37 Start Date: 21/Jul/22 22:37 Worklog Time Spent: 10m Work Description: ZanderXu commented on code in PR #4606: URL: https://github.com/apache/hadoop/pull/4606#discussion_r927154231 ## hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/metrics/RBFMetrics.java: ## @@ -544,28 +548,30 @@ public String getNodeUsage() { final Map> info = new HashMap<>(); try { - RouterRpcServer rpcServer = this.router.getRpcServer(); - DatanodeInfo[] live = rpcServer.getDatanodeReport( - DatanodeReportType.LIVE, false, timeOut); - - if (live.length > 0) { -float totalDfsUsed = 0; -float[] usages = new float[live.length]; -int i = 0; -for (DatanodeInfo dn : live) { - usages[i++] = dn.getDfsUsedPercent(); - totalDfsUsed += dn.getDfsUsedPercent(); -} -totalDfsUsed /= live.length; -Arrays.sort(usages); -median = usages[usages.length / 2]; -max = usages[usages.length - 1]; -min = usages[0]; - -for (i = 0; i < usages.length; i++) { - dev += (usages[i] - totalDfsUsed) * (usages[i] - totalDfsUsed); + if (this.enableGetDNUsage) { +RouterRpcServer rpcServer = this.router.getRpcServer(); +DatanodeInfo[] live = rpcServer.getDatanodeReport( +DatanodeReportType.LIVE, false, timeOut); + +if (live.length > 0) { + float totalDfsUsed = 0; + float[] usages = new float[live.length]; + int i = 0; + for (DatanodeInfo dn : live) { +usages[i++] = dn.getDfsUsedPercent(); Review Comment: Yes, `rpcServer.getDatanodeReport()` is expensive. As the number of DNs or downstream nameservices in the cluster increases, it will become more and more expensive. such as 1w+ DNs, 5w+ DNs, 20+ NSs, 50+ NSs. Issue Time Tracking --- Worklog Id: (was: 793961) Time Spent: 50m (was: 40m) > RBF supports disable getNodeUsage() in RBFMetrics > - > > Key: HDFS-16678 > URL: https://issues.apache.org/jira/browse/HDFS-16678 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Major > Labels: pull-request-available > Time Spent: 50m > Remaining Estimate: 0h > > In our prod environment, we try to collect RBF metrics every 15s through > jmx_exporter. And we found that collection task often failed. > After tracing and found that the collection task is blocked at getNodeUsage() > in RBFMetrics, because it will collection all datanode's usage from > downstream nameservices. This is a very expensive and almost useless > operation. Because in most scenarios, each NameSerivce contains almost the > same DNs. We can get the data usage's from any one nameservices, not from RBF. > So I feel that RBF should supports disable getNodeUsage() in RBFMetrics. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-16678) RBF supports disable getNodeUsage() in RBFMetrics
[ https://issues.apache.org/jira/browse/HDFS-16678?focusedWorklogId=793959=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-793959 ] ASF GitHub Bot logged work on HDFS-16678: - Author: ASF GitHub Bot Created on: 21/Jul/22 22:37 Start Date: 21/Jul/22 22:37 Worklog Time Spent: 10m Work Description: ZanderXu commented on code in PR #4606: URL: https://github.com/apache/hadoop/pull/4606#discussion_r927153919 ## hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/router/RBFConfigKeys.java: ## @@ -315,6 +315,9 @@ public class RBFConfigKeys extends CommonConfigurationKeysPublic { FEDERATION_ROUTER_PREFIX + "dn-report.cache-expire"; public static final long DN_REPORT_CACHE_EXPIRE_MS_DEFAULT = TimeUnit.SECONDS.toMillis(10); + public static final String DFS_ROUTER_ENABLE_GET_DN_USAGE_KEY = + FEDERATION_ROUTER_PREFIX + "enable.get.dn.usage"; + public static final boolean DFS_ROUTER_ENABLE_GET_DN_USAGE_DEFAULT = true; Review Comment: Copy, I will do it. Issue Time Tracking --- Worklog Id: (was: 793959) Time Spent: 40m (was: 0.5h) > RBF supports disable getNodeUsage() in RBFMetrics > - > > Key: HDFS-16678 > URL: https://issues.apache.org/jira/browse/HDFS-16678 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Major > Labels: pull-request-available > Time Spent: 40m > Remaining Estimate: 0h > > In our prod environment, we try to collect RBF metrics every 15s through > jmx_exporter. And we found that collection task often failed. > After tracing and found that the collection task is blocked at getNodeUsage() > in RBFMetrics, because it will collection all datanode's usage from > downstream nameservices. This is a very expensive and almost useless > operation. Because in most scenarios, each NameSerivce contains almost the > same DNs. We can get the data usage's from any one nameservices, not from RBF. > So I feel that RBF should supports disable getNodeUsage() in RBFMetrics. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org
[jira] [Work logged] (HDFS-16678) RBF supports disable getNodeUsage() in RBFMetrics
[ https://issues.apache.org/jira/browse/HDFS-16678?focusedWorklogId=793863=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-793863 ] ASF GitHub Bot logged work on HDFS-16678: - Author: ASF GitHub Bot Created on: 21/Jul/22 17:31 Start Date: 21/Jul/22 17:31 Worklog Time Spent: 10m Work Description: goiri commented on code in PR #4606: URL: https://github.com/apache/hadoop/pull/4606#discussion_r926936268 ## hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/metrics/RBFMetrics.java: ## @@ -544,28 +548,30 @@ public String getNodeUsage() { final Map> info = new HashMap<>(); try { - RouterRpcServer rpcServer = this.router.getRpcServer(); - DatanodeInfo[] live = rpcServer.getDatanodeReport( - DatanodeReportType.LIVE, false, timeOut); - - if (live.length > 0) { -float totalDfsUsed = 0; -float[] usages = new float[live.length]; -int i = 0; -for (DatanodeInfo dn : live) { - usages[i++] = dn.getDfsUsedPercent(); - totalDfsUsed += dn.getDfsUsedPercent(); -} -totalDfsUsed /= live.length; -Arrays.sort(usages); -median = usages[usages.length / 2]; -max = usages[usages.length - 1]; -min = usages[0]; - -for (i = 0; i < usages.length; i++) { - dev += (usages[i] - totalDfsUsed) * (usages[i] - totalDfsUsed); + if (this.enableGetDNUsage) { Review Comment: I would do: ``` DatanodeInfo[] live = null; if (this.enableGetDNUsage) { RouterRpcServer rpcServer = this.router.getRpcServer(); DatanodeInfo[] live = rpcServer.getDatanodeReport(DatanodeReportType.LIVE, false, timeOut); } else { LOG.debug("Getting information is disabled."); // similar message } if (live != null && live.length > 0) { ``` ## hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/metrics/RBFMetrics.java: ## @@ -544,28 +548,30 @@ public String getNodeUsage() { final Map> info = new HashMap<>(); try { - RouterRpcServer rpcServer = this.router.getRpcServer(); - DatanodeInfo[] live = rpcServer.getDatanodeReport( - DatanodeReportType.LIVE, false, timeOut); - - if (live.length > 0) { -float totalDfsUsed = 0; -float[] usages = new float[live.length]; -int i = 0; -for (DatanodeInfo dn : live) { - usages[i++] = dn.getDfsUsedPercent(); - totalDfsUsed += dn.getDfsUsedPercent(); -} -totalDfsUsed /= live.length; -Arrays.sort(usages); -median = usages[usages.length / 2]; -max = usages[usages.length - 1]; -min = usages[0]; - -for (i = 0; i < usages.length; i++) { - dev += (usages[i] - totalDfsUsed) * (usages[i] - totalDfsUsed); + if (this.enableGetDNUsage) { +RouterRpcServer rpcServer = this.router.getRpcServer(); +DatanodeInfo[] live = rpcServer.getDatanodeReport( +DatanodeReportType.LIVE, false, timeOut); + +if (live.length > 0) { + float totalDfsUsed = 0; + float[] usages = new float[live.length]; + int i = 0; + for (DatanodeInfo dn : live) { +usages[i++] = dn.getDfsUsedPercent(); Review Comment: What is the expensive part of this whole block? this? ## hadoop-hdfs-project/hadoop-hdfs-rbf/src/main/java/org/apache/hadoop/hdfs/server/federation/metrics/RBFMetrics.java: ## @@ -544,28 +548,30 @@ public String getNodeUsage() { final Map> info = new HashMap<>(); try { - RouterRpcServer rpcServer = this.router.getRpcServer(); - DatanodeInfo[] live = rpcServer.getDatanodeReport( - DatanodeReportType.LIVE, false, timeOut); - - if (live.length > 0) { -float totalDfsUsed = 0; -float[] usages = new float[live.length]; -int i = 0; -for (DatanodeInfo dn : live) { - usages[i++] = dn.getDfsUsedPercent(); - totalDfsUsed += dn.getDfsUsedPercent(); -} -totalDfsUsed /= live.length; -Arrays.sort(usages); -median = usages[usages.length / 2]; -max = usages[usages.length - 1]; -min = usages[0]; - -for (i = 0; i < usages.length; i++) { - dev += (usages[i] - totalDfsUsed) * (usages[i] - totalDfsUsed); + if (this.enableGetDNUsage) { +RouterRpcServer rpcServer = this.router.getRpcServer(); +DatanodeInfo[] live = rpcServer.getDatanodeReport( +DatanodeReportType.LIVE, false, timeOut); + +if (live.length > 0) { + float totalDfsUsed = 0; + float[] usages = new float[live.length]; + int i = 0; + for (DatanodeInfo dn : live) { +usages[i++] = dn.getDfsUsedPercent(); +
[jira] [Work logged] (HDFS-16678) RBF supports disable getNodeUsage() in RBFMetrics
[ https://issues.apache.org/jira/browse/HDFS-16678?focusedWorklogId=793838=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-793838 ] ASF GitHub Bot logged work on HDFS-16678: - Author: ASF GitHub Bot Created on: 21/Jul/22 16:59 Start Date: 21/Jul/22 16:59 Worklog Time Spent: 10m Work Description: hadoop-yetus commented on PR #4606: URL: https://github.com/apache/hadoop/pull/4606#issuecomment-1191727346 :broken_heart: **-1 overall** | Vote | Subsystem | Runtime | Logfile | Comment | |::|--:|:|::|:---:| | +0 :ok: | reexec | 1m 21s | | Docker mode activated. | _ Prechecks _ | | +1 :green_heart: | dupname | 0m 0s | | No case conflicting files found. | | +0 :ok: | codespell | 0m 1s | | codespell was not available. | | +0 :ok: | detsecrets | 0m 1s | | detect-secrets was not available. | | +0 :ok: | xmllint | 0m 1s | | xmllint was not available. | | +1 :green_heart: | @author | 0m 0s | | The patch does not contain any @author tags. | | -1 :x: | test4tests | 0m 0s | | The patch doesn't appear to include any new or modified tests. Please justify why no new tests are needed for this patch. Also please list what manual steps were performed to verify this patch. | _ trunk Compile Tests _ | | +1 :green_heart: | mvninstall | 43m 1s | | trunk passed | | +1 :green_heart: | compile | 0m 53s | | trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | compile | 0m 47s | | trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | checkstyle | 0m 41s | | trunk passed | | +1 :green_heart: | mvnsite | 0m 53s | | trunk passed | | +1 :green_heart: | javadoc | 0m 59s | | trunk passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javadoc | 1m 9s | | trunk passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 1m 40s | | trunk passed | | +1 :green_heart: | shadedclient | 24m 13s | | branch has no errors when building and testing our client artifacts. | _ Patch Compile Tests _ | | +1 :green_heart: | mvninstall | 0m 38s | | the patch passed | | +1 :green_heart: | compile | 0m 41s | | the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javac | 0m 41s | | the patch passed | | +1 :green_heart: | compile | 0m 36s | | the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | javac | 0m 36s | | the patch passed | | +1 :green_heart: | blanks | 0m 0s | | The patch has no blanks issues. | | +1 :green_heart: | checkstyle | 0m 22s | | the patch passed | | +1 :green_heart: | mvnsite | 0m 39s | | the patch passed | | +1 :green_heart: | javadoc | 0m 37s | | the patch passed with JDK Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 | | +1 :green_heart: | javadoc | 0m 56s | | the patch passed with JDK Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | +1 :green_heart: | spotbugs | 1m 26s | | the patch passed | | +1 :green_heart: | shadedclient | 23m 34s | | patch has no errors when building and testing our client artifacts. | _ Other Tests _ | | +1 :green_heart: | unit | 39m 50s | | hadoop-hdfs-rbf in the patch passed. | | +1 :green_heart: | asflicense | 0m 44s | | The patch does not generate ASF License warnings. | | | | 147m 35s | | | | Subsystem | Report/Notes | |--:|:-| | Docker | ClientAPI=1.41 ServerAPI=1.41 base: https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4606/1/artifact/out/Dockerfile | | GITHUB PR | https://github.com/apache/hadoop/pull/4606 | | Optional Tests | dupname asflicense compile javac javadoc mvninstall mvnsite unit shadedclient spotbugs checkstyle codespell detsecrets xmllint | | uname | Linux 843009c68092 4.15.0-175-generic #184-Ubuntu SMP Thu Mar 24 17:48:36 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux | | Build tool | maven | | Personality | dev-support/bin/hadoop.sh | | git revision | trunk / 4de60fb87f2f25087acab7b90d75cd0e20be622d | | Default Java | Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | Multi-JDK versions | /usr/lib/jvm/java-11-openjdk-amd64:Private Build-11.0.15+10-Ubuntu-0ubuntu0.20.04.1 /usr/lib/jvm/java-8-openjdk-amd64:Private Build-1.8.0_312-8u312-b07-0ubuntu1~20.04-b07 | | Test Results | https://ci-hadoop.apache.org/job/hadoop-multibranch/job/PR-4606/1/testReport/ | | Max. process+thread count | 2025 (vs. ulimit of
[jira] [Work logged] (HDFS-16678) RBF supports disable getNodeUsage() in RBFMetrics
[ https://issues.apache.org/jira/browse/HDFS-16678?focusedWorklogId=793769=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-793769 ] ASF GitHub Bot logged work on HDFS-16678: - Author: ASF GitHub Bot Created on: 21/Jul/22 14:31 Start Date: 21/Jul/22 14:31 Worklog Time Spent: 10m Work Description: ZanderXu opened a new pull request, #4606: URL: https://github.com/apache/hadoop/pull/4606 ### Description of PR In our prod environment, we try to collect RBF metrics every 15s through jmx_exporter. And we found that collection task often failed. After tracing and found that the collection task is blocked at getNodeUsage() in RBFMetrics, because it will collect all datanode's usage from downstream nameservices. This is a very expensive and almost useless operation. Because in most scenarios, each downstream nameserivce contains almost the same DNs. We can get the data usage's from any one nameservices if need, not from RBF. So I feel that RBF should supports disable getNodeUsage() in RBFMetrics. Issue Time Tracking --- Worklog Id: (was: 793769) Remaining Estimate: 0h Time Spent: 10m > RBF supports disable getNodeUsage() in RBFMetrics > - > > Key: HDFS-16678 > URL: https://issues.apache.org/jira/browse/HDFS-16678 > Project: Hadoop HDFS > Issue Type: Improvement >Reporter: ZanderXu >Assignee: ZanderXu >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > In our prod environment, we try to collect RBF metrics every 15s through > jmx_exporter. And we found that collection task often failed. > After tracing and found that the collection task is blocked at getNodeUsage() > in RBFMetrics, because it will collection all datanode's usage from > downstream nameservices. This is a very expensive and almost useless > operation. Because in most scenarios, each NameSerivce contains almost the > same DNs. We can get the data usage's from any one nameservices, not from RBF. > So I feel that RBF should supports disable getNodeUsage() in RBFMetrics. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org