[GitHub] [ozone] neils-dev commented on a diff in pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

GitBox Fri, 21 Oct 2022 18:02:43 -0700


neils-dev commented on code in PR #3781:
URL: https://github.com/apache/ozone/pull/3781#discussion_r1002290670



##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java:
##########
@@ -168,6 +225,43 @@ public int getTrackedNodeCount() {
     return trackedNodes.size();
   }
 
+  synchronized void setMetricsToGauge() {
+    metrics.setTrackedContainersUnhealthyTotal(unhealthyContainers);
+    metrics.setTrackedRecommissionNodesTotal(trackedRecommission);
+    metrics.setTrackedDecommissioningMaintenanceNodesTotal(
+            trackedDecomMaintenance);
+    metrics.setTrackedContainersUnderReplicatedTotal(
+            underReplicatedContainers);
+    metrics.setTrackedContainersSufficientlyReplicatedTotal(
+            sufficientlyReplicatedContainers);
+    metrics.setTrackedPipelinesWaitingToCloseTotal(pipelinesWaitingToClose);
+    for (Map.Entry<String, Long> e :
+            pipelinesWaitingToCloseByHost.entrySet()) {
+      metrics.metricRecordPipelineWaitingToCloseByHost(e.getKey(),
+              e.getValue());
+    }
+    for (Map.Entry<String, ContainerStateInWorkflow> e :

Review Comment:
   I've modified the code to dynamically (without using the helper 
`MetricsRegistry` class to add gauges) add to the collector as is done 
similarly in namenode topmetrics collections.  See 
https://github.com/apache/hadoop/blob/eefa664fea1119a9c6e3ae2d2ad3069019fbd4ef/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/namenode/top/metrics/TopMetrics.java#L167.
   Here the metrics are collected dynamically by host when the host node is in 
the workflow.  When the node exits the workflow, the metrics for that host are 
no longer collected.  In the `JMX,` the node metrics are no longer in the 
output.  Note this is true for `JMX`, the prom endpoint _seems to retain_ the 
last value pushed.  See the following metrics pushed out to JMX for the` 
NodeDecommissionMetrics` when `datanode-2` is decommissioned:
   
   ```
   before
   
   "name" : 
"Hadoop:service=StorageContainerManager,name=NodeDecommissionMetrics",
       "modelerType" : "NodeDecommissionMetrics",
       "tag.Hostname" : "0d207b6cbbf1",
       "TrackedDecommissioningMaintenanceNodesTotal" : 0,
       "TrackedRecommissionNodesTotal" : 0,
       "TrackedPipelinesWaitingToCloseTotal" : 0,
       "TrackedContainersUnderReplicatedTotal" : 0,
       "TrackedContainersUnhealthyTotal" : 0,
       "TrackedContainersSufficientlyReplicatedTotal" : 0
     }, {
   
   during
       "name" : 
"Hadoop:service=StorageContainerManager,name=NodeDecommissionMetrics",
       "modelerType" : "NodeDecommissionMetrics",
       "tag.Hostname" : "0d207b6cbbf1",
       "TrackedDecommissioningMaintenanceNodesTotal" : 1,
       "TrackedRecommissionNodesTotal" : 0,
       "TrackedPipelinesWaitingToCloseTotal" : 2,
       "TrackedContainersUnderReplicatedTotal" : 0,
       "TrackedContainersUnhealthyTotal" : 0,
       "TrackedContainersSufficientlyReplicatedTotal" : 0,
       "TrackedUnhealthyContainers-ozone-datanode-2.ozone_default" : 0,
       "TrackedSufficientlyReplicated-ozone-datanode-2.ozone_default" : 0,
       "TrackedPipelinesWaitingToClose-ozone-datanode-2.ozone_default" : 2,
       "TrackedUnderReplicated-ozone-datanode-2.ozone_default" : 0
     }, {
   
     }, {
       "name" : 
"Hadoop:service=StorageContainerManager,name=NodeDecommissionMetrics",
       "modelerType" : "NodeDecommissionMetrics",
       "tag.Hostname" : "0d207b6cbbf1",
       "TrackedDecommissioningMaintenanceNodesTotal" : 1,
       "TrackedRecommissionNodesTotal" : 0,
       "TrackedPipelinesWaitingToCloseTotal" : 0,
       "TrackedContainersUnderReplicatedTotal" : 1,
       "TrackedContainersUnhealthyTotal" : 0,
       "TrackedContainersSufficientlyReplicatedTotal" : 0,
       "TrackedUnhealthyContainers-ozone-datanode-2.ozone_default" : 0,
       "TrackedSufficientlyReplicated-ozone-datanode-2.ozone_default" : 0,
       "TrackedPipelinesWaitingToClose-ozone-datanode-2.ozone_default" : 0,
       "TrackedUnderReplicated-ozone-datanode-2.ozone_default" : 1
     }, {
   
   after
    }, {
       "name" : 
"Hadoop:service=StorageContainerManager,name=NodeDecommissionMetrics",
       "modelerType" : "NodeDecommissionMetrics",
       "tag.Hostname" : "0d207b6cbbf1",
       "TrackedDecommissioningMaintenanceNodesTotal" : 0,
       "TrackedRecommissionNodesTotal" : 0,
       "TrackedPipelinesWaitingToCloseTotal" : 0,
       "TrackedContainersUnderReplicatedTotal" : 0,
       "TrackedContainersUnhealthyTotal" : 0,
       "TrackedContainersSufficientlyReplicatedTotal" : 0
     }, {
   
   ```
   The host `datanode-2 `metrics no longer visible as the node exits the 
workflow.
   
   This seems to follow how hadoop handles metrics collected dynamically, 
however the `prom endpoint` seems to _**retain**_ the last pushed value for 
some reason.  Is this what we should expect when collecting metrics for hosts 
as they go in and out of the workflow?
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [ozone] neils-dev commented on a diff in pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Reply via email to