[GitHub] [ozone] neils-dev commented on a diff in pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

GitBox Tue, 25 Oct 2022 13:43:38 -0700


neils-dev commented on code in PR #3781:
URL: https://github.com/apache/ozone/pull/3781#discussion_r1004946930



##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java:
##########
@@ -168,6 +232,55 @@ public int getTrackedNodeCount() {
     return trackedNodes.size();
   }
 
+  synchronized void setMetricsToGauge() {
+    metrics.setTrackedContainersUnhealthyTotal(unhealthyContainers);
+    metrics.setTrackedRecommissionNodesTotal(trackedRecommission);
+    metrics.setTrackedDecommissioningMaintenanceNodesTotal(
+            trackedDecomMaintenance);
+    metrics.setTrackedContainersUnderReplicatedTotal(
+            underReplicatedContainers);
+    metrics.setTrackedContainersSufficientlyReplicatedTotal(
+            sufficientlyReplicatedContainers);
+    metrics.setTrackedPipelinesWaitingToCloseTotal(pipelinesWaitingToClose);
+    for (Map.Entry<String, ContainerStateInWorkflow> e :
+            containerStateByHost.entrySet()) {
+      metrics.metricRecordOfContainerStateByHost(e.getKey(),
+          e.getValue().sufficientlyReplicated,
+          e.getValue().underReplicatedContainers,
+          e.getValue().unhealthyContainers,
+          e.getValue().pipelinesWaitingToClose);
+    }
+  }
+
+  void resetContainerMetrics() {
+    pipelinesWaitingToClose = 0;
+    sufficientlyReplicatedContainers = 0;
+    unhealthyContainers = 0;
+    underReplicatedContainers = 0;
+
+    for (Map.Entry<String, ContainerStateInWorkflow> e :

Review Comment:
   Yes, thanks,  I was looking to couple the `ContainerStateInWorkflow` for use 
both in the `DatanodeAdminMonitorImpl`and in the `NodeDecommissionMetrics` 
however there are a few issues with that and thus it is implemented this way.  
Those issues are,
   
   i.) there needs to be separate stores for numbers collected for the metrics 
from the monitor and numbers stored in the `NodeDecommissionMonitor.`  This is 
so that we do not report incomplete intermediate numbers to the` 
NodeDecommissionMonitor` when the metrics are periodically pulled through 
`getMetrics().`  We flush the numbers on each run of the monitor thread to the 
`NodeDecommissionMetrics `once all the numbers have been collected (the calls 
to `metricRecordOfContainterStateByHost)`.  For this reason the `Map<string, 
ContainerStateInWorkflow>` cannot be used directly in the 
`NodeDecommisonMonitor`.
   
   ii.) With the two separate stores, we need to know which hosts stored are 
currently in the workflow and which are out of the workflow and stale.  Thus 
the check in the monitor code to collect those hosts that are stale and 
reporting that to the 
`NodeDecommissionMetrics.metricRemoveRecordOfContainerStateByHost()`.  For this 
reason as well, it looks like clearing the map on each run of the monitor 
instead of iterating to reset to 0, as suggested, is not possible.  We need to 
know which nodes (hosts) have become stale since the last run of the monitor.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [ozone] neils-dev commented on a diff in pull request #3781: HDDS-2642. Expose decommission / maintenance metrics via JMX

Reply via email to