sodonnel commented on code in PR #3781:
URL: https://github.com/apache/ozone/pull/3781#discussion_r1005334878
##########
hadoop-hdds/server-scm/src/main/java/org/apache/hadoop/hdds/scm/node/DatanodeAdminMonitorImpl.java:
##########
@@ -168,6 +232,55 @@ public int getTrackedNodeCount() {
return trackedNodes.size();
}
+ synchronized void setMetricsToGauge() {
+ metrics.setTrackedContainersUnhealthyTotal(unhealthyContainers);
+ metrics.setTrackedRecommissionNodesTotal(trackedRecommission);
+ metrics.setTrackedDecommissioningMaintenanceNodesTotal(
+ trackedDecomMaintenance);
+ metrics.setTrackedContainersUnderReplicatedTotal(
+ underReplicatedContainers);
+ metrics.setTrackedContainersSufficientlyReplicatedTotal(
+ sufficientlyReplicatedContainers);
+ metrics.setTrackedPipelinesWaitingToCloseTotal(pipelinesWaitingToClose);
+ for (Map.Entry<String, ContainerStateInWorkflow> e :
+ containerStateByHost.entrySet()) {
+ metrics.metricRecordOfContainerStateByHost(e.getKey(),
+ e.getValue().sufficientlyReplicated,
+ e.getValue().underReplicatedContainers,
+ e.getValue().unhealthyContainers,
+ e.getValue().pipelinesWaitingToClose);
+ }
+ }
+
+ void resetContainerMetrics() {
+ pipelinesWaitingToClose = 0;
+ sufficientlyReplicatedContainers = 0;
+ unhealthyContainers = 0;
+ underReplicatedContainers = 0;
+
+ for (Map.Entry<String, ContainerStateInWorkflow> e :
Review Comment:
> On each scheduled run of the monitor, the implementation captures the
current workflow state completely prior to flushing the metric update to the
NodeDecommissionMetric object (DatanodeAdminMonitorImpl.setMetricsToGauge). In
doing so, it tries to keep each metric refresh pull from getMetrics display a
full snapshot of the last captured run of the monitor.
The code in `DatanodeAdminMonitorImpl.setMetricsToGauge` is:
```
synchronized void setMetricsToGauge() {
metrics.setTrackedContainersUnhealthyTotal(unhealthyContainers);
metrics.setTrackedRecommissionNodesTotal(trackedRecommission);
metrics.setTrackedDecommissioningMaintenanceNodesTotal(
trackedDecomMaintenance);
metrics.setTrackedContainersUnderReplicatedTotal(
underReplicatedContainers);
metrics.setTrackedContainersSufficientlyReplicatedTotal(
sufficientlyReplicatedContainers);
metrics.setTrackedPipelinesWaitingToCloseTotal(pipelinesWaitingToClose);
for (Map.Entry<String, ContainerStateInWorkflow> e :
...
```
It makes multiple calls to `metrics` and there is nothing stopping
getMetrics() being called on the metrics object by another thread half way
through the execution of `setMetricsToGauge`. This means the metrics can still
be snapshot inconsistently.
To make it consistent, you need to synchronize in the metrics object and set
ALL the metrics in a single synchronized call.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]