kfaraz commented on code in PR #18598:
URL: https://github.com/apache/druid/pull/18598#discussion_r2409716522


##########
extensions-contrib/prometheus-emitter/src/main/java/org/apache/druid/emitter/prometheus/PrometheusEmitter.java:
##########
@@ -256,4 +279,21 @@ public void setPushGateway(PushGateway pushGateway)
   {
     this.pushGateway = pushGateway;
   }
+
+  private void cleanUpStaleMetrics()
+  {
+    if (config.getFlushPeriod() == null) {
+      return;
+    }
+
+    Map<String, DimensionsAndCollector> map = metrics.getRegisteredMetrics();
+    for (Map.Entry<String, DimensionsAndCollector> entry : map.entrySet()) {
+      if (entry.getValue().isExpired(config.getFlushPeriod())) {
+        log.debug("Metric [%s] has expired (last updated %d ms ago)",
+                 entry.getKey(),
+                 entry.getValue().getTimeSinceLastUpdate());
+        entry.getValue().getCollector().clear();

Review Comment:
   Hmm, I suppose removing the children for specific labels would be a more 
complete solution.
   But it could blow up the memory usage pretty fast since we would now be 
tracking the last updated time for every unique set of label values for every 
metric.
   
   I think the issue this PR is trying to address is that if a certain metric 
stops being reported by a service
   altogether (irrespective of label values), it should not remain in the 
collector anymore. e.g. in leadership changes.
   
   Also, the staleness concern should really only apply to gauge metrics. So if 
timer (histogram) metrics like `task/run/time` and `query/wait/time` are still 
being updated by the current service (for any label), we should continue 
emitting them. Getting rid of the expired labels becomes more of a mem usage 
concern than anything else.
   
   But with gauge metrics like say `segment/size` (which represents bytes of 
segment available on a historical for a given datasource), I agree that the 
story is different.
   If all segments of a datasource are removed from a historical, we should 
definitely stop reporting `segment/size` for that datasource.
   
   But I would suggest we add a comment and postpone the full solution for 
gauge metrics for a later PR.
   Ideally, we would want to write up an embedded test (using Prometheus 
testcontainer) and verify the behaviour of gauge metrics in such cases and then 
proceed with a solution.
   
   What are your thoughts, @abhishekrb19 , @aho135 ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to