Hey all, I've got a problem I'm trying to tackle and I would appreciate any 
ideas or feedback.

I'm developing a set of recording rules that do different calculations over 
various durations =(e.g. rate_2m, rate_30m, rate_1h, rate_6h, rate_12h, 
rate_1d, rate_3d):

# Example for 1 hour lookback
record: my_rule:rate_1h 
expr: sum(rate(my_large_metric[1h]))

When the underlying metric has many labels and a very high cardinality, the 
cost of re-aggregating the metric becomes significant. I'm trying to offset 
this cost using an approach where I aggregate recording rules of a shorter 
duration over time, e.g:

# Aggregate + rate large metric 
record: my_rule:rate_1h 
expr: sum(rate(my_large_metric[1h])) 
# Combine 1h samples together, avoiding cost of sum() 
record: my_rule:rate_1d 
expr: avg_over_time(my_rule:rate_1h[24h:1h])

Now this leads to inaccuracy between the recording rule values and the 
equivalent rate()-based expression due to the fact that rate will miss 
increases that happen between prior invocations (effectively the problem 
mentioned 
here: 
https://stackoverflow.com/questions/70829895/using-sum-over-time-for-a-promql-increase-function-recorded-using-recording-rule).

Is there a way to avoid the performance hit and maintain accuracy? I'm 
hoping by pre-aggregating the counter values by instance might do it:

# Tracks the total count of events per scrape target 
record: instance:my_rule:sum 
expr: sum by (instance)(my_large_metric) 
# Use this count of events to calculate rate of change over any durations 
with greatly reduced aggregation costs 
record: my_rule:rate_1h 
expr: sum(rate(instance:my_rule:sum[1h])) 
record: my_rule:rate_1d 
expr: sum(rate(instance:my_rule:sum[1d]))

Now this does violate the principles outlined 
in https://www.robustperception.io/rate-then-sum-never-sum-then-rate but it 
does (I believe) avoid counter resets causing issues by aggregating 
per-instance. Other potential issues I can see with this:
- Removing labeled series might trigger a counter reset
- Slightly increased risk of counter underflow if summed values exceed 2^53

Curious to know what people's thoughts on this are.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/764be600-0409-43c0-8a69-69a79efb7e61n%40googlegroups.com.

Reply via email to