Hey all, I've got a problem I'm trying to tackle and I would appreciate any ideas or feedback.
I'm developing a set of recording rules that do different calculations over various durations =(e.g. rate_2m, rate_30m, rate_1h, rate_6h, rate_12h, rate_1d, rate_3d): # Example for 1 hour lookback record: my_rule:rate_1h expr: sum(rate(my_large_metric[1h])) When the underlying metric has many labels and a very high cardinality, the cost of re-aggregating the metric becomes significant. I'm trying to offset this cost using an approach where I aggregate recording rules of a shorter duration over time, e.g: # Aggregate + rate large metric record: my_rule:rate_1h expr: sum(rate(my_large_metric[1h])) # Combine 1h samples together, avoiding cost of sum() record: my_rule:rate_1d expr: avg_over_time(my_rule:rate_1h[24h:1h]) Now this leads to inaccuracy between the recording rule values and the equivalent rate()-based expression due to the fact that rate will miss increases that happen between prior invocations (effectively the problem mentioned here: https://stackoverflow.com/questions/70829895/using-sum-over-time-for-a-promql-increase-function-recorded-using-recording-rule). Is there a way to avoid the performance hit and maintain accuracy? I'm hoping by pre-aggregating the counter values by instance might do it: # Tracks the total count of events per scrape target record: instance:my_rule:sum expr: sum by (instance)(my_large_metric) # Use this count of events to calculate rate of change over any durations with greatly reduced aggregation costs record: my_rule:rate_1h expr: sum(rate(instance:my_rule:sum[1h])) record: my_rule:rate_1d expr: sum(rate(instance:my_rule:sum[1d])) Now this does violate the principles outlined in https://www.robustperception.io/rate-then-sum-never-sum-then-rate but it does (I believe) avoid counter resets causing issues by aggregating per-instance. Other potential issues I can see with this: - Removing labeled series might trigger a counter reset - Slightly increased risk of counter underflow if summed values exceed 2^53 Curious to know what people's thoughts on this are. -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/764be600-0409-43c0-8a69-69a79efb7e61n%40googlegroups.com.