[prometheus-users] Re: better way to get notified about (true) single scrape failures?

Brian Candler Wed, 10 May 2023 00:03:41 -0700

> Not sure if I'm right, but I think if one places both rules in the same 
group (and I think even the order shouldn't matter?), then the original:
>     expr: min_over_time(up[5m]) == 0 unless max_over_time(up[5m]) == 0
>     for: 5m
> with 5m being the "for:"-time of the long-alert should be guaranteed to 
work... in the sense that if the above doesn't fire... the long-alert > 
does.


It depends on the exact semantics of "for". e.g. take a simple case of 1 
minute rule evaluation interval. If you apply "for: 1m" then I guess that 
means the alert must be firing for two successive evaluations (otherwise, 
"for: 1m" would have no effect).

If so, then "for: 5m" means it must be firing for six successive 
evaluations.

But up[5m] only looks at samples wholly contained within a 5 minute window, 
and therefore will normally only look at 5 samples.  (If there is jitter in 
the sampling time, then occasionally it might look at 4 or 6 samples)

If what I've written above is correct (and it may well not be!), then

expr: up == 0
for: 5m

will fire if "up" is zero for 6 cycles, whereas

... unless max_over_time(up[5m])

will suppress an alert if "up" is zero for (usually) 5 cycles.

If you want to get to the bottom of this with certainty, you can write unit 
tests 
<https://prometheus.io/docs/prometheus/latest/configuration/unit_testing_rules/>
 
that try out these scenarios.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/12e68a80-7d90-4e91-838a-bae6a21ca3b1n%40googlegroups.com.

[prometheus-users] Re: better way to get notified about (true) single scrape failures?

Reply via email to