Hey there. I eventually got back to this and I'm still fighting this problem.
As a reminder, my goal was: - if e.g. scrapes fail for 1m, a target-down alert shall fire (similar to how Icinga would put the host into down state, after pings failed or a number of seconds) - but even if a single scrape fails (which alone wouldn't trigger the above alert) I'd like to get a notification (telling me, that something might be fishy with the networking or so), that is UNLESS that single failed scrape is part of a sequence of failed scrapes that also caused / will cause the above target-down alert Assuming in the following, each number is a sample value with ~10s distance for the `up` metric of a single host, with the most recent one being the right-most: - 1 1 1 1 1 1 1 => should give nothing - 1 1 1 1 1 1 0 => should NOT YET give anything (might be just a single failure, or develop into the target-down alert) - 1 1 1 1 1 0 0 => same as above, not clear yet ... - 1 0 0 0 0 0 0 => here it's clear, this is a target-down alert In the following: - 1 1 1 1 1 0 1 - 1 1 1 1 0 0 1 - 1 1 1 0 0 0 1 ... should eventually (not necessarily after the right-most 1, though) all give a "single-scrape-failure" (even though it's more than just one - it's not a target-down), simply because there's a 0s but for a time span less than 1m. - 1 0 1 0 0 0 0 0 0 should give both, a single-scrape-failure alert (the left-most single 0) AND a target-down alert (the 6 consecutive zeros) - 1 0 1 0 1 0 0 0 should give at least 2x a single-scrape-failure alert, and for the leftmost zeros, it's not yet clear what they'll become. - 0 0 0 0 0 0 0 0 0 0 0 0 (= 2x six zeros) should give only 1 target-down alert - 0 0 0 0 0 0 1 0 0 0 0 0 0 (= 2x six zeros, separated by a 1) should give 2 target-down alerts Whether each of such alerts (e.g. in the 1 0 1 0 1 0 ...) case actually results in a notification (mail) is of course a different matter, and depends on the alertmanager configuration, but at least the alert should fire and with the right alert-manager config one should actually get a notification for each single failed scrape. Now, Brian has already given me some pretty good ideas how do them basically the ideas were: (assuming that 1m makes the target down, and a scrape interval of 10s) For the target-down alert: a) expr: 'up == 0' for: 1m b) expr: 'max_over_time(up[1m]) == 0' for: 0s => here (b) was probably better, as it would use the same condition as is also used in the alert below, and there can be no weird timing effects depending on the for: an when these are actually evaluated. For the single-scrape-failiure alert: A) expr: min_over_time(up[1m20s]) == 0 unless max_over_time(up[1m]) == 0 for: 1m10s (numbers a bit modified from Brian's example, but I think the idea is the same) B) expr: min_over_time(up[1m10s]) == 0 unless max_over_time(up[1m10s]) == 0 for: 1m => I did test (B) quite a lot, but there was at least still one case where it failed and that was when there were two consecutive but distinct target-down errors, that is: 0 0 0 0 0 0 1 0 0 0 0 0 0 (= 2x six zeros, separated by a 1) which would eventually look like e.g. 0 1 0 0 0 0 0 0 or 0 0 1 0 0 0 0 0 in the above check, and thus trigger (via the left-most zeros) a false single-scrape-failiure alert. => I'm not so sure whether I truly understand (A),... especially with respect to any niche cases, when there's jitter or so (plus, IIRC, it also failed in the case described for (B). One approach I tried in the meantime was to use sum_over_time .. and then the idea was simply to check how mane ones there are for each case. But it turns out that even if everything runs normal, the sum is not stable... some times, over [1m] I got only 5, whereas most times it was 6. Not really sure how that comes, because the printed timestamps for each sample seem to be suuuuper accurate (all the time), but the sum wasn't. So I tried a different approach now, based on the above from Brian,... which at least in tests looks promising so far... but I'd like to hear what experts think about it. - both alerts have to be in the same alert groups (I assume this assures they're then evaluated in the same thread and at the "same time" (that is, with respect to the same reference timestamp). - in my example I assume a scrape time of 10s and evaluation interval of 7s (not really sure whether the latter matters or could be changed while the rules stay the same - and it would still work or not) - for: is always 0s ... I think that's good, because at least to me it's unclear, how things are evaluated if the two alerts have different values for for:, especially in border cases. - rules: - alert: target-down expr: 'max_over_time( up[1m0s] ) == 0' for: 0s - alert: single-scrape-failure expr: 'min_over_time(up[15s] offset 1m) == 0 unless max_over_time(up[1m0s]) == 0 unless max_over_time(up[1m0s] offset 1m10s) == 0 unless max_over_time(up[1m0s] offset 1m) == 0 unless max_over_time(up[1m0s] offset 50s) == 0 unless max_over_time(up[1m0s] offset 40s) == 0 unless max_over_time(up[1m0s] offset 30s) == 0 unless max_over_time(up[1m0s] offset 20s) == 0 unless max_over_time(up[1m0s] offset 10s) == 0' for: 0m I think the intended working of target-down is obvious so let me explain the ideas behind single-scrape-failure: I divide the time spans I look at: -130s -120s -110s -100s -90s -80s -70s -60s -50s -40s -30s -20s -10s 0s/now | | | | | | | 0 | | | | | | | case 1 | | | | | | | 0 | 0 | 0 | 0 | 0 | 0 | 0 | case 2 | | | | | | | 0 | 1 | 0 | 0 | 0 | 0 | 0 | case 3 | | | | | | | 0 | 1 | 1 | 1 | 1 | 1 | 1 | case 4 | | | | | | 1 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | case 5 | | | | | | 1 | 0 | 1 | 1 | 1 | 1 | 1 | 1 | case 6 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | case 7 1: Having a 0 somewhere between -70s and -60s is the mandatory for a single scrape failure. For every 0 more rightwards it's not yet clear which case it will end up as (well actually it may be already clear, if there's a 1 even more right, but that's to complex to check and not really needed). For every 0 more leftwards (later than -70s) the alert, if any, would have already fired when between -70s and -60s. So I check this via: min_over_time(up[15s] offset 1m) == 0 not really sure about the 15s ... the idea is to account for jitter, i.e. if there was only one 0 and that came a bit early and eas already before -70s. I guess the question here is, what happens if I do: min_over_time(up[10s] offset 1m) and there is NO sample between -70 and -60 ?? Does it take the next older one? Or the next newer? 2: Should not be single-scrape-failure, but a target-down failure. This I get via the: unless max_over_time(up[1m0s]) == 0 3, 4: are actually undefined, because I didn't fill in the older numbers, so maybe there was another 1m full of 0 after the leftmost (which would have then been it's own target-down alert) 5, 6: Here it's clear, the 0 between -70 and -60 must be single-scrape-failures and should alert, which they do already if the rule were just: expr: min_over_time(up[15s] offset 1m) == 0 unless max_over_time(up[1m0s]) == 0 7: These fail if we had just: expr: min_over_time(up[15s] offset 1m) == 0 unless max_over_time(up[1m0s]) == 0 because, the 0 between -70 and -60 is actually NOT a single-scrape failure, but a part of a target-down alert. This is, where the: unless max_over_time(up[1m0s] offset 1m10s) == 0 unless max_over_time(up[1m0s] offset 1m ) == 0 unless max_over_time(up[1m0s] offset 50s ) == 0 unless max_over_time(up[1m0s] offset 40s ) == 0 unless max_over_time(up[1m0s] offset 30s ) == 0 unless max_over_time(up[1m0s] offset 20s ) == 0 unless max_over_time(up[1m0s] offset 10s ) == 0 come into play. The idea is that I make a number of excluding conditions, which are the same as the expr for target-down, just shifted around the important interval from -70 to -60: -130s -120s -110s -100s -90s -80s -70s -60s -50s -40s -30s -20s -10s 0s/now | | | | | | | 0 | X | X | X | X | X | X | unless max_over_time(up[1m0s] ) == 0 | | | | | | | 0/X | X | X | X | X | X | | unless max_over_time(up[1m0s] offset 10s ) == 0 | | | | | | X | 0/X | X | X | X | X | | | unless max_over_time(up[1m0s] offset 20s ) == 0 | | | | | X | X | 0/X | X | X | X | | | | unless max_over_time(up[1m0s] offset 30s ) == 0 | | | | X | X | X | 0/X | X | X | | | | | unless max_over_time(up[1m0s] offset 40s ) == 0 | | | X | X | X | X | 0/X | X | | | | | | unless max_over_time(up[1m0s] offset 50s ) == 0 | | X | X | X | X | X | 0/X | | | | | | | unless max_over_time(up[1m0s] offset 1m ) == 0 | X | X | X | X | X | X | 0 | | | | | | | unless max_over_time(up[1m0s] offset 1m10s) == 0 X simply denotes whether the 10s interval is part of the respective 1m interval. 0/X is simply when the important interval from -70 to -60 is also part of that, which doesn't matter as it's anyway 0 and we use max_over_time. So, *if* the important interval from -70 to -60 is 0, it looks at the shifted 1m intervals, whether any of those was a target-down alert, and if so, causes not to fire. Now there's still man open questions. First, and perhaps more rhetorical: Why is this so hard to do in Prometheus? I know Prometheus isn't Icinga/Nagios, but there a failed probe would immediately cause the check to go into UNKNOWN state. For Prometheus, whose main purpose is scraping of metrics, one should assume that people may at least have a simply way to get notified, if these scrapes fail. But more concrete questions: 1) Does the above solution sound reasonable? 2) What about my up[15s] offset 1m ... should it be only [10s]? Or something else? (btw: The 10+5s is obviously one scrape interval + less (I took half) than one scrape interval) 3) Should the more or less corresponding unless max_over_time(up[1m0s] offset 1m10s) == 0 be rather unless max_over_time(up[1m5s] offset 1m10s) == 0 4) The question from above: > what happens if I do: > min_over_time(up[10s] offset 1m) > and there is NO sample between -70 and -60 ?? Does it take the next older one? Or the > next newer? 5) I split up the time spans in chunks of 10s, which is my scrape interval. Is that even reasonable? Or should it rather be split up in evaluation intervals? 6) How do the above alerts, depend on the evaluation interval? I mean will they still work as expected if I use e.g. the scrape interval (10s)? Or could this cause the two intervals to be overlaid in just the wrong manner? Same if I'd use any divisor of the scrape interval, like 5s, 2s or 1s? What if I'd use a evaluation interval *bigger* than the scrape interval? 7) In all my above 10s intervals: -130s -120s -110s -100s -90s -80s -70s -60s -50s -40s -30s -20s -10s 0s/now | | | | | | | | | | | | | | The query is always inclusive on both ends, right? So if a sample would lay e.g. exactly on -70s, it would count for both intervals, the one from -80 to -70 and the one from -70 to -60. I'm a bit unsure whether or not that matter for my alerts. Intuitively not, because my expressions look all at intervals (there is no for: Xs) or so and if the sample is right at the border, well that simply means both intervals have that value. And if there's another sample in the same interval, the max_ and min_ functions should just do the right thing (I... kinda guess ^^). 8) I also thought what would happen if there are multiple samples in one interval e.g.: -130s -120s -110s -100s -90s -80s -70s -60s -50s -40s -30s -20s -10s 0s/now | 1 | 1 | 1 | 1 | 1 | 1 | 0 1 | 0 | 0 | 0 | 0 | 0 | 0 | case 8a | 1 | 1 | 1 | 1 | 1 | 1 | 1 0 | 0 | 0 | 0 | 0 | 0 | 0 | case 8b 8a, 8b: min_over_time for the -70s to -60s interval would be 0 in both cases, but in 8a, that would mean single-scrape-failure is lost. No idea how one can solve this. I guess not at all. :-( Perhaps by using an evaluation interval that prevents this mostly, e.g. 7s evaluation interval for 10s scrape interval. Or could one solve this by using count_over_time or last_over_time? *If* that approach of mine (largely based on Brian's ideas), would indeed work as intended... there's still one problem left: If one wants to make a longer period after which target-down fires (e.g. 5m, rather than 1m) but still keep the short scrape time of 10s, one gets and awfully big expression (which probably doesn't execute faster, the longer it gets). Any ideas how to make that better? Thanks, Chris. -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/791bc259-05cf-40e9-b1bf-2177a467aa4cn%40googlegroups.com.