> As a reminder, my goal was: > - if e.g. scrapes fail for 1m, a target-down alert shall fire (similar to > how Icinga would put the host into down state, after pings failed or a > number of seconds) > - but even if a single scrape fails (which alone wouldn't trigger the above > alert) I'd like to get a notification (telling me, that something might be > fishy with the networking or so), that is UNLESS that single failed scrape > is part of a sequence of failed scrapes that also caused / will cause the > above target-down alert > > Assuming in the following, each number is a sample value with ~10s distance > for > the `up` metric of a single host, with the most recent one being the > right-most: > - 1 1 1 1 1 1 1 => should give nothing > - 1 1 1 1 1 1 0 => should NOT YET give anything (might be just a single > failure, > or develop into the target-down alert) > - 1 1 1 1 1 0 0 => same as above, not clear yet > ... > - 1 0 0 0 0 0 0 => here it's clear, this is a target-down alert
One thing you can look into here for detecting and counting failed scrapes is resets(). This works perfectly well when applied to a gauge that is 1 or 0, and in this case it will count the number of times the metric went from 1 to 0 in a particular time interval. You can similarly use changes() to count the total number of transitions (either 1->0 scrape failures or 0->1 scrapes starting to succeed after failures). It may also be useful to multiply the result of this by the current value of the metric, so for example: resets(up{..}[1m]) * up{..} will be non-zero if there have been some number of scrape failures over the past minute *but* the most recent scrape succeeded (if that scrape failed, you're multiplying resets() by zero and getting zero). You can then wrap this in an '(...) > 0' to get something you can maybe use as an alert rule for the 'scrapes failed' notification. You might need to make the range for resets() one step larger than you use for the 'target-down' alert, since resets() will also be zero if up{...} was zero all through its range. (At this point you may also want to look at the alert 'keep_firing_for' setting.) However, my other suggestion here would be that this notification or count of failed scrapes may be better handled as a dashboard or a periodic report (from a script) instead of through an alert, especially a fast-firing alert. I think it will be relatively difficult to make an alert give you an accurate count of how many times this happened; if you want such a count to make decisions, a dashboard (possibly visualizing the up/down blips) or a report could be better. A program is also in the position to extract the raw up{...} metrics (with timestamps) and then readily analyze them for things like how long the failed scrapes tend to last for, how frequently they happen, etc etc. - cks PS: This is not my clever set of tricks, I got it from other people. -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/3652072.1710729628%40apps0.cs.toronto.edu.