[prometheus-users] Re: what to do about flapping alerts?
On Monday, April 8, 2024 at 11:05:41 PM UTC+2 Brian Candler wrote: On Monday 8 April 2024 at 20:57:34 UTC+1 Christoph Anton Mitterer wrote: But for Prometheus, with keep_firing_for, it will be like the same alert. If the alerts have the exact same set of labels (e.g. the alert is at the level of the RAID controller, not at the level of individual drives) then yes. Which will still be quite often the case, I guess. Sometimes it may not matter, i.e. when a "new" alert (which has the same label set) is "missed" because of keep_firing_for, but sometimes it may. It failed, it fixed, it failed again within keep_firing_for: then you only get a single alert, with no additional notification. But that's not the problem you originally asked for: "When the target goes down, the alert clears and as soon as it's back, it pops up again, sending a fresh alert notification." Sure, and this can be avoided with keep_firing_for, but as far as I can see only in some cases (since one wants to keep keep_firing_for shortish) and at a cost of loosing information when the alert condition actually went away (which Prometheus does can in principle know) and came back while still firing. keep_firing_for can be set differently for different alerts. So you can set it to 10m for the "up == 0" alert, and not set it at all for the RAID alert, if that's what you want. If there was no other way than the current keep_firing_for respectively my idea for an alternative keep_firing_for that considers the up/down state of the queried metrics isn't possible and/or reasonable - then rather than being able to set keep_firing_for per alert I'd wish to be able to set it per queried instance. For some cases what I'm working at the university it might have been a nice try to (automatically) query the status of an alert and take action if it fires, but then I'd also rather like to stop that, rather soon after the alert (actually) stops. If I have to use a longer keep_firing_for because of a set of unstable nodes, then either, I get the penalty of unnecessarily long firing alerts for all nodes, or I maintain different set of alerts, which would be possible but also quite ugly. Surely that delay is essential for the de-flapping scenario you describe: you can't send the alert resolved message until you are *sure* the alert has resolved (i.e. after keep_firing_for). Conversely: if you sent the alert resolved message immediately (before keeping_firing_for had expired), and the problem recurred, then you'd have to send out a new alert failing message - which is the flap noise I think you are asking to suppress. Okay maybe we have a misunderstanding here, or better said, I guess there are two kinds of flapping alerts: For example, assume an alert that monitors the utilised disk space on the root fs, and fires whenever that's above 80%. Type 1 Flapping: - The scraping of the metrics works all the time (i.e. `up` is all the time 1). - But IO is happening, that just causes the 80% to be exceeded and then fallen below every few seconds. Type 2 Flapping - There is IO, but the utilisation is always above 80%, say it's already at ~ 90% all the time. - My scrapes fail every now and then[0] I honestly haven't even thought about type 1 yet. But I think these are the ones which would be perfectly solved by keep_firing_for. Well even there I'd still like to be able to have the keep_firing_for applied only to a given label set e.g. something like: keep_firing_for: 10m on {alertnames~="regex-for-my-known-flapping-alerts"} Type 2 is the one that causes me headaches right now. That is why I thought before, it could be solved by something like keep_firing_for but that also takes into account whether any of the metrics it queries were from a target that is "currently" down - and only then let keep_firing_for take effect. Thanks, Chris. [0] I do have a number of hosts, where this constantly happen, not really sure why TBH, but even with niceness of -20 and IOniceness of 0 (though in best-effort class) it happens quite often. The node is under high load (it's one of our compute node for the LHC Computing Grid)... so I guess maybe it's just "overloaded". So I don't think this will go away and I somehow have to get it working with the scrapes failing every now and then. What actually puzzled me more is this: [image: Screenshot from 2024-04-09 00-24-59.png] That's some of the graphs from the Node Full Exporter Grafana dashboard, all for one node (which is one of the flapping ones). As you can see, Memory Basic and Disc Space Used Basic have a gap, where scraping failed. My assumption was, that - for a given target - either scraping fails for all metrics or succeeds for all. But here, only the right side plots have gaps, the left side ones don't. Maybe that's just some consequence of these using counters and rate()
[prometheus-users] Re: what to do about flapping alerts?
Hey Brian. On Saturday, April 6, 2024 at 9:33:27 AM UTC+2 Brian Candler wrote: > but AFAIU that would simply affect all alerts, i.e. it wouldn't just keep firing, when the scraping failed, but also when it actually goes back to an ok state, right? It affects all alerts individually, and I believe it's exactly what you want. A brief flip from "failing" to "OK" doesn't resolve the alert; it only resolves if it has remained in the "OK" state for the keep_firing_for duration. Therefore you won't get a fresh alert until it's been OK for at least keep_firing_for and *then* fails again. I'm still thinking whether it is what I want - or not ;-) Assume the following (arguably a bit made up) example: One has a metric that counts the number of failed drives in a RAID. One drive fails so some alert starts firing. Eventually the computing centre replaces the drive and it starts rebuilding (guess it doesn't matter whether the rebuilding is still considered to cause an alert or not). Eventually it finishes and the alert should go away (and I should e.g. get a resolved message). But because of keep_firing_for, it doesn't stop straight away. Now before it does, yet another disk fails. But for Prometheus, with keep_firing_for, it will be like the same alert. As said, this example is a bit made up, because even without the keep firing for, I wouldn't see the next device if that fails *while* the first one is still failing. But the point is, I will loose follow up alerts that are close to a previous one, when I use keep_firing_for to solve the flapping problem. Also, depending on how large I have to set keep_firing_for, I will also get resolve messages later... which depending on what one does with the alerts may also be less desirable. As you correctly surmise, an alert isn't really a boolean condition, it's a presence/absence condition: the expr returns a vector of 0 or more alerts, each with a unique combination of labels. "keep_firing_for" retains a particular labelled value in the vector for a period of time even if it's no longer being generated by the alerting "expr". Hence if it does reappear in the expr output during that time, it's just a continuation of the previous alert. I think the main problem behind may be rather a conceptual one, namely that Prometheus uses "no data" for no alert, which happens as well when there is no data because of e.g. scrape failures, so it can’t really differentiate between the two conditions. What one would IMO need is a keep_firing_for, that works only while the target is down. But as soon as it goes up again (and even if just for one scrape), the effect would be gone and the alert would stop firing immediately (unless of course, there's still a value that comes out). Wouldn't that make sense? > Similarly, when a node goes completely down (maintenance or so) and then up again, all alerts would then start again to fire (and even a generous keep_firing_for would have been exceeded)... and send new notifications. I don't understand what you're saying here. Can you give some specific examples? Well what I meant is basically the same as a above, just outside of the flapping scenario (in which, I guess, the scrape failures last never longer than perhaps 1-10 mins): - Imagine I have a node with several alerts firing (e.g. again that some upgrades aren't installed yet, or root fs has too much utilisation, things which typically last unless there's some manual intervention). - Also, I have e.g. set my alert manager, to repeat these alerts say once a week (to nag the admin to finally do something about it). What I'd expect should happen is e.g. the following: - I already got the mails from the above alerts, so unless something changes, they should only be re-sent in a week. - If one of those alerts resolves (e.g. someone frees up disk space), but disk space runs over my threshold again later, I'd like a new notification - now, not just in a week. (but back now to the situation, where the alert is still running from the first time and only one mail has been sent) What I e.g. I reboot the system. Maybe the admin upgraded the packages with security updates and also did some firmware upgrades which easily may take a while (we have servers where that runs for an hour or so... o.O). So the system is down for one hour in which scraping fails (and the alert condition would be gone) and any reasonable keep_firing_for: (at least reasonable with it's current semantics) will also have run out already. The system comes up again, but the over utilisation of the root fs is still there and the alert that had already fired before begins again *respectively* continues to do so. At that point, we cannot really know,whether it's the same alert (i.e. the alert condition never resolved) or whether it's a new one (it did resolve but came back again. (Well in my example we can be pretty sure it's the same one, since I rebooted - but generally
[prometheus-users] what to do about flapping alerts?
Hey. I have some simple alerts like: - alert: node_upgrades_non-security_apt expr: 'sum by (instance,job) ( apt_upgrades_pending{origin!~"(?i)^.*-security(?:\\PL.*)?$"} )' - alert: node_upgrades_security_apt expr: 'sum by (instance,job) ( apt_upgrades_pending{origin=~"(?i)^.*-security(?:\\PL.*)?$"} )' If there's no upgrades, these give no value. Similarly, for all other simple alerts, like free disk space: 1 - node_filesystem_avail_bytes{mountpoint="/", fstype!="rootfs", instance!~"(?i)^.*\\.garching\\.physik\\.uni-muenchen\\.de$"} / node_filesystem_size_bytes > 0.80 No value => all ok, some value => alert. I do have some instances which are pretty unstable (i.e. scraping fails every know and then - or more often than that), which are however mostly out of my control, so I cannot do anything about that. When the target goes down, the alert clears and as soon as it's back, it pops up again, sending a fresh alert notification. Now I've seen: https://github.com/prometheus/prometheus/pull/11827 which describes keep_firing_for as "the minimum amount of time that an alert should remain firing, after the expression does not return any results", respectively in https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/#rule : # How long an alert will continue firing after the condition that triggered it # has cleared. [ keep_firing_for: | default = 0s ] but AFAIU that would simply affect all alerts, i.e. it wouldn't just keep firing, when the scraping failed, but also when it actually goes back to an ok state, right? That's IMO however rather undesirable. Similarly, when a node goes completely down (maintenance or so) and then up again, all alerts would then start again to fire (and even a generous keep_firing_for would have been exceeded)... and send new notifications. Is there any way to solve this? Especially that one doesn't get new notifications sent, when the alert never really stopped? At least I wouldn't understand how keep_firing_for would do this. Thanks, Chris. -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/df6e33c7-b621-4f93-b265-7aad0802n%40googlegroups.com.
Re: [prometheus-users] query for time series misses samples (that should be there), but not when offset is used
Hey. On Friday, April 5, 2024 at 7:10:29 AM UTC+2 Ben Kochie wrote: If the jitter is > 0.002, the real value is stored. Interesting... though I guess bad for my solution in the other thread, where I make the assumption that it's guaranteed that samples are always exactly on point with the same interval in-between. Haven't checked it yet, but I'd guess that blows the approach in the other thread. Is there some metric to see whether such non-aligned samples occurred? Also, what would happen if e.g. there was a first scrape, which get's delayed > 0.002 s ... and before that first scrape arrives, there's yet another (later) scrape which has no jitter and is on time? Are they going to be properly ordered? Cheers Chris. -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/ea4d886a-1d7c-4e43-8477-0c598c5e6943n%40googlegroups.com.
Re: [prometheus-users] Re: better way to get notified about (true) single scrape failures?
Hey Chris. On Thursday, April 4, 2024 at 8:41:02 PM UTC+2 Chris Siebenmann wrote: > - The evaluation interval is sufficiently less than the scrape > interval, so that it's guaranteed that none of the `up`-samples are > being missed. I assume you were referring to the above specific point? Maybe there is a misunderstanding: With the above I merely meant that, my solution requires that the alert rule evaluation interval is small enough, so that when it look at resets(up[20s] offset 60s) (which is the window from -70s to -50s PLUS an additional shift by 10s, so effectively -80s to -60s), the evaluations happen often enough, so that no sample can "jump over" that time window. I.e. if the scrape interval was 10s, but the evaluation interval only 20s, it would surely miss some. I don't believe this assumption about up{} is correct. My understanding is that up{} is not merely an indication that Prometheus has connected to the target exporter, but an indication that it has successfully scraped said exporter. Prometheus can only know this after all samples from the scrape target have been received and ingested and there are no unexpected errors, which means that just like other metrics from the scrape, up{} can only be visible after the scrape has finished (and Prometheus knows whether it succeeded or not). Yes, I'd have assumed so as well. Therefore I generally shifted both alerts by 10s, hoping that 10s is enough for all that. How long scrapes take is variable and can be up to almost their timeout interval. You may wish to check 'scrape_duration_seconds'. Our metrics suggest that this can go right up to the timeout (possibly in the case of failed scrapes). Interesting. I see the same (I mean entries that go up to and even a bit above the timeout). Would be interesting to know whether these are ones that still made it "just in time (despite actually being a bit longer than the timeout)... or whether these are only such that timed out and were discarded. Cause the name scrape_duration_seconds would kind of imply that it's the former, but I guess it's actually the latter. So what would you think that means for me and my solution now? The I should shift all my checks even further? That is at least the scrape_timeout + some extra time for the data getting into the TDSB? Thanks, Chris. -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/f6603b09-d44b-412d-831a-c53234c85a82n%40googlegroups.com.
Re: [prometheus-users] query for time series misses samples (that should be there), but not when offset is used
Hey Chris, Brian. Thanks for your replies/confirmations. On Sunday, March 24, 2024 at 8:16:14 AM UTC+1 Ben Kochie wrote: Yup, this is correct. Prometheus sets the timestamp of the sample at the start of the scrape. But since it's an ACID compliant database, the data is not queryable until after it's been fully ingested. This is intentional, because the idea is that whatever atomicity is desired by the target is handled by the target. Any locks taken are done when the target receives the GET /metrics. The exposition formatting, compression, and wire transfer time should not impact the "when time when the sample was gathered". Does make sense, yes... was that documented somewhere? I think it would be helpful if e.g. the page about the querying basics would tell, these two properties: - that data is only returned if it has fully arrived, and thus may not be, even if the query is after the sample time - that Prometheus "adjusts" the timestamps within a certain range And yes, the timing is a tiny bit faked. There are some hidden flags that control this behavior. --scrape.adjust-timestamps --scrape.timestamp-tolerance The default allows up to 2ms (+-0.002) of timing jitter to be ignored. This was added in 2020 due to a regression in the accuracy of the Go internal timer functions. See: https://github.com/prometheus/prometheus/issues/7846 Makes sense, too. And is actually vital for what I do over in https://groups.google.com/g/prometheus-users/c/BwJNsWi1LhI/m/ik2OiRa2AAAJ Just out of curiosity, what happens, if the jitter is more than the +-0.002? Thanks, Chris. -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/ce9bc4cb-6c93-4df0-93ed-cc83e1e17f80n%40googlegroups.com.
Re: [prometheus-users] Re: better way to get notified about (true) single scrape failures?
Hey. On Friday, March 22, 2024 at 9:20:45 AM UTC+1 Brian Candler wrote: You want to "capture" single scrape failures? Sure - it's already being captured. Make yourself a dashboard. Well as I've said before, the dashboard always has the problem that someone actually needs to look at it. But do you really want to be *alerted* on every individual one-time scrape failure? That goes against the whole philosophy of alerting <https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit>, where alerts should be "urgent, important, actionable, and real". A single scrape failure is none of those. I guess in the end I'll see whether or not I'm annoyed by it. ;-) How often do you get hosts where: (1) occasional scrape failures occur; and (2) there are enough of them to make you investigate further, but not enough to trigger any alerts? So far I've seen two kinds of nodes, those where I never get scrape errors, and those where they happen regularly - and probably need investigation. Anyway,... I think it might have found a solution, which - if some assumption's I've made are correct - I'm somewhat confident that it works, even in the strange cases. The assumptions I've made are basically three: - Prometheus does that "faking" of sample times, and thus these are always on point with exactly the scrape interval between each. This in turn should mean, that if I have e.g. a scrape interval of 10s, and I do up[20s], then regardless of when this is done, I get at least 2 samples, and in some rare cases (when the evaluation happens exactly on a scrape time), 3 samples. Never more, never less. Which for `up` I think should be true, as Prometheus itself generates it, right, and not the exporter that is scraped. - The evaluation interval is sufficiently less than the scrape interval, so that it's guaranteed that none of the `up`-samples are being missed. - After some small time (e.g. 10s) it's guaranteed that all samples are in the TSDB and a query will return them. (basically, to counter the observation I've made in https://groups.google.com/g/prometheus-users/c/mXk3HPtqLsg ) - Both alerts run in the same alert group, and that means (I hope) that each query in them is evaluated with respect to the very same time. With that, my final solution would be: - alert: general_target-down (TD below) expr: 'max_over_time(up[1m] offset 10s) == 0' for: 0s - alert: general_target-down_single-scrapes (TDSS below) expr: 'resets(up[20s] offset 60s) >= 1 unless max_over_time(up[50s] offset 10s) == 0' for: 0s And that seems to actually work for at least practical cases (of course it's difficult to simulate the cases where the evaluation happens right on time of a scrape). For anyone who'd ever be interested in the details, and why I think that works in all cases, I've attached the git logs where I describe the changes in my config git below. Thanks to everyone for helping me with that :-) Best wishes, Chris. (needs a mono-spaced font to work out nicely) TL/DR: - commit f31f3c656cae4aeb79ce4bfd1782a624784c1c43 Author: Christoph Anton Mitterer Date: Mon Mar 25 02:01:57 2024 +0100 alerts: overhauled the `general_target-down_single-scrapes`-alert This is a major overhaul of the `general_target-down_single-scrapes`-alert, which turned out to have been quite an effort that went over several months. Before this branch was merged, the `general_target-down_single-scrapes`-alert (from now on called “TDSS”) had various issues. While the alert did stop to fire, when the `general_target-down`-alert (from now on called “TD”) started to do so, that alone meant that it would still also fire when scrapes failed which eventually turned out to be an actual TD. For example the first few (< ≈7) `0`s would have caused TDSS to fire which would seamlessly be replaced by a firing TD (unless any `1`s came in between). Assumptions made below: • The scraping interval is `10s`. • If a (single) time series for the `up`-metric is given like `0 1 0 0 1`, the time goes from left (farther back in time) to right (less farther back in time). I) Goals There should be two alerts: • TD Is for general use and similar to Icinga’s concept of host being `UP` or `DOWN` (with the minor difference, that an unreachable Prometheus target does not necessarily mean that a host is `DOWN` in that sense). It should fire after scraping has failed for some time, for example one minute (which is assumed form now on). • TDSS Since Prometheus is all about monitoring metrics, it’s of interest whether the scraping fails, even if it’s only every now and then for very short amount of times, because in that ca
[prometheus-users] query for time series misses samples (that should be there), but not when offset is used
Hey. I noticed a somewhat unexpected behaviour, perhaps someone can explain why this happens. - on a Prometheus instance, with a scrape interval of 10s - doing the following queries via curl from the same node where Prometheus runs (so there cannot be any different system times or so Looking at the sample times via e.g.: $ while true; do curl -g 'http://localhost:9090/api/v1/query?query=up[1m]' 2> /dev/null | jq '.data.result[0].values' | grep '[".]' | paste - - ; echo; sleep 1 ; done the timings look sper tight: 1711148768.175,"1" 1711148778.175,"1" 1711148788.175,"1" 1711148798.175,"1" 1711148808.175,"1" 1711148818.175,"1" 1711148768.175,"1" 1711148778.175,"1" 1711148788.175,"1" 1711148798.175,"1" 1711148808.175,"1" 1711148818.175,"1" I.e. it's *always* .175. I guess in reality it may not actually be that tight, and Prometheus just sets the timestamps artificially... but that doesn't matter for me. When now doing a query like (in a while loop with no delay): up[1m] and counting the number of samples, I'd expect to get always either 6 samples or perhaps 7 (if may query would happen exactly at a .175 time). But since the sample times are so super tight, I'd not expect to ever get less than 6. But that's just what happens: 1711148408.137921942 1711148408.179789148 1711148408.190407865 1711148408.239896472 1711148358.175,"1" 1711148368.175,"1" 1711148378.175,"1" 1711148388.175,"1" 1711148398.175,"1" 1711148408.249002352 1711148408.287031384 1711148358.175,"1" 1711148368.175,"1" 1711148378.175,"1" 1711148388.175,"1" 1711148398.175,"1" 1711148408.294628944 1711148408.342150984 1711148408.351871893 1711148408.405270701 Here, the non indented times, are timestamps from before and after the whole curl .. | .. pipe. The indented lines are the samples + timestamps in those cases, where != 6 are returned, done via something similar hacky like: $ while true; do f="$( date +%s.%N >&2; curl -g 'http://localhost:9090/api/v1/query?query=up[1m]' 2> /dev/null | jq '.data.result[0].values' | grep '[".]' | paste - - ; date +%s.%N >&2)"; if [ "$( printf '%s\n' "$f" | wc -l)" != 8 ]; then printf '\n%s\n\n' "$f"; fi ; done One sees, that both times before and after the curl, are already behind the .175, yet still the most recent sample (which should already be there - and which in fact shows up at 1711148408.175 in later queries) is missing. Interestingly, when doing these queries offset 10s (beware that curl requires %20 as space)... none of this happens and I basically always get 6 samples - as more or less expected. [I say more or less, because I wonder,... whether it's possible to get 7 ... should it be?] Any ideas why? And especially also why not with an offset? Thanks, Chris. -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/0d8c0274-3a29-4b0b-ba06-4d150fe24b39n%40googlegroups.com.
Re: [prometheus-users] Re: better way to get notified about (true) single scrape failures?
I've been looking into possible alternatives, based on the ideas given here. I) First one completely different approach might be: - alert: target-down expr: 'max_over_time( up[1m0s] ) == 0' for: 0s and: ( - alert: single-scrape-failure expr: 'min_over_time( up[2m0s] ) == 0' for: 1m or - alert: single-scrape-failure expr: 'resets( up[2m0s] ) > 0' for: 1m or perhaps even - alert: single-scrape-failure expr: 'changes( up[2m0s] ) >= 2' for: 1m (which would however behave a bit different, I guess) ) plus an inhibit rule, that silences single-scrape-failure when target-down fires. The for: 1m is needed, so that target-down has a chance to fire (and inhibit) before single-scrape-failure does. I'm not really sure, whether that works in all cases, though, especially since I look back much more (and the additional time span further back may undesirably trigger again. Using for: > 0 seems generally a bit fragile for my use-case (because I want to capture even single scrape failures, but with for: > 0 I need t to have at least two evaluations to actually trigger, so my evaluation period must be small enough so that it's done >= 2 during the scrape interval. Also, I guess the scrape intervals and the evaluation intervals are not synced, so when with for: 0s, when I look back e.g. [1m] and assume a certain number of samples in that range, it may be that there are actually more or less. If I forget about the above approach with inhibiting, then I need to consider cases like: time> - 0 1 0 0 0 0 0 0 first zero should be a single-scrape-failure, the last 6 however a target-down - 1 0 0 0 0 0 1 0 0 0 0 0 0 same here, the first 5 should be a single-scrape-failure, the last 6 however a target-down - 1 0 0 0 0 0 0 1 0 0 0 0 0 0 here however, both should be target-down - 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 or 1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 here, 2x target-down, 1x single-scrape-failure II) Using the original {min,max}_over_time approach: - min_over_time(up[1m]) == 0 tells me, there was at least one missing scrape in the last 1m. but that alone would already be the case for the first zero: . . . . . 0 so: - for: 1m was added (and the [1m] was enlarged) but this would still fire with 0 0 0 0 0 0 0 which should however be a target-down so: - unless max_over_time(up[1m]) == 0 was added to silence it then but that would still fail in e.g. the case when a previous target-down runs out: 0 0 0 0 0 0 -> target down the next is a 1 0 0 0 0 0 0 1 -> single-scrape-failure and some similar cases, Plus the usage of for: >0s is - in my special case - IMO fragile. III) So in my previous mail I came up with the idea of using: - alert: target-down expr: 'max_over_time( up[1m0s] ) == 0' for: 0s - alert: single-scrape-failure expr: 'min_over_time(up[15s] offset 1m) == 0 unless max_over_time(up[1m0s]) == 0 unless max_over_time(up[1m0s] offset 1m10s) == 0 unless max_over_time(up[1m0s] offset 1m) == 0 unless max_over_time(up[1m0s] offset 50s) == 0 unless max_over_time(up[1m0s] offset 40s) == 0 unless max_over_time(up[1m0s] offset 30s) == 0 unless max_over_time(up[1m0s] offset 20s) == 0 unless max_over_time(up[1m0s] offset 10s) == 0' for: 0m The idea was, that when I don't use for: >0s, the first time window where one can be really sure (in all cases), that whether it's a single-scrape-failure or target-down is a 0 in -70s to -60s: -130s -120s -110s -100s -90s -80s -70s -60s -50s -40s -30s -20s -10s 0s/now | | | | | | | 0 | | | | | | | | | | | | | | | | | | 1 | 0 | 1 | case 1 | | | | | | | 0 | 0 | 0 | 0 | 0 | 0 | 0 | case 2 | | | | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | case 3 In case 1 it would be already clear when the zeros is between -20 and -10. But if there's a sequence of zeros, it takes up to -70s to -60s, when it becomes clear. Now the zero in that time span could also be that of a target-down sequence of zeros like in case 3. For these cases, I had the shifted silencers that each looked over 1m. Looked good at first, though there were some open questions. At least one main problem, namely it would fail in e.g. that case: -130s -120s -110s -100s -90s -80s -70s -60s -50s -40s -30s -20s -10s 0s/now | 1 | 1 | 1 | 1 | 1 | 1 | 0 1 | 0 | 0 | 0 | 0 | 0 | 0 | case 8a The zero between -70s to 60s would be noticed, but still be silenced, because the one would not. Chris Siebenmann suggested to use resets(). ... and keep_firing_for:, which Ben Kochie, suggested, too. First I didn't quite understand how the latter would help me? Maybe I have the wrong mindset for it, so could you guys please explain what your idea was wiht keep_firing_for:? IV) resets() sounded promising at first, but while I tried quite some variations, I wasn't able to get anything working. First, something like resets(up[1m]) >= 1 alone (with or without a for: >0s) would already fire in case of: time> 1 0 which still could become a target-down but also in case of: 1 0 0 0 0 0 0 which is a target down. And I think even
Re: [prometheus-users] Re: better way to get notified about (true) single scrape failures?
Hey Chris. On Sun, 2024-03-17 at 22:40 -0400, Chris Siebenmann wrote: > > One thing you can look into here for detecting and counting failed > scrapes is resets(). This works perfectly well when applied to a > gauge Though it is documented as to be only used with counters... :-/ > that is 1 or 0, and in this case it will count the number of times > the > metric went from 1 to 0 in a particular time interval. You can > similarly > use changes() to count the total number of transitions (either 1->0 > scrape failures or 0->1 scrapes starting to succeed after failures). The idea sounds promising... especially to also catch cases like that 8a, I've mentioned in my previous mail and where the {min,max}_over_time approach seems to fail. > It may also be useful to multiply the result of this by the current > value of the metric, so for example: > > resets(up{..}[1m]) * up{..} > > will be non-zero if there have been some number of scrape failures > over > the past minute *but* the most recent scrape succeeded (if that > scrape > failed, you're multiplying resets() by zero and getting zero). You > can > then wrap this in an '(...) > 0' to get something you can maybe use > as > an alert rule for the 'scrapes failed' notification. You might need > to > make the range for resets() one step larger than you use for the > 'target-down' alert, since resets() will also be zero if up{...} was > zero all through its range. > > (At this point you may also want to look at the alert > 'keep_firing_for' > setting.) I will give that some more thinking and reply back if I should find some way to make an alert out of this. Well and probably also if I fail to ^^ ... at least at a first glance I wasn't able to use that to create and alert that would behave as desired. :/ > However, my other suggestion here would be that this notification or > count of failed scrapes may be better handled as a dashboard or a > periodic report (from a script) instead of through an alert, > especially > a fast-firing alert. Well the problem with a dashboard would IMO be, that someone must actually look at it or otherwise it would be pointless. ;-) Not really sure how to do that with a script (which I guess would be conceptually similar to an alert... just that it's sent e.g. weekly). I guess I'm not so much interested in the exact times, when single scrapes fail (I cannot correct it retrospectively anyway) but just *that* it happens and that I have to look into it. My assumption kinda is, that normally scrapes aren't lost. So I would really only get an alert mail if something's wrong. And even if the alert is flaky, like in 1 0 1 0 1 0, I think it could still reduce mail but on the alertmanager level? > I think it will be relatively difficult to make an > alert give you an accurate count of how many times this happened; if > you > want such a count to make decisions, a dashboard (possibly > visualizing > the up/down blips) or a report could be better. A program is also in > the > position to extract the raw up{...} metrics (with timestamps) and > then > readily analyze them for things like how long the failed scrapes tend > to > last for, how frequently they happen, etc etc. Well that sounds to be quite some effort... and I already think that my current approaches required far too much of an effort (and still don't fully work ^^). As said... despite not really being comparable to Prometheus: in Incinga a failed sensor probe would be immediately noticeable. Thanks, Chris. -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/f3549318aaa5f5b4fa0a01fb20c44e30769f540a.camel%40gmail.com.
[prometheus-users] Re: better way to get notified about (true) single scrape failures?
Hey there. I eventually got back to this and I'm still fighting this problem. As a reminder, my goal was: - if e.g. scrapes fail for 1m, a target-down alert shall fire (similar to how Icinga would put the host into down state, after pings failed or a number of seconds) - but even if a single scrape fails (which alone wouldn't trigger the above alert) I'd like to get a notification (telling me, that something might be fishy with the networking or so), that is UNLESS that single failed scrape is part of a sequence of failed scrapes that also caused / will cause the above target-down alert Assuming in the following, each number is a sample value with ~10s distance for the `up` metric of a single host, with the most recent one being the right-most: - 1 1 1 1 1 1 1 => should give nothing - 1 1 1 1 1 1 0 => should NOT YET give anything (might be just a single failure, or develop into the target-down alert) - 1 1 1 1 1 0 0 => same as above, not clear yet ... - 1 0 0 0 0 0 0 => here it's clear, this is a target-down alert In the following: - 1 1 1 1 1 0 1 - 1 1 1 1 0 0 1 - 1 1 1 0 0 0 1 ... should eventually (not necessarily after the right-most 1, though) all give a "single-scrape-failure" (even though it's more than just one - it's not a target-down), simply because there's a 0s but for a time span less than 1m. - 1 0 1 0 0 0 0 0 0 should give both, a single-scrape-failure alert (the left-most single 0) AND a target-down alert (the 6 consecutive zeros) - 1 0 1 0 1 0 0 0 should give at least 2x a single-scrape-failure alert, and for the leftmost zeros, it's not yet clear what they'll become. - 0 0 0 0 0 0 0 0 0 0 0 0 (= 2x six zeros) should give only 1 target-down alert - 0 0 0 0 0 0 1 0 0 0 0 0 0 (= 2x six zeros, separated by a 1) should give 2 target-down alerts Whether each of such alerts (e.g. in the 1 0 1 0 1 0 ...) case actually results in a notification (mail) is of course a different matter, and depends on the alertmanager configuration, but at least the alert should fire and with the right alert-manager config one should actually get a notification for each single failed scrape. Now, Brian has already given me some pretty good ideas how do them basically the ideas were: (assuming that 1m makes the target down, and a scrape interval of 10s) For the target-down alert: a) expr: 'up == 0' for: 1m b) expr: 'max_over_time(up[1m]) == 0' for: 0s => here (b) was probably better, as it would use the same condition as is also used in the alert below, and there can be no weird timing effects depending on the for: an when these are actually evaluated. For the single-scrape-failiure alert: A) expr: min_over_time(up[1m20s]) == 0 unless max_over_time(up[1m]) == 0 for: 1m10s (numbers a bit modified from Brian's example, but I think the idea is the same) B) expr: min_over_time(up[1m10s]) == 0 unless max_over_time(up[1m10s]) == 0 for: 1m => I did test (B) quite a lot, but there was at least still one case where it failed and that was when there were two consecutive but distinct target-down errors, that is: 0 0 0 0 0 0 1 0 0 0 0 0 0 (= 2x six zeros, separated by a 1) which would eventually look like e.g. 0 1 0 0 0 0 0 0 or 0 0 1 0 0 0 0 0 in the above check, and thus trigger (via the left-most zeros) a false single-scrape-failiure alert. => I'm not so sure whether I truly understand (A),... especially with respect to any niche cases, when there's jitter or so (plus, IIRC, it also failed in the case described for (B). One approach I tried in the meantime was to use sum_over_time .. and then the idea was simply to check how mane ones there are for each case. But it turns out that even if everything runs normal, the sum is not stable... some times, over [1m] I got only 5, whereas most times it was 6. Not really sure how that comes, because the printed timestamps for each sample seem to be sper accurate (all the time), but the sum wasn't. So I tried a different approach now, based on the above from Brian,... which at least in tests looks promising so far... but I'd like to hear what experts think about it. - both alerts have to be in the same alert groups (I assume this assures they're then evaluated in the same thread and at the "same time" (that is, with respect to the same reference timestamp). - in my example I assume a scrape time of 10s and evaluation interval of 7s (not really sure whether the latter matters or could be changed while the rules stay the same - and it would still work or not) - for: is always 0s ... I think that's good, because at least to me it's unclear, how things are evaluated if the two alerts have different values for for:, especially in border cases. - rules: - alert: target-down expr: 'max_over_time( up[1m0s] ) == 0' for: 0s - alert: single-scrape-failure expr: 'min_over_time(up[15s] offset 1m) == 0 unless
[prometheus-users] Re: better way to get notified about (true) single scrape failures?
Hey Brian On Wednesday, May 10, 2023 at 9:03:36 AM UTC+2 Brian Candler wrote: It depends on the exact semantics of "for". e.g. take a simple case of 1 minute rule evaluation interval. If you apply "for: 1m" then I guess that means the alert must be firing for two successive evaluations (otherwise, "for: 1m" would have no effect). Seems you're right. I did quite some testing meanwhile with the following alertmanager route (note, that I didn't use 5m, but 1m... simply in order to not have to wait so long): routes: - match_re: alertname: 'td.*' receiver: admins_monitoring group_by: [alertname] group_wait: 0s group_interval: 1s and the following rules: groups: - name: alerts_general_single-scrapes interval: 15s rules: - alert: td-fast expr: 'min_over_time(up[75s]) == 0 unless max_over_time(up[75s]) == 0' for: 1m - alert: td expr: 'up == 0' for: 1m My understanding is, correct me if wrong, that basically prometheus would run a thread for the scrape job (which in my case would have an interval of 15s) and another one that evaluates the alert rules (above every 15s) which then sends the alert to the alertmanager (if firing). It felt a bit brittle to have the rules evaluated with the same period then the scrapes, so I did all tests once with 15s for the rules interval, and once with 10s. But it seems as if this wouldn't change the behaviour. But up[5m] only looks at samples wholly contained within a 5 minute window, and therefore will normally only look at 5 samples. As you can see above,... I had already noticed that you were indeed right before, and if my for: is e.g. 4 * evaluation_interval(15s) = 1m ... I need to look back 5 * evaluation_interval(15s) = 75s At least in my tests, that seemed to cause the desired behaviour, except for one case: When my "slow" td fires (i.e. after 5 consecutive "0"s) and then there is... within (less than?) 1m, another sequence of "0"s that eventually cause a "slow" td. In that case, td-fast fires for a while, until it directly switches over to td firing. Was your idea above with something like: >expr: min_over_time(up[8m]) == 0 unless max_over_time(up[6m]) == 0 >for: 7m intended to fix that issue? Or could one perhaps use ALERTS{alertname="td",instance="lcg-lrz-ext.grid.lrz.de",job="node"}[??s] == 1 somehow, to check whether it did fire... and then silence the false positive. (If there is jitter in the sampling time, then occasionally it might look at 4 or 6 samples) Jitter in the sense that the samples are taken at slightly different times? Do you think that could affect the desired behaviour? I would intuitively expect that it rather only cases the "base duration" not be be exactly e.g. 1m ... so e.g. instead of taking 1m for the "slow" td to fire, it would happen +/- 15s earlier (and conversely for td-slow). Another point I basically don't understand... how does all that relate to the scrap intervals? The plain up == 0 simply looks at the most recent sample (going back up to 5m as you've said in the other thread). The series up[Ns] looks back N seconds, giving whichever samples are within there and now. AFAIU, there it doesn't go "automatically" back any further (like the 5m above), right? In order for the for: to work I need at least two samples... so doesn't that mean that as soon as any scrape time is for:-time(1m) / 2 = ~30s (in the above example), the above two alerts will never fire, even if it's down? So if I had e.g. some jobs scraping only every 10m ... I'd need another pair of td/td-fast alerts, which then filter on the job (up{job="longRunning"}) and either only have td... (if that makes sense) ... or at td-fast for if one of the every-10m-scrape fails and an even long "slow" td like if that fails for 1h. If what I've written above is correct (and it may well not be!), then expr: up == 0 for: 5m will fire if "up" is zero for 6 cycles, whereas As far as I understand you... 6 cycles of rule evaluation interval... with at least two samples within that interval, right? ... unless max_over_time(up[5m]) will suppress an alert if "up" is zero for (usually) 5 cycles. Last but not least an (only) partially related question: Once an alert fires (in prometheus), even i just for one evaluation interval cycle and there is no inhibiton rule or so in alertmanager... is it expected that a notification is sent out for sure,... regardless of alertmanagers grouping settings? Like when the alert fires for one short 15s evaluation interval and clears again afterwards,... but group_wait: is set to some 7d ... is it expected to send that singe firing event after 7d, even if it has resolved already once the 7d are over and there was .g. no further firing in between? Thanks a lot :-) Chris. -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To
[prometheus-users] Re: better way to get notified about (true) single scrape failures?
Hey Brian. On Tuesday, May 9, 2023 at 9:55:22 AM UTC+2 Brian Candler wrote: That's tricky to get exactly right. You could try something like this (untested): expr: min_over_time(up[5m]) == 0 unless max_over_time(up[5m]) == 0 for: 5m - min_over_time will be 0 if any single scrape failed in the past 5 minutes - max_over_time will be 0 if all scrapes failed (which means the 'standard' failure alert should have triggered) Therefore, this should alert if any scrape failed over 5 minutes, unless all scrapes failed over 5 minutes. Ah that seems a pretty smart idea. And the for: is needed to make it actually "count", as the [5m] only looks back 5m, but there, max_over_time(up[5m]) would have likely been still 1 while min_over_time(up[5m]) would already be 0, and if one had then e.g. for: 0s, it would fire immediately. There is a boundary condition where if the scraping fails for approximately 5 minutes you're not sure if the standard failure alert would have triggered. You mean like the above one wouldn't fire cause it thinks it's the long-term alert, while that wouldn't fire either, because it has just resolved then? Hence it might need a bit of tweaking for robustness. To start with, just make it over 6 minutes: expr: min_over_time(up[6m]) == 0 unless max_over_time(up[6m]) == 0 for: 6m That is, if max_over_time[6m] is zero, we're pretty sure that a standard alert will have been triggered by then. That one I don't quite understand. What if e.g. the following scenario happens (with each line giving the state 1m after the one before): for=6 for=5 m -5 -4 -3 -2 -1 0 for min[6m] max[6m] result/shortresult/long up: 1 1 1 1 1 0 1 0 1 pending pending up: 1 1 1 1 0 0 2 0 1 pending pending up: 1 1 1 0 0 0 3 0 1 pending pending up: 1 1 0 0 0 0 4 0 1 pending pending up: 1 0 0 0 0 0 5 0 1 pending fire up: 0 0 0 0 0 1 6 0 1 fireclear After 5m, the long term alert would fire, after that the scraping would succeed again, but AFAIU the "special" alert for the short ones would still be true at that point and then start to fire, despite all the previous 5 zeros have actually been reported as part of a long-down alert. I'm still not quite convinced about the "for: 6m" and whether we might lose an alert if there were a single failed scrape. Maybe this would be more sensitive: expr: min_over_time(up[8m]) == 0 unless max_over_time(up[6m]) == 0 for: 7m but I think you might get some spurious alerts at the *end* of a period of downtime. That also seems quite complex. And I guess it might have the same possible issue from above? The same should be the case if one would do: expr: min_over_time(up[6m]) == 0 unless max_over_time(up[5m]) == 0 for: 6m It may be just 6m ago that there was a "0" (from a long alert) and the last 5m there would have been "1"s. So the short-alert would fire, despite it's unclear whether the "0" 6m ago was really just a lonely one or the end of a long-alert period. Actually, I think, any case where the min_over_time goes further back than the long-alert's for:-time should have that. expr: min_over_time(up[5m]) == 0 unless max_over_time(up[6m]) == 0 for: 5m would also be broken, IMO, cause if 6m ago there was a "1", only the min_over_time(up[5m]) == 0 would remain (and nothing would silence the alert if needed)... if there 6m ago was a "0", it should effectively be the same than using [5m]? Isn't the problem from the very above already solved by placing both alerts in the same rule group? https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/ says: "Recording and alerting rules exist in a rule group. Rules within a group are run sequentially at a regular interval, with the same evaluation time." which I guess applies also to alert rules. Not sure if I'm right, but I think if one places both rules in the same group (and I think even the order shouldn't matter?), then the original: expr: min_over_time(up[5m]) == 0 unless max_over_time(up[5m]) == 0 for: 5m with 5m being the "for:"-time of the long-alert should be guaranteed to work... in the sense that if the above doesn't fire... the long-alert does. Unless of course the grouping settings at alert manager cause trouble.. which I don't quite understand especially, once an alert fires, even if just for short,... is it guaranteed that a notiication is sent? Cause as I wrote before, that didn't seem to be the case. Last but not least, if my assumption is true and your 1st version would work if both alerts are in the same group... how would the interval then matter? Would it still need to be the smallest scrape time (I guess so)? Thanks, Chris.
[prometheus-users] better way to get notified about (true) single scrape failures?
Hey. I have an alert rule like this: groups: - name: alerts_general rules: - alert: general_target-down expr: 'up == 0' for: 5m which is intended to notify about a target instance (respectively a specific exporter on that) being down. There are also routes in alertmanager.yml which have some "higher" periods for group_wait and group_interval and also distribute that resulting alerts to the various receivers (e.g. depending on the instance that is affected). By chance I've noticed that some of our instances (or the networking) seem to be a bit unstable and every now and so often, a single scrape or some few fail. Since this does typically not mean that the exporter is down (in the above sense) I wouldn't want that to cause a notification to be sent to people responsible for the respective instances. But I would want to get one sent, even if only a single scrape fails, to the local prometheus admin (me ^^), so that I can look further, what causes the scrape failures. My (working) solution for that is: a) another alert rule like: groups: - name: alerts_general_single-scrapes interval: 15s rules: - alert: general_target-down_single-scrapes expr: 'up{instance!~"(?i)^.*\\.garching\\.physik\\.uni-muenchen\\.de$"} == 0' for: 0s (With 15s being the smallest scrape time used by any jobs.) And a corresponding alertmanager route like: - match: alertname: general_target-down_single-scrapes receiver: admins_monitoring_no-resolved group_by: [alertname] group_wait: 0s group_interval: 1s The group_wait: 0s and group_interval: 1s seemed necessary, cause despite of the for: 0s, it seems that alertmanager kind of checks again before actually sending a notification... and when the alert is gone by then (because there was e.g. only one single missing scrape) it wouldn't send anything (despite the alert actually fired). That works so far... that is admins_monitoring_no-resolved get a notification for every single failed scrape while all others only get them when they fail for at least 5m. I even improved the above a bit, by clearing the alert for single failed scrapes, when the one for long-term down starts firing via something like: expr: '( up{instance!~"(?i)^.*\\.ignored\\.hosts\\.example\\.org$"} == 0 ) unless on (instance,job) ( ALERTS{alertname="general_target-down", alertstate="firing"} == 1 )' I wondered wheter this can be done better? Ideally I'd like to get notification for general_target-down_single-scrapes only sent, if there would be no one for general_target-down. That is, I don't care if the notification comes in late (by the above ~ 5m), it just *needs* to come, unless - of course - the target is "really" down (that is when general_target-down fires), in which case no notification should go out for general_target-down_single-scrapes. I couldn't think of an easy way to get that. Any ideas? Thanks, Chris. -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/8af3cd3e-f3b9-4c0c-b799-ac7a420d8bb1n%40googlegroups.com.
[prometheus-users] Re: how to make sure a metric is to be checked is "there"
Hey again. On Wednesday, April 26, 2023 at 9:35:32 AM UTC+2 Brian Candler wrote: > expr: up{job="myjob"} == 1 unless my_metric Beware with that, that it will only work if the labels on both 'up' and 'my_metric' match exactly. If they don't, then you can either use on(...) to specify the set of labels which match, or ignoring(...) to specify the ones which don't. You could start with: expr: up{job="myjob"} == 1 unless on (instance) my_metric Ah. I see. I guess one should use on(...) rather than ignoring(...) because one doesn't really know which labels may get added, right? Also, wouldn't it be better to also consider the "job" label? expr: up{job="myjob"} == 1 unless on (instance, job) my_metric because AFAIU, job is set by Prometheus itself, so if I operate on it as well, I can make sure that my_metric is really from the desired job - an not perhaps from some other job that wrongly exports a metric of that name. Does that make sense? but I believe this will break if there are multiple instances of my_metric for the same host. I'd probably do: expr: up{job="myjob"} == 1 unless on (instance) count by (instance) (my_metric) So with job that would be: expr: up{job="myjob"} == 1 unless on (instance,job) count by (instance,job) (my_metric) but I don't quite understand why it's needed in the first place?! If I do the previous: expr: up{job="myjob"} == 1 unless on (instance) my_metric then even if for one given instance value (and optionally one given job value) there are multiple results for my_metric (just differing in other labels), like: node_filesystem_free_bytes{device="/dev/vda1",fstype="vfat",mountpoint="/boot/efi"} 5.34147072e+08 node_filesystem_free_bytes{device="/dev/vda2",fstype="btrfs",mountpoint="/"} 1.2846592e+10 node_filesystem_free_bytes{device="/dev/vda2",fstype="btrfs",mountpoint="/data/btrfs-top-level-subvolumes/system"} 1.2846592e+10 (all with the same instance/job) shouldn't the "unless on (instance)" still work? I mean it wouldn't notice if only one time series were gone (like e.g. only device="/dev/vda1" above), but it should if all were gone? But the count by would also only notice it if all were gone, because only then it gives back no data for the respective instance (and not just 0 as value)? Also, if a scrape does not contain a particular timeseries, but the previous scrape *did* contain that timeseries, then the timeseries is marked "stale" by storing a staleness marker. Is there a way to test for that marker in expressions? So if you do see a value, it means: - it was in the last scrape - it was in the last 5 minutes - there has not been a subsequent scrape where the timeseries was missing Ah, good to know. > Is this with absent() also needed when I have all my targets/jobs statically configured? Use absent() when you need to write an expression which you can't do as a join against another existing timeseries. Okay, ... but AFAIU I couldn't use absent() to reproduce the effect of the above: up{job="myjob"} == 1 unless on (instance) my_metric because if I'd do something like: absent(my_metric) it would be empty as soon as there was at least one time series for the metric. With that I could really only check for a specific time series to be missing like: absent(my_metric{instance="somehost",job="node"}) and would have to make one alert with a different expression for e.g. every instance. Or is there any way to use absent() for the general case which I just don't see? If you want to fire when foo exists not but did not exist 5 minutes ago (i.e. alert whenever a new metric is created), then expr: foo unless foo offset 5m No I think I'd only want alerts if something vanishes. And yes, it will silence after 5 minutes. You don't want to send recovery messages on such alerts. Sounds reasonable. I wonder whether the expression is ideal: The above form would already fire, even if the value was missing just once, exactly the 5m ago. Wouldn't it be better to do something like. expr: foo unless foo offset 15s for: 5m assuming scrape interval of 15s? With offset I cannot just specify the "previous" sample, right? Is it somehow possible to do the above like automatically for all metrics (and not just foo) from one expression? And I guess one would again need to link that somehow with `up` to avoid useless errors? > How does that work via smartmon? Sorry, that was my brainfart. It's "storcli.py" that you want. (Although collecting smartmon info is a good idea too). Ah... I even saw that too, but had totally forgotten that they've renamed megacli. Is there a list of some generally useful alerts, things like: up == 0 or like the above idea of checking for metrics that have vanished? Ideally with how to use them properly ;-) Thanks, Chris. -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe
[prometheus-users] Re: restrict (respectively silence) alert rules to/for certain instances
On Wednesday, April 26, 2023 at 9:14:35 AM UTC+2 Brian Candler wrote: > I guess with (2) you also meant having a route which is then permanently muted? I'd use a route with a null receiver (i.e. a receiver which has no _configs under it) Ah, interesting. It wasn't even clear to me from the documentation, that this works, but as you say - it does. Nevertheless, it only suppresses the alert notifications, but e.g. within the AlertManager they would still show up as firing (as expected). > b) The idea hat I had above: > - using to filter on the instances and add a label if it should be silenced > - use only that label in the expr instead of the full regex > But would that even work? No, because as far as I know alert_relabel_configs is done *after* the alert is generated from the alerting rule. I've already assumed so from the documentation,.. thanks for confirmation. It's only used to add extra labels before sending the generated alert to alertmanager. (It occurs to me that it *might* be possible to use 'drop' rules here to discard alerts; that would be a very confusing config IMO) What do you mean by drop rules? > For me it's really like this: > My Prometheus instance monitors: > - my "own" instances, where I need to react on things like >85% usage on root filesystem (and thus want to get an alert) > - "foreign" instances, where I just get the node exporter data and show e.g. CPU usage, IO usage, and so on as a convenience to users of our cluster - but any alert conditions wouldn't cause any further action on my side (and the guys in charge of those servers have their own monitoring) In this situation, and if you are using static_configs or file_sd_configs to identify the hosts, then I would simply use a target label (e.g. "owner") to distinguish which targets are yours and which are foreign; or I would use two different scrape jobs for self and foreign (which means the "job" label can be used to distinguish them) I had thought about that too, but the downside of it would be that I have to "hardcode" this into the labels within the TDSB. Even if storage is not a concern, what might happen sometimes is that a formerly "foreign" server moves into my responsibility. Then I think things would get messy. In general, TBH, to me its also not really clear what the best practise is in terms of scrape jobs: At one time I planned to use them to "group" servers that somehow belong together, e.g. in the case of a job for data from the node exporter, I would have made node_storage_servers, node_compute_servers or something like that. But then I felt this could actually cause troubles later on, when I want to e.g. filter time series based on the job (or as above: when a server moves its roles). So right now I put everything (from one exporter) in one job. Not really sure whether this is stupid or not ;-) The storage cost of having extra labels in the TSDB is essentially zero, because it's the unique combination of labels that identifies the timeseries - the bag of labels is mapped to an integer ID I believe. So the only problem is if this label changes often, and to me it sounds like a 'local' or 'foreign' instance remains this way indefinitely. Arguably, for the above particular use case, it would be rather quite rare that it changes. But for the node_storage_servers vs. node_compute_servers case... it would actually happen quite often in my environment. If you really want to keep these labels out of the metrics, then having a separate timeseries with metadata for each instance is the next-best option. Suppose you have a bunch of metrics with an 'instance' label, e.g. node_filesystem_free_bytes(instance="bar", } node_filesystem_size_bytes(instance="bar", } ... as the actual metrics you're monitoring, then you create one extra static timeseries per host (instance) like this: meta{instance="bar",owner="self",site="london"} 1 (aside: TSDB storage for this will be almost zero, because of the delta-encoding used). These can be created by scraping a static webserver, or by using recording rules. Then your alerting rules can be like this: expr: | ( ... normal rule here ... ) * on(instance) group_left(site) meta{owner="self"} The join will: * Limit alerting to those hosts which have a corresponding 'meta' timeseries (matched on 'instance') and which has label owner="self" * Add the "site" label to the generated alerts Beware that: 1. this will suppress alerts for any host which does not have a corresponding 'meta' timeseries. It's possible to work around this to default to sending rather than not sending alerts, but makes the expressions more complex: https://www.robustperception.io/left-joins-in-promql 2. the "instance" labels must match exactly. So for example, if you're currently scraping with the default label instance="foo:9100" then you'll need to change this to instance="foo" (which is good practice anyway). See
[prometheus-users] Re: how to make sure a metric is to be checked is "there"
On Tuesday, April 25, 2023 at 9:32:25 AM UTC+2 Brian Candler wrote: I think you would have basically the same problem with Icinga unless you have configured Icinga with a list of RAID controllers which should be present on a given device, or a list of drives which should be present in a particular RAID array. Well true, you still depend on the RAID tool to actually detect the controller and any RAIDs managed by that. But Icinga would likely catch most real world issues that may happen by accident: - raid tool not installed - some wrong parameters used when invoking the tool (e.g. a new version that might have changed command names) - permissions issues (like tool not run as root, broken sudo rules I'm not sure if you realise this, but the expression "up == 0" is not a boolean, it's a filter. The metric "up" has many different timeseries, each with a different label set, and each with a value. The PromQL expression "up" returns all of those timeseries. The expression "up == 0" filters it down to a subset: just those timeseries where the value is 0. Hence this expression could return 0, 1 or more timeseries. When used as an alerting expression, the alert triggers if the expression returns one or more timeseries (and regardless of the *value* of those timeseries). When you understand this, then using PromQL for alerting makes much more sense. Well I think that's clear... I have one (scalar) value in up for each target I scrape, e.g. if I have just node exporter running, I'd get one (scalar) value for the scraped node exporter of every instance. But the problem is that this does not necessarily tell me if e.g. my raid status result was contained in that scraped data, does it? It depends on the exporter... if I had a separate exporter just for the RAID metrics, then I'd be fine. But if it's part of a larger one, like node exporter, it would depend if that errors out just because the RAID data couldn't be determined. And I guess most exporters would pre default just work fine, if e.g. there was simply no RAID tools installed (which does make sense in a way). But it would also mean, that I wouldn't notice the error, if e.g. I forgot to install the tool. In Icinga I'd notice this, cause I have the configured check per host. If that runs and doesn't find e.g. MegaCli... it would error out. Prometheus OTOH knows just about the target (i.e. the host) and the exporter (e.g. node)... so it cannot really tell "ah... the RAID tool is missing"... unless node exporter had an option that would tell it to insist on RAID tool xyz being executed and fail otherwise. That's basically what I'd like to do manually. However, if the RAID controller card were to simply vanish, then yes the corresponding metrics would vanish - similarly if a drive were to vanish from an array, its status would vanish. Well but that would usually also be unnoticed in the Icinga setup... but it's also something that I think never really happens - and if it does one probably sees other errors like broken filesystems. You can create alert expressions which check for a specific sentinel metric being present with absent(...), and you can do things like joining with the 'up' metric, so you can say "if any target is being scraped, then alert me if that target doesn't return metric X". It *is* a bit trickier to understand than a simple alerting condition, but it can be done. I guess that sounds what I'd like to do. Thanks for the below pointers :-) https://www.robustperception.io/absent-alerting-for-scraped-metrics/ expr: up{job="myjob"} == 1 unless my_metric So my_metric would return "something" as soon as it was contained (in the most recent scrape!)... and if it wasn't, up{job="myjob"} == 1 would silence the "extra" error, in case it is NOT up anyway. So in that case one should do always both: - in general, check for any targets/jobs that are not up - in specific (for e.g. very important metrics), additionally check for the specific metric. Right? In general, when I get the value of some time series like node_cpu_seconds_total ... when that is missing for e.g. one instance I would get nothing, right? I.e. there is no special value, just the vector of scalar has one element less. But if I do get a value, it's for sure the one from the most recent scrape?! https://www.robustperception.io/absent-alerting-for-jobs/ Is this with absent() also needed when I have all my targets/jobs statically configured? I guess not because Prometheus should know about it and reflect it in `up` if any of them couldn't be scraped, right? As for drives vanishing from an array, you can write expressions using count() to check the number of drives. If you have lots of machines and don't want separate rules per controller, then it's possible to use another timeseries as a threshold, again this a bit more complex: https://www.robustperception.io/using-time-series-as-alert-thresholds
[prometheus-users] Re: restrict (respectively silence) alert rules to/for certain instances
Hey Brian On Tuesday, April 25, 2023 at 9:59:12 AM UTC+2 Brian Candler wrote: So really I'd divide the possibilities 3 ways: a. Prevent the alert being generated from prometheus in the first place, by writing the expr in such a way that it filters out conditions that you don't want to alert on b. Let the alert arrive at alertmanager, but permanently prevent it from sending out notifications for certain instances c. Apply a temporary silence in alertmanager for certain alerts or groups of alerts (1) is done by writing your 'expr' to match only specific instances or to exclude specific instances (2) is done by matching on labels in your alertmanager routing rules (and if necessary, by adding extra labels in your 'expr') I think in my case (where I want to simply get no alerts at all for a certain group of instances) it would be (1) or (2), with (1) probably being the cleaner one. I guess with (2) you also meant having a route which is then permanently muted? If you want to apply a threshold to only certain filesystems, and/or to have different thresholds per filesystem, then it's possible to put the thresholds in their own set of static timeseries: https://www.robustperception.io/using-time-series-as-alert-thresholds But I don't recommend this, and I find such alerts are brittle. Would also sound like a solution that's a bit over-engineered to me. It helps to rethink exactly what you should be alerting on: https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit For the majority of cases: "alert on symptoms, rather than causes". That is, alert when a service isn't *working* (which you always need to know about), and in those alerts you can include potential cause-based information (e.g. CPU load is high, RAM is full, database is down etc). Now, there are also some things you want to know about *before* they become a problem, like "disk is nearly full". But the trouble with static alerts is, they are a pain to manage. Suppose you have a threshold at 85%, and you have one server which is consistently at 86% but not growing - you know this is the case, you have no need to grow the filesystem, so you end up tweaking thresholds per instance. I would suggest two alternatives: 1. Check dashboards daily. If you want automatic notifications then don't send the sort of alert which gets someone out of bed, but a "FYI" notification to something like Slack or Teams. 2. Write dynamic alerts, e.g. have alerting rules which identify disk usage which is growing rapidly and likely to fill in the next few hours or days. - name: DiskRate10m interval: 1m rules: # Warn if rate of growth over last 10 minutes means filesystem will fill in 2 hours - alert: DiskFilling10m expr: | node_filesystem_avail_bytes / (node_filesystem_avail_bytes - (predict_linear(node_filesystem_avail_bytes{fstype!~"fuse.*|nfs.*"}[10m], 7200) < 0)) * 7200 for: 20m labels: severity: critical annotations: summary: 'Filesystem will be full in {{ $value | humanizeDuration }} at current 10m growth rate' - name: DiskRate3h interval: 10m rules: # Warn if rate of growth over last 3 hours means filesystem will fill in 2 days - alert: DiskFilling3h expr: | node_filesystem_avail_bytes / (node_filesystem_avail_bytes - (predict_linear(node_filesystem_avail_bytes{fstype!~"fuse.*|nfs.*"}[3h], 172800) < 0)) * 172800 for: 6h labels: severity: warning annotations: summary: 'Filesystem will be full in {{ $value | humanizeDuration }} at current 3h growth rate' Thanks but I'm not sure sure whether the above applies to my scenario. For me it's really like this: My Prometheus instance monitors: - my "own" instances, where I need to react on things like >85% usage on root filesystem (and thus want to get an alert) - "foreign" instances, where I just get the node exporter data and show e.g. CPU usage, IO usage, and so on as a convenience to users of our cluster - but any alert conditions wouldn't cause any further action on my side (and the guys in charge of those servers have their own monitoring) So in the end it just boils down to my desire to keep my alert rules small/simple/readable. expr: 100 - ((node_filesystem_avail_bytes{mountpoint="/",fstype!="rootfs"} * 100) / node_filesystem_size_bytes{mountpoint="/",fstype!="rootfs"}) >= 85 => would fire for all nodes, bad expr: 100 - ((node_filesystem_avail_bytes{mountpoint="/",fstype!="rootfs",instance=~"someRegexThatMatchesTheRighHosts"} * 100) / node_filesystem_size_bytes{mountpoint="/",fstype!="rootfs",instance=~"someRegexThatMatchesTheRighHosts"}) >= 85 => would work, I guess, but seems really ugly to read/maintain Not sure whether anything can be done better via adding labels at some stage. As well as target labels, you can set labels in the alerting rules themselves, for when an alert fires. That
[prometheus-users] how to make sure a metric is to be checked is "there"
Hey there. What I'm trying to do is basically replace Icinga with Prometheus (or well not really replacing, but integrating it into the latter, which I anyway need for other purposes). So I'll have e.g. some metric that shows me the RAID status on instances, and I want to get an alert, when a HDD is broken. I guess it's obvious that it could turn out bad if I don't get an alert, just because the metric data isn't there (for some reason). In Icinga, this would have been simple: The system knows about every host and every service it needs to check. If there's no result (like RAID is OK or FAILED) anymore (e.g. because the raid CLI tool is no installed), the check's status would at least go into UNKNOWN. I wonder how this is / can be handled in Prometheus? I mean I can of course check e.g. expr: up == 0 in some alert. But AFAIU this actually just tells me whether there are any scrape targets that couldn't be scraped (in the last run, based on the scrape interval), right? If my important checks were all their own exporters, e.g. one exporter just for the RAID status, then - AFAIU - this would already work any notify me for sure, even if there's no result at all. But what if it's part of some larger exporter, like e.g. the mdadm data in node exporter. up wouldn't become 0, just because node_md_disks would be not part of the metrics. Even if I'd say it's the duty of the exporter to make sure that there is a result even on failure to read the status... what e.g. if some tool is already needed just to determine whether that metric make sense to be collected at all. That would by typical for most hardware RAID controllers... you need the respective RAID tool just to see whether any RAIDS are present. So in principle I'd like a simple way to check for a certain group of hosts on the availability of a certain time series, so that I can set up e.g. an alert that fires if any node where I have e.g. some MegaCLI based RAID, lacks megacli_some_metric. Or is there some other/better way this is done in practise? Thanks, Chris. -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/fbfd8ca1c671830a3ce428a54a60aebed2ea596e.camel%40gmail.com.
[prometheus-users] restrict (respectively silence) alert rules to/for certain instances
Hey. I have some troubles understanding how to do things right™ with respect to alerting. In principle I'd like to do two things: a) have certain alert rules run only for certain instances (though that may in practise actually be less needed, when only the respective nodes would generate the respective metrics - not sure yet, whether this will be the case) b) silence certain (or all) alerts for a given set of instances e.g. these may be nodes where I'm not an admin how can take action on an incident, but just view the time series graphs to see what's going on As example I'll take an alert that fires when the root fs has >85% usage: groups: - name: node_alerts rules: - alert: node_free_fs_space expr: 100 - ((node_filesystem_avail_bytes{mountpoint="/",fstype!="rootfs"} * 100) / node_filesystem_size_bytes{mountpoint="/",fstype!="rootfs"}) >= 85 With respect to (a): I could of course a yet another key like: instance=~"someRegexThatDescribesMyInstances" to each time series, but when that regex gets more complex, everything becomes quite unreadable and it's quite error prone to forget about a place (assuming one has many alerts) when the regex changes. Is there some way like defining host groups or so? Where I have a central place where I could define the list of hosts respectively a regex for that... and just use the name of that definition in the actual alert rules? With respect to (b): Similarly to above,... if I had various instances for which I'd never wanted to see any alerts, I could of course add a regex to all my alerts. But seems quite ugly to clutter up all the rules just for a potentially long list/regex of things for which I don't want to see anyway. Another idea I had was that I do the filtering/silencing in the alertmanager config at route level: Like by adding a "ignore" route, that matches via regex on all the instances I'd like to silence (and have a mute_time_interval set to 24/7), before any other routes match. But AFAIU this would only suppress the message (e.g. mail), but the alert would still show up in the alertmanager webpages/etc. as firing. Not sure whether anything can be done better via adding labels at some stage. - Doing external_labels: in prometheus config doesn't seem to help here (only stact values?) - Same for labels: in in prometheus config. - Setting some "noalerts" label via in prometheus config would also set that in the DB, right? This I rather wouldn't want. - Maybe using: alerting: alert_relabel_configs: - would work? Like matching hostnames on instance and replacing with e.g. "yes" in some "noalerts" target? And then somehow using that in the alert rules... But also sounds a bit ugly, TBH. So... what's the proper way to do this? :-) Thanks, Chris. btw: Is there any difference between: 1) alerting: alert_relabel_configs: - and 2) the relabel_configs: in -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/03619286babc6b2ee9d3295e235016b4e3b383ca.camel%40gmail.com.
Re: [prometheus-users] fading out sample resolution for samples from longer ago possible?
On Tue, 2023-02-28 at 10:25 +0100, Ben Kochie wrote: > > Debian release cycles are too slow for the pace of Prometheus > development. It's rather simple to pull the version from Debian unstable, if on needs so, and that seems pretty current. > You'd be better off running Prometheus using podman, or deploying > official binaries with Ansible[0]. Well I guess view on how software should be distributed differ. The "traditional" system of having distributions has many advantages and is IMO a core reason for the success of Linux and OpenSource. All "modern" alternatives like flatpaks, snaps, and similar repos are IMO especially security wise completely inadequate (especially the fact that there is no trusted intermediate (like the distribution) which does some basic maintenance. It's anyway not possible here because of security policy reasons. > > No, but It depends on your queries. Without seeing what you're > graphing there's no way to tell. Your queries could be complex or > inefficient. Kinda like writing slow SQL queries. As mentioned already in the other thread, so far I merely do only what: https://grafana.com/grafana/dashboards/1860-node-exporter-full/ does. > There are ways to speed up graphs for specific things, for example > you can use recording rules to pre-render parts of the queries. > > For example, if you want to graph node CPU utilization you can have a > recording rule like this: > > groups: > - name: node_exporter > interval: 60s > rules: > - record: instance:node_cpu_utilization:ratio_rate1m > expr: > > avg without (cpu) ( > sum without (mode) ( > > rate(node_cpu_seconds_total{mode!="idle",mode!="iowait",mode!="steal" > }[1m]) > ) > ) > > This will give you a single metric per node that will be faster to > render over longer periods of time. It also effectively down-samples > by only recording one point per minute. But will dashbords like Node Exporter Full automatically use such? And if so... will they (or rather Prometheus) use the real time series (with full resolution) when needed? If so, then the idea would be to create such a rule for every metric I'm interested in and that is slow, right? > Also "Medium sized VM" doesn't give us any indication of how much CPU > or memory you have. Prometheus uses page cache for database access. > So maybe your system is lacking enough memory to effectively cache > the data you're accessing. Right now it's 2 (virtual CPUs) with 4.5 GB RAM... I'd guess it might need more CPU? Previously I suspected IO to be the reason, and while in fact IO is slow (the backend seems to deliver only ~100MB/s)... there seems to be nearly no IO at all while waiting for the "slow graph" (which is Node Export Full's "CPU Basic" panel), e.g. when selecting the last 30 days. Kinda surprising... does Prometheus read it's TSDB really that efficiently? Could it be a problem, when the Grafana runs on another VM? Though there didn't seem to be any network bottleneck... and I guess Grafana just always accesses Prometheus via TCP, so there should be no further positive caching effect when both run on the same node? > No, we've talked about having variable retention times, but nobody > has implemented this. It's possible to script this via the DELETE > endpoint[1]. It would be easy enough to write a cron job that deletes > specific metrics older than X, but I haven't seen this packaged into > a simple tool. I would love to see something like this created. > > [1]: > https://prometheus.io/docs/prometheus/latest/querying/api/#delete- > series Does it make sense to open a feature request ticket for that? I mean it would solve at least my storage "issue" (well it's not really a showstopper... as it was mentioned one could simply by a big check HDD/SSD). And could via the same way be something made that downsamples data for times longer ago? Both together would really give quite some flexibility. For metrics where old data is "boring" one could just delete everything older than e.g. 2 weeks, while keeping full details for that time. For metrics where one is interested in larger time ranges, but where sample resolution doesn't matter so much, one could downsample it... like everything older then 2 weeks... then even more for everything older than 6 months, then even more for everything older than 1 year... and so on. For few metrics where full resolution data is interesting over a really long time span, one could just keep it. > > Seem at least quite big to me... that would - assuming all days can > > be > > compressed roughly to that (which isn't sure of course) - mean for > > one > > year one needs ~ 250 GB for that 40 nodes or about 6,25 GB per node > > (just for the data for node exporter with a 15s interval). > > Without seeing a full meta.json and the size of the files in one dir, > it's hard to say exactly if this is good or bad. It depends a bit on > how
Re: [prometheus-users] fading out sample resolution for samples from longer ago possible?
Hey Brian On Tue, 2023-02-28 at 00:27 -0800, Brian Candler wrote: > > I can offer a couple more options: > > (1) Use two servers with federation. > - server 1 does the scraping and keeps the detailed data for 2 weeks > - server 2 scrapes server 1 at lower interval, using the federation > endpoint I had thought about that as well. Though it feels a bit "ugly". > (2) Use recording rules to generate lower-resolution copies of the > primary timeseries - but then you'd still have to remote-write them > to a second server to get the longer retention, since this can't be > set at timeseries level. I had (very briefly) read about the recording rules (merely just that they exist ^^) ... but wouldn't these give me a new name for the metric? If so, I'd need to adapt e.g. https://grafana.com/grafana/dashboards/1860-node-exporter-full/ to use the metrics generated by the recording rules,... which again seems quite some maintenance effort. Plus, as you even wrote below, I'd need users to use different dashboards, AFAIU, one where the detailed data is used, one where the downsampled data is used. Sure that would work as a workaround, but is of course not really a good solution, as one would rather want to "seamlessly" move from the detailed to less-detailed data. > Either case makes the querying more awkward. If you don't want > separate dashboards for near-term and long-term data, then it might > work to stick promxy in front of them. Which would however make the setup more complex again. > Apart from saving disk space (and disks are really, really cheap > these days), I suspect the main benefit you're looking for is to get > faster queries when running over long time periods. Indeed, I > believe Thanos creates downsampled timeseries for exactly this > reason, whilst still continuing to retain all the full-resolution > data as well. I guess I may have too look into that, how complex it's setup would be. > That depends. What PromQL query does your graph use? How many > timeseries does it touch? What's your scrape interval? So far I've just been playing with the ones from: https://grafana.com/grafana/dashboards/1860-node-exporter-full/ So all queries in that and all time series that uses. Interval is 15s. > Is your VM backed by SSDs? I think it's a Ceph cluster what the super computing centre uses for that, but I have no idea what that runs upon. Probably HDDs. > Another suggestion: running netdata within the VM will give you > performance metrics at 1 second intervals, which can help identify > what's happening during those 10-15 seconds: e.g. are you > bottlenecked on CPU, or disk I/O, or something else. Good idea, thanks. Thanks, Chris. -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/e35d617dbaab44de43da049414103ff1e9102e61.camel%40gmail.com.
Re: [prometheus-users] fading out sample resolution for samples from longer ago possible?
Hy Stuart, Julien and Ben, Hope you don't mind that I answer all three replies in one... don't wanna spam the list ;-) On Tue, 2023-02-21 at 07:31 +, Stuart Clark wrote: > Prometheus itself cannot do downsampling, but other related projects > such as Cortex & Thanos have such features. Uhm, I see. Unfortunately neither is packaged for Debian. Plus it seems to make the overall system even more complex. I want to Prometheus merely or monitoring a few hundred nodes (thus it seems a bit overkill to have something like Cortex, which sounds like a system for really large number of nodes) at the university, though as indicated before, we'd need both: - details data for a like the last week or perhaps two - far less detailed data for much longer terms (like several years) Right now my Prometheus server runs in a medium sized VM, but when I visualise via Grafana and select a time span of a month, it already takes considerable time (like 10-15s) to render the graph. Is this expected? On Tue, 2023-02-21 at 11:45 +0100, Julien Pivotto wrote: > We would love to have this in the future but it would require careful > planning and design document. So native support is nothing on the near horizon? And I guess it's really not possible to "simply" ( ;-) ) have different retention times for different metrics? On Tue, 2023-02-21 at 15:52 +0100, Ben Kochie wrote: > This is mostly unnecessary in Prometheus because it uses compression > in the TSDB samples. What would take up a lot of space in an RRD file > takes up very little space in Prometheus. Well right now I scrape only the node-exporter data from 40 hosts at a 15s interval plus the metrics from prometheus itself. I'm doing this on test install since the 21st of February. Retention time is still at it's default. That gives me: # du --apparent-size -l -c -s --si /var/lib/prometheus/metrics2/* 68M /var/lib/prometheus/metrics2/01GSST2X0KDHZ0VM2WEX0FPS2H 481M/var/lib/prometheus/metrics2/01GSVQWH7BB6TDCEWXV4QFC9V2 501M/var/lib/prometheus/metrics2/01GSXNP1T77WCEM44CGD7E95QH 485M/var/lib/prometheus/metrics2/01GSZKFK53BQRXFAJ7RK9EDHQX 490M/var/lib/prometheus/metrics2/01GT1H90WKAHYGSFED5W2BW49Q 487M/var/lib/prometheus/metrics2/01GT3F2SJ6X22HFFPFKMV6DB3B 498M/var/lib/prometheus/metrics2/01GT5CW8HNJSGFJH2D3ADGC9HH 490M/var/lib/prometheus/metrics2/01GT7ANS5KDVHVQZJ7RTVNQQGH 501M/var/lib/prometheus/metrics2/01GT98FETDR3PN34ZP59Y0KNXT 172M/var/lib/prometheus/metrics2/01GT9X2BPN51JGB6QVK2X8R3BR 60M /var/lib/prometheus/metrics2/01GTAASP91FSFGBBH8BBN2SQDJ 60M /var/lib/prometheus/metrics2/01GTAHNDG070WXY8WGDVS22D2Y 171M/var/lib/prometheus/metrics2/01GTAHNHQ587CQVGWVDAN26V8S 102M/var/lib/prometheus/metrics2/chunks_head 21k /var/lib/prometheus/metrics2/queries.active 427M/var/lib/prometheus/metrics2/wal 5,0Gtotal Not sure whether I understood meta.json correctly (haven't found a documentation for minTime/maxTime) but I guess that the big ones correspond to 64800s? Seem at least quite big to me... that would - assuming all days can be compressed roughly to that (which isn't sure of course) - mean for one year one needs ~ 250 GB for that 40 nodes or about 6,25 GB per node (just for the data for node exporter with a 15s interval). Does that sound reasonable/expected? > What's actually more > difficult is doing all the index loads for this long period of time. > But Prometheus uses mmap to opportunistically access the data on > disk. And is there anything that can be done to improve that? Other than simply using some fast NVMe or so? Thanks, Chris. -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/45f21aedf2412705809fc69522055ca82b2f95f2.camel%40gmail.com.
[prometheus-users] fading out sample resolution for samples from longer ago possible?
Hey. I wondered whether one can to with Prometheus something similar that is possible with systems using RRD (e.g. Ganlia). Depending on the kind of metrics, like for those from the node exporter, one may want a very high sample resolution (and thus short scraping interval) for like the last 2 days,... but the further one goes back the less interesting those data becomes, at least in that resolution (ever looked a how much IO a server had 2 years ago per 15s)? What one may however want is a rough overview of these metrics for those time periods longer ago, e.g. in order to see some trends. For other values, e.g. the total used disk space on a shared filesystem or maybe a tape library, one may not need such high resolution for the last 2 days, but therefore want the data (with low sample resolution, e.g. 1 sample per day) going back much longer, like the last 10 years. With Ganglia/RRD it one would then simply use multiple RRDs, each for different time spans and with different resolutions... and RRD would interpolate it's samples accordingly. Can anything like this be done with Prometheus? Or is that completely out of scope? I saw that one can set the retention period, but that seems to affect everything. So even if I have e.g. my low resolution tape library total size, which I could scrape only every hour or so, ... it wouldn't really help me. In order to keep data for that like the last 10 years, I'd need to set the retention time to that. But then the high resolution samples like from the node exporter would also be kept that long (with full resolution). Thanks, Chris. -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/36e3506c-1fba-48e4-b3d9-ead908767cf2n%40googlegroups.com.
Re: [prometheus-users] collect non-metrics data
Hey Ben. On Saturday, February 11, 2023 at 11:18:44 AM UTC+1 Ben Kochie wrote: You combine this with an "info" metric that tells you about the rest of the device. Ah,... and I assume that one could just also export these info metrics alongside e.g. node_md_state? Thanks :-) Chris. -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/d3abe98c-6b2b-4471-a981-8b99936b7dd4n%40googlegroups.com.
[prometheus-users] collect non-metrics data
Hey. I wondered whether the following is possible with Prometheus. I basically think about possibly phasing out Icinga and do any alerting in Prometheus. For checks that are clearly metrics based (like load or free disk space) this seems rather easy. But what about any checks that are not really based on metrics? Like e.g. check_raid, which gives an error if any RAID has lost a disk or similar. Of course one could always just try to make a metric out of it - above one could make e.g. the number of non-consistent RAIDs the metric. But what one actually wants from such checks is additional (typically purely textual) information, like in the above example which HDD (enclosure, bay number,... or the serial number) has failed. Also I have numerous other checks which test for things which are not really related to a number but where the output are strings. Is there any (good) way to get that done with Prometheus, or is it simply not meant for that specific use case. Thanks, Chris. -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/8fe84502-eca5-4e53-8a9c-35e7a9dd6113n%40googlegroups.com.