[prometheus-users] Re: what to do about flapping alerts?

2024-04-08 Thread Christoph Anton Mitterer


On Monday, April 8, 2024 at 11:05:41 PM UTC+2 Brian Candler wrote:

On Monday 8 April 2024 at 20:57:34 UTC+1 Christoph Anton Mitterer wrote:

But for Prometheus, with keep_firing_for, it will be like the same alert.


If the alerts have the exact same set of labels (e.g. the alert is at the 
level of the RAID controller, not at the level of individual drives) then 
yes.


Which will still be quite often the case, I guess. Sometimes it may not 
matter, i.e. when a "new" alert (which has the same label set) is "missed" 
because of keep_firing_for, but sometimes it may.
 

It failed, it fixed, it failed again within keep_firing_for: then you only 
get a single alert, with no additional notification.
But that's not the problem you originally asked for:
"When the target goes down, the alert clears and as soon as it's back, it 
pops up again, sending a fresh alert notification."


Sure, and this can be avoided with keep_firing_for, but as far as I can see 
only in some cases (since one wants to keep keep_firing_for shortish) and 
at a cost of loosing information when the alert condition actually went 
away (which Prometheus does can in principle know) and came back while 
still firing.

 

keep_firing_for can be set differently for different alerts.  So you can 
set it to 10m for the "up == 0" alert, and not set it at all for the RAID 
alert, if that's what you want.


If there was no other way than the current keep_firing_for respectively my 
idea for an alternative keep_firing_for that considers the up/down state of 
the queried metrics isn't possible and/or reasonable - then rather than 
being able to set keep_firing_for per alert I'd wish to be able to set it 
per queried instance.

For some cases what I'm working at the university it might have been a nice 
try to (automatically) query the status of an alert and take action if it 
fires, but then I'd also rather like to stop that, rather soon after the 
alert (actually) stops. If I have to use a longer keep_firing_for because 
of a set of unstable nodes, then either, I get the penalty of unnecessarily 
long firing alerts for all nodes, or I maintain different set of alerts, 
which would be possible but also quite ugly.


  

Surely that delay is essential for the de-flapping scenario you describe: 
you can't send the alert resolved message until you are *sure* the alert 
has resolved (i.e. after keep_firing_for).

Conversely: if you sent the alert resolved message immediately (before 
keeping_firing_for had expired), and the problem recurred, then you'd have 
to send out a new alert failing message - which is the flap noise I think 
you are asking to suppress.


Okay maybe we have a misunderstanding here, or better said, I guess there 
are two kinds of flapping alerts:

For example, assume an alert that monitors the utilised disk space on the 
root fs, and fires whenever that's above 80%.

Type 1 Flapping:
- The scraping of the metrics works all the time (i.e. `up` is all the time 
1).
- But IO is happening, that just causes the 80% to be exceeded and then 
fallen below every few seconds.

Type 2 Flapping
- There is IO, but the utilisation is always above 80%, say it's already at 
~ 90% all the time.
- My scrapes fail every now and then[0]

I honestly haven't even thought about type 1 yet. But I think these are the 
ones which would be perfectly solved by keep_firing_for.
Well even there I'd still like to be able to have the keep_firing_for 
applied only to a given label set e.g. something like: keep_firing_for: 10m 
on {alertnames~="regex-for-my-known-flapping-alerts"}

Type 2 is the one that causes me headaches right now.

That is why I thought before, it could be solved by something like 
keep_firing_for but that also takes into account whether any of the metrics 
it queries were from a target that is "currently" down - and only then let 
keep_firing_for take effect.


Thanks,
Chris.


[0] I do have a number of hosts, where this constantly happen, not really 
sure why TBH, but even with niceness of -20 and IOniceness of 0 (though in 
best-effort class) it happens quite often. The node is under high load 
(it's one of our compute node for the LHC Computing Grid)... so I guess 
maybe it's just "overloaded". So I don't think this will go away and I 
somehow have to get it working with the scrapes failing every now and then.

What actually puzzled me more is this:
[image: Screenshot from 2024-04-09 00-24-59.png]
That's some of the graphs from the Node Full Exporter Grafana dashboard, 
all for one node (which is one of the flapping ones).
As you can see, Memory Basic and Disc Space Used Basic have a gap, where 
scraping failed.
My assumption was, that - for a given target - either scraping 
fails for all metrics or succeeds for all.
But here, only the right side plots have gaps, the left side ones don't.

Maybe that's just some consequence of these using counters and rate() 

[prometheus-users] Re: what to do about flapping alerts?

2024-04-08 Thread Christoph Anton Mitterer
Hey Brian.

On Saturday, April 6, 2024 at 9:33:27 AM UTC+2 Brian Candler wrote:

> but AFAIU that would simply affect all alerts, i.e. it wouldn't just keep 
firing, when the scraping failed, but also when it actually goes back to an 
ok state, right?

It affects all alerts individually, and I believe it's exactly what you 
want. A brief flip from "failing" to "OK" doesn't resolve the alert; it 
only resolves if it has remained in the "OK" state for the keep_firing_for 
duration. Therefore you won't get a fresh alert until it's been OK for at 
least keep_firing_for and *then* fails again.


I'm still thinking whether it is what I want - or not ;-)

Assume the following (arguably a bit made up) example:
One has a metric that counts the number of failed drives in a RAID. One 
drive fails so some alert starts firing. Eventually the computing centre 
replaces the drive and it starts rebuilding (guess it doesn't matter 
whether the rebuilding is still considered to cause an alert or not). 
Eventually it finishes and the alert should go away (and I should e.g. get 
a resolved message).
But because of keep_firing_for, it doesn't stop straight away.
Now before it does, yet another disk fails.
But for Prometheus, with keep_firing_for, it will be like the same alert.

As said, this example is a bit made up, because even without the keep 
firing for, I wouldn't see the next device if that fails *while* the first 
one is still failing.
But the point is, I will loose follow up alerts that are close to a 
previous one, when I use keep_firing_for to solve the flapping problem.
Also, depending on how large I have to set keep_firing_for, I will also get 
resolve messages later... which depending on what one does with the alerts 
may also be less desirable.


 

As you correctly surmise, an alert isn't really a boolean condition, it's a 
presence/absence condition: the expr returns a vector of 0 or more alerts, 
each with a unique combination of labels.  "keep_firing_for" retains a 
particular labelled value in the vector for a period of time even if it's 
no longer being generated by the alerting "expr".  Hence if it does 
reappear in the expr output during that time, it's just a continuation of 
the previous alert.


I think the main problem behind may be rather a conceptual one, namely that 
Prometheus uses "no data" for no alert, which happens as well when there is 
no data because of e.g. scrape failures, so it can’t really differentiate 
between the two conditions.

What one would IMO need is a keep_firing_for, that works only while the 
target is down. But as soon as it goes up again (and even if just for one 
scrape), the effect would be gone and the alert would stop firing 
immediately (unless of course, there's still a value that comes out).
Wouldn't that make sense?
 

> Similarly, when a node goes completely down (maintenance or so) and then 
up again, all alerts would then start again to fire (and even a generous 
keep_firing_for would have been exceeded)... and send new notifications.
I don't understand what you're saying here. Can you give some specific 
examples?


Well what I meant is basically the same as a above, just outside of the 
flapping scenario (in which, I guess, the scrape failures last never longer 
than perhaps 1-10 mins):
- Imagine I have a node with several alerts firing (e.g. again that some 
upgrades aren't installed yet, or root fs has too much utilisation, things 
which typically last unless there's some manual intervention).
- Also, I have e.g. set my alert manager, to repeat these alerts say once a 
week (to nag the admin to finally do something about it).

What I'd expect should happen is e.g. the following:
- I already got the mails from the above alerts, so unless something 
changes, they should only be re-sent in a week.
- If one of those alerts resolves (e.g. someone frees up disk space), but 
disk space runs over my threshold again later, I'd like a new notification 
- now, not just in a week.
(but back now to the situation, where the alert is still running from the 
first time and only one mail has been sent)

What I e.g. I reboot the system. Maybe the admin upgraded the packages with 
security updates and also did some firmware upgrades which easily may take 
a while (we have servers where that runs for an hour or so... o.O).
So the system is down for one hour in which scraping fails (and the alert 
condition would be gone) and any reasonable keep_firing_for: (at least 
reasonable with it's current semantics) will also have run out already.

The system comes up again, but the over utilisation of the root fs is still 
there and the alert that had already fired before begins again 
*respectively* continues to do so.

At that point, we cannot really know,whether it's the same alert (i.e. the 
alert condition never resolved) or whether it's a new one (it did resolve 
but came back again.
(Well in my example we can be pretty sure it's the same one, since I 
rebooted - but generally 

[prometheus-users] what to do about flapping alerts?

2024-04-05 Thread Christoph Anton Mitterer
Hey.

I have some simple alerts like:
- alert: node_upgrades_non-security_apt
  expr:  'sum by (instance,job) ( 
apt_upgrades_pending{origin!~"(?i)^.*-security(?:\\PL.*)?$"} )'
- alert: node_upgrades_security_apt
  expr:  'sum by (instance,job) ( 
apt_upgrades_pending{origin=~"(?i)^.*-security(?:\\PL.*)?$"} )'

If there's no upgrades, these give no value.
Similarly, for all other simple alerts, like free disk space:
1 - node_filesystem_avail_bytes{mountpoint="/", fstype!="rootfs", 
instance!~"(?i)^.*\\.garching\\.physik\\.uni-muenchen\\.de$"} / 
node_filesystem_size_bytes  >  0.80

No value => all ok, some value => alert.

I do have some instances which are pretty unstable (i.e. scraping fails 
every know and then - or more often than that), which are however mostly 
out of my control, so I cannot do anything about that.

When the target goes down, the alert clears and as soon as it's back, it 
pops up again, sending a fresh alert notification.

Now I've seen:
https://github.com/prometheus/prometheus/pull/11827
which describes keep_firing_for as "the minimum amount of time that an 
alert should remain firing, after the expression does not return any 
results", respectively in 
https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/#rule
 
:
# How long an alert will continue firing after the condition that triggered 
it # has cleared. [ keep_firing_for:  | default = 0s ] 

but AFAIU that would simply affect all alerts, i.e. it wouldn't just keep 
firing, when the scraping failed, but also when it actually goes back to an 
ok state, right?
That's IMO however rather undesirable.

Similarly, when a node goes completely down (maintenance or so) and then up 
again, all alerts would then start again to fire (and even a generous 
keep_firing_for would have been exceeded)... and send new notifications.


Is there any way to solve this? Especially that one doesn't get new 
notifications sent, when the alert never really stopped?

At least I wouldn't understand how keep_firing_for would do this.

Thanks,
Chris.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/df6e33c7-b621-4f93-b265-7aad0802n%40googlegroups.com.


Re: [prometheus-users] query for time series misses samples (that should be there), but not when offset is used

2024-04-05 Thread Christoph Anton Mitterer
Hey.


On Friday, April 5, 2024 at 7:10:29 AM UTC+2 Ben Kochie wrote:

If the jitter is > 0.002, the real value is stored. 

 
Interesting... though I guess bad for my solution in the other thread, 
where I make the assumption that it's guaranteed that samples are always 
exactly on point with the same interval in-between.
Haven't checked it yet, but I'd guess that blows the approach in the other 
thread.

Is there some metric to see whether such non-aligned samples occurred?


Also, what would happen if e.g. there was a first scrape, which get's 
delayed > 0.002 s ... and before that first scrape arrives, there's yet 
another (later) scrape which has no jitter and is on time?
Are they going to be properly ordered? 

Cheers
Chris.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/ea4d886a-1d7c-4e43-8477-0c598c5e6943n%40googlegroups.com.


Re: [prometheus-users] Re: better way to get notified about (true) single scrape failures?

2024-04-05 Thread Christoph Anton Mitterer
Hey Chris.

On Thursday, April 4, 2024 at 8:41:02 PM UTC+2 Chris Siebenmann wrote:

> - The evaluation interval is sufficiently less than the scrape 
> interval, so that it's guaranteed that none of the `up`-samples are 
> being missed. 


I assume you were referring to the above specific point?

Maybe there is a misunderstanding:

With the above I merely meant that, my solution requires that the alert 
rule evaluation interval is small enough, so that when it look at 
resets(up[20s] offset 60s) (which is the window from -70s to -50s PLUS an 
additional shift by 10s, so effectively -80s to -60s), the evaluations 
happen often enough, so that no sample can "jump over" that time window.

I.e. if the scrape interval was 10s, but the evaluation interval only 20s, 
it would surely miss some.
 

I don't believe this assumption about up{} is correct. My understanding 
is that up{} is not merely an indication that Prometheus has connected 
to the target exporter, but an indication that it has successfully 
scraped said exporter. Prometheus can only know this after all samples 
from the scrape target have been received and ingested and there are no 
unexpected errors, which means that just like other metrics from the 
scrape, up{} can only be visible after the scrape has finished (and 
Prometheus knows whether it succeeded or not). 


Yes, I'd have assumed so as well. Therefore I generally shifted both alerts 
by 10s, hoping that 10s is enough for all that.

 

How long scrapes take is variable and can be up to almost their timeout 
interval. You may wish to check 'scrape_duration_seconds'. Our metrics 
suggest that this can go right up to the timeout (possibly in the case 
of failed scrapes). 


Interesting. 

I see the same (I mean entries that go up to and even a bit above the 
timeout). Would be interesting to know whether these are ones that still 
made it "just in time (despite actually being a bit longer than the 
timeout)... or whether these are only such that timed out and were 
discarded.
Cause the name scrape_duration_seconds would kind of imply that it's the 
former, but I guess it's actually the latter.

So what would you think that means for me and my solution now? The I should 
shift all my checks even further? That is at least the scrape_timeout + 
some extra time for the data getting into the TDSB?


Thanks,
Chris.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/f6603b09-d44b-412d-831a-c53234c85a82n%40googlegroups.com.


Re: [prometheus-users] query for time series misses samples (that should be there), but not when offset is used

2024-04-04 Thread Christoph Anton Mitterer
Hey Chris, Brian.

Thanks for your replies/confirmations.


On Sunday, March 24, 2024 at 8:16:14 AM UTC+1 Ben Kochie wrote:

Yup, this is correct. Prometheus sets the timestamp of the sample at the 
start of the scrape. But since it's an ACID compliant database, the data is 
not queryable until after it's been fully ingested.

This is intentional, because the idea is that whatever atomicity is desired 
by the target is handled by the target. Any locks taken are done when the 
target receives the GET /metrics. The exposition formatting, compression, 
and wire transfer time should not impact the "when time when the sample was 
gathered".


Does make sense, yes... was that documented somewhere? I think it would be 
helpful if e.g. the page about the querying basics would tell, these two 
properties:
- that data is only returned if it has fully arrived, and thus may not be, 
even if the query is after the sample time
- that Prometheus "adjusts" the timestamps within a certain range
 

And yes, the timing is a tiny bit faked. There are some hidden flags that 
control this behavior.

--scrape.adjust-timestamps
--scrape.timestamp-tolerance 

The default allows up to 2ms (+-0.002) of timing jitter to be ignored. This 
was added in 2020 due to a regression in the accuracy of the Go internal 
timer functions.

See: https://github.com/prometheus/prometheus/issues/7846


Makes sense, too. And is actually vital for what I do over in 
https://groups.google.com/g/prometheus-users/c/BwJNsWi1LhI/m/ik2OiRa2AAAJ 

Just out of curiosity, what happens, if the jitter is more than the +-0.002?

Thanks,
Chris.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/ce9bc4cb-6c93-4df0-93ed-cc83e1e17f80n%40googlegroups.com.


Re: [prometheus-users] Re: better way to get notified about (true) single scrape failures?

2024-04-04 Thread Christoph Anton Mitterer
Hey.

On Friday, March 22, 2024 at 9:20:45 AM UTC+1 Brian Candler wrote:

You want to "capture" single scrape failures?  Sure - it's already being 
captured.  Make yourself a dashboard.


Well as I've said before, the dashboard always has the problem that someone 
actually needs to look at it.
 

But do you really want to be *alerted* on every individual one-time scrape 
failure?  That goes against the whole philosophy of alerting 
<https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit>,
 
where alerts should be "urgent, important, actionable, and real".  A single 
scrape failure is none of those.


I guess in the end I'll see whether or not I'm annoyed by it. ;-)
 

How often do you get hosts where:
(1) occasional scrape failures occur; and
(2) there are enough of them to make you investigate further, but not 
enough to trigger any alerts?


So far I've seen two kinds of nodes, those where I never get scrape errors, 
and those where they happen regularly - and probably need investigation.


Anyway,... I think it might have found a solution, which - if some
assumption's I've made are correct - I'm somewhat confident that
it works, even in the strange cases.


The assumptions I've made are basically three:
- Prometheus does that "faking" of sample times, and thus these are
  always on point with exactly the scrape interval between each.
  This in turn should mean, that if I have e.g. a scrape interval of
  10s, and I do up[20s], then regardless of when this is done, I get
  at least 2 samples, and in some rare cases (when the evaluation
  happens exactly on a scrape time), 3 samples.
  Never more, never less.
  Which for `up` I think should be true, as Prometheus itself
  generates it, right, and not the exporter that is scraped.
- The evaluation interval is sufficiently less than the scrape
  interval, so that it's guaranteed that none of the `up`-samples are
  being missed.
- After some small time (e.g. 10s) it's guaranteed that all samples
  are in the TSDB and a query will return them.
  (basically, to counter the observation I've made in
  https://groups.google.com/g/prometheus-users/c/mXk3HPtqLsg )
- Both alerts run in the same alert group, and that means (I hope) that
  each query in them is evaluated with respect to the very same time.

With that, my final solution would be:
- alert: general_target-down   (TD below)
  expr: 'max_over_time(up[1m] offset 10s) == 0'
  for:  0s
- alert: general_target-down_single-scrapes   (TDSS below)
  expr: 'resets(up[20s] offset 60s) >= 1  unless  max_over_time(up[50s] 
offset 10s) == 0'
  for:  0s

And that seems to actually work for at least practical cases (of
course it's difficult to simulate the cases where the evaluation
happens right on time of a scrape).

For anyone who'd ever be interested in the details, and why I think that 
works in all cases,
I've attached the git logs where I describe the changes in my config git 
below.

Thanks to everyone for helping me with that :-)

Best wishes,
Chris.


(needs a mono-spaced font to work out nicely)
TL/DR:
-
commit f31f3c656cae4aeb79ce4bfd1782a624784c1c43
Author: Christoph Anton Mitterer 
Date:   Mon Mar 25 02:01:57 2024 +0100

alerts: overhauled the `general_target-down_single-scrapes`-alert

This is a major overhaul of the 
`general_target-down_single-scrapes`-alert,
which turned out to have been quite an effort that went over several 
months.

Before this branch was merged, the 
`general_target-down_single-scrapes`-alert
(from now on called “TDSS”) had various issues.
While the alert did stop to fire, when the `general_target-down`-alert 
(from now
on called “TD”) started to do so, that alone meant that it would still 
also fire
when scrapes failed which eventually turned out to be an actual TD.
For example the first few (< ≈7) `0`s would have caused TDSS to fire 
which would
seamlessly be replaced by a firing TD (unless any `1`s came in between).

Assumptions made below:
• The scraping interval is `10s`.
• If a (single) time series for the `up`-metric is given like `0 1 0 0 
1`, the
  time goes from left (farther back in time) to right (less farther 
back in
  time).

I) Goals

There should be two alerts:
• TD
  Is for general use and similar to Icinga’s concept of host being `UP` 
or
  `DOWN` (with the minor difference, that an unreachable Prometheus 
target does
  not necessarily mean that a host is `DOWN` in that sense).
  It should fire after scraping has failed for some time, for example 
one
  minute (which is assumed form now on).
• TDSS
  Since Prometheus is all about monitoring metrics, it’s of interest 
whether the
  scraping fails, even if it’s only every now and then for very short 
amount of
  times, because in that ca

[prometheus-users] query for time series misses samples (that should be there), but not when offset is used

2024-03-22 Thread Christoph Anton Mitterer
Hey.

I noticed a somewhat unexpected behaviour, perhaps someone can explain why 
this happens.

- on a Prometheus instance, with a scrape interval of 10s
- doing the following queries via curl from the same node where Prometheus 
runs (so there cannot be any different system times or so

Looking at the sample times via e.g.:
$ while true; do curl -g 'http://localhost:9090/api/v1/query?query=up[1m]' 
2> /dev/null | jq '.data.result[0].values' | grep '[".]' | paste - - ; 
echo; sleep 1 ; done

the timings look sper tight:
1711148768.175,"1"
1711148778.175,"1"
1711148788.175,"1"
1711148798.175,"1"
1711148808.175,"1"
1711148818.175,"1"

1711148768.175,"1"
1711148778.175,"1"
1711148788.175,"1"
1711148798.175,"1"
1711148808.175,"1"
1711148818.175,"1"

I.e. it's *always* .175. I guess in reality it may not actually be that 
tight, and Prometheus just sets the timestamps artificially... but that 
doesn't matter for me.

When now doing a query like (in a while loop with no delay):
up[1m]
and counting the number of samples, I'd expect to get always either 6 
samples or perhaps 7 (if may query would happen exactly at a .175 time).

But since the sample times are so super tight, I'd not expect to ever get 
less than 6.
But that's just what happens:

1711148408.137921942
1711148408.179789148
1711148408.190407865
1711148408.239896472

1711148358.175,"1"
1711148368.175,"1"
1711148378.175,"1"
1711148388.175,"1"
1711148398.175,"1"

1711148408.249002352
1711148408.287031384

1711148358.175,"1"
1711148368.175,"1"
1711148378.175,"1"
1711148388.175,"1"
1711148398.175,"1"

1711148408.294628944
1711148408.342150984
1711148408.351871893
1711148408.405270701

Here, the non indented times, are timestamps from before and after the 
whole curl .. | .. pipe.
The indented lines are the samples + timestamps in those cases, where != 6 
are returned, done via something similar hacky like:
$ while true; do f="$( date +%s.%N >&2; curl -g 
'http://localhost:9090/api/v1/query?query=up[1m]' 2> /dev/null | jq 
'.data.result[0].values' | grep '[".]' | paste - - ; date +%s.%N >&2)"; if 
[ "$( printf '%s\n' "$f" | wc -l)" != 8 ]; then printf '\n%s\n\n' "$f"; fi 
; done

One sees, that both times before and after the curl, are already behind the 
.175, yet still the most recent sample (which should already be there - and 
which in fact shows up at 1711148408.175 in later queries) is missing.

Interestingly, when doing these queries offset 10s (beware that curl 
requires %20 as space)... none of this happens and I basically always get 6 
samples - as more or less expected.

[I say more or less, because I wonder,... whether it's possible to get 7 
... should it be?]

Any ideas why? And especially also why not with an offset?


Thanks,
Chris.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/0d8c0274-3a29-4b0b-ba06-4d150fe24b39n%40googlegroups.com.


Re: [prometheus-users] Re: better way to get notified about (true) single scrape failures?

2024-03-21 Thread Christoph Anton Mitterer

I've been looking into possible alternatives, based on the ideas given here.

I) First one completely different approach might be:
- alert: target-down expr: 'max_over_time( up[1m0s] ) == 0' for: 0s and: (
- alert: single-scrape-failure
expr: 'min_over_time( up[2m0s] ) == 0'
for: 1m
or
- alert: single-scrape-failure
expr: 'resets( up[2m0s] ) > 0'
for: 1m
or perhaps even
- alert: single-scrape-failure
expr: 'changes( up[2m0s] ) >= 2'
for: 1m
(which would however behave a bit different, I guess)
)

plus an inhibit rule, that silences single-scrape-failure when
target-down fires.
The for: 1m is needed, so that target-down has a chance to fire
(and inhibit) before single-scrape-failure does.

I'm not really sure, whether that works in all cases, though,
especially since I look back much more (and the additional time
span further back may undesirably trigger again.


Using for: > 0 seems generally a bit fragile for my use-case (because I 
want to capture even single scrape failures, but with for: > 0 I need t to 
have at least two evaluations to actually trigger, so my evaluation period 
must be small enough so that it's done >= 2 during the scrape interval.

Also, I guess the scrape intervals and the evaluation intervals are not 
synced, so when with for: 0s, when I look back e.g. [1m] and assume a 
certain number of samples in that range, it may be that there are actually 
more or less.


If I forget about the above approach with inhibiting, then I need to 
consider cases like:
time>
- 0 1 0 0 0 0 0 0
first zero should be a single-scrape-failure, the last 6 however a
target-down
- 1 0 0 0 0 0 1 0 0 0 0 0 0
same here, the first 5 should be a single-scrape-failure, the last 6
however a target-down
- 1 0 0 0 0 0 0 1 0 0 0 0 0 0
here however, both should be target-down
- 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0
or
1 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0
here, 2x target-down, 1x single-scrape-failure




II) Using the original {min,max}_over_time approach:
- min_over_time(up[1m]) == 0
tells me, there was at least one missing scrape in the last 1m.
but that alone would already be the case for the first zero:
. . . . . 0
so:
- for: 1m
was added (and the [1m] was enlarged)
but this would still fire with
0 0 0 0 0 0 0
which should however be a target-down
so:
- unless max_over_time(up[1m]) == 0
was added to silence it then
but that would still fail in e.g. the case when a previous
target-down runs out:
0 0 0 0 0 0 -> target down
the next is a 1
0 0 0 0 0 0 1 -> single-scrape-failure
and some similar cases,

Plus the usage of for: >0s is - in my special case - IMO fragile.



III) So in my previous mail I came up with the idea of using:
- alert: target-down expr: 'max_over_time( up[1m0s] ) == 0' for: 0s - 
alert: single-scrape-failure expr: 'min_over_time(up[15s] offset 1m) == 0 
unless max_over_time(up[1m0s]) == 0 unless max_over_time(up[1m0s] offset 
1m10s) == 0 unless max_over_time(up[1m0s] offset 1m) == 0 unless 
max_over_time(up[1m0s] offset 50s) == 0 unless max_over_time(up[1m0s] 
offset 40s) == 0 unless max_over_time(up[1m0s] offset 30s) == 0 unless 
max_over_time(up[1m0s] offset 20s) == 0 unless max_over_time(up[1m0s] 
offset 10s) == 0' for: 0m
The idea was, that when I don't use for: >0s, the first time
window where one can be really sure (in all cases), that whether
it's a single-scrape-failure or target-down is a 0 in -70s to
-60s:
-130s -120s -110s -100s -90s -80s -70s -60s -50s -40s -30s -20s -10s 0s/now 
| | | | | | | 0 | | | | | | | | | | | | | | | | | | 1 | 0 | 1 | case 1 | | 
| | | | | 0 | 0 | 0 | 0 | 0 | 0 | 0 | case 2 | | | | 1 | 0 | 0 | 0 | 0 | 0 
| 0 | 0 | 1 | 1 | case 3 In case 1 it would be already clear when the zeros 
is between -20
and -10.
But if there's a sequence of zeros, it takes up to -70s to -60s,
when it becomes clear.

Now the zero in that time span could also be that of a target-down
sequence of zeros like in case 3.
For these cases, I had the shifted silencers that each looked over
1m.

Looked good at first, though there were some open questions.
At least one main problem, namely it would fail in e.g. that case:
-130s -120s -110s -100s -90s -80s -70s -60s -50s -40s -30s -20s -10s 0s/now 
| 1 | 1 | 1 | 1 | 1 | 1 | 0 1 | 0 | 0 | 0 | 0 | 0 | 0 | case 8a
The zero between -70s to 60s would be noticed, but still be
silenced, because the one would not.




Chris Siebenmann suggested to use resets(). ... and keep_firing_for:, which 
Ben Kochie, suggested, too.

First I didn't quite understand how the latter would help me? Maybe I have 
the wrong mindset for it, so could you guys please explain what your idea 
was wiht keep_firing_for:?




IV) resets() sounded promising at first, but while I tried quite some
variations, I wasn't able to get anything working.
First, something like
resets(up[1m]) >= 1
alone (with or without a for: >0s) would already fire in case of:
time>
1 0
which still could become a target-down but also in case of:
1 0 0 0 0 0 0
which is a target down.
And I think even 

Re: [prometheus-users] Re: better way to get notified about (true) single scrape failures?

2024-03-17 Thread Christoph Anton Mitterer
Hey Chris.

On Sun, 2024-03-17 at 22:40 -0400, Chris Siebenmann wrote:
> 
> One thing you can look into here for detecting and counting failed
> scrapes is resets(). This works perfectly well when applied to a
> gauge

Though it is documented as to be only used with counters... :-/


> that is 1 or 0, and in this case it will count the number of times
> the
> metric went from 1 to 0 in a particular time interval. You can
> similarly
> use changes() to count the total number of transitions (either 1->0
> scrape failures or 0->1 scrapes starting to succeed after failures).

The idea sounds promising... especially to also catch cases like that
8a, I've mentioned in my previous mail and where the
{min,max}_over_time approach seems to fail.


> It may also be useful to multiply the result of this by the current
> value of the metric, so for example:
> 
>   resets(up{..}[1m]) * up{..}
> 
> will be non-zero if there have been some number of scrape failures
> over
> the past minute *but* the most recent scrape succeeded (if that
> scrape
> failed, you're multiplying resets() by zero and getting zero). You
> can
> then wrap this in an '(...) > 0' to get something you can maybe use
> as
> an alert rule for the 'scrapes failed' notification. You might need
> to
> make the range for resets() one step larger than you use for the
> 'target-down' alert, since resets() will also be zero if up{...} was
> zero all through its range.
> 
> (At this point you may also want to look at the alert
> 'keep_firing_for'
> setting.)

I will give that some more thinking and reply back if I should find
some way to make an alert out of this.

Well and probably also if I fail to ^^ ... at least at a first glance I
wasn't able to use that to create and alert that would behave as
desired. :/


> However, my other suggestion here would be that this notification or
> count of failed scrapes may be better handled as a dashboard or a
> periodic report (from a script) instead of through an alert,
> especially
> a fast-firing alert.

Well the problem with a dashboard would IMO be, that someone must
actually look at it or otherwise it would be pointless. ;-)

Not really sure how to do that with a script (which I guess would be
conceptually similar to an alert... just that it's sent e.g. weekly).

I guess I'm not so much interested in the exact times, when single
scrapes fail (I cannot correct it retrospectively anyway) but just
*that* it happens and that I have to look into it.

My assumption kinda is, that normally scrapes aren't lost. So I would
really only get an alert mail if something's wrong.
And even if the alert is flaky, like in 1 0 1 0 1 0, I think it could
still reduce mail but on the alertmanager level?


> I think it will be relatively difficult to make an
> alert give you an accurate count of how many times this happened; if
> you
> want such a count to make decisions, a dashboard (possibly
> visualizing
> the up/down blips) or a report could be better. A program is also in
> the
> position to extract the raw up{...} metrics (with timestamps) and
> then
> readily analyze them for things like how long the failed scrapes tend
> to
> last for, how frequently they happen, etc etc.

Well that sounds to be quite some effort... and I already think that my
current approaches required far too much of an effort (and still don't
fully work ^^).
As said... despite not really being comparable to Prometheus: in
Incinga a failed sensor probe would be immediately noticeable.


Thanks,
Chris.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/f3549318aaa5f5b4fa0a01fb20c44e30769f540a.camel%40gmail.com.


[prometheus-users] Re: better way to get notified about (true) single scrape failures?

2024-03-17 Thread Christoph Anton Mitterer
Hey there.

I eventually got back to this and I'm still fighting this problem.

As a reminder, my goal was:
- if e.g. scrapes fail for 1m, a target-down alert shall fire (similar to
  how Icinga would put the host into down state, after pings failed or a
  number of seconds)
- but even if a single scrape fails (which alone wouldn't trigger the above
  alert) I'd like to get a notification (telling me, that something might be
  fishy with the networking or so), that is UNLESS that single failed scrape
  is part of a sequence of failed scrapes that also caused / will cause the
  above target-down alert

Assuming in the following, each number is a sample value with ~10s distance 
for
the `up` metric of a single host, with the most recent one being the 
right-most:
- 1 1 1 1 1 1 1 => should give nothing
- 1 1 1 1 1 1 0 => should NOT YET give anything (might be just a single 
failure,
   or develop into the target-down alert)
- 1 1 1 1 1 0 0 => same as above, not clear yet
...
- 1 0 0 0 0 0 0 => here it's clear, this is a target-down alert

In the following:
- 1 1 1 1 1 0 1
- 1 1 1 1 0 0 1
- 1 1 1 0 0 0 1
...
should eventually (not necessarily after the right-most 1, though) all give 
a
"single-scrape-failure" (even though it's more than just one - it's not a
target-down), simply because there's a 0s but for a time span less than 1m.

- 1 0 1 0 0 0 0 0 0
should give both, a single-scrape-failure alert (the left-most single 0) 
AND a
target-down alert (the 6 consecutive zeros)

-   1 0 1 0 1 0 0 0
should give at least 2x a single-scrape-failure alert, and for the leftmost
zeros, it's not yet clear what they'll become.
-   0 0 0 0 0 0 0 0 0 0 0 0  (= 2x six zeros)
should give only 1 target-down alert
- 0 0 0 0 0 0 1 0 0 0 0 0 0  (= 2x six zeros, separated by a 1)
should give 2 target-down alerts

Whether each of such alerts (e.g. in the 1 0 1 0 1 0 ...) case actually 
results
in a notification (mail) is of course a different matter, and depends on the
alertmanager configuration, but at least the alert should fire and with the 
right
alert-manager config one should actually get a notification for each single 
failed
scrape.


Now, Brian has already given me some pretty good ideas how do them 
basically the
ideas were:
(assuming that 1m makes the target down, and a scrape interval of 10s)

For the target-down alert:
a) expr: 'up == 0'
   for:  1m
b) expr: 'max_over_time(up[1m]) == 0'
   for:  0s
=> here (b) was probably better, as it would use the same condition as is 
also used
   in the alert below, and there can be no weird timing effects depending 
on the
   for: an when these are actually evaluated.

For the single-scrape-failiure alert:
A) expr: min_over_time(up[1m20s]) == 0 unless max_over_time(up[1m]) == 0
   for: 1m10s
   (numbers a bit modified from Brian's example, but I think the idea is 
the same)
B) expr: min_over_time(up[1m10s]) == 0 unless max_over_time(up[1m10s]) == 0
   for: 1m

=> I did test (B) quite a lot, but there was at least still one case where 
it failed
   and that was when there were two consecutive but distinct target-down 
errors, that
   is:
   0 0 0 0 0 0 1 0 0 0 0 0 0  (= 2x six zeros, separated by a 1)
   which would eventually look like e.g. 
   0 1 0 0 0 0 0 0   or   0 0 1 0 0 0 0 0
   in the above check, and thus trigger (via the left-most zeros) a false
   single-scrape-failiure alert.

=> I'm not so sure whether I truly understand (A),... especially with 
respect to any
   niche cases, when there's jitter or so (plus, IIRC, it also failed in 
the case
   described for (B).


One approach I tried in the meantime was to use sum_over_time .. and then 
the idea was
simply to check how mane ones there are for each case. But it turns out 
that even if
everything runs normal, the sum is not stable... some times, over [1m] I 
got only 5,
whereas most times it was 6.
Not really sure how that comes, because the printed timestamps for each 
sample seem to
be sper accurate (all the time), but the sum wasn't.


So I tried a different approach now, based on the above from Brian,... 
which at least in
tests looks promising so far... but I'd like to hear what experts think 
about it.

- both alerts have to be in the same alert groups (I assume this assures 
they're then
  evaluated in the same thread and at the "same time" (that is, with 
respect to the same
  reference timestamp).
- in my example I assume a scrape time of 10s and evaluation interval of 7s 
(not really
  sure whether the latter matters or could be changed while the rules stay 
the same - and
  it would still work or not)
- for: is always 0s ... I think that's good, because at least to me it's 
unclear, how
  things are evaluated if the two alerts have different values for for:, 
especially in
  border cases.
- rules:
- alert: target-down
  expr: 'max_over_time( up[1m0s] )  ==  0'
  for:  0s
- alert: single-scrape-failure
  expr: 'min_over_time(up[15s] offset 1m) == 0 unless 

[prometheus-users] Re: better way to get notified about (true) single scrape failures?

2023-05-12 Thread Christoph Anton Mitterer
Hey Brian

On Wednesday, May 10, 2023 at 9:03:36 AM UTC+2 Brian Candler wrote:

It depends on the exact semantics of "for". e.g. take a simple case of 1 
minute rule evaluation interval. If you apply "for: 1m" then I guess that 
means the alert must be firing for two successive evaluations (otherwise, 
"for: 1m" would have no effect).


Seems you're right.

I did quite some testing meanwhile with the following alertmanager route 
(note, that I didn't use 5m, but 1m... simply in order to not have to wait 
so long):
  routes:
  - match_re:
  alertname: 'td.*'
receiver:   admins_monitoring
group_by:   [alertname]
group_wait: 0s
group_interval: 1s

and the following rules:
groups:
  - name: alerts_general_single-scrapes
interval: 15s
rules:
- alert: td-fast
  expr: 'min_over_time(up[75s]) == 0 unless max_over_time(up[75s]) == 0'
  for:  1m
- alert: td
  expr: 'up == 0'
  for:  1m

My understanding is, correct me if wrong, that basically prometheus would 
run a thread for the scrape job (which in my case would have an interval of 
15s) and another one that evaluates the alert rules (above every 15s) which 
then sends the alert to the alertmanager (if firing).

It felt a bit brittle to have the rules evaluated with the same period then 
the scrapes, so I did all tests once with 15s for the rules interval, and 
once with 10s. But it seems as if this wouldn't change the behaviour.


But up[5m] only looks at samples wholly contained within a 5 minute window, 
and therefore will normally only look at 5 samples.


As you can see above,... I had already noticed that you were indeed right 
before, and if my for: is e.g. 4 * evaluation_interval(15s) = 1m ... I need 
to look back 5 * evaluation_interval(15s) = 75s

At least in my tests, that seemed to cause the desired behaviour, except 
for one case:
When my "slow" td fires (i.e. after 5 consecutive "0"s) and then there 
is... within (less than?) 1m, another sequence of "0"s that eventually 
cause a "slow" td. In that case, td-fast fires for a while, until it 
directly switches over to td firing.

Was your idea above with something like:
>expr: min_over_time(up[8m]) == 0 unless max_over_time(up[6m]) == 0
>for: 7m
intended to fix that issue?

Or could one perhaps use 
ALERTS{alertname="td",instance="lcg-lrz-ext.grid.lrz.de",job="node"}[??s] 
== 1 somehow, to check whether it did fire... and then silence the false 
positive.

 

  (If there is jitter in the sampling time, then occasionally it might look 
at 4 or 6 samples)


Jitter in the sense that the samples are taken at slightly different times?
Do you think that could affect the desired behaviour? I would intuitively 
expect that it rather only cases the "base duration" not be be exactly e.g. 
1m ... so e.g. instead of taking 1m for the "slow" td to fire, it would 
happen +/- 15s earlier (and conversely for td-slow).


Another point I basically don't understand... how does all that relate to 
the scrap intervals?
The plain up == 0 simply looks at the most recent sample (going back up to 
5m as you've said in the other thread).

The series up[Ns] looks back N seconds, giving whichever samples are within 
there and now. AFAIU, there it doesn't go "automatically" back any further 
(like the 5m above), right?

In order for the for: to work I need at least two samples... so doesn't 
that mean that as soon as any scrape time is for:-time(1m) / 2 = ~30s (in 
the above example), the above two alerts will never fire, even if it's down?

So if I had e.g. some jobs scraping only every 10m ... I'd need another 
pair of td/td-fast alerts, which then filter on the job 
(up{job="longRunning"}) and either only have td... (if that makes sense) 
... or at td-fast for if one of the every-10m-scrape fails and an even long 
"slow" td like if that fails for 1h.


If what I've written above is correct (and it may well not be!), then

expr: up == 0
for: 5m

will fire if "up" is zero for 6 cycles, whereas


As far as I understand you... 6 cycles of rule evaluation interval... with 
at least two samples within that interval, right?
 

... unless max_over_time(up[5m])

will suppress an alert if "up" is zero for (usually) 5 cycles.



 Last but not least an (only) partially related question:

Once an alert fires (in prometheus), even i just for one evaluation 
interval cycle and there is no inhibiton rule or so in alertmanager... 
is it expected that a notification is sent out for sure,... regardless of 
alertmanagers grouping settings?
Like when the alert fires for one short 15s evaluation interval and clears 
again afterwards,... but group_wait: is set to some 7d ... is it expected 
to send that singe firing event after 7d, even if it has resolved already 
once the 7d are over and there was .g. no further firing in between?


Thanks a lot :-)
Chris.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To 

[prometheus-users] Re: better way to get notified about (true) single scrape failures?

2023-05-09 Thread Christoph Anton Mitterer
Hey Brian.

On Tuesday, May 9, 2023 at 9:55:22 AM UTC+2 Brian Candler wrote:

That's tricky to get exactly right. You could try something like this 
(untested):

expr: min_over_time(up[5m]) == 0 unless max_over_time(up[5m]) == 0
for: 5m

- min_over_time will be 0 if any single scrape failed in the past 5 minutes
- max_over_time will be 0 if all scrapes failed (which means the 'standard' 
failure alert should have triggered)

Therefore, this should alert if any scrape failed over 5 minutes, unless 
all scrapes failed over 5 minutes.


Ah that seems a pretty smart idea.

And the for: is needed to make it actually "count", as the [5m] only looks 
back 5m, but there, max_over_time(up[5m]) would have likely been still 1 
while min_over_time(up[5m]) would already be 0, and if one had then e.g. 
for: 0s, it would fire immediately.
 

There is a boundary condition where if the scraping fails for approximately 
5 minutes you're not sure if the standard failure alert would have 
triggered.


You mean like the above one wouldn't fire cause it thinks it's the 
long-term alert, while that wouldn't fire either, because it has just 
resolved then?
 
 

Hence it might need a bit of tweaking for robustness. To start with, just 
make it over 6 minutes:

expr: min_over_time(up[6m]) == 0 unless max_over_time(up[6m]) == 0
for: 6m

That is, if max_over_time[6m] is zero, we're pretty sure that a standard 
alert will have been triggered by then.


That one I don't quite understand.
What if e.g. the following scenario happens (with each line giving the 
state 1m after the one before):

  for=6   for=5
m   -5 -4 -3 -2 -1  0   for min[6m] max[6m] result/shortresult/long
up:  1  1  1  1  1  0   1   0   1   pending pending 
up:  1  1  1  1  0  0   2   0   1   pending pending
up:  1  1  1  0  0  0   3   0   1   pending pending
up:  1  1  0  0  0  0   4   0   1   pending pending 
up:  1  0  0  0  0  0   5   0   1   pending fire
up:  0  0  0  0  0  1   6   0   1   fireclear

After 5m, the long term alert would fire, after that the scraping would 
succeed again, but AFAIU the "special" alert for the short ones would still 
be true at that point and then start to fire, despite all the previous 5 
zeros have actually been reported as part of a long-down alert.


I'm still not quite convinced about the "for: 6m" and whether we might lose 
an alert if there were a single failed scrape. Maybe this would be more 
sensitive:

expr: min_over_time(up[8m]) == 0 unless max_over_time(up[6m]) == 0
for: 7m

but I think you might get some spurious alerts at the *end* of a period of 
downtime.


That also seems quite complex. And I guess it might have the same possible 
issue from above?

The same should be the case if one would do:
expr: min_over_time(up[6m]) == 0 unless max_over_time(up[5m]) == 0
for: 6m
It may be just 6m ago that there was a "0" (from a long alert) and the last 
5m there would have been "1"s. So the short-alert would fire, despite it's 
unclear whether the "0" 6m ago was really just a lonely one or the end of a 
long-alert period.

Actually, I think, any case where the min_over_time goes further back than 
the long-alert's for:-time should have that.


expr: min_over_time(up[5m]) == 0 unless max_over_time(up[6m]) == 0
for: 5m
would also be broken, IMO, cause if 6m ago there was a "1", only the 
min_over_time(up[5m]) == 0 would remain (and nothing would silence the 
alert if needed)... if there 6m ago was a "0", it should effectively be the 
same than using [5m]?


Isn't the problem from the very above already solved by placing both alerts 
in the same rule group?

https://prometheus.io/docs/prometheus/latest/configuration/recording_rules/ 
says:
"Recording and alerting rules exist in a rule group. Rules within a group 
are run sequentially at a regular interval, with the same evaluation time."
which I guess applies also to alert rules.

Not sure if I'm right, but I think if one places both rules in the same 
group (and I think even the order shouldn't matter?), then the original:
expr: min_over_time(up[5m]) == 0 unless max_over_time(up[5m]) == 0
for: 5m
with 5m being the "for:"-time of the long-alert should be guaranteed to 
work... in the sense that if the above doesn't fire... the long-alert does.

Unless of course the grouping settings at alert manager cause trouble.. 
which I don't quite understand especially, once an alert fires, even if 
just for short,... is it guaranteed that a notiication is sent?
Cause as I wrote before, that didn't seem to be the case.

Last but not least, if my assumption is true and your 1st version would 
work if both alerts are in the same group... how would the interval then 
matter? Would it still need to be the smallest scrape time (I guess so)?


Thanks,
Chris.


[prometheus-users] better way to get notified about (true) single scrape failures?

2023-05-08 Thread Christoph Anton Mitterer
Hey.

I have an alert rule like this:

groups:
  - name:   alerts_general
rules:
- alert: general_target-down
  expr: 'up == 0'
  for:  5m

which is intended to notify about a target instance (respectively a 
specific exporter on that) being down.

There are also routes in alertmanager.yml which have some "higher" periods 
for group_wait and group_interval and also distribute that resulting alerts 
to the various receivers (e.g. depending on the instance that is affected).


By chance I've noticed that some of our instances (or the networking) seem 
to be a bit unstable and every now and so often, a single scrape or some 
few fail.

Since this does typically not mean that the exporter is down (in the above 
sense) I wouldn't want that to cause a notification to be sent to people 
responsible for the respective instances.
But I would want to get one sent, even if only a single scrape fails, to 
the local prometheus admin (me ^^), so that I can look further, what causes 
the scrape failures.



My (working) solution for that is:
a) another alert rule like:
groups:
  - name: alerts_general_single-scrapes
interval: 15s
rules:
- alert: general_target-down_single-scrapes  
  expr: 
'up{instance!~"(?i)^.*\\.garching\\.physik\\.uni-muenchen\\.de$"} == 0'
  for:  0s

(With 15s being the smallest scrape time used by any jobs.)

And a corresponding alertmanager route like:
  - match:
  alertname: general_target-down_single-scrapes
receiver:   admins_monitoring_no-resolved
group_by:   [alertname]
group_wait: 0s
group_interval: 1s


The group_wait: 0s and group_interval: 1s seemed necessary, cause despite 
of the for: 0s, it seems that alertmanager kind of checks again before 
actually sending a notification... and when the alert is gone by then 
(because there was e.g. only one single missing scrape) it wouldn't send 
anything (despite the alert actually fired).


That works so far... that is admins_monitoring_no-resolved get a 
notification for every single failed scrape while all others only get them 
when they fail for at least 5m.

I even improved the above a bit, by clearing the alert for single failed 
scrapes, when the one for long-term down starts firing via something like:
  expr: '( up{instance!~"(?i)^.*\\.ignored\\.hosts\\.example\\.org$"} 
== 0 )  unless on (instance,job)  ( ALERTS{alertname="general_target-down", 
alertstate="firing"} == 1 )'


I wondered wheter this can be done better?

Ideally I'd like to get notification for general_target-down_single-scrapes 
only sent, if there would be no one for general_target-down.

That is, I don't care if the notification comes in late (by the above ~ 
5m), it just *needs* to come, unless - of course - the target is "really" 
down (that is when general_target-down fires), in which case no 
notification should go out for general_target-down_single-scrapes.


I couldn't think of an easy way to get that. Any ideas?


Thanks,
Chris.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/8af3cd3e-f3b9-4c0c-b799-ac7a420d8bb1n%40googlegroups.com.


[prometheus-users] Re: how to make sure a metric is to be checked is "there"

2023-04-27 Thread Christoph Anton Mitterer
Hey again.

On Wednesday, April 26, 2023 at 9:35:32 AM UTC+2 Brian Candler wrote:

> expr: up{job="myjob"} == 1 unless my_metric

Beware with that, that it will only work if the labels on both 'up' and 
'my_metric' match exactly.  If they don't, then you can either use on(...) 
to specify the set of labels which match, or ignoring(...) to specify the 
ones which don't.

You could start with:

expr: up{job="myjob"} == 1 unless on (instance) my_metric


Ah. I see.
I guess one should use on(...) rather than ignoring(...) because one 
doesn't really know which labels may get added, right?

Also, wouldn't it be better to also consider the "job" label?
   expr: up{job="myjob"} == 1 unless on (instance, job) my_metric
because AFAIU, job is set by Prometheus itself, so if I operate on it as 
well, I can make sure that my_metric is really from the desired job - an 
not perhaps from some other job that wrongly exports a metric of that name.
Does that make sense?

 

but I believe this will break if there are multiple instances of my_metric 
for the same host. I'd probably do:

expr: up{job="myjob"} == 1 unless on (instance) count by (instance) 
(my_metric)


So with job that would be:
   expr: up{job="myjob"} == 1 unless on (instance,job) count by 
(instance,job) (my_metric)
 
but I don't quite understand why it's needed in the first place?!

If I do the previous:
  expr: up{job="myjob"} == 1 unless on (instance) my_metric
then even if for one given instance value (and optionally one given job 
value) there are multiple results for my_metric (just differing in other 
labels), like:
   
node_filesystem_free_bytes{device="/dev/vda1",fstype="vfat",mountpoint="/boot/efi"}
 
5.34147072e+08
   
node_filesystem_free_bytes{device="/dev/vda2",fstype="btrfs",mountpoint="/"} 
1.2846592e+10
   
node_filesystem_free_bytes{device="/dev/vda2",fstype="btrfs",mountpoint="/data/btrfs-top-level-subvolumes/system"}
 
1.2846592e+10
(all with the same instance/job)

shouldn't the "unless on (instance)" still work? I mean it wouldn't notice 
if only one time series were gone (like e.g. only device="/dev/vda1" 
above), but it should if all were gone?
But the count by would also only notice it if all were gone, because only 
then it gives back no data for the respective instance (and not just 0 as 
value)?


Also, if a scrape does not contain a particular timeseries, but the 
previous scrape *did* contain that timeseries, then the timeseries is 
marked "stale" by storing a staleness marker.


 Is there a way to test for that marker in expressions?

 

So if you do see a value, it means:
- it was in the last scrape
- it was in the last 5 minutes 

- there has not been a subsequent scrape where the timeseries was missing


Ah, good to know.
 

> Is this with absent() also needed when I have all my targets/jobs 
statically configured?

Use absent() when you need to write an expression which you can't do as a 
join against another existing timeseries.


Okay, ... but AFAIU I couldn't use absent() to reproduce the effect of the 
above:
   up{job="myjob"} == 1 unless on (instance) my_metric
because if I'd do something like:
   absent(my_metric)
it would be empty as soon as there was at least one time series for the 
metric.
With that I could really only check for a specific time series to be 
missing like:
   absent(my_metric{instance="somehost",job="node"})
and would have to make one alert with a different expression for e.g. every 
instance.

Or is there any way to use absent() for the general case which I just don't 
see?


If you want to fire when foo exists not but did not exist 5 minutes ago 
(i.e. alert whenever a new metric is created), then

expr: foo unless foo offset 5m


No I think I'd only want alerts if something vanishes.
 

And yes, it will silence after 5 minutes. You don't want to send recovery 
messages on such alerts.


Sounds reasonable.

I wonder whether the expression is ideal:
The above form would already fire, even if the value was missing just once, 
exactly the 5m ago.
Wouldn't it be better to do something like.
   expr: foo unless foo offset 15s
   for: 5m
assuming scrape interval of 15s?

With offset I cannot just specify the "previous" sample, right?

Is it somehow possible to do the above like automatically for all metrics 
(and not just foo) from one expression?
And I guess one would again need to link that somehow with `up` to avoid 
useless errors?

 

> How does that work via smartmon?
Sorry, that was my brainfart. It's "storcli.py" that you want.  (Although 
collecting smartmon info is a good idea too).


Ah... I even saw that too, but had totally forgotten that they've renamed 
megacli.


Is there a list of some generally useful alerts, things like:
   up == 0
or like the above idea of checking for metrics that have vanished? Ideally 
with how to use them properly ;-)


 

Thanks,
Chris.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe 

[prometheus-users] Re: restrict (respectively silence) alert rules to/for certain instances

2023-04-27 Thread Christoph Anton Mitterer
On Wednesday, April 26, 2023 at 9:14:35 AM UTC+2 Brian Candler wrote:

> I guess with (2) you also meant having a route which is then permanently 
muted?

I'd use a route with a null receiver (i.e. a receiver which has no 
_configs under it)


Ah, interesting. It wasn't even clear to me from the documentation, that 
this works, but as you say - it does.

Nevertheless, it only suppresses the alert notifications, but e.g. within 
the AlertManager they would still show up as firing (as expected).

 

> b) The idea hat I had above:
> - using  to filter on the instances and add a 
label if it should be silenced
> - use only that label in the expr instead of the full regex
> But would that even work?

No, because as far as I know alert_relabel_configs is done *after* the 
alert is generated from the alerting rule.


I've already assumed so from the documentation,.. thanks for confirmation.

 

It's only used to add extra labels before sending the generated alert to 
alertmanager. (It occurs to me that it *might* be possible to use 'drop' 
rules here to discard alerts; that would be a very confusing config IMO)


What do you mean by drop rules?

 

> For me it's really like this:
> My Prometheus instance monitors:
> - my "own" instances, where I need to react on things like >85% usage on 
root filesystem (and thus want to get an alert)
> - "foreign" instances, where I just get the node exporter data and show 
e.g. CPU usage, IO usage, and so on as a convenience to users of our 
cluster - but any alert conditions wouldn't cause any further action on my 
side (and the guys in charge of those servers have their own monitoring)

In this situation, and if you are using static_configs or file_sd_configs 
to identify the hosts, then I would simply use a target label (e.g. 
"owner") to distinguish which targets are yours and which are foreign; or I 
would use two different scrape jobs for self and foreign (which means the 
"job" label can be used to distinguish them)


I had thought about that too, but the downside of it would be that I have 
to "hardcode" this into the labels within the TDSB. Even if storage is not 
a concern, what might happen sometimes is that a formerly "foreign" server 
moves into my responsibility.
Then I think things would get messy.

In general, TBH, to me its also not really clear what the best practise is 
in terms of scrape jobs:

At one time I planned to use them to "group" servers that somehow belong 
together, e.g. in the case of a job for data from the node exporter, I 
would have made node_storage_servers, node_compute_servers or something 
like that.
But then I felt this could actually cause troubles later on, when I want to 
e.g. filter time series based on the job (or as above: when a server moves 
its roles).

So right now I put everything (from one exporter) in one job.
Not really sure whether this is stupid or not ;-)

 

The storage cost of having extra labels in the TSDB is essentially zero, 
because it's the unique combination of labels that identifies the 
timeseries - the bag of labels is mapped to an integer ID I believe.  So 
the only problem is if this label changes often, and to me it sounds like a 
'local' or 'foreign' instance remains this way indefinitely.


Arguably, for the above particular use case, it would be rather quite rare 
that it changes.
But for the node_storage_servers vs. node_compute_servers case... it would 
actually happen quite often in my environment.
 

If you really want to keep these labels out of the metrics, then having a 
separate timeseries with metadata for each instance is the next-best 
option. Suppose you have a bunch of metrics with an 'instance' label, e.g.

node_filesystem_free_bytes(instance="bar", }
node_filesystem_size_bytes(instance="bar", }
...

as the actual metrics you're monitoring, then you create one extra static 
timeseries per host (instance) like this:

meta{instance="bar",owner="self",site="london"} 1

(aside: TSDB storage for this will be almost zero, because of the 
delta-encoding used). These can be created by scraping a static webserver, 
or by using recording rules.

Then your alerting rules can be like this:

expr: |
  (
 ... normal rule here ...
  ) * on(instance) group_left(site) meta{owner="self"}

The join will:
* Limit alerting to those hosts which have a corresponding 'meta' 
timeseries (matched on 'instance') and which has label owner="self"
* Add the "site" label to the generated alerts

Beware that:

1. this will suppress alerts for any host which does not have a 
corresponding 'meta' timeseries. It's possible to work around this to 
default to sending rather than not sending alerts, but makes the 
expressions more complex:
https://www.robustperception.io/left-joins-in-promql

2.  the "instance" labels must match exactly. So for example, if you're 
currently scraping with the default label instance="foo:9100" then you'll 
need to change this to instance="foo" (which is good practice anyway).  See

[prometheus-users] Re: how to make sure a metric is to be checked is "there"

2023-04-25 Thread Christoph Anton Mitterer


On Tuesday, April 25, 2023 at 9:32:25 AM UTC+2 Brian Candler wrote:

I think you would have basically the same problem with Icinga unless you 
have configured Icinga with a list of RAID controllers which should be 
present on a given device, or a list of drives which should be present in a 
particular RAID array.


Well true, you still depend on the RAID tool to actually detect the 
controller and any RAIDs managed by that.

But Icinga would likely catch most real world issues that may happen by 
accident:
- raid tool not installed
- some wrong parameters used when invoking the tool (e.g. a new version 
that might have changed command names)
- permissions issues (like tool not run as root, broken sudo rules
 

I'm not sure if you realise this, but the expression "up == 0" is not a 
boolean, it's a filter.  The metric "up" has many different timeseries, 
each with a different label set, and each with a value.  The PromQL 
expression "up" returns all of those timeseries.  The expression "up == 0" 
filters it down to a subset: just those timeseries where the value is 0.  
Hence this expression could return 0, 1 or more timeseries.  When used as 
an alerting expression, the alert triggers if the expression returns one or 
more timeseries (and regardless of the *value* of those timeseries).  When 
you understand this, then using PromQL for alerting makes much more sense.


Well I think that's clear... I have one (scalar) value in up for each 
target I scrape, e.g. if I have just node exporter running, I'd get one 
(scalar) value for the scraped node exporter of every instance.

But the problem is that this does not necessarily tell me if e.g. my raid 
status result was contained in that scraped data, does it?

It depends on the exporter... if I had a separate exporter just for the 
RAID metrics, then I'd be fine. But if it's part of a larger one, like node 
exporter, it would depend if that errors out just because the RAID data 
couldn't be determined. And I guess most exporters would pre default just 
work fine, if e.g. there was simply no RAID tools installed (which does 
make sense in a way).

But it would also mean, that I wouldn't notice the error, if e.g. I forgot 
to install the tool.
In Icinga I'd notice this, cause I have the configured check per host. If 
that runs and doesn't find e.g. MegaCli... it would error out.

Prometheus OTOH knows just about the target (i.e. the host) and the 
exporter (e.g. node)... so it cannot really tell "ah... the RAID tool is 
missing"... unless node exporter had an option that would tell it to insist 
on RAID tool xyz being executed and fail otherwise.
That's basically what I'd like to do manually.


However, if the RAID controller card were to simply vanish, then yes the 
corresponding metrics would vanish - similarly if a drive were to vanish 
from an array, its status would vanish.


Well but that would usually also be unnoticed in the Icinga setup...  but 
it's also something that I think never really happens - and if it does one 
probably sees other errors like broken filesystems.

 

You can create alert expressions which check for a specific sentinel metric 
being present with absent(...), and you can do things like joining with the 
'up' metric, so you can say "if any target is being scraped, then alert me 
if that target doesn't return metric X".  It *is* a bit trickier to 
understand than a simple alerting condition, but it can be done.


I guess that sounds what I'd like to do. Thanks for the below pointers :-)

https://www.robustperception.io/absent-alerting-for-scraped-metrics/

expr: up{job="myjob"} == 1 unless my_metric

So my_metric would return "something" as soon as it was contained (in the 
most recent scrape!)... and if it wasn't, up{job="myjob"} == 1 would 
silence the "extra" error, in case it is NOT up anyway.

So in that case one should do always both:
- in general, check for any targets/jobs that are not up
- in specific (for e.g. very important metrics), additionally check for the 
specific metric.
 Right?

In general, when I get the value of some time series like 
node_cpu_seconds_total ... when that is missing for e.g. one instance I 
would get nothing, right? I.e. there is no special value, just the vector 
of scalar has one element less. But if I do get a value, it's for sure the 
one from the most recent scrape?!
  

https://www.robustperception.io/absent-alerting-for-jobs/

Is this with absent() also needed when I have all my targets/jobs 
statically configured? I guess not because Prometheus should know about it 
and reflect it in `up` if any of them couldn't be scraped, right?

 

As for drives vanishing from an array, you can write expressions using 
count() to check the number of drives.  If you have lots of machines and 
don't want separate rules per controller, then it's possible to use another 
timeseries as a threshold, again this a bit more complex:
https://www.robustperception.io/using-time-series-as-alert-thresholds



[prometheus-users] Re: restrict (respectively silence) alert rules to/for certain instances

2023-04-25 Thread Christoph Anton Mitterer
Hey Brian

On Tuesday, April 25, 2023 at 9:59:12 AM UTC+2 Brian Candler wrote:

So really I'd divide the possibilities 3 ways:

a. Prevent the alert being generated from prometheus in the first place, by 
writing the expr in such a way that it filters out conditions that you 
don't want to alert on

b. Let the alert arrive at alertmanager, but permanently prevent it from 
sending out notifications for certain instances

c. Apply a temporary silence in alertmanager for certain alerts or groups 
of alerts

(1) is done by writing your 'expr' to match only specific instances or to 
exclude specific instances

(2) is done by matching on labels in your alertmanager routing rules (and 
if necessary, by adding extra labels in your 'expr')


I think in my case (where I want to simply get no alerts at all for a 
certain group of instances) it would be (1) or (2), with (1) probably being 
the cleaner one.

I guess with (2) you also meant having a route which is then permanently 
muted?


If you want to apply a threshold to only certain filesystems, and/or to 
have different thresholds per filesystem, then it's possible to put the 
thresholds in their own set of static timeseries:

https://www.robustperception.io/using-time-series-as-alert-thresholds

But I don't recommend this, and I find such alerts are brittle.


Would also sound like a solution that's a bit over-engineered to me.
 

  It helps to rethink exactly what you should be alerting on:
https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/edit

For the majority of cases: "alert on symptoms, rather than causes".  That 
is, alert when a service isn't *working* (which you always need to know 
about), and in those alerts you can include potential cause-based 
information (e.g. CPU load is high, RAM is full, database is down etc).

Now, there are also some things you want to know about *before* they become 
a problem, like "disk is nearly full".  But the trouble with static alerts 
is, they are a pain to manage.  Suppose you have a threshold at 85%, and 
you have one server which is consistently at 86% but not growing - you know 
this is the case, you have no need to grow the filesystem, so you end up 
tweaking thresholds per instance.

I would suggest two alternatives:

1. Check dashboards daily.  If you want automatic notifications then don't 
send the sort of alert which gets someone out of bed, but a "FYI" 
notification to something like Slack or Teams.

2. Write dynamic alerts, e.g. have alerting rules which identify disk usage 
which is growing rapidly and likely to fill in the next few hours or days.

- name: DiskRate10m
  interval: 1m
  rules:
  # Warn if rate of growth over last 10 minutes means filesystem will fill 
in 2 hours
  - alert: DiskFilling10m
expr: |
node_filesystem_avail_bytes / (node_filesystem_avail_bytes -

(predict_linear(node_filesystem_avail_bytes{fstype!~"fuse.*|nfs.*"}[10m], 
7200) < 0)) * 7200
for: 20m
labels:
  severity: critical
annotations:
  summary: 'Filesystem will be full in {{ $value | humanizeDuration }} 
at current 10m growth rate'

- name: DiskRate3h
  interval: 10m
  rules:
  # Warn if rate of growth over last 3 hours means filesystem will fill in 
2 days
  - alert: DiskFilling3h
expr: |
node_filesystem_avail_bytes / (node_filesystem_avail_bytes -

(predict_linear(node_filesystem_avail_bytes{fstype!~"fuse.*|nfs.*"}[3h], 
172800) < 0)) * 172800
for: 6h
labels:
  severity: warning
annotations:
  summary: 'Filesystem will be full in {{ $value | humanizeDuration }} 
at current 3h growth rate'


Thanks but I'm not sure sure whether the above applies to my scenario.

For me it's really like this:
My Prometheus instance monitors:
- my "own" instances, where I need to react on things like >85% usage on 
root filesystem (and thus want to get an alert)
- "foreign" instances, where I just get the node exporter data and show 
e.g. CPU usage, IO usage, and so on as a convenience to users of our 
cluster - but any alert conditions wouldn't cause any further action on my 
side (and the guys in charge of those servers have their own monitoring)

So in the end it just boils down to my desire to keep my alert rules 
small/simple/readable.
   expr: 100 - 
((node_filesystem_avail_bytes{mountpoint="/",fstype!="rootfs"} * 100) / 
node_filesystem_size_bytes{mountpoint="/",fstype!="rootfs"}) >= 85 
=> would fire for all nodes, bad

   expr: 100 - 
((node_filesystem_avail_bytes{mountpoint="/",fstype!="rootfs",instance=~"someRegexThatMatchesTheRighHosts"}
 
* 100) / 
node_filesystem_size_bytes{mountpoint="/",fstype!="rootfs",instance=~"someRegexThatMatchesTheRighHosts"})
 
>= 85 
=> would work, I guess, but seems really ugly to read/maintain


 

Not sure whether anything can be done better via adding labels at some 
stage.


As well as target labels, you can set labels in the alerting rules 
themselves, for when an alert fires. That 

[prometheus-users] how to make sure a metric is to be checked is "there"

2023-04-24 Thread Christoph Anton Mitterer
Hey there.

What I'm trying to do is basically replace Icinga with Prometheus (or
well not really replacing, but integrating it into the latter, which I
anyway need for other purposes).

So I'll have e.g. some metric that shows me the RAID status on
instances, and I want to get an alert, when a HDD is broken.


I guess it's obvious that it could turn out bad if I don't get an
alert, just because the metric data isn't there (for some reason).


In Icinga, this would have been simple:
The system knows about every host and every service it needs to check.
If there's no result (like RAID is OK or FAILED) anymore (e.g. because
the raid CLI tool is no installed), the check's status would at least
go into UNKNOWN.



I wonder how this is / can be handled in Prometheus?


I mean I can of course check e.g.
   expr: up == 0
in some alert.
But AFAIU this actually just tells me whether there are any scrape
targets that couldn't be scraped (in the last run, based on the scrape
interval), right?

If my important checks were all their own exporters, e.g. one exporter
just for the RAID status, then - AFAIU - this would already work any
notify me for sure, even if there's no result at all.

But what if it's part of some larger exporter, like e.g. the mdadm data
in node exporter.

up wouldn't become 0, just because node_md_disks would be not part of
the metrics.


Even if I'd say it's the duty of the exporter to make sure that there
is a result even on failure to read the status... what e.g. if some
tool is already needed just to determine whether that metric make sense
to be collected at all.
That would by typical for most hardware RAID controllers... you need
the respective RAID tool just to see whether any RAIDS are present.


So in principle I'd like a simple way to check for a certain group of
hosts on the availability of a certain time series, so that I can set
up e.g. an alert that fires if any node where I have e.g. some MegaCLI
based RAID, lacks megacli_some_metric.

Or is there some other/better way this is done in practise?


Thanks,
Chris.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/fbfd8ca1c671830a3ce428a54a60aebed2ea596e.camel%40gmail.com.


[prometheus-users] restrict (respectively silence) alert rules to/for certain instances

2023-04-24 Thread Christoph Anton Mitterer
Hey.

I have some troubles understanding how to do things right™ with respect
to alerting.

In principle I'd like to do two things:

a) have certain alert rules run only for certain instances
   (though that may in practise actually be less needed, when only the
   respective nodes would generate the respective metrics - not sure
   yet, whether this will be the case)
b) silence certain (or all) alerts for a given set of instances
   e.g. these may be nodes where I'm not an admin how can take action
   on an incident, but just view the time series graphs to see what's
   going on


As example I'll take an alert that fires when the root fs has >85%
usage:
   groups:
 - name:   node_alerts
   rules:
   - alert: node_free_fs_space
 expr: 100 - 
((node_filesystem_avail_bytes{mountpoint="/",fstype!="rootfs"} * 100) / 
node_filesystem_size_bytes{mountpoint="/",fstype!="rootfs"}) >= 85



With respect to (a):
I could of course a yet another key like:
   instance=~"someRegexThatDescribesMyInstances"
to each time series, but when that regex gets more complex, everything
becomes quite unreadable and it's quite error prone to forget about a
place (assuming one has many alerts) when the regex changes.

Is there some way like defining host groups or so? Where I have a
central place where I could define the list of hosts respectively a
regex for that... and just use the name of that definition in the
actual alert rules?


With respect to (b):
Similarly to above,... if I had various instances for which I'd never
wanted to see any alerts, I could of course add a regex to all my
alerts.
But seems quite ugly to clutter up all the rules just for a potentially
long list/regex of things for which I don't want to see anyway.

Another idea I had was that I do the filtering/silencing in the
alertmanager config at route level:
Like by adding a "ignore" route, that matches via regex on all the
instances I'd like to silence (and have a mute_time_interval set to
24/7), before any other routes match.

But AFAIU this would only suppress the message (e.g. mail), but the
alert would still show up in the alertmanager webpages/etc. as firing.



Not sure whether anything can be done better via adding labels at some
stage.
- Doing external_labels: in prometheus config doesn't seem to help
  here (only stact values?)
- Same for labels: in  in prometheus config.
- Setting some "noalerts" label via  in prometheus
  config would also set that in the DB, right?
  This I rather wouldn't want.

- Maybe using:
  alerting:
alert_relabel_configs:
  - 
  would work? Like matching hostnames on instance and replacing with
  e.g. "yes" in some "noalerts" target?
  And then somehow using that in the alert rules...
  
  But also sounds a bit ugly, TBH.


So... what's the proper way to do this? :-)


Thanks,
Chris.


btw: Is there any difference between:
1) alerting:
 alert_relabel_configs:
   - 
and
2) the relabel_configs: in  

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/03619286babc6b2ee9d3295e235016b4e3b383ca.camel%40gmail.com.


Re: [prometheus-users] fading out sample resolution for samples from longer ago possible?

2023-03-01 Thread Christoph Anton Mitterer
On Tue, 2023-02-28 at 10:25 +0100, Ben Kochie wrote:
> 
> Debian release cycles are too slow for the pace of Prometheus
> development.

It's rather simple to pull the version from Debian unstable, if on
needs so, and that seems pretty current.


> You'd be better off running Prometheus using podman, or deploying
> official binaries with Ansible[0].

Well I guess view on how software should be distributed differ.

The "traditional" system of having distributions has many advantages
and is IMO a core reason for the success of Linux and OpenSource.

All "modern" alternatives like flatpaks, snaps, and similar repos are
IMO especially security wise completely inadequate (especially the fact
that there is no trusted intermediate (like the distribution) which
does some basic maintenance.

It's anyway not possible here because of security policy reasons.


> 
> No, but It depends on your queries. Without seeing what you're
> graphing there's no way to tell. Your queries could be complex or
> inefficient. Kinda like writing slow SQL queries.

As mentioned already in the other thread, so far I merely do only what:
https://grafana.com/grafana/dashboards/1860-node-exporter-full/
does.


> There are ways to speed up graphs for specific things, for example
> you can use recording rules to pre-render parts of the queries.
> 
> For example, if you want to graph node CPU utilization you can have a
> recording rule like this:
> 
> groups:
>   - name: node_exporter
>     interval: 60s
>     rules:
>       - record: instance:node_cpu_utilization:ratio_rate1m
>         expr: >
>           avg without (cpu) (
>             sum without (mode) (
>              
> rate(node_cpu_seconds_total{mode!="idle",mode!="iowait",mode!="steal"
> }[1m])
>             )
>           )
> 
> This will give you a single metric per node that will be faster to
> render over longer periods of time. It also effectively down-samples
> by only recording one point per minute.

But will dashbords like Node Exporter Full automatically use such?
And if so... will they (or rather Prometheus) use the real time series
(with full resolution) when needed?

If so, then the idea would be to create such a rule for every metric
I'm interested in and that is slow, right?



> Also "Medium sized VM" doesn't give us any indication of how much CPU
> or memory you have. Prometheus uses page cache for database access.
> So maybe your system is lacking enough memory to effectively cache
> the data you're accessing.

Right now it's 2 (virtual CPUs) with 4.5 GB RAM... I'd guess it might
need more CPU?

Previously I suspected IO to be the reason, and while in fact IO is
slow (the backend seems to deliver only ~100MB/s)... there seems to be
nearly no IO at all while waiting for the "slow graph" (which is Node
Export Full's "CPU Basic" panel), e.g. when selecting the last 30 days.

Kinda surprising... does Prometheus read it's TSDB really that
efficiently?


Could it be a problem, when the Grafana runs on another VM? Though
there didn't seem to be any network bottleneck... and I guess Grafana
just always accesses Prometheus via TCP, so there should be no further
positive caching effect when both run on the same node?


> No, we've talked about having variable retention times, but nobody
> has implemented this. It's possible to script this via the DELETE
> endpoint[1]. It would be easy enough to write a cron job that deletes
> specific metrics older than X, but I haven't seen this packaged into
> a simple tool. I would love to see something like this created.
> 
> [1]: 
> https://prometheus.io/docs/prometheus/latest/querying/api/#delete-
> series 

Does it make sense to open a feature request ticket for that?

I mean it would solve at least my storage "issue" (well it's not really
a showstopper... as it was mentioned one could simply by a big check
HDD/SSD).

And could via the same way be something made that downsamples data for
times longer ago?


Both together would really give quite some flexibility.

For metrics where old data is "boring" one could just delete
everything older than e.g. 2 weeks, while keeping full details for that
time.

For metrics where one is interested in larger time ranges, but where
sample resolution doesn't matter so much, one could downsample it...
like everything older then 2 weeks... then even more for everything
older than 6 months, then even more for everything older than 1 year...
and so on.

For few metrics where full resolution data is interesting over a really
long time span, one could just keep it.



> > Seem at least quite big to me... that would - assuming all days can
> > be
> > compressed roughly to that (which isn't sure of course) - mean for
> > one
> > year one needs ~ 250 GB for that 40 nodes or about 6,25 GB per node
> > (just for the data for node exporter with a 15s interval).
> 
> Without seeing a full meta.json and the size of the files in one dir,
> it's hard to say exactly if this is good or bad. It depends a bit on
> how 

Re: [prometheus-users] fading out sample resolution for samples from longer ago possible?

2023-03-01 Thread Christoph Anton Mitterer
Hey Brian

On Tue, 2023-02-28 at 00:27 -0800, Brian Candler wrote:
> 
> I can offer a couple more options:
> 
> (1) Use two servers with federation.
> - server 1 does the scraping and keeps the detailed data for 2 weeks
> - server 2 scrapes server 1 at lower interval, using the federation
> endpoint

I had thought about that as well. Though it feels a bit "ugly".


> (2) Use recording rules to generate lower-resolution copies of the
> primary timeseries - but then you'd still have to remote-write them
> to a second server to get the longer retention, since this can't be
> set at timeseries level.

I had (very briefly) read about the recording rules (merely just that
they exist ^^) ... but wouldn't these give me a new name for the
metric?

If so, I'd need to adapt e.g.
https://grafana.com/grafana/dashboards/1860-node-exporter-full/ to use
the metrics generated by the recording rules,... which again seems
quite some maintenance effort.

Plus, as you even wrote below, I'd need users to use different
dashboards, AFAIU, one where the detailed data is used, one where the
downsampled data is used.
Sure that would work as a workaround, but is of course not really a
good solution, as one would rather want to "seamlessly" move from the
detailed to less-detailed data.


> Either case makes the querying more awkward.  If you don't want
> separate dashboards for near-term and long-term data, then it might
> work to stick promxy in front of them.

Which would however make the setup more complex again.


> Apart from saving disk space (and disks are really, really cheap
> these days), I suspect the main benefit you're looking for is to get
> faster queries when running over long time periods.  Indeed, I
> believe Thanos creates downsampled timeseries for exactly this
> reason, whilst still continuing to retain all the full-resolution
> data as well.

I guess I may have too look into that, how complex it's setup would be.



> That depends.  What PromQL query does your graph use? How many
> timeseries does it touch? What's your scrape interval?

So far I've just been playing with the ones from:
https://grafana.com/grafana/dashboards/1860-node-exporter-full/
So all queries in that and all time series that uses.

Interval is 15s.


> Is your VM backed by SSDs?

I think it's a Ceph cluster what the super computing centre uses for
that, but I have no idea what that runs upon. Probably HDDs.


> Another suggestion: running netdata within the VM will give you
> performance metrics at 1 second intervals, which can help identify
> what's happening during those 10-15 seconds: e.g. are you
> bottlenecked on CPU, or disk I/O, or something else.

Good idea, thanks.


Thanks,
Chris.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/e35d617dbaab44de43da049414103ff1e9102e61.camel%40gmail.com.


Re: [prometheus-users] fading out sample resolution for samples from longer ago possible?

2023-02-27 Thread Christoph Anton Mitterer
Hy Stuart, Julien and Ben,

Hope you don't mind that I answer all three replies in one... don't
wanna spam the list ;-)



On Tue, 2023-02-21 at 07:31 +, Stuart Clark wrote:
> Prometheus itself cannot do downsampling, but other related projects 
> such as Cortex & Thanos have such features.

Uhm, I see. Unfortunately neither is packaged for Debian. Plus it seems
to make the overall system even more complex.

I want to Prometheus merely or monitoring a few hundred nodes (thus it
seems a bit overkill to have something like Cortex, which sounds like a
system for really large number of nodes) at the university, though as
indicated before, we'd need both:
- details data for a like the last week or perhaps two
- far less detailed data for much longer terms (like several years)

Right now my Prometheus server runs in a medium sized VM, but when I
visualise via Grafana and select a time span of a month, it already
takes considerable time (like 10-15s) to render the graph.

Is this expected?




On Tue, 2023-02-21 at 11:45 +0100, Julien Pivotto wrote:
> We would love to have this in the future but it would require careful
> planning and design document.

So native support is nothing on the near horizon?

And I guess it's really not possible to "simply" ( ;-) ) have different
retention times for different metrics?




On Tue, 2023-02-21 at 15:52 +0100, Ben Kochie wrote:
> This is mostly unnecessary in Prometheus because it uses compression
> in the TSDB samples. What would take up a lot of space in an RRD file
> takes up very little space in Prometheus.

Well right now I scrape only the node-exporter data from 40 hosts at a
15s interval plus the metrics from prometheus itself.
I'm doing this on test install since the 21st of February.
Retention time is still at it's default.

That gives me:
# du --apparent-size -l -c -s --si /var/lib/prometheus/metrics2/*
68M /var/lib/prometheus/metrics2/01GSST2X0KDHZ0VM2WEX0FPS2H
481M/var/lib/prometheus/metrics2/01GSVQWH7BB6TDCEWXV4QFC9V2
501M/var/lib/prometheus/metrics2/01GSXNP1T77WCEM44CGD7E95QH
485M/var/lib/prometheus/metrics2/01GSZKFK53BQRXFAJ7RK9EDHQX
490M/var/lib/prometheus/metrics2/01GT1H90WKAHYGSFED5W2BW49Q
487M/var/lib/prometheus/metrics2/01GT3F2SJ6X22HFFPFKMV6DB3B
498M/var/lib/prometheus/metrics2/01GT5CW8HNJSGFJH2D3ADGC9HH
490M/var/lib/prometheus/metrics2/01GT7ANS5KDVHVQZJ7RTVNQQGH
501M/var/lib/prometheus/metrics2/01GT98FETDR3PN34ZP59Y0KNXT
172M/var/lib/prometheus/metrics2/01GT9X2BPN51JGB6QVK2X8R3BR
60M /var/lib/prometheus/metrics2/01GTAASP91FSFGBBH8BBN2SQDJ
60M /var/lib/prometheus/metrics2/01GTAHNDG070WXY8WGDVS22D2Y
171M/var/lib/prometheus/metrics2/01GTAHNHQ587CQVGWVDAN26V8S
102M/var/lib/prometheus/metrics2/chunks_head
21k /var/lib/prometheus/metrics2/queries.active
427M/var/lib/prometheus/metrics2/wal
5,0Gtotal

Not sure whether I understood meta.json correctly (haven't found a
documentation for minTime/maxTime) but I guess that the big ones
correspond to 64800s?

Seem at least quite big to me... that would - assuming all days can be
compressed roughly to that (which isn't sure of course) - mean for one
year one needs ~ 250 GB for that 40 nodes or about 6,25 GB per node
(just for the data for node exporter with a 15s interval).

Does that sound reasonable/expected?



> What's actually more
> difficult is doing all the index loads for this long period of time.
> But Prometheus uses mmap to opportunistically access the data on
> disk.

And is there anything that can be done to improve that? Other than
simply using some fast NVMe or so?



Thanks,
Chris.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/45f21aedf2412705809fc69522055ca82b2f95f2.camel%40gmail.com.


[prometheus-users] fading out sample resolution for samples from longer ago possible?

2023-02-20 Thread Christoph Anton Mitterer
Hey.

I wondered whether one can to with Prometheus something similar that is 
possible with systems using RRD (e.g. Ganlia).

Depending on the kind of metrics, like for those from the node exporter, 
one may want a very high sample resolution (and thus short scraping 
interval) for like the last 2 days,... but the further one goes back the 
less interesting those data becomes, at least in that resolution (ever 
looked a how much IO a server had 2 years ago per 15s)?

What one may however want is a rough overview of these metrics for those 
time periods longer ago, e.g. in order to see some trends.


For other values, e.g. the total used disk space on a shared filesystem or 
maybe a tape library, one may not need such high resolution for the last 2 
days, but therefore want the data (with low sample resolution, e.g. 1 
sample per day) going back much longer, like the last 10 years.


With Ganglia/RRD it one would then simply use multiple RRDs, each for 
different time spans and with different resolutions... and RRD would 
interpolate it's samples accordingly.


Can anything like this be done with Prometheus? Or is that completely out 
of scope?


I saw that one can set the retention period, but that seems to affect 
everything.

So even if I have e.g. my low resolution tape library total size, which I 
could scrape only every hour or so, ... it wouldn't really help me.
In order to keep data for that like the last 10 years, I'd need to set the 
retention time to that.

But then the high resolution samples like from the node exporter would also 
be kept that long (with full resolution).


Thanks,
Chris.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/36e3506c-1fba-48e4-b3d9-ead908767cf2n%40googlegroups.com.


Re: [prometheus-users] collect non-metrics data

2023-02-13 Thread Christoph Anton Mitterer
Hey Ben.

On Saturday, February 11, 2023 at 11:18:44 AM UTC+1 Ben Kochie wrote:

You combine this with an "info" metric that tells you about the rest of the 
device.

Ah,... and I assume that one could just also export these info metrics 
alongside e.g. node_md_state?

Thanks :-)
Chris.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/d3abe98c-6b2b-4471-a981-8b99936b7dd4n%40googlegroups.com.


[prometheus-users] collect non-metrics data

2023-02-11 Thread Christoph Anton Mitterer
Hey.

I wondered whether the following is possible with Prometheus. I basically 
think about possibly phasing out Icinga and do any alerting in Prometheus.

For checks that are clearly metrics based (like load or free disk space) 
this seems rather easy.

But what about any checks that are not really based on metrics?
Like e.g. check_raid, which gives an error if any RAID has lost a disk or 
similar.

Of course one could always just try to make a metric out of it - above one 
could make e.g. the number of non-consistent RAIDs the metric.

But what one actually wants from such checks is additional (typically 
purely textual) information, like in the above example which HDD 
(enclosure, bay number,... or the serial number) has failed.
Also I have numerous other checks which test for things which are not 
really related to a number but where the output are strings.

Is there any (good) way to get that done with Prometheus, or is it simply 
not meant for that specific use case.

Thanks,
Chris.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/8fe84502-eca5-4e53-8a9c-35e7a9dd6113n%40googlegroups.com.