Re: [prometheus-developers] Preventing Prometheus from running out of memory

'DECQ , Jérôme' via Prometheus Developers Mon, 28 Nov 2022 15:29:09 -0800

While I believe we should be opened to different solutions to limit the damages 
of Prometheus going OOM, I just want to point out that metrics use cases are 
wider than visibility and alerting. For example, missing metrics can get you to 
miss autoscaling .., and hence be dangerous. I’d be curious if we can be 
selective on partial failures, and have more control on the priority metrics to 
privilege for success.

--
Jérôme

From: <prometheus-developers@googlegroups.com> on behalf of "l.mi...@gmail.com" 
<l.mier...@gmail.com>
Date: Monday, November 28, 2022 at 7:09 AM
To: Prometheus Developers <prometheus-developers@googlegroups.com>
Subject: RE: [EXTERNAL][prometheus-developers] Preventing Prometheus from 
running out of memory

CAUTION: This email originated from outside of the organization. Do not click 
links or open attachments unless you can confirm the sender and know the 
content is safe.

Let's take a step back.
The fundamental problem this thread is about is trying to prevent Prometheus 
OOM kills. When Prometheus crashes due to out of memory it stops scraping any 
metrics and it will take some time to start (due to WAL replay). During that 
time you lose all you metrics and alerting.
This is the core of the issue.
There seem to be no disagreement that losing all metrics and alerting is 
considered bad.

The disagreement seems to be purely about how to handle it.
At the moment Prometheus behaviour is to fail hard and lose all monitoring.
What the PR I've liked to is about is graceful degradation. So instead of 
losing everything you're just losing a subset. And losing here means mostly - 
there's a flood of new time series and there's no space for them, so they don't 
get scraped. All existing time series already stored in TSDB are unaffected.
For my use cases that's much better than "a flood of time series comes and 
takes Prometheus offline". Especially that after the flood you end up with huge 
WAL that might continue to crash Prometheus on replay until you remove all WAL 
files.

When I say "Nothing will break if metrics are missing" I mean that my services 
will run regardless if I scrape metrics or not. I might not know if they are 
broken if I'm missing metrics, but that's all the impact. What is acceptable to 
lose and what not will be different for different users, so sure, it might be 
too much for some.

On Monday, 28 November 2022 at 14:40:56 UTC sup...@gmail.com wrote:
To add to "Nothing will break if things are missing". Prometheus alerts depend 
on the existence of data for alerting to work. If you're "just missing some 
data", Prometheus alerts will fail to fire. This is unacceptable and we make 
specific design trade-offs in order to avoid this situation.

On Mon, Nov 28, 2022 at 3:38 PM Ben Kochie <sup...@gmail.com> wrote:
On Mon, Nov 28, 2022 at 1:40 PM l.mi...@gmail.com <l.mi...@gmail.com> wrote:
I agree that this is not side-effect free. With the worst possible outcome of 
sending bogus alerts.
But.
We're talking about metrics, so from my PoV the consequences range not from 
"annoying to dangerous" but "annoying to more annoying".

Actually, we're NOT talking about metrics. We're talking about monitoring. 
Prometheus is a monitoring system that also happens to be based on metrics.

There's no real danger in missing metrics. Nothing will break if they are 
missing.

Absolutely not true. We depend on Prometheus for our critical monitoring 
behavior. Without Prometheus we don't know if our service is down. Without 
Prometheus we could have problems go undetected for millions of active users. 
If we're not serving our users this affects our company's income.

To repeat, Prometheus is not some generic TSDB. It's a monitoring platform 
designed to be as robust as possible for base level alerting for everything it 
watches.

And if someone receives a bogus alert that's gonna be along with another alert 
saying "you're over your limit, some metrics might be missing". Which does help 
to limit any potential confusion.

Again, the motivation is to prevent cardinality issues pilling up to a point 
where Prometheus runs out of memory and crashes.
Maybe that's an issue that's more common in the environments I work with then 
others and so I'm more willing to make trade-offs to fix that.
I do expect us to run in production with HEAD & soft limit patches (as we 
already do) since doing so without it requires too much firefighting, so I'm 
happy to share more experience of that later.
On Monday, 28 November 2022 at 11:46:09 UTC Julius Volz wrote:
My reaction is similar to that of Ben and Julien: if metrics within a target go 
partially missing, all kinds of expectations around aggregations, histograms, 
and other metric correlations are off in ways that you wouldn't expect in a 
normal Prometheus setup. Everyone expects full scrape failures, as they are 
common in Prometheus, and you usually have an alert on the "up" metric. Now you 
can say that if you explicitly choose to turn on that feature, you can expect a 
user to deal with the consequences of that. But the consequences are so 
unpredictable and can range from annoying to dangerous:

* Incorrectly firing alerts (e.g. because some series are missing in an 
aggregation or a histogram)
* Silently broken alerts that should be firing (same but the other way around)
* Graphs seemingly working for an instance, but showing wrong data because some 
series of a metric are missing
* Queries that try to correlate different sets of metrics from a target 
breaking in subtle ways

Essentially it removes a previously strong invariant from Prometheus-based 
monitoring and introduces a whole new way of having to think about things like 
the above. Even if you have alerts to catch the underlying situation, you may 
get a bunch of incorrect alert notifications from the now-broken rules in the 
meantime.

On Mon, Nov 28, 2022 at 10:58 AM l.mi...@gmail.com <l.mi...@gmail.com> wrote:

On Sunday, 27 November 2022 at 19:25:12 UTC sup...@gmail.com wrote:
Soft / partial failure modes can be very hard problems to deal with. You have 
to be a lot more careful to not end up missing partial failures.

Sure, partial failures modes can be *in general* a hard problem, but with 
Prometheus the worst case is that some metrics will be missing and some won't, 
that's at least in my opinion, not that problematic.
What do you mean by "missing partial failures"?

While it seems like having a soft sample_limt is good, and the hard 
sample_limit is bad. The "Fail fast" will serve you better in the long run. 
Most of the Prometheus monitoring design assumes fail fast. Partial results are 
too hard to reason about from a monitoring perspective. With fail fast you will 
know quickly and decisively that you've hit a problem. If you treat monitoring 
outages as just as bad as an actual service outage, you'll end up with a 
healthier system overall.

I think I've seen the statement that partial scrapes are "too hard to reason 
about" a number of times
>From my experience they are not. A hard fail means "all or nothing", partial 
>fail means "all or some". It's just a different level of granularity, that's 
>not that hard to get your head around IMHO.
I think that it's a hard problem from a Prometheus developers point of view, 
because the inner logic and handling of errors is where the (potential) 
complication is. But from end user perspective it's not that hard at all. Most 
of users I talked to asked me to implement partial failure mode, because that 
makes it both safer and easier for them to make changes.
With hard failure any mistake costs you every metric you scrape, there's no 
forgiveness. Treating it all like service outage works best for micro services 
where make small deployments and it's easy to roll-back or roll-forward, it's 
harder for bigger services, especially ones worked on by multiple teams.

Most cardinality issues we see are not because someone made a bad release, 
although that happens often too, but because a good chunk of all exported 
metrics are somehow dynamically tied to the workload of each service. And 
although we work hard to not have any obviously explosive metrics, like using 
request path as labels etc, most services will export different set of time 
series if they receive more traffic, or different traffic, or there's an 
incident with dependent service etc. So the number of time series is always 
changing just because traffic fluctuates, trying to set a hard limit that won't 
trigger simply because there happen to be more traffic at 5am on Sunday is 
difficult. So what teams will do is give themselves a huge buffer, to allow for 
any spikes, and then you repeat that for each service and the sum of all hard 
limits is 2-5x you maximum capacity and it doesn't prevent any OOM kill. It 
will work on individual service level, eventually, but it does nothing to stop 
Prometheus as a whole from running out of memory.

This is all about sample_limit. Which is just one level of protection.
Part of the patch we run is also a TSDB HEAD limit, that stops it from 
appending more time series than configured limit.

For the case of label explosions, there are some good meta metrics[0] that can 
help you. The "scrape_series_added" metric can allow you to soft detect label 
leaks.
In addition, there is a new feature flag[1] that adds additional target metrics 
for monitoring for targets nearing their limits.

[0]: 
https://prometheus.io/docs/concepts/jobs_instances/#automatically-generated-labels-and-time-series
[1]: 
https://prometheus.io/docs/prometheus/latest/feature_flags/#extra-scrape-metrics

We have plenty of monitoring for the health of all scrapes. Unfortunately 
monitoring alone doesn't prevent Prometheus from running out of memory.
For a long time we had quotas for all scrapes, that were implemented as alerts 
for scrape owners. It caused a lot of noise, fatigue and it didn't do anything 
useful for our capacity management. Common request based on that experience was 
to add soft limits.

On Fri, Nov 25, 2022 at 1:27 PM l.mi...@gmail.com <l.mi...@gmail.com> wrote:
Hello,

One of the biggest challenges we have when trying to run Prometheus with a 
constantly growing number of scraped services is keeping resource usage under 
control.
This usually means memory usage.
Cardinality is often a huge problem and we often end up with services 
accidentally exposing labels that are risky. One silly mistake we see every now 
and then is putting raw errors as labels, which then leads to time series with 
{error="connection from $ip:$port to $ip:$port timed out"} and so on.

We had a lot of way of dealing with this that uses vanilla Prometheus features 
but none of it really works well for us.
Obviously there is sample_limit that one might use here, but the biggest 
problem with it is the fact that once you hit sample_limit threshold you lose 
all metrics, and that's just not acceptable for us.
If I have a service that exports 999 time series and it suddenly goes to 1001 
(with sample_limit=1000) I really don't want to lose all metrics just because 
of that because losing all monitoring is bigger problem than having a few extra 
time series in Prometheus. It's just too risky.

We're currently running Prometheus with patches from:
https://github.com/prometheus/prometheus/pull/11124

This gives us 2 levels of protection:
- global HEAD limit - Prometheus is not allowed to have more than M time series 
in TSDB
- per scrape sample_limit - but patched so that if you exceed sample_limit it 
will start rejecting time series that aren't already in TSDB

This works well for us and gives us a system that:
- gives us reassurance that Prometheus won't start getting OOM killed overnight
- service owners can add new metrics without fear that a typo will cost them 
all metrics

But comments on that PR suggest that it's a highly controversial feature.
I wanted to probe this community to see what the overall feeling is and how 
likely is that vanilla Prometheus will have something like this.
It's a small patch so I'm happy to just maintain it for our internal 
deployments but it just feels like a common problem to me, so a baked in 
solution would be great.

Lukasz
--
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-devel...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/5ab29a58-e5a4-43c5-b662-4436db61f20an%40googlegroups.com<https://groups.google.com/d/msgid/prometheus-developers/5ab29a58-e5a4-43c5-b662-4436db61f20an%40googlegroups.com?utm_medium=email&utm_source=footer>.
--
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-devel...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/7d71d3c3-ee76-43ef-bd09-4263055ab1d3n%40googlegroups.com<https://groups.google.com/d/msgid/prometheus-developers/7d71d3c3-ee76-43ef-bd09-4263055ab1d3n%40googlegroups.com?utm_medium=email&utm_source=footer>.
--
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-devel...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/8751b359-bc6e-4b2c-8c8e-5502675e8ab6n%40googlegroups.com<https://groups.google.com/d/msgid/prometheus-developers/8751b359-bc6e-4b2c-8c8e-5502675e8ab6n%40googlegroups.com?utm_medium=email&utm_source=footer>.
--
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to 
prometheus-developers+unsubscr...@googlegroups.com<mailto:prometheus-developers+unsubscr...@googlegroups.com>.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/d740484c-9920-481c-aeb6-a847b0fbe20an%40googlegroups.com<https://groups.google.com/d/msgid/prometheus-developers/d740484c-9920-481c-aeb6-a847b0fbe20an%40googlegroups.com?utm_medium=email&utm_source=footer>.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/1928787B-73B2-4B2B-8894-2D7C49C7C39A%40amazon.com.

Re: [prometheus-developers] Preventing Prometheus from running out of memory

Reply via email to