Re: [prometheus-developers] Preventing Prometheus from running out of memory

l.mi...@gmail.com Mon, 28 Nov 2022 04:40:30 -0800

I agree that this is not side-effect free. With the worst possible outcome 
of sending bogus alerts.
But.
We're talking about metrics, so from my PoV the consequences range not from 
"annoying to dangerous" but "annoying to more annoying".
There's no real danger in missing metrics. Nothing will break if they are 
missing.
And if someone receives a bogus alert that's gonna be along with another 
alert saying "you're over your limit, some metrics might be missing". Which 
does help to limit any potential confusion.


Again, the motivation is to prevent cardinality issues pilling up to a 
point where Prometheus runs out of memory and crashes.
Maybe that's an issue that's more common in the environments I work with 
then others and so I'm more willing to make trade-offs to fix that.
I do expect us to run in production with HEAD & soft limit patches (as we 
already do) since doing so without it requires too much firefighting, so 
I'm happy to share more experience of that later.
On Monday, 28 November 2022 at 11:46:09 UTC Julius Volz wrote:

> My reaction is similar to that of Ben and Julien: if metrics within a 
> target go partially missing, all kinds of expectations around aggregations, 
> histograms, and other metric correlations are off in ways that you wouldn't 
> expect in a normal Prometheus setup. Everyone expects full scrape failures, 
> as they are common in Prometheus, and you usually have an alert on the "up" 
> metric. Now you can say that if you explicitly choose to turn on that 
> feature, you can expect a user to deal with the consequences of that. But 
> the consequences are so unpredictable and can range from annoying to 
> dangerous:
>
> * Incorrectly firing alerts (e.g. because some series are missing in an 
> aggregation or a histogram)
> * Silently broken alerts that should be firing (same but the other way 
> around)
> * Graphs seemingly working for an instance, but showing wrong data because 
> some series of a metric are missing
> * Queries that try to correlate different sets of metrics from a target 
> breaking in subtle ways
>
> Essentially it removes a previously strong invariant from Prometheus-based 
> monitoring and introduces a whole new way of having to think about things 
> like the above. Even if you have alerts to catch the underlying situation, 
> you may get a bunch of incorrect alert notifications from the now-broken 
> rules in the meantime.
>
> On Mon, Nov 28, 2022 at 10:58 AM l.mi...@gmail.com <l.mi...@gmail.com> 
> wrote:
>
>>
>>
>> On Sunday, 27 November 2022 at 19:25:12 UTC sup...@gmail.com wrote:
>>
>>> Soft / partial failure modes can be very hard problems to deal with. You 
>>> have to be a lot more careful to not end up missing partial failures.
>>>
>>
>> Sure, partial failures modes can be *in general* a hard problem, but with 
>> Prometheus the worst case is that some metrics will be missing and some 
>> won't, that's at least in my opinion, not that problematic.
>> What do you mean by "missing partial failures"?
>>  
>>
>>> While it seems like having a soft sample_limt is good, and the hard 
>>> sample_limit is bad. The "Fail fast" will serve you better in the long run. 
>>> Most of the Prometheus monitoring design assumes fail fast. Partial results 
>>> are too hard to reason about from a monitoring perspective. With fail fast 
>>> you will know quickly and decisively that you've hit a problem. If you 
>>> treat monitoring outages as just as bad as an actual service outage, you'll 
>>> end up with a healthier system overall.
>>>
>>
>> I think I've seen the statement that partial scrapes are "too hard to 
>> reason about" a number of times
>> From my experience they are not. A hard fail means "all or nothing", 
>> partial fail means "all or some". It's just a different level of 
>> granularity, that's not that hard to get your head around IMHO.
>> I think that it's a hard problem from a Prometheus developers point of 
>> view, because the inner logic and handling of errors is where the 
>> (potential) complication is. But from end user perspective it's not that 
>> hard at all. Most of users I talked to asked me to implement partial 
>> failure mode, because that makes it both safer and easier for them to make 
>> changes.
>> With hard failure any mistake costs you every metric you scrape, there's 
>> no forgiveness. Treating it all like service outage works best for micro 
>> services where make small deployments and it's easy to roll-back or 
>> roll-forward, it's harder for bigger services, especially ones worked on by 
>> multiple teams.
>>
>> Most cardinality issues we see are not because someone made a bad 
>> release, although that happens often too, but because a good chunk of all 
>> exported metrics are somehow dynamically tied to the workload of each 
>> service. And although we work hard to not have any obviously explosive 
>> metrics, like using request path as labels etc, most services will export 
>> different set of time series if they receive more traffic, or different 
>> traffic, or there's an incident with dependent service etc. So the number 
>> of time series is always changing just because traffic fluctuates, trying 
>> to set a hard limit that won't trigger simply because there happen to be 
>> more traffic at 5am on Sunday is difficult. So what teams will do is give 
>> themselves a huge buffer, to allow for any spikes, and then you repeat that 
>> for each service and the sum of all hard limits is 2-5x you maximum 
>> capacity and it doesn't prevent any OOM kill. It will work on individual 
>> service level, eventually, but it does nothing to stop Prometheus as a 
>> whole from running out of memory.
>>
>> This is all about sample_limit. Which is just one level of protection.
>> Part of the patch we run is also a TSDB HEAD limit, that stops it from 
>> appending more time series than configured limit.
>>  
>>
>>>
>>> For the case of label explosions, there are some good meta metrics[0] 
>>> that can help you. The "scrape_series_added" metric can allow you to soft 
>>> detect label leaks.
>>> In addition, there is a new feature flag[1] that adds additional target 
>>> metrics for monitoring for targets nearing their limits.
>>>
>>> [0]: 
>>> https://prometheus.io/docs/concepts/jobs_instances/#automatically-generated-labels-and-time-series
>>> [1]: 
>>> https://prometheus.io/docs/prometheus/latest/feature_flags/#extra-scrape-metrics
>>>
>>>
>> We have plenty of monitoring for the health of all scrapes. Unfortunately 
>> monitoring alone doesn't prevent Prometheus from running out of memory.
>> For a long time we had quotas for all scrapes, that were implemented as 
>> alerts for scrape owners. It caused a lot of noise, fatigue and it didn't 
>> do anything useful for our capacity management. Common request based on 
>> that experience was to add soft limits.
>>  
>>
>>> On Fri, Nov 25, 2022 at 1:27 PM l.mi...@gmail.com <l.mi...@gmail.com> 
>>> wrote:
>>>
>>>> Hello,
>>>>
>>>> One of the biggest challenges we have when trying to run Prometheus 
>>>> with a constantly growing number of scraped services is keeping resource 
>>>> usage under control.
>>>> This usually means memory usage.
>>>> Cardinality is often a huge problem and we often end up with services 
>>>> accidentally exposing labels that are risky. One silly mistake we see 
>>>> every 
>>>> now and then is putting raw errors as labels, which then leads to time 
>>>> series with {error="connection from $ip:$port to $ip:$port timed out"} and 
>>>> so on.
>>>>
>>>> We had a lot of way of dealing with this that uses vanilla Prometheus 
>>>> features but none of it really works well for us.
>>>> Obviously there is sample_limit that one might use here, but the 
>>>> biggest problem with it is the fact that once you hit sample_limit 
>>>> threshold you lose all metrics, and that's just not acceptable for us.
>>>> If I have a service that exports 999 time series and it suddenly goes 
>>>> to 1001 (with sample_limit=1000) I really don't want to lose all metrics 
>>>> just because of that because losing all monitoring is bigger problem than 
>>>> having a few extra time series in Prometheus. It's just too risky.
>>>>
>>>> We're currently running Prometheus with patches from:
>>>> https://github.com/prometheus/prometheus/pull/11124
>>>>
>>>> This gives us 2 levels of protection:
>>>> - global HEAD limit - Prometheus is not allowed to have more than M 
>>>> time series in TSDB
>>>> - per scrape sample_limit - but patched so that if you exceed 
>>>> sample_limit it will start rejecting time series that aren't already in 
>>>> TSDB
>>>>
>>>> This works well for us and gives us a system that:
>>>> - gives us reassurance that Prometheus won't start getting OOM killed 
>>>> overnight
>>>> - service owners can add new metrics without fear that a typo will cost 
>>>> them all metrics
>>>>
>>>> But comments on that PR suggest that it's a highly controversial 
>>>> feature.
>>>> I wanted to probe this community to see what the overall feeling is and 
>>>> how likely is that vanilla Prometheus will have something like this.
>>>> It's a small patch so I'm happy to just maintain it for our internal 
>>>> deployments but it just feels like a common problem to me, so a baked in 
>>>> solution would be great.
>>>>
>>>> Lukasz
>>>>
>>>> -- 
>>>> You received this message because you are subscribed to the Google 
>>>> Groups "Prometheus Developers" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send 
>>>> an email to prometheus-devel...@googlegroups.com.
>>>> To view this discussion on the web visit 
>>>> https://groups.google.com/d/msgid/prometheus-developers/5ab29a58-e5a4-43c5-b662-4436db61f20an%40googlegroups.com
>>>>  
>>>> <https://groups.google.com/d/msgid/prometheus-developers/5ab29a58-e5a4-43c5-b662-4436db61f20an%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>> -- 
>> You received this message because you are subscribed to the Google Groups 
>> "Prometheus Developers" group.
>> To unsubscribe from this group and stop receiving emails from it, send an 
>> email to prometheus-devel...@googlegroups.com.
>>
> To view this discussion on the web visit 
>> https://groups.google.com/d/msgid/prometheus-developers/7d71d3c3-ee76-43ef-bd09-4263055ab1d3n%40googlegroups.com
>>  
>> <https://groups.google.com/d/msgid/prometheus-developers/7d71d3c3-ee76-43ef-bd09-4263055ab1d3n%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/8751b359-bc6e-4b2c-8c8e-5502675e8ab6n%40googlegroups.com.

Re: [prometheus-developers] Preventing Prometheus from running out of memory

Reply via email to