Soft / partial failure modes can be very hard problems to deal with. You
have to be a lot more careful to not end up missing partial failures.

While it seems like having a soft sample_limt is good, and the hard
sample_limit is bad. The "Fail fast" will serve you better in the long run.
Most of the Prometheus monitoring design assumes fail fast. Partial results
are too hard to reason about from a monitoring perspective. With fail fast
you will know quickly and decisively that you've hit a problem. If you
treat monitoring outages as just as bad as an actual service outage, you'll
end up with a healthier system overall.

For the case of label explosions, there are some good meta metrics[0] that
can help you. The "scrape_series_added" metric can allow you to soft detect
label leaks.
In addition, there is a new feature flag[1] that adds additional target
metrics for monitoring for targets nearing their limits.

[0]:
https://prometheus.io/docs/concepts/jobs_instances/#automatically-generated-labels-and-time-series
[1]:
https://prometheus.io/docs/prometheus/latest/feature_flags/#extra-scrape-metrics

On Fri, Nov 25, 2022 at 1:27 PM l.mi...@gmail.com <l.mier...@gmail.com>
wrote:

> Hello,
>
> One of the biggest challenges we have when trying to run Prometheus with a
> constantly growing number of scraped services is keeping resource usage
> under control.
> This usually means memory usage.
> Cardinality is often a huge problem and we often end up with services
> accidentally exposing labels that are risky. One silly mistake we see every
> now and then is putting raw errors as labels, which then leads to time
> series with {error="connection from $ip:$port to $ip:$port timed out"} and
> so on.
>
> We had a lot of way of dealing with this that uses vanilla Prometheus
> features but none of it really works well for us.
> Obviously there is sample_limit that one might use here, but the biggest
> problem with it is the fact that once you hit sample_limit threshold you
> lose all metrics, and that's just not acceptable for us.
> If I have a service that exports 999 time series and it suddenly goes to
> 1001 (with sample_limit=1000) I really don't want to lose all metrics just
> because of that because losing all monitoring is bigger problem than having
> a few extra time series in Prometheus. It's just too risky.
>
> We're currently running Prometheus with patches from:
> https://github.com/prometheus/prometheus/pull/11124
>
> This gives us 2 levels of protection:
> - global HEAD limit - Prometheus is not allowed to have more than M time
> series in TSDB
> - per scrape sample_limit - but patched so that if you exceed sample_limit
> it will start rejecting time series that aren't already in TSDB
>
> This works well for us and gives us a system that:
> - gives us reassurance that Prometheus won't start getting OOM killed
> overnight
> - service owners can add new metrics without fear that a typo will cost
> them all metrics
>
> But comments on that PR suggest that it's a highly controversial feature.
> I wanted to probe this community to see what the overall feeling is and
> how likely is that vanilla Prometheus will have something like this.
> It's a small patch so I'm happy to just maintain it for our internal
> deployments but it just feels like a common problem to me, so a baked in
> solution would be great.
>
> Lukasz
>
> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Developers" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to prometheus-developers+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-developers/5ab29a58-e5a4-43c5-b662-4436db61f20an%40googlegroups.com
> <https://groups.google.com/d/msgid/prometheus-developers/5ab29a58-e5a4-43c5-b662-4436db61f20an%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-developers/CABbyFmrni%3DCGPE%2BK%3DoAwrFuWcRh3Edh8cABXB8gKpCqyv5T85w%40mail.gmail.com.

Reply via email to