On 06.06.21 09:56, Christian Galsterer wrote:
> There are metrics for the actual scrape duration but currently there are no
> metrics for the scrape timeouts. Adding metrics for the scrape timeout
> would it make possible to monitor and alert on scrape timeouts without
> hard-coding the timeouts in the PromQL queries but the new metric can be
> used.
Sounds like a good idea at first glance, but note that this would be
yet another metric that gets automatically added to every single
target. I think we have to be careful when doing so.
Your proposal mirrors a part of the configuration into metrics. That
is sometimes a neat thing to do, but it has to be enjoyed responsibly.
In this case, you want to specifically alert on scrape timeouts (or, I
guess, approaching them). The same argument could be made to alert on
exceeding (or approaching) the sample limit. So we need a new scrape
metric for the `sample_limit` configuration setting, too. The same is
true for all the other limits: `label_limit`,
`label_name_length_limit`, `label_value_length_limit`,
`target_limit`. So we have to add _six_ new metrics. Also, I had a
bunch of situations where I would have liked to know the intended
scrape interval of a series (rather than guessing it from the spacing
I could see in the samples of the series). So yet another metric for
the configured scrape interval. Things are getting out of control
here...
The question is, of course, why you would like to alert on scrape
timeout specifically. There are many possible reasons why a scrape
fails. Generally, I would recommend to just alert on `up` being zero
too often. If that alert fires, you can then checkout the Prometheus
server in question and investigate _why_ the scrapes are failing.
Interestingly, we have a metric
`prometheus_rule_group_interval_seconds` for the configured evaluation
interval of a rule group. Note, however, that this is not a synthetic
metric injected alongside the evaluation result of the rule, but only
exposed by the `/metrics` endpoint of Prometheus itself. That's only
one metric per rule group, and it's exposed for meta-monitoring, which
could be on a separate server, so it doesn't "pollute" the normal
metrics.
In summary, I'm pretty sure we shouldn't add half a dozen synthetic
metrics for each target to mirror its configuration into metrics. But
perhaps we could add more metrics for meta-monitoring. Have a look at
the already existing metrics beginning with
`prometheus_target_...`. There is for example
`prometheus_target_scrapes_exceeded_sample_limit_total`, but note that
this is just one metric for the whole server. It's mostly meant to get
a specific alerts if _any_ targets run into the sample limit. Perhaps
the same could be done for timeouts as
`prometheus_target_scrapes_exceeded_scrape_timeout`.
--
Björn Rabenstein
[PGP-ID] 0x851C3DA17D748D03
[email] bjo...@rabenste.in
--
You received this message because you are subscribed to the Google Groups
"Prometheus Developers" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to prometheus-developers+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-developers/20210609162547.GO3670%40jahnn.