Thanks for your answers,

In my current setup, running prometheus in HA, i have 1 instance who can't 
scrape apps, but the other one can. I want to find out which one isn't able 
to scrape the apps, so i can restart it. i don't see anything in the logs 
that reflect the issues. it would be nice if we could 'translate' the 
output of the /targets page to some kind of metric, if that makes sense

Op zondag 7 maart 2021 om 06:38:26 UTC+1 schreef Evelyn Pereira Souza:

> On 06.03.21 11:45, Ben Kochie wrote:
> > Yes, this is what the `up` metric provides. There's also 
> > `scrape_duration_seconds` that provides the time it took to perform the 
> > scrape. This makes it easier to see timeouts
> Hi
>
> a few additions from 
> https://www.omerlh.info/2019/03/04/keeping-prometheus-in-shape/
>
> - Use scrape_duration for monitoring
> - Use scrape_limit to drop problematic targets
> - Use scrape_samples_scraped to monitor the size of metrics exposed by 
> specific target
>
> alert: ScrapeDuration
> expr: max(scrape_duration_seconds) > 15
> for: 5m
> labels:
> severity: high
> annotations:
> summary: "Prometheus Scrape Duration is getting near the limit"
>
>
> alert: TeamAwesomeScraeSampleSize
> expr: max(scrape_samples_scraped[kubernetes_namespace='awesome']) > 1000
> for: 5m
> labels:
> severity: high
> annotations:
> summary: "Oh No! One of our services is exposing too much metrics!"
>
> kind regards
> Evelyn
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/0b2244fb-e442-4561-869e-ebf1aa60eca8n%40googlegroups.com.

Reply via email to