Thanks for your answers, In my current setup, running prometheus in HA, i have 1 instance who can't scrape apps, but the other one can. I want to find out which one isn't able to scrape the apps, so i can restart it. i don't see anything in the logs that reflect the issues. it would be nice if we could 'translate' the output of the /targets page to some kind of metric, if that makes sense
Op zondag 7 maart 2021 om 06:38:26 UTC+1 schreef Evelyn Pereira Souza: > On 06.03.21 11:45, Ben Kochie wrote: > > Yes, this is what the `up` metric provides. There's also > > `scrape_duration_seconds` that provides the time it took to perform the > > scrape. This makes it easier to see timeouts > Hi > > a few additions from > https://www.omerlh.info/2019/03/04/keeping-prometheus-in-shape/ > > - Use scrape_duration for monitoring > - Use scrape_limit to drop problematic targets > - Use scrape_samples_scraped to monitor the size of metrics exposed by > specific target > > alert: ScrapeDuration > expr: max(scrape_duration_seconds) > 15 > for: 5m > labels: > severity: high > annotations: > summary: "Prometheus Scrape Duration is getting near the limit" > > > alert: TeamAwesomeScraeSampleSize > expr: max(scrape_samples_scraped[kubernetes_namespace='awesome']) > 1000 > for: 5m > labels: > severity: high > annotations: > summary: "Oh No! One of our services is exposing too much metrics!" > > kind regards > Evelyn > -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/0b2244fb-e442-4561-869e-ebf1aa60eca8n%40googlegroups.com.

