Hello. I am trying to get to the bottom of two issues, one which may relate to the other. Relatively new to Prometheus and alert rules and the like.
We have a prometheus server running version 2.17.2 and cAdvisor running version 0.35 We have an older prometheus server running version 2.8.0 and cAdvisor version 0.33 They are monitoring containers running on instances in AWS. The older prometheus server doesn't have an issue and has been running steadily for months. The new server which I implemented last week has issues where prometheus is firing off critical alerts for our containers randomly but at least one per hour for different containers. I have tried everything I know to try including lengthening the evaluation_interval to 60s and adjusting the rules.alert.yml to trigger if down for 45 seconds for a length of 10 seconds. But the random alerts keep coming in. But I watch / refresh the targets in Prometheus and they are never down. The containers themselves have been up for days and haven't restarted. I thought perhaps adjusting scrape_timeout maybe would fix this? But every time I try to put a different value in for scrape_timeout in the Prometheus.yml file the Prometheus service starts and then immediately crashes. Can anyone offer any help or suggestions? I am running out of ideas on what to tweak. Thanks. -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/2220fee0-8c0b-429a-a888-b1b2193736a7%40googlegroups.com.