Hello.

I am trying to get to the bottom of two issues, one which may relate to the 
other. 
Relatively new to Prometheus and alert rules and the like.

We have a prometheus server running version 2.17.2 and cAdvisor running 
version 0.35

We have an older prometheus server running version 2.8.0 and cAdvisor 
version 0.33

They are monitoring containers running on instances in AWS.  

The older prometheus server doesn't have an issue and has been running 
steadily for months.

The new server which I implemented last week has issues where prometheus is 
firing off critical alerts for our containers randomly but at least one per 
hour for different containers. 

I have tried everything I know to try including lengthening the 
evaluation_interval to 60s and adjusting the rules.alert.yml to trigger if 
down for 45 seconds for a length of 10 seconds.

But the random alerts keep coming in.  But I watch / refresh the targets in 
Prometheus and they are never down.   

The containers themselves have been up for days and haven't restarted. 

I thought perhaps adjusting scrape_timeout maybe would fix this? But every 
time I try to put a different value in for scrape_timeout in the 
Prometheus.yml file the Prometheus service starts and then immediately 
crashes.

Can anyone offer any help or suggestions? I am running out of ideas on what 
to tweak.

Thanks.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/2220fee0-8c0b-429a-a888-b1b2193736a7%40googlegroups.com.

Reply via email to