Hi guys,

I have Prometheus in HA mode - 4 nodes - Prometheus+Aletmanager on each. 
Everything works fine but very often I experience issue that an alert is 
firing again even the event is already resolved by alertmanager.

Below logs from example event(Chrony_Service_Down) recorded by alertmanager:

############################################################################################################
(1) Jan 10 10:48:48 prometheus-01 alertmanager[1213219]: 
ts=2023-01-10T10:48:48.746Z caller=dispatch.go:165 level=debug 
component=dispatcher msg="Received alert" 
alert=Chrony_Service_Down[d8c020a][active]

(2) Jan 10 10:49:19 prometheus-01 alertmanager[1213219]: 
ts=2023-01-10T10:49:19.460Z caller=nflog.go:553 level=debug component=nflog 
msg="gossiping new entry" 
entry="entry:<group_key:\"{}/{severity=\\\"page\\\"}:{alertname=\\\"Chrony_Service_Down\\\",
 
datacenter=\\\"dc01\\\", exporter=\\\"node\\\", 
fqdn=\\\"server.example.com\\\", instance=\\\"server.example.com\\\", 
job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
severity=\\\"page\\\"}\" receiver:<group_name:\"opsgenie\" 
integration:\"opsgenie\" > timestamp:<seconds:1673347759 nanos:262824014 > 
firing_alerts:10151928354614242630 > expires_at:<seconds:1673779759 
nanos:262824014 > "

(3) Jan 10 10:50:48 prometheus-01 alertmanager[1213219]: 
ts=2023-01-10T10:50:48.745Z caller=dispatch.go:165 level=debug 
component=dispatcher msg="Received alert" 
alert=Chrony_Service_Down[d8c020a][resolved]

(4) Jan 10 10:51:29 prometheus-01 alertmanager[1213219]: 
ts=2023-01-10T10:51:29.183Z caller=nflog.go:553 level=debug component=nflog 
msg="gossiping new entry" 
entry="entry:<group_key:\"{}/{severity=\\\"page\\\"}:{alertname=\\\"Chrony_Service_Down\\\",
 
datacenter=\\\"dc01\\\", exporter=\\\"node\\\", 
fqdn=\\\"server.example.com\\\", instance=\\\"server.example.com\\\", 
job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
severity=\\\"page\\\"}\" receiver:<group_name:\"opsgenie\" 
integration:\"opsgenie\" > timestamp:<seconds:1673347888 nanos:897562679 > 
resolved_alerts:10151928354614242630 > expires_at:<seconds:1673779888 
nanos:897562679 > "

(5) Jan 10 10:51:49 prometheus-01 alertmanager[1213219]: 
ts=2023-01-10T10:51:49.745Z caller=nflog.go:553 level=debug component=nflog 
msg="gossiping new entry" 
entry="entry:<group_key:\"{}/{severity=\\\"page\\\"}:{alertname=\\\"Chrony_Service_Down\\\",
 
datacenter=\\\"dc01\\\", exporter=\\\"node\\\", 
fqdn=\\\"server.example.com\\\", instance=\\\"server.example.com\\\", 
job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
severity=\\\"page\\\"}\" receiver:<group_name:\"opsgenie\" 
integration:\"opsgenie\" > timestamp:<seconds:1673347909 nanos:649205670 > 
firing_alerts:10151928354614242630 > expires_at:<seconds:1673779909 
nanos:649205670 > "

(6) Jan 10 10:51:59 prometheus-01 alertmanager[1213219]: 
ts=2023-01-10T10:51:59.312Z caller=nflog.go:553 level=debug component=nflog 
msg="gossiping new entry" 
entry="entry:<group_key:\"{}/{severity=\\\"page\\\"}:{alertname=\\\"Chrony_Service_Down\\\",
 
datacenter=\\\"dc01\\\", exporter=\\\"node\\\", 
fqdn=\\\"server.example.com\\\", instance=\\\"server.example.com\\\", 
job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
severity=\\\"page\\\"}\" receiver:<group_name:\"opsgenie\" 
integration:\"opsgenie\" > timestamp:<seconds:1673347919 nanos:137020780 > 
resolved_alerts:10151928354614242630 > expires_at:<seconds:1673779919 
nanos:137020780 > "

(7) Jan 10 10:54:58 prometheus-01 alertmanager[1213219]: 
ts=2023-01-10T10:54:58.744Z caller=dispatch.go:165 level=debug 
component=dispatcher msg="Received alert" 
alert=Chrony_Service_Down[d8c020a][resolved]
#############################################################################################################

Interesting is line number 5 (Jan 10 10:51:49) where alertmanager fired 
alert second time even minute ago (Jan 10 10:50:48) the alert was marked as 
resolved.
Such behavior generates duplicate alert in our system which is quite 
annoying in our scale.

What is worth to mention:
- For test purpose the event is scrapped by 4 Promethues servers(default) 
but alert rule is evaluated by one Promethues.
- The event occurres only once so there is no flapping which might cause 
another alert firing.

Thanks

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/a3ced99d-58f8-4275-a478-3d2ab884b764n%40googlegroups.com.

Reply via email to