[prometheus-users] Re: An alert fires twice even an event occurres only once

Brian Candler Fri, 13 Jan 2023 06:29:02 -0800

That's a lot of alertmanagers.  Are they all fully meshed?  (But I'd say 2 
or 3 would be better - spread over different regions)


On Friday, 13 January 2023 at 14:16:27 UTC LukaszSz wrote:

> Yes. The prometheus server is configured to communicate with  all 
> alertmanagers ( sorry there are 8 alertmanagers ):
>
> alerting:
>   alert_relabel_configs:
>   - action: labeldrop
>     regex: "^prometheus_server$"
>   alertmanagers:
>   - static_configs:
>     - targets:
>       - alertmanager1:9093
>       - alertmanager2:9093
>       - alertmanager3:9093
>       - alertmanager4:9093
>       - alertmanager5:9093
>       - alertmanager6:9093
>       - alertmanager7:9093
>       - alertmanager8:9093 
>
> On Friday, January 13, 2023 at 2:02:14 PM UTC+1 Brian Candler wrote:
>
>> Yes, but have you configured the prometheus (the one which has alerting 
>> rules) to have all four alertmanagers as its destination?
>>
>> On Friday, 13 January 2023 at 12:55:49 UTC LukaszSz wrote:
>>
>>> Yes Brian. As I mentioned in my post the Alertmangers are in cluster and 
>>> this event is visible on my 4 alertmanagers.
>>> Problem which I described is that an alerts are firing twice and it 
>>> generates duplication. 
>>>
>>> On Friday, January 13, 2023 at 1:34:52 PM UTC+1 Brian Candler wrote:
>>>
>>>> Are the alertmanagers clustered?  Then you should configure prometheus 
>>>> to deliver the alert to *all* alertmanagers.
>>>>
>>>> On Friday, 13 January 2023 at 11:08:37 UTC LukaszSz wrote:
>>>>
>>>>> Hi guys,
>>>>>
>>>>> I have Prometheus in HA mode - 4 nodes - Prometheus+Aletmanager on 
>>>>> each. 
>>>>> Everything works fine but very often I experience issue that an alert 
>>>>> is firing again even the event is already resolved by alertmanager.
>>>>>
>>>>> Below logs from example event(Chrony_Service_Down) recorded by 
>>>>> alertmanager:
>>>>>
>>>>>
>>>>> ############################################################################################################
>>>>> (1) Jan 10 10:48:48 prometheus-01 alertmanager[1213219]: 
>>>>> ts=2023-01-10T10:48:48.746Z caller=dispatch.go:165 level=debug 
>>>>> component=dispatcher msg="Received alert" 
>>>>> alert=Chrony_Service_Down[d8c020a][active]
>>>>>
>>>>> (2) Jan 10 10:49:19 prometheus-01 alertmanager[1213219]: 
>>>>> ts=2023-01-10T10:49:19.460Z caller=nflog.go:553 level=debug 
>>>>> component=nflog 
>>>>> msg="gossiping new entry" 
>>>>> entry="entry:<group_key:\"{}/{severity=\\\"page\\\"}:{alertname=\\\"Chrony_Service_Down\\\",
>>>>>  
>>>>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
>>>>> server.example.com\\\", instance=\\\"server.example.com\\\", 
>>>>> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
>>>>> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
>>>>> severity=\\\"page\\\"}\" receiver:<group_name:\"opsgenie\" 
>>>>> integration:\"opsgenie\" > timestamp:<seconds:1673347759 nanos:262824014 
>>>>> > 
>>>>> firing_alerts:10151928354614242630 > expires_at:<seconds:1673779759 
>>>>> nanos:262824014 > "
>>>>>
>>>>> (3) Jan 10 10:50:48 prometheus-01 alertmanager[1213219]: 
>>>>> ts=2023-01-10T10:50:48.745Z caller=dispatch.go:165 level=debug 
>>>>> component=dispatcher msg="Received alert" 
>>>>> alert=Chrony_Service_Down[d8c020a][resolved]
>>>>>
>>>>> (4) Jan 10 10:51:29 prometheus-01 alertmanager[1213219]: 
>>>>> ts=2023-01-10T10:51:29.183Z caller=nflog.go:553 level=debug 
>>>>> component=nflog 
>>>>> msg="gossiping new entry" 
>>>>> entry="entry:<group_key:\"{}/{severity=\\\"page\\\"}:{alertname=\\\"Chrony_Service_Down\\\",
>>>>>  
>>>>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
>>>>> server.example.com\\\", instance=\\\"server.example.com\\\", 
>>>>> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
>>>>> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
>>>>> severity=\\\"page\\\"}\" receiver:<group_name:\"opsgenie\" 
>>>>> integration:\"opsgenie\" > timestamp:<seconds:1673347888 nanos:897562679 
>>>>> > 
>>>>> resolved_alerts:10151928354614242630 > expires_at:<seconds:1673779888 
>>>>> nanos:897562679 > "
>>>>>
>>>>> (5) Jan 10 10:51:49 prometheus-01 alertmanager[1213219]: 
>>>>> ts=2023-01-10T10:51:49.745Z caller=nflog.go:553 level=debug 
>>>>> component=nflog 
>>>>> msg="gossiping new entry" 
>>>>> entry="entry:<group_key:\"{}/{severity=\\\"page\\\"}:{alertname=\\\"Chrony_Service_Down\\\",
>>>>>  
>>>>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
>>>>> server.example.com\\\", instance=\\\"server.example.com\\\", 
>>>>> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
>>>>> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
>>>>> severity=\\\"page\\\"}\" receiver:<group_name:\"opsgenie\" 
>>>>> integration:\"opsgenie\" > timestamp:<seconds:1673347909 nanos:649205670 
>>>>> > 
>>>>> firing_alerts:10151928354614242630 > expires_at:<seconds:1673779909 
>>>>> nanos:649205670 > "
>>>>>
>>>>> (6) Jan 10 10:51:59 prometheus-01 alertmanager[1213219]: 
>>>>> ts=2023-01-10T10:51:59.312Z caller=nflog.go:553 level=debug 
>>>>> component=nflog 
>>>>> msg="gossiping new entry" 
>>>>> entry="entry:<group_key:\"{}/{severity=\\\"page\\\"}:{alertname=\\\"Chrony_Service_Down\\\",
>>>>>  
>>>>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
>>>>> server.example.com\\\", instance=\\\"server.example.com\\\", 
>>>>> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
>>>>> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
>>>>> severity=\\\"page\\\"}\" receiver:<group_name:\"opsgenie\" 
>>>>> integration:\"opsgenie\" > timestamp:<seconds:1673347919 nanos:137020780 
>>>>> > 
>>>>> resolved_alerts:10151928354614242630 > expires_at:<seconds:1673779919 
>>>>> nanos:137020780 > "
>>>>>
>>>>> (7) Jan 10 10:54:58 prometheus-01 alertmanager[1213219]: 
>>>>> ts=2023-01-10T10:54:58.744Z caller=dispatch.go:165 level=debug 
>>>>> component=dispatcher msg="Received alert" 
>>>>> alert=Chrony_Service_Down[d8c020a][resolved]
>>>>>
>>>>> #############################################################################################################
>>>>>
>>>>> Interesting is line number 5 (Jan 10 10:51:49) where alertmanager 
>>>>> fired alert second time even minute ago (Jan 10 10:50:48) the alert was 
>>>>> marked as resolved.
>>>>> Such behavior generates duplicate alert in our system which is quite 
>>>>> annoying in our scale.
>>>>>
>>>>> What is worth to mention:
>>>>> - For test purpose the event is scrapped by 4 Promethues 
>>>>> servers(default) but alert rule is evaluated by one Promethues.
>>>>> - The event occurres only once so there is no flapping which might 
>>>>> cause another alert firing.
>>>>>
>>>>> Thanks
>>>>>
>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/91381a22-f874-4802-a431-619f6b3d2ecdn%40googlegroups.com.

[prometheus-users] Re: An alert fires twice even an event occurres only once

Reply via email to