date:20230113

[prometheus-users] Re: An alert fires twice even an event occurres only once

2023-01-13 Thread LukaszSz

Interesting. Seems that the alertmanagers are spread over 3 different 
regions ( 2xAsia, 2xUSA,4xEurope).
Maybe there is some latency problem between them like latency in gossip 
messages ?

On Friday, January 13, 2023 at 3:28:57 PM UTC+1 Brian Candler wrote:

> That's a lot of alertmanagers.  Are they all fully meshed?  (But I'd say 2 
> or 3 would be better - spread over different regions)
>
> On Friday, 13 January 2023 at 14:16:27 UTC LukaszSz wrote:
>
>> Yes. The prometheus server is configured to communicate with  all 
>> alertmanagers ( sorry there are 8 alertmanagers ):
>>
>> alerting:
>>   alert_relabel_configs:
>>   - action: labeldrop
>> regex: "^prometheus_server$"
>>   alertmanagers:
>>   - static_configs:
>> - targets:
>>   - alertmanager1:9093
>>   - alertmanager2:9093
>>   - alertmanager3:9093
>>   - alertmanager4:9093
>>   - alertmanager5:9093
>>   - alertmanager6:9093
>>   - alertmanager7:9093
>>   - alertmanager8:9093 
>>
>> On Friday, January 13, 2023 at 2:02:14 PM UTC+1 Brian Candler wrote:
>>
>>> Yes, but have you configured the prometheus (the one which has alerting 
>>> rules) to have all four alertmanagers as its destination?
>>>
>>> On Friday, 13 January 2023 at 12:55:49 UTC LukaszSz wrote:
>>>
 Yes Brian. As I mentioned in my post the Alertmangers are in cluster 
 and this event is visible on my 4 alertmanagers.
 Problem which I described is that an alerts are firing twice and it 
 generates duplication. 

 On Friday, January 13, 2023 at 1:34:52 PM UTC+1 Brian Candler wrote:

> Are the alertmanagers clustered?  Then you should configure prometheus 
> to deliver the alert to *all* alertmanagers.
>
> On Friday, 13 January 2023 at 11:08:37 UTC LukaszSz wrote:
>
>> Hi guys,
>>
>> I have Prometheus in HA mode - 4 nodes - Prometheus+Aletmanager on 
>> each. 
>> Everything works fine but very often I experience issue that an alert 
>> is firing again even the event is already resolved by alertmanager.
>>
>> Below logs from example event(Chrony_Service_Down) recorded by 
>> alertmanager:
>>
>>
>> 
>> (1) Jan 10 10:48:48 prometheus-01 alertmanager[1213219]: 
>> ts=2023-01-10T10:48:48.746Z caller=dispatch.go:165 level=debug 
>> component=dispatcher msg="Received alert" 
>> alert=Chrony_Service_Down[d8c020a][active]
>>
>> (2) Jan 10 10:49:19 prometheus-01 alertmanager[1213219]: 
>> ts=2023-01-10T10:49:19.460Z caller=nflog.go:553 level=debug 
>> component=nflog 
>> msg="gossiping new entry" 
>> entry="entry:>  
>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
>> server.example.com\\\", instance=\\\"server.example.com\\\", 
>> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
>> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
>> severity=\\\"page\\\"}\" receiver:> integration:\"opsgenie\" > timestamp:> > 
>> firing_alerts:10151928354614242630 > expires_at:> nanos:262824014 > "
>>
>> (3) Jan 10 10:50:48 prometheus-01 alertmanager[1213219]: 
>> ts=2023-01-10T10:50:48.745Z caller=dispatch.go:165 level=debug 
>> component=dispatcher msg="Received alert" 
>> alert=Chrony_Service_Down[d8c020a][resolved]
>>
>> (4) Jan 10 10:51:29 prometheus-01 alertmanager[1213219]: 
>> ts=2023-01-10T10:51:29.183Z caller=nflog.go:553 level=debug 
>> component=nflog 
>> msg="gossiping new entry" 
>> entry="entry:>  
>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
>> server.example.com\\\", instance=\\\"server.example.com\\\", 
>> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
>> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
>> severity=\\\"page\\\"}\" receiver:> integration:\"opsgenie\" > timestamp:> > 
>> resolved_alerts:10151928354614242630 > expires_at:> nanos:897562679 > "
>>
>> (5) Jan 10 10:51:49 prometheus-01 alertmanager[1213219]: 
>> ts=2023-01-10T10:51:49.745Z caller=nflog.go:553 level=debug 
>> component=nflog 
>> msg="gossiping new entry" 
>> entry="entry:>  
>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
>> server.example.com\\\", instance=\\\"server.example.com\\\", 
>> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
>> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
>> severity=\\\"page\\\"}\" receiver:> integration:\"opsgenie\" > timestamp:> > 
>> firing_alerts:10151928354614242630 > expires_at:> nanos:649205670 > "
>>
>> (6) Jan 10 10:51:59 prometheus-01 alertmanager[1213219]: 
>> ts=2023-01-10T10:51:59.312Z caller=nflog.go:553 level=debug 
>> component=nflog 
>> msg="gossiping new

[prometheus-users] Re: An alert fires twice even an event occurres only once

2023-01-13 Thread Brian Candler

That's a lot of alertmanagers.  Are they all fully meshed?  (But I'd say 2 
or 3 would be better - spread over different regions)

On Friday, 13 January 2023 at 14:16:27 UTC LukaszSz wrote:

> Yes. The prometheus server is configured to communicate with  all 
> alertmanagers ( sorry there are 8 alertmanagers ):
>
> alerting:
>   alert_relabel_configs:
>   - action: labeldrop
> regex: "^prometheus_server$"
>   alertmanagers:
>   - static_configs:
> - targets:
>   - alertmanager1:9093
>   - alertmanager2:9093
>   - alertmanager3:9093
>   - alertmanager4:9093
>   - alertmanager5:9093
>   - alertmanager6:9093
>   - alertmanager7:9093
>   - alertmanager8:9093 
>
> On Friday, January 13, 2023 at 2:02:14 PM UTC+1 Brian Candler wrote:
>
>> Yes, but have you configured the prometheus (the one which has alerting 
>> rules) to have all four alertmanagers as its destination?
>>
>> On Friday, 13 January 2023 at 12:55:49 UTC LukaszSz wrote:
>>
>>> Yes Brian. As I mentioned in my post the Alertmangers are in cluster and 
>>> this event is visible on my 4 alertmanagers.
>>> Problem which I described is that an alerts are firing twice and it 
>>> generates duplication. 
>>>
>>> On Friday, January 13, 2023 at 1:34:52 PM UTC+1 Brian Candler wrote:
>>>
 Are the alertmanagers clustered?  Then you should configure prometheus 
 to deliver the alert to *all* alertmanagers.

 On Friday, 13 January 2023 at 11:08:37 UTC LukaszSz wrote:

> Hi guys,
>
> I have Prometheus in HA mode - 4 nodes - Prometheus+Aletmanager on 
> each. 
> Everything works fine but very often I experience issue that an alert 
> is firing again even the event is already resolved by alertmanager.
>
> Below logs from example event(Chrony_Service_Down) recorded by 
> alertmanager:
>
>
> 
> (1) Jan 10 10:48:48 prometheus-01 alertmanager[1213219]: 
> ts=2023-01-10T10:48:48.746Z caller=dispatch.go:165 level=debug 
> component=dispatcher msg="Received alert" 
> alert=Chrony_Service_Down[d8c020a][active]
>
> (2) Jan 10 10:49:19 prometheus-01 alertmanager[1213219]: 
> ts=2023-01-10T10:49:19.460Z caller=nflog.go:553 level=debug 
> component=nflog 
> msg="gossiping new entry" 
> entry="entry:  
> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
> server.example.com\\\", instance=\\\"server.example.com\\\", 
> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
> severity=\\\"page\\\"}\" receiver: integration:\"opsgenie\" > timestamp: > 
> firing_alerts:10151928354614242630 > expires_at: nanos:262824014 > "
>
> (3) Jan 10 10:50:48 prometheus-01 alertmanager[1213219]: 
> ts=2023-01-10T10:50:48.745Z caller=dispatch.go:165 level=debug 
> component=dispatcher msg="Received alert" 
> alert=Chrony_Service_Down[d8c020a][resolved]
>
> (4) Jan 10 10:51:29 prometheus-01 alertmanager[1213219]: 
> ts=2023-01-10T10:51:29.183Z caller=nflog.go:553 level=debug 
> component=nflog 
> msg="gossiping new entry" 
> entry="entry:  
> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
> server.example.com\\\", instance=\\\"server.example.com\\\", 
> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
> severity=\\\"page\\\"}\" receiver: integration:\"opsgenie\" > timestamp: > 
> resolved_alerts:10151928354614242630 > expires_at: nanos:897562679 > "
>
> (5) Jan 10 10:51:49 prometheus-01 alertmanager[1213219]: 
> ts=2023-01-10T10:51:49.745Z caller=nflog.go:553 level=debug 
> component=nflog 
> msg="gossiping new entry" 
> entry="entry:  
> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
> server.example.com\\\", instance=\\\"server.example.com\\\", 
> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
> severity=\\\"page\\\"}\" receiver: integration:\"opsgenie\" > timestamp: > 
> firing_alerts:10151928354614242630 > expires_at: nanos:649205670 > "
>
> (6) Jan 10 10:51:59 prometheus-01 alertmanager[1213219]: 
> ts=2023-01-10T10:51:59.312Z caller=nflog.go:553 level=debug 
> component=nflog 
> msg="gossiping new entry" 
> entry="entry:  
> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
> server.example.com\\\", instance=\\\"server.example.com\\\", 
> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
> severity=\\\"page\\\"}\" receiver: integration:\"opsgenie\" >

[prometheus-users] Re: An alert fires twice even an event occurres only once

2023-01-13 Thread LukaszSz

Yes. The prometheus server is configured to communicate with  all 
alertmanagers ( sorry there are 8 alertmanagers ):

alerting:
  alert_relabel_configs:
  - action: labeldrop
regex: "^prometheus_server$"
  alertmanagers:
  - static_configs:
- targets:
  - alertmanager1:9093
  - alertmanager2:9093
  - alertmanager3:9093
  - alertmanager4:9093
  - alertmanager5:9093
  - alertmanager6:9093
  - alertmanager7:9093
  - alertmanager8:9093 

On Friday, January 13, 2023 at 2:02:14 PM UTC+1 Brian Candler wrote:

> Yes, but have you configured the prometheus (the one which has alerting 
> rules) to have all four alertmanagers as its destination?
>
> On Friday, 13 January 2023 at 12:55:49 UTC LukaszSz wrote:
>
>> Yes Brian. As I mentioned in my post the Alertmangers are in cluster and 
>> this event is visible on my 4 alertmanagers.
>> Problem which I described is that an alerts are firing twice and it 
>> generates duplication. 
>>
>> On Friday, January 13, 2023 at 1:34:52 PM UTC+1 Brian Candler wrote:
>>
>>> Are the alertmanagers clustered?  Then you should configure prometheus 
>>> to deliver the alert to *all* alertmanagers.
>>>
>>> On Friday, 13 January 2023 at 11:08:37 UTC LukaszSz wrote:
>>>
 Hi guys,

 I have Prometheus in HA mode - 4 nodes - Prometheus+Aletmanager on 
 each. 
 Everything works fine but very often I experience issue that an alert 
 is firing again even the event is already resolved by alertmanager.

 Below logs from example event(Chrony_Service_Down) recorded by 
 alertmanager:


 
 (1) Jan 10 10:48:48 prometheus-01 alertmanager[1213219]: 
 ts=2023-01-10T10:48:48.746Z caller=dispatch.go:165 level=debug 
 component=dispatcher msg="Received alert" 
 alert=Chrony_Service_Down[d8c020a][active]

 (2) Jan 10 10:49:19 prometheus-01 alertmanager[1213219]: 
 ts=2023-01-10T10:49:19.460Z caller=nflog.go:553 level=debug 
 component=nflog 
 msg="gossiping new entry" 
 entry="entry:>>>  
 datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
 server.example.com\\\", instance=\\\"server.example.com\\\", 
 job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
 puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
 severity=\\\"page\\\"}\" receiver:>>> integration:\"opsgenie\" > timestamp: 
 firing_alerts:10151928354614242630 > expires_at:>>> nanos:262824014 > "

 (3) Jan 10 10:50:48 prometheus-01 alertmanager[1213219]: 
 ts=2023-01-10T10:50:48.745Z caller=dispatch.go:165 level=debug 
 component=dispatcher msg="Received alert" 
 alert=Chrony_Service_Down[d8c020a][resolved]

 (4) Jan 10 10:51:29 prometheus-01 alertmanager[1213219]: 
 ts=2023-01-10T10:51:29.183Z caller=nflog.go:553 level=debug 
 component=nflog 
 msg="gossiping new entry" 
 entry="entry:>>>  
 datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
 server.example.com\\\", instance=\\\"server.example.com\\\", 
 job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
 puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
 severity=\\\"page\\\"}\" receiver:>>> integration:\"opsgenie\" > timestamp: 
 resolved_alerts:10151928354614242630 > expires_at:>>> nanos:897562679 > "

 (5) Jan 10 10:51:49 prometheus-01 alertmanager[1213219]: 
 ts=2023-01-10T10:51:49.745Z caller=nflog.go:553 level=debug 
 component=nflog 
 msg="gossiping new entry" 
 entry="entry:>>>  
 datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
 server.example.com\\\", instance=\\\"server.example.com\\\", 
 job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
 puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
 severity=\\\"page\\\"}\" receiver:>>> integration:\"opsgenie\" > timestamp: 
 firing_alerts:10151928354614242630 > expires_at:>>> nanos:649205670 > "

 (6) Jan 10 10:51:59 prometheus-01 alertmanager[1213219]: 
 ts=2023-01-10T10:51:59.312Z caller=nflog.go:553 level=debug 
 component=nflog 
 msg="gossiping new entry" 
 entry="entry:>>>  
 datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
 server.example.com\\\", instance=\\\"server.example.com\\\", 
 job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
 puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
 severity=\\\"page\\\"}\" receiver:>>> integration:\"opsgenie\" > timestamp: 
 resolved_alerts:10151928354614242630 > expires_at:>>> nanos:137020780 > "

 (7) Jan 10 10:54:58 prometheus-01 alertmanager[1213219]: 
 ts=2023-01-10T10:54:58.744Z caller=dispatch.go:165 level=debug 
 component=dispatcher msg="Received alert" 
 alert=Chrony_Service_Down[d8c020a][resolved]

[prometheus-users] Re: An alert fires twice even an event occurres only once

2023-01-13 Thread Brian Candler

Yes, but have you configured the prometheus (the one which has alerting 
rules) to have all four alertmanagers as its destination?

On Friday, 13 January 2023 at 12:55:49 UTC LukaszSz wrote:

> Yes Brian. As I mentioned in my post the Alertmangers are in cluster and 
> this event is visible on my 4 alertmanagers.
> Problem which I described is that an alerts are firing twice and it 
> generates duplication. 
>
> On Friday, January 13, 2023 at 1:34:52 PM UTC+1 Brian Candler wrote:
>
>> Are the alertmanagers clustered?  Then you should configure prometheus to 
>> deliver the alert to *all* alertmanagers.
>>
>> On Friday, 13 January 2023 at 11:08:37 UTC LukaszSz wrote:
>>
>>> Hi guys,
>>>
>>> I have Prometheus in HA mode - 4 nodes - Prometheus+Aletmanager on each. 
>>> Everything works fine but very often I experience issue that an alert is 
>>> firing again even the event is already resolved by alertmanager.
>>>
>>> Below logs from example event(Chrony_Service_Down) recorded by 
>>> alertmanager:
>>>
>>>
>>> 
>>> (1) Jan 10 10:48:48 prometheus-01 alertmanager[1213219]: 
>>> ts=2023-01-10T10:48:48.746Z caller=dispatch.go:165 level=debug 
>>> component=dispatcher msg="Received alert" 
>>> alert=Chrony_Service_Down[d8c020a][active]
>>>
>>> (2) Jan 10 10:49:19 prometheus-01 alertmanager[1213219]: 
>>> ts=2023-01-10T10:49:19.460Z caller=nflog.go:553 level=debug component=nflog 
>>> msg="gossiping new entry" 
>>> entry="entry:>>  
>>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
>>> server.example.com\\\", instance=\\\"server.example.com\\\", 
>>> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
>>> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
>>> severity=\\\"page\\\"}\" receiver:>> integration:\"opsgenie\" > timestamp: 
>>> firing_alerts:10151928354614242630 > expires_at:>> nanos:262824014 > "
>>>
>>> (3) Jan 10 10:50:48 prometheus-01 alertmanager[1213219]: 
>>> ts=2023-01-10T10:50:48.745Z caller=dispatch.go:165 level=debug 
>>> component=dispatcher msg="Received alert" 
>>> alert=Chrony_Service_Down[d8c020a][resolved]
>>>
>>> (4) Jan 10 10:51:29 prometheus-01 alertmanager[1213219]: 
>>> ts=2023-01-10T10:51:29.183Z caller=nflog.go:553 level=debug component=nflog 
>>> msg="gossiping new entry" 
>>> entry="entry:>>  
>>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
>>> server.example.com\\\", instance=\\\"server.example.com\\\", 
>>> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
>>> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
>>> severity=\\\"page\\\"}\" receiver:>> integration:\"opsgenie\" > timestamp: 
>>> resolved_alerts:10151928354614242630 > expires_at:>> nanos:897562679 > "
>>>
>>> (5) Jan 10 10:51:49 prometheus-01 alertmanager[1213219]: 
>>> ts=2023-01-10T10:51:49.745Z caller=nflog.go:553 level=debug component=nflog 
>>> msg="gossiping new entry" 
>>> entry="entry:>>  
>>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
>>> server.example.com\\\", instance=\\\"server.example.com\\\", 
>>> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
>>> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
>>> severity=\\\"page\\\"}\" receiver:>> integration:\"opsgenie\" > timestamp: 
>>> firing_alerts:10151928354614242630 > expires_at:>> nanos:649205670 > "
>>>
>>> (6) Jan 10 10:51:59 prometheus-01 alertmanager[1213219]: 
>>> ts=2023-01-10T10:51:59.312Z caller=nflog.go:553 level=debug component=nflog 
>>> msg="gossiping new entry" 
>>> entry="entry:>>  
>>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
>>> server.example.com\\\", instance=\\\"server.example.com\\\", 
>>> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
>>> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
>>> severity=\\\"page\\\"}\" receiver:>> integration:\"opsgenie\" > timestamp: 
>>> resolved_alerts:10151928354614242630 > expires_at:>> nanos:137020780 > "
>>>
>>> (7) Jan 10 10:54:58 prometheus-01 alertmanager[1213219]: 
>>> ts=2023-01-10T10:54:58.744Z caller=dispatch.go:165 level=debug 
>>> component=dispatcher msg="Received alert" 
>>> alert=Chrony_Service_Down[d8c020a][resolved]
>>>
>>> #
>>>
>>> Interesting is line number 5 (Jan 10 10:51:49) where alertmanager fired 
>>> alert second time even minute ago (Jan 10 10:50:48) the alert was marked as 
>>> resolved.
>>> Such behavior generates duplicate alert in our system which is quite 
>>> annoying in our scale.
>>>
>>> What is worth to mention:
>>> - For test purpose the event is scrapped by 4 Promethues 
>>> servers(default) but alert rule is evaluated by one Promethues.
>>> - The event occurres only once so there is no flapping which might cause 
>>> another alert firing.
>>>
>>> Thanks
>>>
>>

-- 
You

[prometheus-users] Re: An alert fires twice even an event occurres only once

2023-01-13 Thread LukaszSz

Yes Brian. As I mentioned in my post the Alertmangers are in cluster and 
this event is visible on my 4 alertmanagers.
Problem which I described is that an alerts are firing twice and it 
generates duplication. 

On Friday, January 13, 2023 at 1:34:52 PM UTC+1 Brian Candler wrote:

> Are the alertmanagers clustered?  Then you should configure prometheus to 
> deliver the alert to *all* alertmanagers.
>
> On Friday, 13 January 2023 at 11:08:37 UTC LukaszSz wrote:
>
>> Hi guys,
>>
>> I have Prometheus in HA mode - 4 nodes - Prometheus+Aletmanager on each. 
>> Everything works fine but very often I experience issue that an alert is 
>> firing again even the event is already resolved by alertmanager.
>>
>> Below logs from example event(Chrony_Service_Down) recorded by 
>> alertmanager:
>>
>>
>> 
>> (1) Jan 10 10:48:48 prometheus-01 alertmanager[1213219]: 
>> ts=2023-01-10T10:48:48.746Z caller=dispatch.go:165 level=debug 
>> component=dispatcher msg="Received alert" 
>> alert=Chrony_Service_Down[d8c020a][active]
>>
>> (2) Jan 10 10:49:19 prometheus-01 alertmanager[1213219]: 
>> ts=2023-01-10T10:49:19.460Z caller=nflog.go:553 level=debug component=nflog 
>> msg="gossiping new entry" 
>> entry="entry:>  
>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
>> server.example.com\\\", instance=\\\"server.example.com\\\", 
>> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
>> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
>> severity=\\\"page\\\"}\" receiver:> integration:\"opsgenie\" > timestamp: 
>> firing_alerts:10151928354614242630 > expires_at:> nanos:262824014 > "
>>
>> (3) Jan 10 10:50:48 prometheus-01 alertmanager[1213219]: 
>> ts=2023-01-10T10:50:48.745Z caller=dispatch.go:165 level=debug 
>> component=dispatcher msg="Received alert" 
>> alert=Chrony_Service_Down[d8c020a][resolved]
>>
>> (4) Jan 10 10:51:29 prometheus-01 alertmanager[1213219]: 
>> ts=2023-01-10T10:51:29.183Z caller=nflog.go:553 level=debug component=nflog 
>> msg="gossiping new entry" 
>> entry="entry:>  
>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
>> server.example.com\\\", instance=\\\"server.example.com\\\", 
>> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
>> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
>> severity=\\\"page\\\"}\" receiver:> integration:\"opsgenie\" > timestamp: 
>> resolved_alerts:10151928354614242630 > expires_at:> nanos:897562679 > "
>>
>> (5) Jan 10 10:51:49 prometheus-01 alertmanager[1213219]: 
>> ts=2023-01-10T10:51:49.745Z caller=nflog.go:553 level=debug component=nflog 
>> msg="gossiping new entry" 
>> entry="entry:>  
>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
>> server.example.com\\\", instance=\\\"server.example.com\\\", 
>> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
>> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
>> severity=\\\"page\\\"}\" receiver:> integration:\"opsgenie\" > timestamp: 
>> firing_alerts:10151928354614242630 > expires_at:> nanos:649205670 > "
>>
>> (6) Jan 10 10:51:59 prometheus-01 alertmanager[1213219]: 
>> ts=2023-01-10T10:51:59.312Z caller=nflog.go:553 level=debug component=nflog 
>> msg="gossiping new entry" 
>> entry="entry:>  
>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
>> server.example.com\\\", instance=\\\"server.example.com\\\", 
>> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
>> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
>> severity=\\\"page\\\"}\" receiver:> integration:\"opsgenie\" > timestamp: 
>> resolved_alerts:10151928354614242630 > expires_at:> nanos:137020780 > "
>>
>> (7) Jan 10 10:54:58 prometheus-01 alertmanager[1213219]: 
>> ts=2023-01-10T10:54:58.744Z caller=dispatch.go:165 level=debug 
>> component=dispatcher msg="Received alert" 
>> alert=Chrony_Service_Down[d8c020a][resolved]
>>
>> #
>>
>> Interesting is line number 5 (Jan 10 10:51:49) where alertmanager fired 
>> alert second time even minute ago (Jan 10 10:50:48) the alert was marked as 
>> resolved.
>> Such behavior generates duplicate alert in our system which is quite 
>> annoying in our scale.
>>
>> What is worth to mention:
>> - For test purpose the event is scrapped by 4 Promethues servers(default) 
>> but alert rule is evaluated by one Promethues.
>> - The event occurres only once so there is no flapping which might cause 
>> another alert firing.
>>
>> Thanks
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit

[prometheus-users] Re: An alert fires twice even an event occurres only once

2023-01-13 Thread Brian Candler

Are the alertmanagers clustered?  Then you should configure prometheus to 
deliver the alert to *all* alertmanagers.

On Friday, 13 January 2023 at 11:08:37 UTC LukaszSz wrote:

> Hi guys,
>
> I have Prometheus in HA mode - 4 nodes - Prometheus+Aletmanager on each. 
> Everything works fine but very often I experience issue that an alert is 
> firing again even the event is already resolved by alertmanager.
>
> Below logs from example event(Chrony_Service_Down) recorded by 
> alertmanager:
>
>
> 
> (1) Jan 10 10:48:48 prometheus-01 alertmanager[1213219]: 
> ts=2023-01-10T10:48:48.746Z caller=dispatch.go:165 level=debug 
> component=dispatcher msg="Received alert" 
> alert=Chrony_Service_Down[d8c020a][active]
>
> (2) Jan 10 10:49:19 prometheus-01 alertmanager[1213219]: 
> ts=2023-01-10T10:49:19.460Z caller=nflog.go:553 level=debug component=nflog 
> msg="gossiping new entry" 
> entry="entry:  
> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
> server.example.com\\\", instance=\\\"server.example.com\\\", 
> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
> severity=\\\"page\\\"}\" receiver: integration:\"opsgenie\" > timestamp: 
> firing_alerts:10151928354614242630 > expires_at: nanos:262824014 > "
>
> (3) Jan 10 10:50:48 prometheus-01 alertmanager[1213219]: 
> ts=2023-01-10T10:50:48.745Z caller=dispatch.go:165 level=debug 
> component=dispatcher msg="Received alert" 
> alert=Chrony_Service_Down[d8c020a][resolved]
>
> (4) Jan 10 10:51:29 prometheus-01 alertmanager[1213219]: 
> ts=2023-01-10T10:51:29.183Z caller=nflog.go:553 level=debug component=nflog 
> msg="gossiping new entry" 
> entry="entry:  
> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
> server.example.com\\\", instance=\\\"server.example.com\\\", 
> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
> severity=\\\"page\\\"}\" receiver: integration:\"opsgenie\" > timestamp: 
> resolved_alerts:10151928354614242630 > expires_at: nanos:897562679 > "
>
> (5) Jan 10 10:51:49 prometheus-01 alertmanager[1213219]: 
> ts=2023-01-10T10:51:49.745Z caller=nflog.go:553 level=debug component=nflog 
> msg="gossiping new entry" 
> entry="entry:  
> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
> server.example.com\\\", instance=\\\"server.example.com\\\", 
> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
> severity=\\\"page\\\"}\" receiver: integration:\"opsgenie\" > timestamp: 
> firing_alerts:10151928354614242630 > expires_at: nanos:649205670 > "
>
> (6) Jan 10 10:51:59 prometheus-01 alertmanager[1213219]: 
> ts=2023-01-10T10:51:59.312Z caller=nflog.go:553 level=debug component=nflog 
> msg="gossiping new entry" 
> entry="entry:  
> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
> server.example.com\\\", instance=\\\"server.example.com\\\", 
> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
> severity=\\\"page\\\"}\" receiver: integration:\"opsgenie\" > timestamp: 
> resolved_alerts:10151928354614242630 > expires_at: nanos:137020780 > "
>
> (7) Jan 10 10:54:58 prometheus-01 alertmanager[1213219]: 
> ts=2023-01-10T10:54:58.744Z caller=dispatch.go:165 level=debug 
> component=dispatcher msg="Received alert" 
> alert=Chrony_Service_Down[d8c020a][resolved]
>
> #
>
> Interesting is line number 5 (Jan 10 10:51:49) where alertmanager fired 
> alert second time even minute ago (Jan 10 10:50:48) the alert was marked as 
> resolved.
> Such behavior generates duplicate alert in our system which is quite 
> annoying in our scale.
>
> What is worth to mention:
> - For test purpose the event is scrapped by 4 Promethues servers(default) 
> but alert rule is evaluated by one Promethues.
> - The event occurres only once so there is no flapping which might cause 
> another alert firing.
>
> Thanks
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/95e1af48-5d49-473f-9d39-086625b147e5n%40googlegroups.com.

Re: [prometheus-users] AlertManager rules examples

2023-01-13 Thread Stuart Clark

On 11/01/2023 19:58, Eulogio Apelin wrote:
I'm looking for information, primarily examples, of various ways to
configure alert rules.

Specifically, scenarios like:

In a single rule group:
Regular expression that determined a tls cert expires in 60 days. send
1 alert
Regular expression that determined a tls cert expires in 40 days, send
1 alert
Regular expression that determined a tls cert expires in 30 days, send
1 alert
Regular expression that determined a tls cert expires in 20 days, send
1 alert
Regular expression that determined a tls cert expires in 10 days, send
1 alert
Regular expression that determined a tls cert expires in 5 days, send
1 alert
Regular expression that determined a tls cert expires in 0 days, send
1 alert

Another scenario is to
send an alert once day to an email address.
send an alert if it's the 3rd day in a row, send the alert to another
set of address. and stop alerting.

can alertmanager send alerts to teams like it does slack?

And another other general examples of alert manager rules.

I think it is best not to think of alerts as moment in time events but
as being a time period where a certain condition is true. Separate to
the actual alert firing are then rules (in Alertmanager) of how to route
it (e.g. to Slack, email, etc.), what to send (email body template) and
how often to remind people that the alert is happening.

So for example with your TLS expiry example you might have an alert
which starts firing once a certificate is within 60 days of expiry. It
would continue to fire continuously until either the certificate is
renewed (i.e. it is over 60 days again) or stops existing (because
you've reconfigured Prometheus to no longer monitor that certificate).
Then within Alertmanager you can set rules to send you a message every
10 days that alert is firing, meaning you'd get a message at 60, 50, 40,
etc days until expiry.

More complex alerting routing decisions are generally out of scope for
Alertmanager and would be expected to be managed by a more complex
system (such as PagerDuty, OpsGenie, Grafana On-Call, etc.). This would
cover you example of wanting to escalate an alert after a period of
time, but would also cover things like having on-call rotas where
different people would be contacted by looking at a rota calendar.

--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/b43cfa1a-18c1-3c44-48f3-46349d8cdffa%40Jahingo.com.

[prometheus-users] AlertManager rules examples

2023-01-13 Thread Eulogio Apelin

I'm looking for information, primarily examples, of various ways to 
configure alert rules.

Specifically, scenarios like:

In a single rule group:
Regular expression that determined a tls cert expires in 60 days. send 1 
alert
Regular expression that determined a tls cert expires in 40 days, send 1 
alert
Regular expression that determined a tls cert expires in 30 days, send 1 
alert
Regular expression that determined a tls cert expires in 20 days, send 1 
alert
Regular expression that determined a tls cert expires in 10 days, send 1 
alert
Regular expression that determined a tls cert expires in 5 days, send 1 
alert
Regular expression that determined a tls cert expires in 0 days, send 1 
alert

Another scenario is to 
send an alert once day to an email address.
send an alert if it's the 3rd day in a row, send the alert to another set 
of address. and stop alerting.

can alertmanager send alerts to teams like it does slack?

And another other general examples of alert manager rules.

Thanks!

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/f9e3a133-c1fd-4f68-9939-6374cfcbec20n%40googlegroups.com.

[prometheus-users] the query text is cut off within postgres_exporter

2023-01-13 Thread Markus Zwettler



the query text is cut off within postgres_exporter / grafana.

this makes it hard or even impossible to inspect suspect queries.


question: any way to get the whole query text from there?

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/0d7489a8-3247-414d-8974-e329db3406d5n%40googlegroups.com.

[prometheus-users] Resolved alerts are grouped into firing alerts

2023-01-13 Thread Ivan

I got this strange behavior where resolved alerts are sent alongside with
firing ones. So i have this rule
kube_pod_container_status_ready{namespace="default"} == 0. What happens:
when pod is down alert is sent and everything is fine, then pod is up and
it is resolved. But if pod will fail again in a short period an gets
recreated by deploy with different name then the alert will be fired
mentioning previous pod and new one. I also noticed that if you wait about
20 minutes after alert is resolved and kill a pod again there is only one
pod in the alert.

This is expected:
fist alert 12:24
Container ubuntu in pod test-ubuntu-5579c5f49c-rsb8v is not ready for 30
seconds.

Prometheus Alert (Firing)

summary Container is not ready for too long.
alertnameKubeContainerNotReady
container ubuntu
endpoint http
instance 10.233.74.200:8080
jobkube-state-metrics
pod test-ubuntu-5579c5f49c-rsb8v
prometheus prometheus/prometheus-kube-prometheus-prometheus
service prometheus-kube-state-metrics
severity warning
uid 85f61574-2559-4f1a-8a14-f08ee4e34b8a

second alert 12:27

Container ubuntu in pod test-ubuntu-5579c5f49c-rsb8v is not ready for 30
seconds.

Prometheus Alert (Resolved)

summaryContainer is not ready for too long.
alertname KubeContainerNotReady
containerubuntu
endpoint http
instance 10.233.74.200:8080
job kube-state-metrics
pod test-ubuntu-5579c5f49c-rsb8v
prometheus prometheus/prometheus-kube-prometheus-prometheus
serviceprometheus-kube-state-metrics
severity warning
uid 85f61574-2559-4f1a-8a14-f08ee4e34b8a

Then I kill the pod and this happens (its a single alert):

third alert: 12:32

Container ubuntu in pod test-ubuntu-5579c5f49c-rsb8v is not ready for 30
seconds. 12:32

Prometheus Alert (Firing)

summaryContainer is not ready for too long.
alertname KubeContainerNotReady
containerubuntu
endpoint http
instance 10.233.74.200:8080
job kube-state-metrics
pod test-ubuntu-5579c5f49c-rsb8v
prometheus prometheus/prometheus-kube-prometheus-prometheus
serviceprometheus-kube-state-metrics
severity warning
uid85f61574-2559-4f1a-8a14-f08ee4e34b8a

Container ubuntu in pod test-ubuntu-5579c5f49c-sjlrk is not ready for 30
seconds.

summary Container is not ready for too long.
alertname KubeContainerNotReady
container ubuntu
endpointhttp
instance10.233.74.200:8080
job kube-state-metrics
pod test-ubuntu-5579c5f49c-sjlrk
prometheus prometheus/prometheus-kube-prometheus-prometheus
service prometheus-kube-state-metrics
severity warning
uid aba2b39c-a4b1-4d02-a532-4ca39ef8c0da

here's my config:

config:
global:
resolve_timeout: 5m
route:
group_by: ['alertname']
group_interval: 30s
repeat_interval: 24h
group_wait: 30s
receiver: 'prometheus-msteams'
receivers:
- name: 'prometheus-msteams'
webhook_configs: #
https://prometheus.io/docs/alerting/configuration/#webhook_config
- send_resolved: true
url: "http://prometheus-msteams:2000/prometheus-msteams;

Now, I know I can just group them by pod or some other labels or even turn
off grouping, but I want to figure out what exactly happens here. Also i
cant figure out what will happen to alert that has no label by which i am
grouping. For example if i group by podname how will alerts without pod be
treated.

--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/037e1cbe-078e-458b-b306-b3933c871013n%40googlegroups.com.

[prometheus-users] An alert fires twice even an event occurres only once

2023-01-13 Thread LukaszSz

Hi guys,

I have Prometheus in HA mode - 4 nodes - Prometheus+Aletmanager on each. 
Everything works fine but very often I experience issue that an alert is 
firing again even the event is already resolved by alertmanager.

Below logs from example event(Chrony_Service_Down) recorded by alertmanager:


(1) Jan 10 10:48:48 prometheus-01 alertmanager[1213219]: 
ts=2023-01-10T10:48:48.746Z caller=dispatch.go:165 level=debug 
component=dispatcher msg="Received alert" 
alert=Chrony_Service_Down[d8c020a][active]

(2) Jan 10 10:49:19 prometheus-01 alertmanager[1213219]: 
ts=2023-01-10T10:49:19.460Z caller=nflog.go:553 level=debug component=nflog 
msg="gossiping new entry" 
entry="entry: timestamp: 
firing_alerts:10151928354614242630 > expires_at: "

(3) Jan 10 10:50:48 prometheus-01 alertmanager[1213219]: 
ts=2023-01-10T10:50:48.745Z caller=dispatch.go:165 level=debug 
component=dispatcher msg="Received alert" 
alert=Chrony_Service_Down[d8c020a][resolved]

(4) Jan 10 10:51:29 prometheus-01 alertmanager[1213219]: 
ts=2023-01-10T10:51:29.183Z caller=nflog.go:553 level=debug component=nflog 
msg="gossiping new entry" 
entry="entry: timestamp: 
resolved_alerts:10151928354614242630 > expires_at: "

(5) Jan 10 10:51:49 prometheus-01 alertmanager[1213219]: 
ts=2023-01-10T10:51:49.745Z caller=nflog.go:553 level=debug component=nflog 
msg="gossiping new entry" 
entry="entry: timestamp: 
firing_alerts:10151928354614242630 > expires_at: "

(6) Jan 10 10:51:59 prometheus-01 alertmanager[1213219]: 
ts=2023-01-10T10:51:59.312Z caller=nflog.go:553 level=debug component=nflog 
msg="gossiping new entry" 
entry="entry: timestamp: 
resolved_alerts:10151928354614242630 > expires_at: "

(7) Jan 10 10:54:58 prometheus-01 alertmanager[1213219]: 
ts=2023-01-10T10:54:58.744Z caller=dispatch.go:165 level=debug 
component=dispatcher msg="Received alert" 
alert=Chrony_Service_Down[d8c020a][resolved]
#

Interesting is line number 5 (Jan 10 10:51:49) where alertmanager fired 
alert second time even minute ago (Jan 10 10:50:48) the alert was marked as 
resolved.
Such behavior generates duplicate alert in our system which is quite 
annoying in our scale.

What is worth to mention:
- For test purpose the event is scrapped by 4 Promethues servers(default) 
but alert rule is evaluated by one Promethues.
- The event occurres only once so there is no flapping which might cause 
another alert firing.

Thanks

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/a3ced99d-58f8-4275-a478-3d2ab884b764n%40googlegroups.com.

[prometheus-users] Re: An alert fires twice even an event occurres only once

[prometheus-users] Re: An alert fires twice even an event occurres only once

[prometheus-users] Re: An alert fires twice even an event occurres only once

[prometheus-users] Re: An alert fires twice even an event occurres only once

[prometheus-users] Re: An alert fires twice even an event occurres only once

[prometheus-users] Re: An alert fires twice even an event occurres only once

Re: [prometheus-users] AlertManager rules examples

[prometheus-users] AlertManager rules examples

[prometheus-users] the query text is cut off within postgres_exporter

[prometheus-users] Resolved alerts are grouped into firing alerts

[prometheus-users] An alert fires twice even an event occurres only once

11 matches

Site Navigation

Mail list logo

Footer information