[prometheus-users] Re: An alert fires twice even an event occurres only once

2023-01-13 Thread LukaszSz
Interesting. Seems that the alertmanagers are spread over 3 different 
regions ( 2xAsia, 2xUSA,4xEurope).
Maybe there is some latency problem between them like latency in gossip 
messages ?

On Friday, January 13, 2023 at 3:28:57 PM UTC+1 Brian Candler wrote:

> That's a lot of alertmanagers.  Are they all fully meshed?  (But I'd say 2 
> or 3 would be better - spread over different regions)
>
> On Friday, 13 January 2023 at 14:16:27 UTC LukaszSz wrote:
>
>> Yes. The prometheus server is configured to communicate with  all 
>> alertmanagers ( sorry there are 8 alertmanagers ):
>>
>> alerting:
>>   alert_relabel_configs:
>>   - action: labeldrop
>> regex: "^prometheus_server$"
>>   alertmanagers:
>>   - static_configs:
>> - targets:
>>   - alertmanager1:9093
>>   - alertmanager2:9093
>>   - alertmanager3:9093
>>   - alertmanager4:9093
>>   - alertmanager5:9093
>>   - alertmanager6:9093
>>   - alertmanager7:9093
>>   - alertmanager8:9093 
>>
>> On Friday, January 13, 2023 at 2:02:14 PM UTC+1 Brian Candler wrote:
>>
>>> Yes, but have you configured the prometheus (the one which has alerting 
>>> rules) to have all four alertmanagers as its destination?
>>>
>>> On Friday, 13 January 2023 at 12:55:49 UTC LukaszSz wrote:
>>>
 Yes Brian. As I mentioned in my post the Alertmangers are in cluster 
 and this event is visible on my 4 alertmanagers.
 Problem which I described is that an alerts are firing twice and it 
 generates duplication. 

 On Friday, January 13, 2023 at 1:34:52 PM UTC+1 Brian Candler wrote:

> Are the alertmanagers clustered?  Then you should configure prometheus 
> to deliver the alert to *all* alertmanagers.
>
> On Friday, 13 January 2023 at 11:08:37 UTC LukaszSz wrote:
>
>> Hi guys,
>>
>> I have Prometheus in HA mode - 4 nodes - Prometheus+Aletmanager on 
>> each. 
>> Everything works fine but very often I experience issue that an alert 
>> is firing again even the event is already resolved by alertmanager.
>>
>> Below logs from example event(Chrony_Service_Down) recorded by 
>> alertmanager:
>>
>>
>> 
>> (1) Jan 10 10:48:48 prometheus-01 alertmanager[1213219]: 
>> ts=2023-01-10T10:48:48.746Z caller=dispatch.go:165 level=debug 
>> component=dispatcher msg="Received alert" 
>> alert=Chrony_Service_Down[d8c020a][active]
>>
>> (2) Jan 10 10:49:19 prometheus-01 alertmanager[1213219]: 
>> ts=2023-01-10T10:49:19.460Z caller=nflog.go:553 level=debug 
>> component=nflog 
>> msg="gossiping new entry" 
>> entry="entry:>  
>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
>> server.example.com\\\", instance=\\\"server.example.com\\\", 
>> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
>> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
>> severity=\\\"page\\\"}\" receiver:> integration:\"opsgenie\" > timestamp:> > 
>> firing_alerts:10151928354614242630 > expires_at:> nanos:262824014 > "
>>
>> (3) Jan 10 10:50:48 prometheus-01 alertmanager[1213219]: 
>> ts=2023-01-10T10:50:48.745Z caller=dispatch.go:165 level=debug 
>> component=dispatcher msg="Received alert" 
>> alert=Chrony_Service_Down[d8c020a][resolved]
>>
>> (4) Jan 10 10:51:29 prometheus-01 alertmanager[1213219]: 
>> ts=2023-01-10T10:51:29.183Z caller=nflog.go:553 level=debug 
>> component=nflog 
>> msg="gossiping new entry" 
>> entry="entry:>  
>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
>> server.example.com\\\", instance=\\\"server.example.com\\\", 
>> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
>> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
>> severity=\\\"page\\\"}\" receiver:> integration:\"opsgenie\" > timestamp:> > 
>> resolved_alerts:10151928354614242630 > expires_at:> nanos:897562679 > "
>>
>> (5) Jan 10 10:51:49 prometheus-01 alertmanager[1213219]: 
>> ts=2023-01-10T10:51:49.745Z caller=nflog.go:553 level=debug 
>> component=nflog 
>> msg="gossiping new entry" 
>> entry="entry:>  
>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
>> server.example.com\\\", instance=\\\"server.example.com\\\", 
>> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
>> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
>> severity=\\\"page\\\"}\" receiver:> integration:\"opsgenie\" > timestamp:> > 
>> firing_alerts:10151928354614242630 > expires_at:> nanos:649205670 > "
>>
>> (6) Jan 10 10:51:59 prometheus-01 alertmanager[1213219]: 
>> ts=2023-01-10T10:51:59.312Z caller=nflog.go:553 level=debug 
>> component=nflog 
>> msg="gossiping new 

[prometheus-users] Re: An alert fires twice even an event occurres only once

2023-01-13 Thread Brian Candler
That's a lot of alertmanagers.  Are they all fully meshed?  (But I'd say 2 
or 3 would be better - spread over different regions)

On Friday, 13 January 2023 at 14:16:27 UTC LukaszSz wrote:

> Yes. The prometheus server is configured to communicate with  all 
> alertmanagers ( sorry there are 8 alertmanagers ):
>
> alerting:
>   alert_relabel_configs:
>   - action: labeldrop
> regex: "^prometheus_server$"
>   alertmanagers:
>   - static_configs:
> - targets:
>   - alertmanager1:9093
>   - alertmanager2:9093
>   - alertmanager3:9093
>   - alertmanager4:9093
>   - alertmanager5:9093
>   - alertmanager6:9093
>   - alertmanager7:9093
>   - alertmanager8:9093 
>
> On Friday, January 13, 2023 at 2:02:14 PM UTC+1 Brian Candler wrote:
>
>> Yes, but have you configured the prometheus (the one which has alerting 
>> rules) to have all four alertmanagers as its destination?
>>
>> On Friday, 13 January 2023 at 12:55:49 UTC LukaszSz wrote:
>>
>>> Yes Brian. As I mentioned in my post the Alertmangers are in cluster and 
>>> this event is visible on my 4 alertmanagers.
>>> Problem which I described is that an alerts are firing twice and it 
>>> generates duplication. 
>>>
>>> On Friday, January 13, 2023 at 1:34:52 PM UTC+1 Brian Candler wrote:
>>>
 Are the alertmanagers clustered?  Then you should configure prometheus 
 to deliver the alert to *all* alertmanagers.

 On Friday, 13 January 2023 at 11:08:37 UTC LukaszSz wrote:

> Hi guys,
>
> I have Prometheus in HA mode - 4 nodes - Prometheus+Aletmanager on 
> each. 
> Everything works fine but very often I experience issue that an alert 
> is firing again even the event is already resolved by alertmanager.
>
> Below logs from example event(Chrony_Service_Down) recorded by 
> alertmanager:
>
>
> 
> (1) Jan 10 10:48:48 prometheus-01 alertmanager[1213219]: 
> ts=2023-01-10T10:48:48.746Z caller=dispatch.go:165 level=debug 
> component=dispatcher msg="Received alert" 
> alert=Chrony_Service_Down[d8c020a][active]
>
> (2) Jan 10 10:49:19 prometheus-01 alertmanager[1213219]: 
> ts=2023-01-10T10:49:19.460Z caller=nflog.go:553 level=debug 
> component=nflog 
> msg="gossiping new entry" 
> entry="entry:  
> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
> server.example.com\\\", instance=\\\"server.example.com\\\", 
> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
> severity=\\\"page\\\"}\" receiver: integration:\"opsgenie\" > timestamp: > 
> firing_alerts:10151928354614242630 > expires_at: nanos:262824014 > "
>
> (3) Jan 10 10:50:48 prometheus-01 alertmanager[1213219]: 
> ts=2023-01-10T10:50:48.745Z caller=dispatch.go:165 level=debug 
> component=dispatcher msg="Received alert" 
> alert=Chrony_Service_Down[d8c020a][resolved]
>
> (4) Jan 10 10:51:29 prometheus-01 alertmanager[1213219]: 
> ts=2023-01-10T10:51:29.183Z caller=nflog.go:553 level=debug 
> component=nflog 
> msg="gossiping new entry" 
> entry="entry:  
> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
> server.example.com\\\", instance=\\\"server.example.com\\\", 
> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
> severity=\\\"page\\\"}\" receiver: integration:\"opsgenie\" > timestamp: > 
> resolved_alerts:10151928354614242630 > expires_at: nanos:897562679 > "
>
> (5) Jan 10 10:51:49 prometheus-01 alertmanager[1213219]: 
> ts=2023-01-10T10:51:49.745Z caller=nflog.go:553 level=debug 
> component=nflog 
> msg="gossiping new entry" 
> entry="entry:  
> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
> server.example.com\\\", instance=\\\"server.example.com\\\", 
> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
> severity=\\\"page\\\"}\" receiver: integration:\"opsgenie\" > timestamp: > 
> firing_alerts:10151928354614242630 > expires_at: nanos:649205670 > "
>
> (6) Jan 10 10:51:59 prometheus-01 alertmanager[1213219]: 
> ts=2023-01-10T10:51:59.312Z caller=nflog.go:553 level=debug 
> component=nflog 
> msg="gossiping new entry" 
> entry="entry:  
> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
> server.example.com\\\", instance=\\\"server.example.com\\\", 
> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
> severity=\\\"page\\\"}\" receiver: integration:\"opsgenie\" > 

[prometheus-users] Re: An alert fires twice even an event occurres only once

2023-01-13 Thread LukaszSz
Yes. The prometheus server is configured to communicate with  all 
alertmanagers ( sorry there are 8 alertmanagers ):

alerting:
  alert_relabel_configs:
  - action: labeldrop
regex: "^prometheus_server$"
  alertmanagers:
  - static_configs:
- targets:
  - alertmanager1:9093
  - alertmanager2:9093
  - alertmanager3:9093
  - alertmanager4:9093
  - alertmanager5:9093
  - alertmanager6:9093
  - alertmanager7:9093
  - alertmanager8:9093 

On Friday, January 13, 2023 at 2:02:14 PM UTC+1 Brian Candler wrote:

> Yes, but have you configured the prometheus (the one which has alerting 
> rules) to have all four alertmanagers as its destination?
>
> On Friday, 13 January 2023 at 12:55:49 UTC LukaszSz wrote:
>
>> Yes Brian. As I mentioned in my post the Alertmangers are in cluster and 
>> this event is visible on my 4 alertmanagers.
>> Problem which I described is that an alerts are firing twice and it 
>> generates duplication. 
>>
>> On Friday, January 13, 2023 at 1:34:52 PM UTC+1 Brian Candler wrote:
>>
>>> Are the alertmanagers clustered?  Then you should configure prometheus 
>>> to deliver the alert to *all* alertmanagers.
>>>
>>> On Friday, 13 January 2023 at 11:08:37 UTC LukaszSz wrote:
>>>
 Hi guys,

 I have Prometheus in HA mode - 4 nodes - Prometheus+Aletmanager on 
 each. 
 Everything works fine but very often I experience issue that an alert 
 is firing again even the event is already resolved by alertmanager.

 Below logs from example event(Chrony_Service_Down) recorded by 
 alertmanager:


 
 (1) Jan 10 10:48:48 prometheus-01 alertmanager[1213219]: 
 ts=2023-01-10T10:48:48.746Z caller=dispatch.go:165 level=debug 
 component=dispatcher msg="Received alert" 
 alert=Chrony_Service_Down[d8c020a][active]

 (2) Jan 10 10:49:19 prometheus-01 alertmanager[1213219]: 
 ts=2023-01-10T10:49:19.460Z caller=nflog.go:553 level=debug 
 component=nflog 
 msg="gossiping new entry" 
 entry="entry:>>>  
 datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
 server.example.com\\\", instance=\\\"server.example.com\\\", 
 job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
 puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
 severity=\\\"page\\\"}\" receiver:>>> integration:\"opsgenie\" > timestamp: 
 firing_alerts:10151928354614242630 > expires_at:>>> nanos:262824014 > "

 (3) Jan 10 10:50:48 prometheus-01 alertmanager[1213219]: 
 ts=2023-01-10T10:50:48.745Z caller=dispatch.go:165 level=debug 
 component=dispatcher msg="Received alert" 
 alert=Chrony_Service_Down[d8c020a][resolved]

 (4) Jan 10 10:51:29 prometheus-01 alertmanager[1213219]: 
 ts=2023-01-10T10:51:29.183Z caller=nflog.go:553 level=debug 
 component=nflog 
 msg="gossiping new entry" 
 entry="entry:>>>  
 datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
 server.example.com\\\", instance=\\\"server.example.com\\\", 
 job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
 puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
 severity=\\\"page\\\"}\" receiver:>>> integration:\"opsgenie\" > timestamp: 
 resolved_alerts:10151928354614242630 > expires_at:>>> nanos:897562679 > "

 (5) Jan 10 10:51:49 prometheus-01 alertmanager[1213219]: 
 ts=2023-01-10T10:51:49.745Z caller=nflog.go:553 level=debug 
 component=nflog 
 msg="gossiping new entry" 
 entry="entry:>>>  
 datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
 server.example.com\\\", instance=\\\"server.example.com\\\", 
 job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
 puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
 severity=\\\"page\\\"}\" receiver:>>> integration:\"opsgenie\" > timestamp: 
 firing_alerts:10151928354614242630 > expires_at:>>> nanos:649205670 > "

 (6) Jan 10 10:51:59 prometheus-01 alertmanager[1213219]: 
 ts=2023-01-10T10:51:59.312Z caller=nflog.go:553 level=debug 
 component=nflog 
 msg="gossiping new entry" 
 entry="entry:>>>  
 datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
 server.example.com\\\", instance=\\\"server.example.com\\\", 
 job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
 puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
 severity=\\\"page\\\"}\" receiver:>>> integration:\"opsgenie\" > timestamp: 
 resolved_alerts:10151928354614242630 > expires_at:>>> nanos:137020780 > "

 (7) Jan 10 10:54:58 prometheus-01 alertmanager[1213219]: 
 ts=2023-01-10T10:54:58.744Z caller=dispatch.go:165 level=debug 
 component=dispatcher msg="Received alert" 
 alert=Chrony_Service_Down[d8c020a][resolved]

 

[prometheus-users] Re: An alert fires twice even an event occurres only once

2023-01-13 Thread Brian Candler
Yes, but have you configured the prometheus (the one which has alerting 
rules) to have all four alertmanagers as its destination?

On Friday, 13 January 2023 at 12:55:49 UTC LukaszSz wrote:

> Yes Brian. As I mentioned in my post the Alertmangers are in cluster and 
> this event is visible on my 4 alertmanagers.
> Problem which I described is that an alerts are firing twice and it 
> generates duplication. 
>
> On Friday, January 13, 2023 at 1:34:52 PM UTC+1 Brian Candler wrote:
>
>> Are the alertmanagers clustered?  Then you should configure prometheus to 
>> deliver the alert to *all* alertmanagers.
>>
>> On Friday, 13 January 2023 at 11:08:37 UTC LukaszSz wrote:
>>
>>> Hi guys,
>>>
>>> I have Prometheus in HA mode - 4 nodes - Prometheus+Aletmanager on each. 
>>> Everything works fine but very often I experience issue that an alert is 
>>> firing again even the event is already resolved by alertmanager.
>>>
>>> Below logs from example event(Chrony_Service_Down) recorded by 
>>> alertmanager:
>>>
>>>
>>> 
>>> (1) Jan 10 10:48:48 prometheus-01 alertmanager[1213219]: 
>>> ts=2023-01-10T10:48:48.746Z caller=dispatch.go:165 level=debug 
>>> component=dispatcher msg="Received alert" 
>>> alert=Chrony_Service_Down[d8c020a][active]
>>>
>>> (2) Jan 10 10:49:19 prometheus-01 alertmanager[1213219]: 
>>> ts=2023-01-10T10:49:19.460Z caller=nflog.go:553 level=debug component=nflog 
>>> msg="gossiping new entry" 
>>> entry="entry:>>  
>>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
>>> server.example.com\\\", instance=\\\"server.example.com\\\", 
>>> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
>>> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
>>> severity=\\\"page\\\"}\" receiver:>> integration:\"opsgenie\" > timestamp: 
>>> firing_alerts:10151928354614242630 > expires_at:>> nanos:262824014 > "
>>>
>>> (3) Jan 10 10:50:48 prometheus-01 alertmanager[1213219]: 
>>> ts=2023-01-10T10:50:48.745Z caller=dispatch.go:165 level=debug 
>>> component=dispatcher msg="Received alert" 
>>> alert=Chrony_Service_Down[d8c020a][resolved]
>>>
>>> (4) Jan 10 10:51:29 prometheus-01 alertmanager[1213219]: 
>>> ts=2023-01-10T10:51:29.183Z caller=nflog.go:553 level=debug component=nflog 
>>> msg="gossiping new entry" 
>>> entry="entry:>>  
>>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
>>> server.example.com\\\", instance=\\\"server.example.com\\\", 
>>> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
>>> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
>>> severity=\\\"page\\\"}\" receiver:>> integration:\"opsgenie\" > timestamp: 
>>> resolved_alerts:10151928354614242630 > expires_at:>> nanos:897562679 > "
>>>
>>> (5) Jan 10 10:51:49 prometheus-01 alertmanager[1213219]: 
>>> ts=2023-01-10T10:51:49.745Z caller=nflog.go:553 level=debug component=nflog 
>>> msg="gossiping new entry" 
>>> entry="entry:>>  
>>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
>>> server.example.com\\\", instance=\\\"server.example.com\\\", 
>>> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
>>> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
>>> severity=\\\"page\\\"}\" receiver:>> integration:\"opsgenie\" > timestamp: 
>>> firing_alerts:10151928354614242630 > expires_at:>> nanos:649205670 > "
>>>
>>> (6) Jan 10 10:51:59 prometheus-01 alertmanager[1213219]: 
>>> ts=2023-01-10T10:51:59.312Z caller=nflog.go:553 level=debug component=nflog 
>>> msg="gossiping new entry" 
>>> entry="entry:>>  
>>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
>>> server.example.com\\\", instance=\\\"server.example.com\\\", 
>>> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
>>> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
>>> severity=\\\"page\\\"}\" receiver:>> integration:\"opsgenie\" > timestamp: 
>>> resolved_alerts:10151928354614242630 > expires_at:>> nanos:137020780 > "
>>>
>>> (7) Jan 10 10:54:58 prometheus-01 alertmanager[1213219]: 
>>> ts=2023-01-10T10:54:58.744Z caller=dispatch.go:165 level=debug 
>>> component=dispatcher msg="Received alert" 
>>> alert=Chrony_Service_Down[d8c020a][resolved]
>>>
>>> #
>>>
>>> Interesting is line number 5 (Jan 10 10:51:49) where alertmanager fired 
>>> alert second time even minute ago (Jan 10 10:50:48) the alert was marked as 
>>> resolved.
>>> Such behavior generates duplicate alert in our system which is quite 
>>> annoying in our scale.
>>>
>>> What is worth to mention:
>>> - For test purpose the event is scrapped by 4 Promethues 
>>> servers(default) but alert rule is evaluated by one Promethues.
>>> - The event occurres only once so there is no flapping which might cause 
>>> another alert firing.
>>>
>>> Thanks
>>>
>>

-- 
You 

[prometheus-users] Re: An alert fires twice even an event occurres only once

2023-01-13 Thread LukaszSz
Yes Brian. As I mentioned in my post the Alertmangers are in cluster and 
this event is visible on my 4 alertmanagers.
Problem which I described is that an alerts are firing twice and it 
generates duplication. 

On Friday, January 13, 2023 at 1:34:52 PM UTC+1 Brian Candler wrote:

> Are the alertmanagers clustered?  Then you should configure prometheus to 
> deliver the alert to *all* alertmanagers.
>
> On Friday, 13 January 2023 at 11:08:37 UTC LukaszSz wrote:
>
>> Hi guys,
>>
>> I have Prometheus in HA mode - 4 nodes - Prometheus+Aletmanager on each. 
>> Everything works fine but very often I experience issue that an alert is 
>> firing again even the event is already resolved by alertmanager.
>>
>> Below logs from example event(Chrony_Service_Down) recorded by 
>> alertmanager:
>>
>>
>> 
>> (1) Jan 10 10:48:48 prometheus-01 alertmanager[1213219]: 
>> ts=2023-01-10T10:48:48.746Z caller=dispatch.go:165 level=debug 
>> component=dispatcher msg="Received alert" 
>> alert=Chrony_Service_Down[d8c020a][active]
>>
>> (2) Jan 10 10:49:19 prometheus-01 alertmanager[1213219]: 
>> ts=2023-01-10T10:49:19.460Z caller=nflog.go:553 level=debug component=nflog 
>> msg="gossiping new entry" 
>> entry="entry:>  
>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
>> server.example.com\\\", instance=\\\"server.example.com\\\", 
>> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
>> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
>> severity=\\\"page\\\"}\" receiver:> integration:\"opsgenie\" > timestamp: 
>> firing_alerts:10151928354614242630 > expires_at:> nanos:262824014 > "
>>
>> (3) Jan 10 10:50:48 prometheus-01 alertmanager[1213219]: 
>> ts=2023-01-10T10:50:48.745Z caller=dispatch.go:165 level=debug 
>> component=dispatcher msg="Received alert" 
>> alert=Chrony_Service_Down[d8c020a][resolved]
>>
>> (4) Jan 10 10:51:29 prometheus-01 alertmanager[1213219]: 
>> ts=2023-01-10T10:51:29.183Z caller=nflog.go:553 level=debug component=nflog 
>> msg="gossiping new entry" 
>> entry="entry:>  
>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
>> server.example.com\\\", instance=\\\"server.example.com\\\", 
>> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
>> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
>> severity=\\\"page\\\"}\" receiver:> integration:\"opsgenie\" > timestamp: 
>> resolved_alerts:10151928354614242630 > expires_at:> nanos:897562679 > "
>>
>> (5) Jan 10 10:51:49 prometheus-01 alertmanager[1213219]: 
>> ts=2023-01-10T10:51:49.745Z caller=nflog.go:553 level=debug component=nflog 
>> msg="gossiping new entry" 
>> entry="entry:>  
>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
>> server.example.com\\\", instance=\\\"server.example.com\\\", 
>> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
>> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
>> severity=\\\"page\\\"}\" receiver:> integration:\"opsgenie\" > timestamp: 
>> firing_alerts:10151928354614242630 > expires_at:> nanos:649205670 > "
>>
>> (6) Jan 10 10:51:59 prometheus-01 alertmanager[1213219]: 
>> ts=2023-01-10T10:51:59.312Z caller=nflog.go:553 level=debug component=nflog 
>> msg="gossiping new entry" 
>> entry="entry:>  
>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
>> server.example.com\\\", instance=\\\"server.example.com\\\", 
>> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
>> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
>> severity=\\\"page\\\"}\" receiver:> integration:\"opsgenie\" > timestamp: 
>> resolved_alerts:10151928354614242630 > expires_at:> nanos:137020780 > "
>>
>> (7) Jan 10 10:54:58 prometheus-01 alertmanager[1213219]: 
>> ts=2023-01-10T10:54:58.744Z caller=dispatch.go:165 level=debug 
>> component=dispatcher msg="Received alert" 
>> alert=Chrony_Service_Down[d8c020a][resolved]
>>
>> #
>>
>> Interesting is line number 5 (Jan 10 10:51:49) where alertmanager fired 
>> alert second time even minute ago (Jan 10 10:50:48) the alert was marked as 
>> resolved.
>> Such behavior generates duplicate alert in our system which is quite 
>> annoying in our scale.
>>
>> What is worth to mention:
>> - For test purpose the event is scrapped by 4 Promethues servers(default) 
>> but alert rule is evaluated by one Promethues.
>> - The event occurres only once so there is no flapping which might cause 
>> another alert firing.
>>
>> Thanks
>>
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 

[prometheus-users] Re: An alert fires twice even an event occurres only once

2023-01-13 Thread Brian Candler
Are the alertmanagers clustered?  Then you should configure prometheus to 
deliver the alert to *all* alertmanagers.

On Friday, 13 January 2023 at 11:08:37 UTC LukaszSz wrote:

> Hi guys,
>
> I have Prometheus in HA mode - 4 nodes - Prometheus+Aletmanager on each. 
> Everything works fine but very often I experience issue that an alert is 
> firing again even the event is already resolved by alertmanager.
>
> Below logs from example event(Chrony_Service_Down) recorded by 
> alertmanager:
>
>
> 
> (1) Jan 10 10:48:48 prometheus-01 alertmanager[1213219]: 
> ts=2023-01-10T10:48:48.746Z caller=dispatch.go:165 level=debug 
> component=dispatcher msg="Received alert" 
> alert=Chrony_Service_Down[d8c020a][active]
>
> (2) Jan 10 10:49:19 prometheus-01 alertmanager[1213219]: 
> ts=2023-01-10T10:49:19.460Z caller=nflog.go:553 level=debug component=nflog 
> msg="gossiping new entry" 
> entry="entry:  
> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
> server.example.com\\\", instance=\\\"server.example.com\\\", 
> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
> severity=\\\"page\\\"}\" receiver: integration:\"opsgenie\" > timestamp: 
> firing_alerts:10151928354614242630 > expires_at: nanos:262824014 > "
>
> (3) Jan 10 10:50:48 prometheus-01 alertmanager[1213219]: 
> ts=2023-01-10T10:50:48.745Z caller=dispatch.go:165 level=debug 
> component=dispatcher msg="Received alert" 
> alert=Chrony_Service_Down[d8c020a][resolved]
>
> (4) Jan 10 10:51:29 prometheus-01 alertmanager[1213219]: 
> ts=2023-01-10T10:51:29.183Z caller=nflog.go:553 level=debug component=nflog 
> msg="gossiping new entry" 
> entry="entry:  
> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
> server.example.com\\\", instance=\\\"server.example.com\\\", 
> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
> severity=\\\"page\\\"}\" receiver: integration:\"opsgenie\" > timestamp: 
> resolved_alerts:10151928354614242630 > expires_at: nanos:897562679 > "
>
> (5) Jan 10 10:51:49 prometheus-01 alertmanager[1213219]: 
> ts=2023-01-10T10:51:49.745Z caller=nflog.go:553 level=debug component=nflog 
> msg="gossiping new entry" 
> entry="entry:  
> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
> server.example.com\\\", instance=\\\"server.example.com\\\", 
> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
> severity=\\\"page\\\"}\" receiver: integration:\"opsgenie\" > timestamp: 
> firing_alerts:10151928354614242630 > expires_at: nanos:649205670 > "
>
> (6) Jan 10 10:51:59 prometheus-01 alertmanager[1213219]: 
> ts=2023-01-10T10:51:59.312Z caller=nflog.go:553 level=debug component=nflog 
> msg="gossiping new entry" 
> entry="entry:  
> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\"
> server.example.com\\\", instance=\\\"server.example.com\\\", 
> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", 
> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", 
> severity=\\\"page\\\"}\" receiver: integration:\"opsgenie\" > timestamp: 
> resolved_alerts:10151928354614242630 > expires_at: nanos:137020780 > "
>
> (7) Jan 10 10:54:58 prometheus-01 alertmanager[1213219]: 
> ts=2023-01-10T10:54:58.744Z caller=dispatch.go:165 level=debug 
> component=dispatcher msg="Received alert" 
> alert=Chrony_Service_Down[d8c020a][resolved]
>
> #
>
> Interesting is line number 5 (Jan 10 10:51:49) where alertmanager fired 
> alert second time even minute ago (Jan 10 10:50:48) the alert was marked as 
> resolved.
> Such behavior generates duplicate alert in our system which is quite 
> annoying in our scale.
>
> What is worth to mention:
> - For test purpose the event is scrapped by 4 Promethues servers(default) 
> but alert rule is evaluated by one Promethues.
> - The event occurres only once so there is no flapping which might cause 
> another alert firing.
>
> Thanks
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/95e1af48-5d49-473f-9d39-086625b147e5n%40googlegroups.com.


Re: [prometheus-users] AlertManager rules examples

2023-01-13 Thread Stuart Clark

On 11/01/2023 19:58, Eulogio Apelin wrote:
I'm looking for information, primarily examples, of various ways to 
configure alert rules.


Specifically, scenarios like:

In a single rule group:
Regular expression that determined a tls cert expires in 60 days. send 
1 alert
Regular expression that determined a tls cert expires in 40 days, send 
1 alert
Regular expression that determined a tls cert expires in 30 days, send 
1 alert
Regular expression that determined a tls cert expires in 20 days, send 
1 alert
Regular expression that determined a tls cert expires in 10 days, send 
1 alert
Regular expression that determined a tls cert expires in 5 days, send 
1 alert
Regular expression that determined a tls cert expires in 0 days, send 
1 alert


Another scenario is to
send an alert once day to an email address.
send an alert if it's the 3rd day in a row, send the alert to another 
set of address. and stop alerting.


can alertmanager send alerts to teams like it does slack?

And another other general examples of alert manager rules.

I think it is best not to think of alerts as moment in time events but 
as being a time period where a certain condition is true. Separate to 
the actual alert firing are then rules (in Alertmanager) of how to route 
it (e.g. to Slack, email, etc.), what to send (email body template) and 
how often to remind people that the alert is happening.


So for example with your TLS expiry example you might have an alert 
which starts firing once a certificate is within 60 days of expiry. It 
would continue to fire continuously until either the certificate is 
renewed (i.e. it is over 60 days again) or stops existing (because 
you've reconfigured Prometheus to no longer monitor that certificate). 
Then within Alertmanager you can set rules to send you a message every 
10 days that alert is firing, meaning you'd get a message at 60, 50, 40, 
etc days until expiry.


More complex alerting routing decisions are generally out of scope for 
Alertmanager and would be expected to be managed by a more complex 
system (such as PagerDuty, OpsGenie, Grafana On-Call, etc.). This would 
cover you example of wanting to escalate an alert after a period of 
time, but would also cover things like having on-call rotas where 
different people would be contacted by looking at a rota calendar.


--
Stuart Clark

--
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/b43cfa1a-18c1-3c44-48f3-46349d8cdffa%40Jahingo.com.


[prometheus-users] AlertManager rules examples

2023-01-13 Thread Eulogio Apelin
I'm looking for information, primarily examples, of various ways to 
configure alert rules.

Specifically, scenarios like:

In a single rule group:
Regular expression that determined a tls cert expires in 60 days. send 1 
alert
Regular expression that determined a tls cert expires in 40 days, send 1 
alert
Regular expression that determined a tls cert expires in 30 days, send 1 
alert
Regular expression that determined a tls cert expires in 20 days, send 1 
alert
Regular expression that determined a tls cert expires in 10 days, send 1 
alert
Regular expression that determined a tls cert expires in 5 days, send 1 
alert
Regular expression that determined a tls cert expires in 0 days, send 1 
alert

Another scenario is to 
send an alert once day to an email address.
send an alert if it's the 3rd day in a row, send the alert to another set 
of address. and stop alerting.

can alertmanager send alerts to teams like it does slack?

And another other general examples of alert manager rules.

Thanks!

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/f9e3a133-c1fd-4f68-9939-6374cfcbec20n%40googlegroups.com.


[prometheus-users] the query text is cut off within postgres_exporter

2023-01-13 Thread Markus Zwettler


the query text is cut off within postgres_exporter / grafana.

this makes it hard or even impossible to inspect suspect queries.


question: any way to get the whole query text from there?

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/0d7489a8-3247-414d-8974-e329db3406d5n%40googlegroups.com.


[prometheus-users] Resolved alerts are grouped into firing alerts

2023-01-13 Thread Ivan
I got this strange behavior where resolved alerts are sent alongside with 
firing ones. So i have this rule 
kube_pod_container_status_ready{namespace="default"} == 0. What happens: 
when pod is down alert is sent and everything is fine, then pod is up and 
it is resolved. But if pod will fail again in a short period an gets 
recreated by deploy with different name then the alert will be fired 
mentioning previous pod and new one. I also noticed that if you wait about 
20 minutes after alert is resolved and kill a pod again there is only one 
pod in the alert. 

This is expected:
fist alert 12:24
Container ubuntu in pod test-ubuntu-5579c5f49c-rsb8v is not ready for 30 
seconds. 

Prometheus Alert (Firing)
  
summary Container is not ready for too long.
alertnameKubeContainerNotReady
container ubuntu
endpoint  http
instance  10.233.74.200:8080
jobkube-state-metrics
pod   test-ubuntu-5579c5f49c-rsb8v
prometheus prometheus/prometheus-kube-prometheus-prometheus
service  prometheus-kube-state-metrics
severity warning
uid 85f61574-2559-4f1a-8a14-f08ee4e34b8a

second alert 12:27

Container ubuntu in pod test-ubuntu-5579c5f49c-rsb8v is not ready for 30 
seconds.

Prometheus Alert (Resolved)

summaryContainer is not ready for too long.
alertname   KubeContainerNotReady
containerubuntu
endpoint http
instance 10.233.74.200:8080
job   kube-state-metrics
pod  test-ubuntu-5579c5f49c-rsb8v
prometheus   prometheus/prometheus-kube-prometheus-prometheus
serviceprometheus-kube-state-metrics
severity   warning
uid   85f61574-2559-4f1a-8a14-f08ee4e34b8a

Then I kill the pod and this happens (its a single alert):

third alert: 12:32

Container ubuntu in pod test-ubuntu-5579c5f49c-rsb8v is not ready for 30 
seconds. 12:32

Prometheus Alert (Firing)

summaryContainer is not ready for too long.
alertname   KubeContainerNotReady
containerubuntu
endpoint http
instance 10.233.74.200:8080
job   kube-state-metrics
pod  test-ubuntu-5579c5f49c-rsb8v
prometheus   prometheus/prometheus-kube-prometheus-prometheus
serviceprometheus-kube-state-metrics
severity   warning
uid85f61574-2559-4f1a-8a14-f08ee4e34b8a

Container ubuntu in pod test-ubuntu-5579c5f49c-sjlrk is not ready for 30 
seconds.

summary   Container is not ready for too long.
alertname  KubeContainerNotReady
container   ubuntu
endpointhttp
instance10.233.74.200:8080
job  kube-state-metrics
pod test-ubuntu-5579c5f49c-sjlrk
prometheus  prometheus/prometheus-kube-prometheus-prometheus
service   prometheus-kube-state-metrics
severity  warning
uid  aba2b39c-a4b1-4d02-a532-4ca39ef8c0da

here's my config:

  config:
global:
  resolve_timeout: 5m
route:
  group_by: ['alertname']
  group_interval: 30s
  repeat_interval: 24h
  group_wait: 30s
  receiver: 'prometheus-msteams'
receivers:
- name: 'prometheus-msteams'
  webhook_configs: # 
https://prometheus.io/docs/alerting/configuration/#webhook_config 
  - send_resolved: true
url: "http://prometheus-msteams:2000/prometheus-msteams;

Now, I know I can just group them by pod or some other labels or even turn 
off grouping, but I want to figure out what exactly happens here. Also i 
cant figure out what will happen to alert that has no label by which i am 
grouping. For example if i group by podname how will alerts without pod be 
treated. 

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/037e1cbe-078e-458b-b306-b3933c871013n%40googlegroups.com.


[prometheus-users] An alert fires twice even an event occurres only once

2023-01-13 Thread LukaszSz
Hi guys,

I have Prometheus in HA mode - 4 nodes - Prometheus+Aletmanager on each. 
Everything works fine but very often I experience issue that an alert is 
firing again even the event is already resolved by alertmanager.

Below logs from example event(Chrony_Service_Down) recorded by alertmanager:


(1) Jan 10 10:48:48 prometheus-01 alertmanager[1213219]: 
ts=2023-01-10T10:48:48.746Z caller=dispatch.go:165 level=debug 
component=dispatcher msg="Received alert" 
alert=Chrony_Service_Down[d8c020a][active]

(2) Jan 10 10:49:19 prometheus-01 alertmanager[1213219]: 
ts=2023-01-10T10:49:19.460Z caller=nflog.go:553 level=debug component=nflog 
msg="gossiping new entry" 
entry="entry: timestamp: 
firing_alerts:10151928354614242630 > expires_at: "

(3) Jan 10 10:50:48 prometheus-01 alertmanager[1213219]: 
ts=2023-01-10T10:50:48.745Z caller=dispatch.go:165 level=debug 
component=dispatcher msg="Received alert" 
alert=Chrony_Service_Down[d8c020a][resolved]

(4) Jan 10 10:51:29 prometheus-01 alertmanager[1213219]: 
ts=2023-01-10T10:51:29.183Z caller=nflog.go:553 level=debug component=nflog 
msg="gossiping new entry" 
entry="entry: timestamp: 
resolved_alerts:10151928354614242630 > expires_at: "

(5) Jan 10 10:51:49 prometheus-01 alertmanager[1213219]: 
ts=2023-01-10T10:51:49.745Z caller=nflog.go:553 level=debug component=nflog 
msg="gossiping new entry" 
entry="entry: timestamp: 
firing_alerts:10151928354614242630 > expires_at: "

(6) Jan 10 10:51:59 prometheus-01 alertmanager[1213219]: 
ts=2023-01-10T10:51:59.312Z caller=nflog.go:553 level=debug component=nflog 
msg="gossiping new entry" 
entry="entry: timestamp: 
resolved_alerts:10151928354614242630 > expires_at: "

(7) Jan 10 10:54:58 prometheus-01 alertmanager[1213219]: 
ts=2023-01-10T10:54:58.744Z caller=dispatch.go:165 level=debug 
component=dispatcher msg="Received alert" 
alert=Chrony_Service_Down[d8c020a][resolved]
#

Interesting is line number 5 (Jan 10 10:51:49) where alertmanager fired 
alert second time even minute ago (Jan 10 10:50:48) the alert was marked as 
resolved.
Such behavior generates duplicate alert in our system which is quite 
annoying in our scale.

What is worth to mention:
- For test purpose the event is scrapped by 4 Promethues servers(default) 
but alert rule is evaluated by one Promethues.
- The event occurres only once so there is no flapping which might cause 
another alert firing.

Thanks

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/a3ced99d-58f8-4275-a478-3d2ab884b764n%40googlegroups.com.