[prometheus-users] Re: An alert fires twice even an event occurres only once
Interesting. Seems that the alertmanagers are spread over 3 different regions ( 2xAsia, 2xUSA,4xEurope). Maybe there is some latency problem between them like latency in gossip messages ? On Friday, January 13, 2023 at 3:28:57 PM UTC+1 Brian Candler wrote: > That's a lot of alertmanagers. Are they all fully meshed? (But I'd say 2 > or 3 would be better - spread over different regions) > > On Friday, 13 January 2023 at 14:16:27 UTC LukaszSz wrote: > >> Yes. The prometheus server is configured to communicate with all >> alertmanagers ( sorry there are 8 alertmanagers ): >> >> alerting: >> alert_relabel_configs: >> - action: labeldrop >> regex: "^prometheus_server$" >> alertmanagers: >> - static_configs: >> - targets: >> - alertmanager1:9093 >> - alertmanager2:9093 >> - alertmanager3:9093 >> - alertmanager4:9093 >> - alertmanager5:9093 >> - alertmanager6:9093 >> - alertmanager7:9093 >> - alertmanager8:9093 >> >> On Friday, January 13, 2023 at 2:02:14 PM UTC+1 Brian Candler wrote: >> >>> Yes, but have you configured the prometheus (the one which has alerting >>> rules) to have all four alertmanagers as its destination? >>> >>> On Friday, 13 January 2023 at 12:55:49 UTC LukaszSz wrote: >>> Yes Brian. As I mentioned in my post the Alertmangers are in cluster and this event is visible on my 4 alertmanagers. Problem which I described is that an alerts are firing twice and it generates duplication. On Friday, January 13, 2023 at 1:34:52 PM UTC+1 Brian Candler wrote: > Are the alertmanagers clustered? Then you should configure prometheus > to deliver the alert to *all* alertmanagers. > > On Friday, 13 January 2023 at 11:08:37 UTC LukaszSz wrote: > >> Hi guys, >> >> I have Prometheus in HA mode - 4 nodes - Prometheus+Aletmanager on >> each. >> Everything works fine but very often I experience issue that an alert >> is firing again even the event is already resolved by alertmanager. >> >> Below logs from example event(Chrony_Service_Down) recorded by >> alertmanager: >> >> >> >> (1) Jan 10 10:48:48 prometheus-01 alertmanager[1213219]: >> ts=2023-01-10T10:48:48.746Z caller=dispatch.go:165 level=debug >> component=dispatcher msg="Received alert" >> alert=Chrony_Service_Down[d8c020a][active] >> >> (2) Jan 10 10:49:19 prometheus-01 alertmanager[1213219]: >> ts=2023-01-10T10:49:19.460Z caller=nflog.go:553 level=debug >> component=nflog >> msg="gossiping new entry" >> entry="entry:> >> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\" >> server.example.com\\\", instance=\\\"server.example.com\\\", >> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", >> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", >> severity=\\\"page\\\"}\" receiver:> integration:\"opsgenie\" > timestamp:> > >> firing_alerts:10151928354614242630 > expires_at:> nanos:262824014 > " >> >> (3) Jan 10 10:50:48 prometheus-01 alertmanager[1213219]: >> ts=2023-01-10T10:50:48.745Z caller=dispatch.go:165 level=debug >> component=dispatcher msg="Received alert" >> alert=Chrony_Service_Down[d8c020a][resolved] >> >> (4) Jan 10 10:51:29 prometheus-01 alertmanager[1213219]: >> ts=2023-01-10T10:51:29.183Z caller=nflog.go:553 level=debug >> component=nflog >> msg="gossiping new entry" >> entry="entry:> >> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\" >> server.example.com\\\", instance=\\\"server.example.com\\\", >> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", >> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", >> severity=\\\"page\\\"}\" receiver:> integration:\"opsgenie\" > timestamp:> > >> resolved_alerts:10151928354614242630 > expires_at:> nanos:897562679 > " >> >> (5) Jan 10 10:51:49 prometheus-01 alertmanager[1213219]: >> ts=2023-01-10T10:51:49.745Z caller=nflog.go:553 level=debug >> component=nflog >> msg="gossiping new entry" >> entry="entry:> >> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\" >> server.example.com\\\", instance=\\\"server.example.com\\\", >> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", >> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", >> severity=\\\"page\\\"}\" receiver:> integration:\"opsgenie\" > timestamp:> > >> firing_alerts:10151928354614242630 > expires_at:> nanos:649205670 > " >> >> (6) Jan 10 10:51:59 prometheus-01 alertmanager[1213219]: >> ts=2023-01-10T10:51:59.312Z caller=nflog.go:553 level=debug >> component=nflog >> msg="gossiping new
[prometheus-users] Re: An alert fires twice even an event occurres only once
That's a lot of alertmanagers. Are they all fully meshed? (But I'd say 2 or 3 would be better - spread over different regions) On Friday, 13 January 2023 at 14:16:27 UTC LukaszSz wrote: > Yes. The prometheus server is configured to communicate with all > alertmanagers ( sorry there are 8 alertmanagers ): > > alerting: > alert_relabel_configs: > - action: labeldrop > regex: "^prometheus_server$" > alertmanagers: > - static_configs: > - targets: > - alertmanager1:9093 > - alertmanager2:9093 > - alertmanager3:9093 > - alertmanager4:9093 > - alertmanager5:9093 > - alertmanager6:9093 > - alertmanager7:9093 > - alertmanager8:9093 > > On Friday, January 13, 2023 at 2:02:14 PM UTC+1 Brian Candler wrote: > >> Yes, but have you configured the prometheus (the one which has alerting >> rules) to have all four alertmanagers as its destination? >> >> On Friday, 13 January 2023 at 12:55:49 UTC LukaszSz wrote: >> >>> Yes Brian. As I mentioned in my post the Alertmangers are in cluster and >>> this event is visible on my 4 alertmanagers. >>> Problem which I described is that an alerts are firing twice and it >>> generates duplication. >>> >>> On Friday, January 13, 2023 at 1:34:52 PM UTC+1 Brian Candler wrote: >>> Are the alertmanagers clustered? Then you should configure prometheus to deliver the alert to *all* alertmanagers. On Friday, 13 January 2023 at 11:08:37 UTC LukaszSz wrote: > Hi guys, > > I have Prometheus in HA mode - 4 nodes - Prometheus+Aletmanager on > each. > Everything works fine but very often I experience issue that an alert > is firing again even the event is already resolved by alertmanager. > > Below logs from example event(Chrony_Service_Down) recorded by > alertmanager: > > > > (1) Jan 10 10:48:48 prometheus-01 alertmanager[1213219]: > ts=2023-01-10T10:48:48.746Z caller=dispatch.go:165 level=debug > component=dispatcher msg="Received alert" > alert=Chrony_Service_Down[d8c020a][active] > > (2) Jan 10 10:49:19 prometheus-01 alertmanager[1213219]: > ts=2023-01-10T10:49:19.460Z caller=nflog.go:553 level=debug > component=nflog > msg="gossiping new entry" > entry="entry: > datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\" > server.example.com\\\", instance=\\\"server.example.com\\\", > job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", > puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", > severity=\\\"page\\\"}\" receiver: integration:\"opsgenie\" > timestamp: > > firing_alerts:10151928354614242630 > expires_at: nanos:262824014 > " > > (3) Jan 10 10:50:48 prometheus-01 alertmanager[1213219]: > ts=2023-01-10T10:50:48.745Z caller=dispatch.go:165 level=debug > component=dispatcher msg="Received alert" > alert=Chrony_Service_Down[d8c020a][resolved] > > (4) Jan 10 10:51:29 prometheus-01 alertmanager[1213219]: > ts=2023-01-10T10:51:29.183Z caller=nflog.go:553 level=debug > component=nflog > msg="gossiping new entry" > entry="entry: > datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\" > server.example.com\\\", instance=\\\"server.example.com\\\", > job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", > puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", > severity=\\\"page\\\"}\" receiver: integration:\"opsgenie\" > timestamp: > > resolved_alerts:10151928354614242630 > expires_at: nanos:897562679 > " > > (5) Jan 10 10:51:49 prometheus-01 alertmanager[1213219]: > ts=2023-01-10T10:51:49.745Z caller=nflog.go:553 level=debug > component=nflog > msg="gossiping new entry" > entry="entry: > datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\" > server.example.com\\\", instance=\\\"server.example.com\\\", > job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", > puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", > severity=\\\"page\\\"}\" receiver: integration:\"opsgenie\" > timestamp: > > firing_alerts:10151928354614242630 > expires_at: nanos:649205670 > " > > (6) Jan 10 10:51:59 prometheus-01 alertmanager[1213219]: > ts=2023-01-10T10:51:59.312Z caller=nflog.go:553 level=debug > component=nflog > msg="gossiping new entry" > entry="entry: > datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\" > server.example.com\\\", instance=\\\"server.example.com\\\", > job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", > puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", > severity=\\\"page\\\"}\" receiver: integration:\"opsgenie\" >
[prometheus-users] Re: An alert fires twice even an event occurres only once
Yes. The prometheus server is configured to communicate with all alertmanagers ( sorry there are 8 alertmanagers ): alerting: alert_relabel_configs: - action: labeldrop regex: "^prometheus_server$" alertmanagers: - static_configs: - targets: - alertmanager1:9093 - alertmanager2:9093 - alertmanager3:9093 - alertmanager4:9093 - alertmanager5:9093 - alertmanager6:9093 - alertmanager7:9093 - alertmanager8:9093 On Friday, January 13, 2023 at 2:02:14 PM UTC+1 Brian Candler wrote: > Yes, but have you configured the prometheus (the one which has alerting > rules) to have all four alertmanagers as its destination? > > On Friday, 13 January 2023 at 12:55:49 UTC LukaszSz wrote: > >> Yes Brian. As I mentioned in my post the Alertmangers are in cluster and >> this event is visible on my 4 alertmanagers. >> Problem which I described is that an alerts are firing twice and it >> generates duplication. >> >> On Friday, January 13, 2023 at 1:34:52 PM UTC+1 Brian Candler wrote: >> >>> Are the alertmanagers clustered? Then you should configure prometheus >>> to deliver the alert to *all* alertmanagers. >>> >>> On Friday, 13 January 2023 at 11:08:37 UTC LukaszSz wrote: >>> Hi guys, I have Prometheus in HA mode - 4 nodes - Prometheus+Aletmanager on each. Everything works fine but very often I experience issue that an alert is firing again even the event is already resolved by alertmanager. Below logs from example event(Chrony_Service_Down) recorded by alertmanager: (1) Jan 10 10:48:48 prometheus-01 alertmanager[1213219]: ts=2023-01-10T10:48:48.746Z caller=dispatch.go:165 level=debug component=dispatcher msg="Received alert" alert=Chrony_Service_Down[d8c020a][active] (2) Jan 10 10:49:19 prometheus-01 alertmanager[1213219]: ts=2023-01-10T10:49:19.460Z caller=nflog.go:553 level=debug component=nflog msg="gossiping new entry" entry="entry:>>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\" server.example.com\\\", instance=\\\"server.example.com\\\", job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", severity=\\\"page\\\"}\" receiver:>>> integration:\"opsgenie\" > timestamp: firing_alerts:10151928354614242630 > expires_at:>>> nanos:262824014 > " (3) Jan 10 10:50:48 prometheus-01 alertmanager[1213219]: ts=2023-01-10T10:50:48.745Z caller=dispatch.go:165 level=debug component=dispatcher msg="Received alert" alert=Chrony_Service_Down[d8c020a][resolved] (4) Jan 10 10:51:29 prometheus-01 alertmanager[1213219]: ts=2023-01-10T10:51:29.183Z caller=nflog.go:553 level=debug component=nflog msg="gossiping new entry" entry="entry:>>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\" server.example.com\\\", instance=\\\"server.example.com\\\", job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", severity=\\\"page\\\"}\" receiver:>>> integration:\"opsgenie\" > timestamp: resolved_alerts:10151928354614242630 > expires_at:>>> nanos:897562679 > " (5) Jan 10 10:51:49 prometheus-01 alertmanager[1213219]: ts=2023-01-10T10:51:49.745Z caller=nflog.go:553 level=debug component=nflog msg="gossiping new entry" entry="entry:>>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\" server.example.com\\\", instance=\\\"server.example.com\\\", job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", severity=\\\"page\\\"}\" receiver:>>> integration:\"opsgenie\" > timestamp: firing_alerts:10151928354614242630 > expires_at:>>> nanos:649205670 > " (6) Jan 10 10:51:59 prometheus-01 alertmanager[1213219]: ts=2023-01-10T10:51:59.312Z caller=nflog.go:553 level=debug component=nflog msg="gossiping new entry" entry="entry:>>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\" server.example.com\\\", instance=\\\"server.example.com\\\", job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", severity=\\\"page\\\"}\" receiver:>>> integration:\"opsgenie\" > timestamp: resolved_alerts:10151928354614242630 > expires_at:>>> nanos:137020780 > " (7) Jan 10 10:54:58 prometheus-01 alertmanager[1213219]: ts=2023-01-10T10:54:58.744Z caller=dispatch.go:165 level=debug component=dispatcher msg="Received alert" alert=Chrony_Service_Down[d8c020a][resolved]
[prometheus-users] Re: An alert fires twice even an event occurres only once
Yes, but have you configured the prometheus (the one which has alerting rules) to have all four alertmanagers as its destination? On Friday, 13 January 2023 at 12:55:49 UTC LukaszSz wrote: > Yes Brian. As I mentioned in my post the Alertmangers are in cluster and > this event is visible on my 4 alertmanagers. > Problem which I described is that an alerts are firing twice and it > generates duplication. > > On Friday, January 13, 2023 at 1:34:52 PM UTC+1 Brian Candler wrote: > >> Are the alertmanagers clustered? Then you should configure prometheus to >> deliver the alert to *all* alertmanagers. >> >> On Friday, 13 January 2023 at 11:08:37 UTC LukaszSz wrote: >> >>> Hi guys, >>> >>> I have Prometheus in HA mode - 4 nodes - Prometheus+Aletmanager on each. >>> Everything works fine but very often I experience issue that an alert is >>> firing again even the event is already resolved by alertmanager. >>> >>> Below logs from example event(Chrony_Service_Down) recorded by >>> alertmanager: >>> >>> >>> >>> (1) Jan 10 10:48:48 prometheus-01 alertmanager[1213219]: >>> ts=2023-01-10T10:48:48.746Z caller=dispatch.go:165 level=debug >>> component=dispatcher msg="Received alert" >>> alert=Chrony_Service_Down[d8c020a][active] >>> >>> (2) Jan 10 10:49:19 prometheus-01 alertmanager[1213219]: >>> ts=2023-01-10T10:49:19.460Z caller=nflog.go:553 level=debug component=nflog >>> msg="gossiping new entry" >>> entry="entry:>> >>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\" >>> server.example.com\\\", instance=\\\"server.example.com\\\", >>> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", >>> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", >>> severity=\\\"page\\\"}\" receiver:>> integration:\"opsgenie\" > timestamp: >>> firing_alerts:10151928354614242630 > expires_at:>> nanos:262824014 > " >>> >>> (3) Jan 10 10:50:48 prometheus-01 alertmanager[1213219]: >>> ts=2023-01-10T10:50:48.745Z caller=dispatch.go:165 level=debug >>> component=dispatcher msg="Received alert" >>> alert=Chrony_Service_Down[d8c020a][resolved] >>> >>> (4) Jan 10 10:51:29 prometheus-01 alertmanager[1213219]: >>> ts=2023-01-10T10:51:29.183Z caller=nflog.go:553 level=debug component=nflog >>> msg="gossiping new entry" >>> entry="entry:>> >>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\" >>> server.example.com\\\", instance=\\\"server.example.com\\\", >>> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", >>> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", >>> severity=\\\"page\\\"}\" receiver:>> integration:\"opsgenie\" > timestamp: >>> resolved_alerts:10151928354614242630 > expires_at:>> nanos:897562679 > " >>> >>> (5) Jan 10 10:51:49 prometheus-01 alertmanager[1213219]: >>> ts=2023-01-10T10:51:49.745Z caller=nflog.go:553 level=debug component=nflog >>> msg="gossiping new entry" >>> entry="entry:>> >>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\" >>> server.example.com\\\", instance=\\\"server.example.com\\\", >>> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", >>> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", >>> severity=\\\"page\\\"}\" receiver:>> integration:\"opsgenie\" > timestamp: >>> firing_alerts:10151928354614242630 > expires_at:>> nanos:649205670 > " >>> >>> (6) Jan 10 10:51:59 prometheus-01 alertmanager[1213219]: >>> ts=2023-01-10T10:51:59.312Z caller=nflog.go:553 level=debug component=nflog >>> msg="gossiping new entry" >>> entry="entry:>> >>> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\" >>> server.example.com\\\", instance=\\\"server.example.com\\\", >>> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", >>> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", >>> severity=\\\"page\\\"}\" receiver:>> integration:\"opsgenie\" > timestamp: >>> resolved_alerts:10151928354614242630 > expires_at:>> nanos:137020780 > " >>> >>> (7) Jan 10 10:54:58 prometheus-01 alertmanager[1213219]: >>> ts=2023-01-10T10:54:58.744Z caller=dispatch.go:165 level=debug >>> component=dispatcher msg="Received alert" >>> alert=Chrony_Service_Down[d8c020a][resolved] >>> >>> # >>> >>> Interesting is line number 5 (Jan 10 10:51:49) where alertmanager fired >>> alert second time even minute ago (Jan 10 10:50:48) the alert was marked as >>> resolved. >>> Such behavior generates duplicate alert in our system which is quite >>> annoying in our scale. >>> >>> What is worth to mention: >>> - For test purpose the event is scrapped by 4 Promethues >>> servers(default) but alert rule is evaluated by one Promethues. >>> - The event occurres only once so there is no flapping which might cause >>> another alert firing. >>> >>> Thanks >>> >> -- You
[prometheus-users] Re: An alert fires twice even an event occurres only once
Yes Brian. As I mentioned in my post the Alertmangers are in cluster and this event is visible on my 4 alertmanagers. Problem which I described is that an alerts are firing twice and it generates duplication. On Friday, January 13, 2023 at 1:34:52 PM UTC+1 Brian Candler wrote: > Are the alertmanagers clustered? Then you should configure prometheus to > deliver the alert to *all* alertmanagers. > > On Friday, 13 January 2023 at 11:08:37 UTC LukaszSz wrote: > >> Hi guys, >> >> I have Prometheus in HA mode - 4 nodes - Prometheus+Aletmanager on each. >> Everything works fine but very often I experience issue that an alert is >> firing again even the event is already resolved by alertmanager. >> >> Below logs from example event(Chrony_Service_Down) recorded by >> alertmanager: >> >> >> >> (1) Jan 10 10:48:48 prometheus-01 alertmanager[1213219]: >> ts=2023-01-10T10:48:48.746Z caller=dispatch.go:165 level=debug >> component=dispatcher msg="Received alert" >> alert=Chrony_Service_Down[d8c020a][active] >> >> (2) Jan 10 10:49:19 prometheus-01 alertmanager[1213219]: >> ts=2023-01-10T10:49:19.460Z caller=nflog.go:553 level=debug component=nflog >> msg="gossiping new entry" >> entry="entry:> >> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\" >> server.example.com\\\", instance=\\\"server.example.com\\\", >> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", >> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", >> severity=\\\"page\\\"}\" receiver:> integration:\"opsgenie\" > timestamp: >> firing_alerts:10151928354614242630 > expires_at:> nanos:262824014 > " >> >> (3) Jan 10 10:50:48 prometheus-01 alertmanager[1213219]: >> ts=2023-01-10T10:50:48.745Z caller=dispatch.go:165 level=debug >> component=dispatcher msg="Received alert" >> alert=Chrony_Service_Down[d8c020a][resolved] >> >> (4) Jan 10 10:51:29 prometheus-01 alertmanager[1213219]: >> ts=2023-01-10T10:51:29.183Z caller=nflog.go:553 level=debug component=nflog >> msg="gossiping new entry" >> entry="entry:> >> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\" >> server.example.com\\\", instance=\\\"server.example.com\\\", >> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", >> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", >> severity=\\\"page\\\"}\" receiver:> integration:\"opsgenie\" > timestamp: >> resolved_alerts:10151928354614242630 > expires_at:> nanos:897562679 > " >> >> (5) Jan 10 10:51:49 prometheus-01 alertmanager[1213219]: >> ts=2023-01-10T10:51:49.745Z caller=nflog.go:553 level=debug component=nflog >> msg="gossiping new entry" >> entry="entry:> >> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\" >> server.example.com\\\", instance=\\\"server.example.com\\\", >> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", >> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", >> severity=\\\"page\\\"}\" receiver:> integration:\"opsgenie\" > timestamp: >> firing_alerts:10151928354614242630 > expires_at:> nanos:649205670 > " >> >> (6) Jan 10 10:51:59 prometheus-01 alertmanager[1213219]: >> ts=2023-01-10T10:51:59.312Z caller=nflog.go:553 level=debug component=nflog >> msg="gossiping new entry" >> entry="entry:> >> datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\" >> server.example.com\\\", instance=\\\"server.example.com\\\", >> job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", >> puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", >> severity=\\\"page\\\"}\" receiver:> integration:\"opsgenie\" > timestamp: >> resolved_alerts:10151928354614242630 > expires_at:> nanos:137020780 > " >> >> (7) Jan 10 10:54:58 prometheus-01 alertmanager[1213219]: >> ts=2023-01-10T10:54:58.744Z caller=dispatch.go:165 level=debug >> component=dispatcher msg="Received alert" >> alert=Chrony_Service_Down[d8c020a][resolved] >> >> # >> >> Interesting is line number 5 (Jan 10 10:51:49) where alertmanager fired >> alert second time even minute ago (Jan 10 10:50:48) the alert was marked as >> resolved. >> Such behavior generates duplicate alert in our system which is quite >> annoying in our scale. >> >> What is worth to mention: >> - For test purpose the event is scrapped by 4 Promethues servers(default) >> but alert rule is evaluated by one Promethues. >> - The event occurres only once so there is no flapping which might cause >> another alert firing. >> >> Thanks >> > -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscr...@googlegroups.com. To view this discussion on the web visit
[prometheus-users] Re: An alert fires twice even an event occurres only once
Are the alertmanagers clustered? Then you should configure prometheus to deliver the alert to *all* alertmanagers. On Friday, 13 January 2023 at 11:08:37 UTC LukaszSz wrote: > Hi guys, > > I have Prometheus in HA mode - 4 nodes - Prometheus+Aletmanager on each. > Everything works fine but very often I experience issue that an alert is > firing again even the event is already resolved by alertmanager. > > Below logs from example event(Chrony_Service_Down) recorded by > alertmanager: > > > > (1) Jan 10 10:48:48 prometheus-01 alertmanager[1213219]: > ts=2023-01-10T10:48:48.746Z caller=dispatch.go:165 level=debug > component=dispatcher msg="Received alert" > alert=Chrony_Service_Down[d8c020a][active] > > (2) Jan 10 10:49:19 prometheus-01 alertmanager[1213219]: > ts=2023-01-10T10:49:19.460Z caller=nflog.go:553 level=debug component=nflog > msg="gossiping new entry" > entry="entry: > datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\" > server.example.com\\\", instance=\\\"server.example.com\\\", > job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", > puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", > severity=\\\"page\\\"}\" receiver: integration:\"opsgenie\" > timestamp: > firing_alerts:10151928354614242630 > expires_at: nanos:262824014 > " > > (3) Jan 10 10:50:48 prometheus-01 alertmanager[1213219]: > ts=2023-01-10T10:50:48.745Z caller=dispatch.go:165 level=debug > component=dispatcher msg="Received alert" > alert=Chrony_Service_Down[d8c020a][resolved] > > (4) Jan 10 10:51:29 prometheus-01 alertmanager[1213219]: > ts=2023-01-10T10:51:29.183Z caller=nflog.go:553 level=debug component=nflog > msg="gossiping new entry" > entry="entry: > datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\" > server.example.com\\\", instance=\\\"server.example.com\\\", > job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", > puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", > severity=\\\"page\\\"}\" receiver: integration:\"opsgenie\" > timestamp: > resolved_alerts:10151928354614242630 > expires_at: nanos:897562679 > " > > (5) Jan 10 10:51:49 prometheus-01 alertmanager[1213219]: > ts=2023-01-10T10:51:49.745Z caller=nflog.go:553 level=debug component=nflog > msg="gossiping new entry" > entry="entry: > datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\" > server.example.com\\\", instance=\\\"server.example.com\\\", > job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", > puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", > severity=\\\"page\\\"}\" receiver: integration:\"opsgenie\" > timestamp: > firing_alerts:10151928354614242630 > expires_at: nanos:649205670 > " > > (6) Jan 10 10:51:59 prometheus-01 alertmanager[1213219]: > ts=2023-01-10T10:51:59.312Z caller=nflog.go:553 level=debug component=nflog > msg="gossiping new entry" > entry="entry: > datacenter=\\\"dc01\\\", exporter=\\\"node\\\", fqdn=\\\" > server.example.com\\\", instance=\\\"server.example.com\\\", > job=\\\"node\\\", monitoring_infra=\\\"prometheus-mon\\\", > puppet_certname=\\\"server.example.com\\\", service=\\\"chrony\\\", > severity=\\\"page\\\"}\" receiver: integration:\"opsgenie\" > timestamp: > resolved_alerts:10151928354614242630 > expires_at: nanos:137020780 > " > > (7) Jan 10 10:54:58 prometheus-01 alertmanager[1213219]: > ts=2023-01-10T10:54:58.744Z caller=dispatch.go:165 level=debug > component=dispatcher msg="Received alert" > alert=Chrony_Service_Down[d8c020a][resolved] > > # > > Interesting is line number 5 (Jan 10 10:51:49) where alertmanager fired > alert second time even minute ago (Jan 10 10:50:48) the alert was marked as > resolved. > Such behavior generates duplicate alert in our system which is quite > annoying in our scale. > > What is worth to mention: > - For test purpose the event is scrapped by 4 Promethues servers(default) > but alert rule is evaluated by one Promethues. > - The event occurres only once so there is no flapping which might cause > another alert firing. > > Thanks > -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/95e1af48-5d49-473f-9d39-086625b147e5n%40googlegroups.com.
Re: [prometheus-users] AlertManager rules examples
On 11/01/2023 19:58, Eulogio Apelin wrote: I'm looking for information, primarily examples, of various ways to configure alert rules. Specifically, scenarios like: In a single rule group: Regular expression that determined a tls cert expires in 60 days. send 1 alert Regular expression that determined a tls cert expires in 40 days, send 1 alert Regular expression that determined a tls cert expires in 30 days, send 1 alert Regular expression that determined a tls cert expires in 20 days, send 1 alert Regular expression that determined a tls cert expires in 10 days, send 1 alert Regular expression that determined a tls cert expires in 5 days, send 1 alert Regular expression that determined a tls cert expires in 0 days, send 1 alert Another scenario is to send an alert once day to an email address. send an alert if it's the 3rd day in a row, send the alert to another set of address. and stop alerting. can alertmanager send alerts to teams like it does slack? And another other general examples of alert manager rules. I think it is best not to think of alerts as moment in time events but as being a time period where a certain condition is true. Separate to the actual alert firing are then rules (in Alertmanager) of how to route it (e.g. to Slack, email, etc.), what to send (email body template) and how often to remind people that the alert is happening. So for example with your TLS expiry example you might have an alert which starts firing once a certificate is within 60 days of expiry. It would continue to fire continuously until either the certificate is renewed (i.e. it is over 60 days again) or stops existing (because you've reconfigured Prometheus to no longer monitor that certificate). Then within Alertmanager you can set rules to send you a message every 10 days that alert is firing, meaning you'd get a message at 60, 50, 40, etc days until expiry. More complex alerting routing decisions are generally out of scope for Alertmanager and would be expected to be managed by a more complex system (such as PagerDuty, OpsGenie, Grafana On-Call, etc.). This would cover you example of wanting to escalate an alert after a period of time, but would also cover things like having on-call rotas where different people would be contacted by looking at a rota calendar. -- Stuart Clark -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/b43cfa1a-18c1-3c44-48f3-46349d8cdffa%40Jahingo.com.
[prometheus-users] AlertManager rules examples
I'm looking for information, primarily examples, of various ways to configure alert rules. Specifically, scenarios like: In a single rule group: Regular expression that determined a tls cert expires in 60 days. send 1 alert Regular expression that determined a tls cert expires in 40 days, send 1 alert Regular expression that determined a tls cert expires in 30 days, send 1 alert Regular expression that determined a tls cert expires in 20 days, send 1 alert Regular expression that determined a tls cert expires in 10 days, send 1 alert Regular expression that determined a tls cert expires in 5 days, send 1 alert Regular expression that determined a tls cert expires in 0 days, send 1 alert Another scenario is to send an alert once day to an email address. send an alert if it's the 3rd day in a row, send the alert to another set of address. and stop alerting. can alertmanager send alerts to teams like it does slack? And another other general examples of alert manager rules. Thanks! -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/f9e3a133-c1fd-4f68-9939-6374cfcbec20n%40googlegroups.com.
[prometheus-users] the query text is cut off within postgres_exporter
the query text is cut off within postgres_exporter / grafana. this makes it hard or even impossible to inspect suspect queries. question: any way to get the whole query text from there? -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/0d7489a8-3247-414d-8974-e329db3406d5n%40googlegroups.com.
[prometheus-users] Resolved alerts are grouped into firing alerts
I got this strange behavior where resolved alerts are sent alongside with firing ones. So i have this rule kube_pod_container_status_ready{namespace="default"} == 0. What happens: when pod is down alert is sent and everything is fine, then pod is up and it is resolved. But if pod will fail again in a short period an gets recreated by deploy with different name then the alert will be fired mentioning previous pod and new one. I also noticed that if you wait about 20 minutes after alert is resolved and kill a pod again there is only one pod in the alert. This is expected: fist alert 12:24 Container ubuntu in pod test-ubuntu-5579c5f49c-rsb8v is not ready for 30 seconds. Prometheus Alert (Firing) summary Container is not ready for too long. alertnameKubeContainerNotReady container ubuntu endpoint http instance 10.233.74.200:8080 jobkube-state-metrics pod test-ubuntu-5579c5f49c-rsb8v prometheus prometheus/prometheus-kube-prometheus-prometheus service prometheus-kube-state-metrics severity warning uid 85f61574-2559-4f1a-8a14-f08ee4e34b8a second alert 12:27 Container ubuntu in pod test-ubuntu-5579c5f49c-rsb8v is not ready for 30 seconds. Prometheus Alert (Resolved) summaryContainer is not ready for too long. alertname KubeContainerNotReady containerubuntu endpoint http instance 10.233.74.200:8080 job kube-state-metrics pod test-ubuntu-5579c5f49c-rsb8v prometheus prometheus/prometheus-kube-prometheus-prometheus serviceprometheus-kube-state-metrics severity warning uid 85f61574-2559-4f1a-8a14-f08ee4e34b8a Then I kill the pod and this happens (its a single alert): third alert: 12:32 Container ubuntu in pod test-ubuntu-5579c5f49c-rsb8v is not ready for 30 seconds. 12:32 Prometheus Alert (Firing) summaryContainer is not ready for too long. alertname KubeContainerNotReady containerubuntu endpoint http instance 10.233.74.200:8080 job kube-state-metrics pod test-ubuntu-5579c5f49c-rsb8v prometheus prometheus/prometheus-kube-prometheus-prometheus serviceprometheus-kube-state-metrics severity warning uid85f61574-2559-4f1a-8a14-f08ee4e34b8a Container ubuntu in pod test-ubuntu-5579c5f49c-sjlrk is not ready for 30 seconds. summary Container is not ready for too long. alertname KubeContainerNotReady container ubuntu endpointhttp instance10.233.74.200:8080 job kube-state-metrics pod test-ubuntu-5579c5f49c-sjlrk prometheus prometheus/prometheus-kube-prometheus-prometheus service prometheus-kube-state-metrics severity warning uid aba2b39c-a4b1-4d02-a532-4ca39ef8c0da here's my config: config: global: resolve_timeout: 5m route: group_by: ['alertname'] group_interval: 30s repeat_interval: 24h group_wait: 30s receiver: 'prometheus-msteams' receivers: - name: 'prometheus-msteams' webhook_configs: # https://prometheus.io/docs/alerting/configuration/#webhook_config - send_resolved: true url: "http://prometheus-msteams:2000/prometheus-msteams; Now, I know I can just group them by pod or some other labels or even turn off grouping, but I want to figure out what exactly happens here. Also i cant figure out what will happen to alert that has no label by which i am grouping. For example if i group by podname how will alerts without pod be treated. -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/037e1cbe-078e-458b-b306-b3933c871013n%40googlegroups.com.
[prometheus-users] An alert fires twice even an event occurres only once
Hi guys, I have Prometheus in HA mode - 4 nodes - Prometheus+Aletmanager on each. Everything works fine but very often I experience issue that an alert is firing again even the event is already resolved by alertmanager. Below logs from example event(Chrony_Service_Down) recorded by alertmanager: (1) Jan 10 10:48:48 prometheus-01 alertmanager[1213219]: ts=2023-01-10T10:48:48.746Z caller=dispatch.go:165 level=debug component=dispatcher msg="Received alert" alert=Chrony_Service_Down[d8c020a][active] (2) Jan 10 10:49:19 prometheus-01 alertmanager[1213219]: ts=2023-01-10T10:49:19.460Z caller=nflog.go:553 level=debug component=nflog msg="gossiping new entry" entry="entry: timestamp: firing_alerts:10151928354614242630 > expires_at: " (3) Jan 10 10:50:48 prometheus-01 alertmanager[1213219]: ts=2023-01-10T10:50:48.745Z caller=dispatch.go:165 level=debug component=dispatcher msg="Received alert" alert=Chrony_Service_Down[d8c020a][resolved] (4) Jan 10 10:51:29 prometheus-01 alertmanager[1213219]: ts=2023-01-10T10:51:29.183Z caller=nflog.go:553 level=debug component=nflog msg="gossiping new entry" entry="entry: timestamp: resolved_alerts:10151928354614242630 > expires_at: " (5) Jan 10 10:51:49 prometheus-01 alertmanager[1213219]: ts=2023-01-10T10:51:49.745Z caller=nflog.go:553 level=debug component=nflog msg="gossiping new entry" entry="entry: timestamp: firing_alerts:10151928354614242630 > expires_at: " (6) Jan 10 10:51:59 prometheus-01 alertmanager[1213219]: ts=2023-01-10T10:51:59.312Z caller=nflog.go:553 level=debug component=nflog msg="gossiping new entry" entry="entry: timestamp: resolved_alerts:10151928354614242630 > expires_at: " (7) Jan 10 10:54:58 prometheus-01 alertmanager[1213219]: ts=2023-01-10T10:54:58.744Z caller=dispatch.go:165 level=debug component=dispatcher msg="Received alert" alert=Chrony_Service_Down[d8c020a][resolved] # Interesting is line number 5 (Jan 10 10:51:49) where alertmanager fired alert second time even minute ago (Jan 10 10:50:48) the alert was marked as resolved. Such behavior generates duplicate alert in our system which is quite annoying in our scale. What is worth to mention: - For test purpose the event is scrapped by 4 Promethues servers(default) but alert rule is evaluated by one Promethues. - The event occurres only once so there is no flapping which might cause another alert firing. Thanks -- You received this message because you are subscribed to the Google Groups "Prometheus Users" group. To unsubscribe from this group and stop receiving emails from it, send an email to prometheus-users+unsubscr...@googlegroups.com. To view this discussion on the web visit https://groups.google.com/d/msgid/prometheus-users/a3ced99d-58f8-4275-a478-3d2ab884b764n%40googlegroups.com.