It's correct for prometheus to send alerts to both alertmanagers, but I 
suspect you haven't got the alertmanagers clustered together correctly.
See: 
https://prometheus.io/docs/alerting/latest/alertmanager/#high-availability

Make sure you've configured the cluster flags, and check your alertmanager 
container logs for messages relating to clustering or "gossip".

On Tuesday, 28 June 2022 at 16:12:53 UTC+1 ionel...@crunch.co.uk wrote:

> Hi Brian,
>
> So my previous assumption proved to be correct - it was in fact the 
> alertmanager settings that weren't getting properly applied on the fly. 
> Today I ensured they were applied in a guaranteed way & I can see the 
> alerts firing every 6 minutes now, for these settings:
> *    group_wait: 30s*
> *    group_interval: 2m*
> *    repeat_interval: 5m*
>
> Now I'm trying to sort out the fact that the alerts fire twice each time. 
> We have some form of HA in place, where we spawn 2 pods for the 
> alertmanager & looking at their logs, I can see that each container fires 
> the alert, which explains why I see 2 of them:
>
>
> *prometheus-alertmanager-0 level=debug ts=2022-06-28T14:27:40.121Z 
> caller=notify.go:735 component=dispatcher receiver=pager 
> integration=slack[0] msg="Notify success" 
> attempts=1prometheus-alertmanager-1 level=debug ts=2022-06-28T14:27:40.418Z 
> caller=notify.go:735 component=dispatcher receiver=pager 
> integration=slack[0] msg="Notify success" attempts=1*
>
> Any idea why that is?
>
> Thank you!
> On Monday, 27 June 2022 at 17:20:29 UTC+1 Brian Candler wrote:
>
>> Look at container logs then.
>>
>> Metrics include things like the number of notifications attempted, 
>> succeeded and failed.  Those would be the obvious first place to look.  
>> (For example: is it actually trying to send a mail? if so, is it succeeding 
>> or failing?)
>>
>> Aside: vector(0) and vector(1) are the same for generating alerts. It's 
>> only the presence of a value that triggers an alert, the actual value 
>> itself can be anything.
>>
>> On Monday, 27 June 2022 at 16:28:46 UTC+1 ionel...@crunch.co.uk wrote:
>>
>>> Ok, added a rule with an expression of *vector(1)*, went live at 12:31, 
>>> when it fired 2 alerts  (?!), but then went completely silent until 15:36, 
>>> when it fired again 2x (so more than 3 h in). The alert has been stuck in 
>>> the *FIRING* state the whole time, as expected.
>>> Unfortunately the logs don't shed any light - there's nothing logged 
>>> aside from the bootstrap logs. It isn't a systemd process - it's run in a 
>>> container & there seems to be just a big executable in there.
>>> The meta-metrics contain quite a lot of data in there - any particulars 
>>> I should be looking for?
>>>
>>> Either way, I'm now inclined to believe that this is definitely an 
>>> *alertmanager* setting matter. As I was mentioning in my initial email, 
>>> I've already tweaked *group_wait,* *group_interval & **repeat_interval*, 
>>> but they probably didn't take effect, as I thought they would. So maybe 
>>> that's something I need to sort out. And better logging should help 
>>> understand all of that, which I still need to figure out how to do.
>>>
>>> Thank you very much for your help Brian!
>>>
>>> On Monday, 27 June 2022 at 09:59:59 UTC+1 Brian Candler wrote:
>>>
>>>> I suspect the easiest way to debug this is to focus on "*repeat_interval: 
>>>> 2m*".  Even if a single alert is statically firing, you should get the 
>>>> same notification resent every 2 minutes.  So don't worry about catching 
>>>> second instances of the same expr; just set a simple alerting expression 
>>>> which fires continuously, say just "expr: vector(0)", to find out why it's 
>>>> not resending.
>>>>
>>>> You can then look at logs from alertmanager (e.g. "journalctl -eu 
>>>> alertmanager" if running under systemd). You can also look at the metrics 
>>>> alertmanager itself generates:
>>>>
>>>>     curl localhost:9093/metrics | grep alertmanager
>>>>
>>>> Hopefully, one of these may give you a clue as to what's happening 
>>>> (e.g. maybe your mail system or other notification endpoint has some sort 
>>>> of rate limiting??).
>>>>
>>>> However, if the vector(0) expression *does* send repeated alerts 
>>>> successfully, then your problem is most likely something to do with your 
>>>> actual alerting expr, and you'll need to break it down into simpler pieces 
>>>> to debug it.
>>>>
>>>> Apart from that, all I can say is "it works for me™": if an alerting 
>>>> expression subsequently generates a second alert in its result vector, 
>>>> then 
>>>> I get another alert after group_interval.
>>>>
>>>> On Monday, 27 June 2022 at 09:39:45 UTC+1 ionel...@crunch.co.uk wrote:
>>>>
>>>>> Hi Brian,
>>>>>
>>>>> Thanks for your reply! To be honest, you can pretty much ignore that 
>>>>> first part of the expression, that doesn't change anything in the 
>>>>> "repeat" 
>>>>> behaviour. In fact, we don't even have that bit at the moment, that's 
>>>>> just 
>>>>> something I've been playing with in order to capture that very first 
>>>>> springing into existence of the metric, which isn't covered by the 
>>>>> current 
>>>>> expression,  
>>>>> *sum(rate(error_counter{service="myservice",other="labels"}[1m])) 
>>>>> > 0'*.
>>>>> Also, I've already done the PromQL graphing that you suggested, I 
>>>>> could see those multiple lines that you were talking about, but then 
>>>>> there 
>>>>> was no alert firing... 🤷‍♂️
>>>>>
>>>>> Any other pointers?
>>>>>
>>>>> Thanks,
>>>>> Ionel
>>>>>
>>>>> On Saturday, 25 June 2022 at 16:52:17 UTC+1 Brian Candler wrote:
>>>>>
>>>>>> Try putting the whole alerting "expr" into the PromQL query browser, 
>>>>>> and switching to graph view.
>>>>>>
>>>>>> This will show you the alert vector graphically, with a separate line 
>>>>>> for each alert instance.  If this isn't showing multiple lines, then you 
>>>>>> won't receive multiple alerts.  Then you can break down your query into 
>>>>>> parts, try them individually, to try to understand why it's not working 
>>>>>> as 
>>>>>> you expect.
>>>>>>
>>>>>> Looking at just part of your expression:
>>>>>>
>>>>>> *sum(error_counter{service="myservice",other="labels"} unless 
>>>>>> error_counter{service="myservice",other="labels"} offset 1m) > 0*
>>>>>>
>>>>>> And taking just the part inside sum():
>>>>>>
>>>>>> *error_counter{service="myservice",other="labels"} unless 
>>>>>> error_counter{service="myservice",other="labels"} offset 1m*
>>>>>>
>>>>>> This expression is weird. It will only generate a value when the 
>>>>>> error counter first springs into existence.  As soon as it has existed 
>>>>>> for 
>>>>>> more than 1 minute - even with value zero - then the "unless" cause will 
>>>>>> suppress the expression completely, i.e. it will be an empty instance 
>>>>>> vector.
>>>>>>
>>>>>> I think this is probably not what you want.  In any case it's not a 
>>>>>> good idea to have timeseries which come and go; it's very awkward to 
>>>>>> alert 
>>>>>> on a timeseries appearing or disappearing, and you may have problems 
>>>>>> with 
>>>>>> staleness, i.e. the timeseries may continue to exist for 5 minutes after 
>>>>>> you've stopped generating points in it.
>>>>>>
>>>>>> It's much better to have a timeseries which continues to exist.  That 
>>>>>> is, "error_counter" should spring into existence with value 0, and 
>>>>>> increment when errors occur, and stop incrementing when errors don't 
>>>>>> occur 
>>>>>> - but continue to keep the value it had before.
>>>>>>
>>>>>> If your error_counter timeseries *does* exist continuously, then this 
>>>>>> 'unless' clause is probably not what you want.
>>>>>>
>>>>>> On Saturday, 25 June 2022 at 15:42:08 UTC+1 ionel...@crunch.co.uk 
>>>>>> wrote:
>>>>>>
>>>>>>> Hello,
>>>>>>>
>>>>>>> I'm trying to set up some alerts that fire on critical errors, so 
>>>>>>> I'm aiming for immediate & consistent reporting for as much as possible.
>>>>>>>
>>>>>>> So for that matter, I defined the alert rule without a *for* clause:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *groups:- name: Test alerts  rules:  - alert: MyService Test Alert  
>>>>>>>   expr: 'sum(error_counter{service="myservice",other="labels"} unless 
>>>>>>> error_counter{service="myservice",other="labels"} offset 1m) > 0     or 
>>>>>>> sum(rate(error_counter{service="myservice",other="labels"}[1m])) > 0'*
>>>>>>>
>>>>>>> Prometheus is configured to scrape & evaluate at 10 s:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *global:  scrape_interval: 10s  scrape_timeout: 10s  
>>>>>>> evaluation_interval: 10s*
>>>>>>>
>>>>>>> And the alert manager (docker image 
>>>>>>> *quay.io/prometheus/alertmanager:v0.23.0 
>>>>>>> <http://quay.io/prometheus/alertmanager:v0.23.0>*) is configured 
>>>>>>> with these parameters:
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *route:  group_by: ['alertname', 'node_name']  group_wait: 30s  
>>>>>>> group_interval: 1m # used to be 5m  repeat_interval: 2m # used to be 3h*
>>>>>>>
>>>>>>> Now what happens when testing is this:
>>>>>>> - on the very first metric generated, the alert fires as expected;
>>>>>>> - on subsequent tests it stops firing;
>>>>>>> - *I kept on running a new test each minute for 20 minutes, but no 
>>>>>>> alert fired again*;
>>>>>>> - I can see the alert state going into *FIRING* in the alerts view 
>>>>>>> in the Prometheus UI;
>>>>>>> - I can see the metric values getting generated when executing the 
>>>>>>> expression query in the Prometheus UI;
>>>>>>>
>>>>>>> Redid the same test suite after a 2 hour break & exactly the same 
>>>>>>> thing happened, including the fact that* the alert fired on the 
>>>>>>> first test!*
>>>>>>>
>>>>>>> What am I missing here? How can I make the alert manager fire that 
>>>>>>> alert on repeated error metric hits? Ok, it doesn't have to be as soon 
>>>>>>> as 
>>>>>>> 2m, but let's consider that for testing's sake.
>>>>>>>
>>>>>>> Pretty please, any advice is much appreciated!
>>>>>>>
>>>>>>> Kind regards,
>>>>>>> Ionel
>>>>>>>
>>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/dabd9915-05dc-475a-b143-a6154095f1c9n%40googlegroups.com.

Reply via email to