[prometheus-users] Re: Alertmanager slack alerting issues

Brian Candler Mon, 22 Aug 2022 23:46:04 -0700

Yes, you've got it.  It's easy to test your hypothesis: simply paste the 
alert rule expression


    100 - (*avg by(instance,cluster)* 
(rate(node_cpu_seconds_total{mode="idle"}[2m])) 
* 100) > 95

into the PromQL query browser in the prometheus web interface, and you'll 
see all the results - including their labels.

I believe you'll get results like

{instance="foo",cluster="bar"} 98.4

There won't be any "env" label there because you've aggregated it away.

Try using: *avg by(instance,cluster,env)* instead.

Or you could have separate alerting rules per environment, and re-apply the 
label in your rule:

    expr: 100 - (*avg by(instance,cluster)* 
(rate(node_cpu_seconds_total{env="dev",mode="idle"}[2m])) 
* 100) > 98
    labels:
      env: dev

On Monday, 22 August 2022 at 21:21:51 UTC+1 rs wrote:

> Thanks Brian, I am in the midst of setting up a slack receiver (to weed 
> out the alerts going to the wrong channel). One thing I have noticed is, 
> the alerts being routed incorrectly may actually have to do with a rule:
>
> - alert: High_Cpu_Load
>
> expr: 100 - (*avg by(instance,cluster)* 
> (rate(node_cpu_seconds_total{mode="idle"}[2m])) * 100) > 95
>
> for: 0m
>
> labels:
>
> severity: warning
>
> annotations:
>
> summary: Host high CPU load (instance {{ $labels.instance }})
>
> description: "CPU load is > 95%\n INSTANCE = {{ $labels.instance }}\n 
> VALUE = %{{ $value | humanize }}\n LABELS = {{ $labels }}"
>
> I believe the issue may be that I'm not passing in 'env' into the 
> expression and that is causing an issue with the alerts. Just a hunch, but 
> I appreciate you pointing me in the right direction!
>
> On Monday, August 22, 2022 at 3:06:47 PM UTC-4 Brian Candler wrote:
>
>> "Looks correct but still doesn't work how I expect"
>>
>> What you've shown is a target configuration, not an alert arriving at 
>> alertmanager.
>>
>> Therefore, I'm suggesting you take a divide-and-conquer approach.  First, 
>> work out which of your receiver routing rules is being triggered (is it the 
>> 'production' receiver, or is it the 'slack' receiver?) by making them 
>> different.  This will point to which routing rule is or isn't being 
>> triggered.  And then you can work out why.
>>
>> There are all sorts of reasons it might not work, other than the config 
>> you've shown.  For example, if you have any target rewriting or metric 
>> rewriting rules which set the env; if the exporter itself sets "env" and 
>> you have honor_labels set; and so on.
>>
>> Hence the first part is to find out from real alert events: is the alert 
>> being generated without the "dev" label? In that case alert routing is just 
>> fine, and you need to work out why that label is wrong (and you're looking 
>> at the prometheus side). Or is the alert actually arriving at alertmanager 
>> with the "dev" label, in which case you're looking at the alertmanager side 
>> to find out why it's not being routed as expected.
>>
>> On Monday, 22 August 2022 at 18:45:25 UTC+1 rs wrote:
>>
>>> I checked the json file and the tagging was correct. Here's an example:
>>>
>>>
>>>    {
>>>
>>>        "labels": {
>>>
>>>            "cluster": "X Stage Servers",
>>>
>>>            "env": "dev"
>>>
>>>        },
>>>
>>>        "targets": [
>>>
>>>            "x:9100",
>>>
>>>            "y:9100",
>>>
>>>            "z:9100"
>>>
>>>        ]
>>>
>>>    },
>>> This is being sent to the production/default channel.
>>>
>>> On Friday, August 12, 2022 at 11:29:34 AM UTC-4 Brian Candler wrote:
>>>
>>>> Firstly, I'd drop the "continue: true" lines. They are not required, 
>>>> and are just going to cause confusion.
>>>>
>>>> The 'slack' and 'production' receivers are both sending to 
>>>> #prod-channel.  So you'll hit this if the env is not exactly "dev".  I 
>>>> suggest you look in detail at the alerts themselves: maybe they're tagging 
>>>> with "Dev" or "dev " (with a hidden space).
>>>>
>>>> If you change the default 'slack' receiver to go to a different 
>>>> channel, or use a different title/text template, it will be easier to see 
>>>> if this is the problem or not.
>>>>
>>>>
>>>> On Friday, 12 August 2022 at 09:36:22 UTC+1 rs wrote:
>>>>
>>>>> Hi everyone! I am configuring alertmanager to send outputs to a prod 
>>>>> slack channel and dev slack channel. I have checked with the routing tree 
>>>>> editor and everything should be working correctly. 
>>>>> However, I am seeing some (not all) alerts that are tagged with 'env: 
>>>>> dev' being sent to the prod slack channel. Is there some sort of old 
>>>>> configuration caching happening? Is there a way to flush this out?
>>>>>
>>>>> --- Alertmanager.yml ---
>>>>> global:
>>>>>   http_config:
>>>>>     proxy_url: 'xyz'
>>>>> templates:
>>>>>   - templates/*.tmpl
>>>>> route:
>>>>>   group_by: [cluster,alertname]
>>>>>   group_wait: 10s
>>>>>   group_interval: 30m
>>>>>   repeat_interval: 24h
>>>>>   receiver: 'slack'
>>>>>   routes:
>>>>>   - receiver: 'production'
>>>>>     match:
>>>>>       env: 'prod'
>>>>>     continue: true
>>>>>   - receiver: 'staging'
>>>>>     match:
>>>>>       env: 'dev'
>>>>>     continue: true
>>>>> receivers:
>>>>> #Fallback option - Default set to production server
>>>>> - name: 'slack'
>>>>>   slack_configs:
>>>>>   - api_url: 'api url'
>>>>>     channel: '#prod-channel'
>>>>>     send_resolved: true
>>>>>     color: '{{ template "slack.color" . }}'
>>>>>     title: '{{ template "slack.title" . }}'
>>>>>     text: '{{ template "slack.text" . }}'
>>>>>     actions:
>>>>>       - type: button
>>>>>         text: 'Query :mag:'
>>>>>         url: '{{ (index .Alerts 0).GeneratorURL }}'
>>>>>       - type: button
>>>>>         text: 'Silence :no_bell:'
>>>>>         url: '{{ template "__alert_silence_link" . }}'
>>>>>       - type: button
>>>>>         text: 'Dashboard :grafana:'
>>>>>         url: '{{ (index .Alerts 0).Annotations.dashboard }}'
>>>>> - name: 'staging'
>>>>>   slack_configs:
>>>>>   - api_url: 'api url'
>>>>>     channel: '#staging-channel'
>>>>>     send_resolved: true
>>>>>     color: '{{ template "slack.color" . }}'
>>>>>     title: '{{ template "slack.title" . }}'
>>>>>     text: '{{ template "slack.text" . }}'
>>>>>     actions:
>>>>>       - type: button
>>>>>         text: 'Query :mag:'
>>>>>         url: '{{ (index .Alerts 0).GeneratorURL }}'
>>>>>       - type: button
>>>>>         text: 'Silence :no_bell:'
>>>>>         url: '{{ template "__alert_silence_link" . }}'
>>>>>       - type: button
>>>>>         text: 'Dashboard :grafana:'
>>>>>         url: '{{ (index .Alerts 0).Annotations.dashboard }}'
>>>>> - name: 'production'
>>>>>   slack_configs:
>>>>>   - api_url: 'api url'
>>>>>     channel: '#prod-channel'
>>>>>     send_resolved: true
>>>>>     color: '{{ template "slack.color" . }}'
>>>>>     title: '{{ template "slack.title" . }}'
>>>>>     text: '{{ template "slack.text" . }}'
>>>>>     actions:
>>>>>       - type: button
>>>>>         text: 'Query :mag:'
>>>>>         url: '{{ (index .Alerts 0).GeneratorURL }}'
>>>>>       - type: button
>>>>>         text: 'Silence :no_bell:'
>>>>>         url: '{{ template "__alert_silence_link" . }}'
>>>>>       - type: button
>>>>>         text: 'Dashboard :grafana:'
>>>>>         url: '{{ (index .Alerts 0).Annotations.dashboard }}'
>>>>>
>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/0cd92b20-f8c6-4dbe-b136-c829ae202258n%40googlegroups.com.

[prometheus-users] Re: Alertmanager slack alerting issues

Reply via email to