[prometheus-users] Re: Alerts are getting fire after every minute

'Brian Candler' via Prometheus Users Wed, 05 Mar 2025 10:29:14 -0800

I notice that your "up == 0" graph shows lots of green which are values 
where up == 0. These are legitimately generating alerts, in my opinion. If 
you have set evaluation_interval to 5m, and "for:" to be less than 5m, then 
a single instance of up == 0 will send an alert, because that's what you 
asked for.


*> I want alerts to be trigger after 5 min and only if condition is true.*

Then you want:

evaluation_interval: 15s  # on the rule group, or globally
for: 5m   # on the individual alerting rule(s)

Then an alert will only be sent if alert condition has been present 
consecutively for the whole 5 minutes (i.e. 20 cycles).

Finally: you may find it helpful to include {{ $value }} in an annotation 
on each alerting rule, so you can tell the value which triggered the alert. 
I can see you've done this already in one of your alerts:

   - alert: "Total Messages > 10k in last 1 min"
      expr: rabbitmq_queue_messages > 10000
...
      annotations:
        summary: "'{{ $labels.queue }}' has total '*{{ $value }}*' messages 
for more than 1 min."

And this is reflected in the alert:

      description: 'Queue QUEUE_NAME in RabbitMQ has total *1.110738e+06* 
messages\n' +

        'for more than 1 minutes.\n',

      summary: "RabbitMQ Queue 'QUEUE_NAME' has more than 10L messages"

rabbitmq_queue_messages is a vector containing zero or more instances of 
that metric.

rabbitmq_queue_messages > 10000 is a reduced vector, containing only those 
instance of the metric with a value greater than 10000.

You can see that the $value at the time the alert was generated 
was 1.110738e+06, which is 1,110,738, and that's clearly a lot more than 
10,000. Hence you get an alert. It's what you asked for.

If you want a more readable string in the annotation, you can use {{ $value 
| humanize }}, but it will lose some precision.

On Wednesday, 5 March 2025 at 10:28:15 UTC Amol Nagotkar wrote:

> As u can see in below images
> Last trigger was at 15:31:29
> And receive emails after that time also, which is for example 15:35, 
> 15:37, etc. 
> [image: IMG-20250305-WA0061.jpg]
>
> [image: IMG-20250305-WA0060.jpg]
> On Wednesday, March 5, 2025 at 3:28:20 PM UTC+5:30 Amol Nagotkar wrote:
>
>>
>> Thank you for the quick reply.
>>
>> So, as i told you i am not using alertmanager. i am getting alerts based 
>> on config->
>>
>> alerting:
>>
>>   alertmanagers:
>>
>>     - static_configs:
>>
>>         - targets:
>>
>>           - IP_ADDRESS_OF_EMAIL_APPLICATION:PORT
>>
>>
>> written in prometheus.yml file. below is the alert response (array of 
>> object) i am receiving from prometheus.
>>
>>
>> [
>>
>>   {
>>
>>     annotations: {
>>
>>       description: 'Queue QUEUE_NAME in RabbitMQ has total 1.110738e+06 
>> messages\n' +
>>
>>         'for more than 1 minutes.\n',
>>
>>       summary: "RabbitMQ Queue 'QUEUE_NAME' has more than 10L messages"
>>
>>     },
>>
>>     endsAt: '2025-02-03T06:33:31.893Z',
>>
>>     startsAt: '2025-02-03T06:28:31.893Z',
>>
>>     generatorURL: '
>> http://helo-container-pr:9091/graph?g0.expr=rabbitmq_queue_messages+%3E+1e%2B06&g0.tab=1
>> ',
>>
>>     labels: {
>>
>>       alertname: 'Total Messages > 10L in last 1 min',
>>
>>       instance: 'IP_ADDRESS:15692',
>>
>>       job: 'rabbitmq-rcs',
>>
>>       queue: 'QUEUE_NAME',
>>
>>       severity: 'critical',
>>
>>       vhost: 'webhook'
>>
>>     }
>>
>>   }
>>
>> ]
>>
>>
>>
>> *If i keep evaluation_internal**: **15s, it started triggering every 
>> minute.* 
>>
>> *I want alerts to be trigger after 5 min and only if condition is true.*
>> On Wednesday, March 5, 2025 at 2:18:34 PM UTC+5:30 Brian Candler wrote:
>>
>>> You still haven't shown an example of the actual alert you're concerned 
>>> about (for example, the E-mail containing all the labels and the 
>>> annotations)
>>>
>>> alertmanager cannot generate any alert unless Prometheus triggers it. 
>>> Please go into the PromQL web interface, switch to the "Graph" tab with the 
>>> default 1 hour time window (or less), and enter the following queries:
>>>
>>> up == 0
>>> rabbitmq_queue_consumers == 0
>>> rabbitmq_queue_messages > 10000
>>>
>>> Show the graphs.  If they are not blank, then alerts will be generated. 
>>>
>>> "*for: 30s" *has no effect when you have "*evaluation_interval: 5m".* I 
>>> suggest you use *evaluation_internal: 15s* (to match your scrape 
>>> internal), and then "for: 30s" will have some benefit; it will only send an 
>>> alert if the alerting condition has been true for two successive cycles.
>>>
>>> On Wednesday, 5 March 2025 at 07:50:23 UTC Amol Nagotkar wrote:
>>>
>>>> Thank you for the reply.
>>>>
>>>>
>>>> answers for above points-
>>>>
>>>> 1. i checked expression "up == 0" is firing rarely and all my targets 
>>>> are being scraped.
>>>>
>>>> 2. for not to get alerts every minutes, now i kept  *evaluation_interval 
>>>> as 5m* 
>>>>
>>>> 3. i have removed keep_firing_for as it is not suitable for my use case.
>>>>
>>>>
>>>> Updated:
>>>>
>>>> I am using prometheus alerting for rabbitmq. Below is the configuration 
>>>> I am using.
>>>>
>>>>
>>>> *prometheus.yml file*
>>>>
>>>> global:
>>>>
>>>>   scrape_interval: 15s # Set the scrape interval to every 15 seconds. 
>>>> Default is every 1 minute.
>>>>
>>>>   evaluation_interval: 5m # Evaluate rules every 15 seconds. The 
>>>> default is every 1 minute.
>>>>
>>>>   # scrape_timeout is set to the global default (10s).
>>>>
>>>>
>>>> alerting:
>>>>
>>>>    alertmanagers:
>>>>
>>>>        - static_configs:
>>>>
>>>>            - targets:
>>>>
>>>>                - ip:port
>>>>
>>>> rule_files:
>>>>
>>>> - "alerts_rules.yml"
>>>>
>>>> scrape_configs:
>>>>
>>>> - job_name: "prometheus"
>>>>
>>>>   static_configs:
>>>>
>>>>   - targets: ["ip:port"]
>>>>
>>>>
>>>> *alerts_rules.yml file*
>>>>
>>>> groups:
>>>>
>>>> - name: instance_alerts
>>>>
>>>>   rules:
>>>>
>>>>   - alert: "Instance Down"
>>>>
>>>>     expr: up == 0
>>>>
>>>>     for: 30s
>>>>
>>>>     # keep_firing_for: 30s
>>>>
>>>>     labels:
>>>>
>>>>       severity: "Critical"
>>>>
>>>>     annotations:
>>>>
>>>>       summary: "Endpoint {{ $labels.instance }} down"
>>>>
>>>>       description: "{{ $labels.instance }} of job {{ $labels.job }} has 
>>>> been down for more than 30 sec."
>>>>
>>>>
>>>> - name: rabbitmq_alerts
>>>>
>>>>   rules:
>>>>
>>>>     - alert: "Consumer down for last 1 min"
>>>>
>>>>       expr: rabbitmq_queue_consumers == 0
>>>>
>>>>       for: 30s
>>>>
>>>>       # keep_firing_for: 30s
>>>>
>>>>       labels:
>>>>
>>>>         severity: Critical
>>>>
>>>>       annotations:
>>>>
>>>>         summary: "shortify | '{{ $labels.queue }}' has no consumers"
>>>>
>>>>         description: "The queue '{{ $labels.queue }}' in vhost '{{ 
>>>> $labels.vhost }}' has zero consumers for more than 30 sec. Immediate 
>>>> attention is required."
>>>>
>>>>
>>>>
>>>>     - alert: "Total Messages > 10k in last 1 min"
>>>>
>>>>       expr: rabbitmq_queue_messages > 10000
>>>>
>>>>       for: 30s
>>>>
>>>>       # keep_firing_for: 30s
>>>>
>>>>       labels:
>>>>
>>>>         severity: Critical
>>>>
>>>>       annotations:
>>>>
>>>>         summary: "'{{ $labels.queue }}' has total '{{ $value }}' 
>>>> messages for more than 1 min."
>>>>
>>>>         description: |
>>>>
>>>>           Queue {{ $labels.queue }} in RabbitMQ has total {{ $value }} 
>>>> messages for more than 1 min.
>>>>
>>>>
>>>> Event if there is no data in queue, it sends me alerts. I have kept 
>>>> *evaluation_interval: 
>>>> 5m* ( Prometheus evaluates alert rules every 5 minutes) and *for: 30s* 
>>>> (Ensures 
>>>> the alert fires only if the condition persists for 30s).
>>>>
>>>> I guess *for* is not working for me.
>>>>
>>>> By the way* i am not using alertmanager*(
>>>> https://github.com/prometheus/alertmanager/releases/latest/download/alertmanager-0.28.0.linux-amd64.tar.gz
>>>> )
>>>>
>>>> i am just using *prometheus* (
>>>> https://github.com/prometheus/prometheus/releases/download/v3.1.0/prometheus-3.1.0.linux-amd64.tar.gz
>>>> )
>>>>
>>>> https://prometheus.io/download/
>>>>
>>>> How can i solve this. Thank you in advance.
>>>>
>>>> On Saturday, February 15, 2025 at 12:13:01 AM UTC+5:30 Brian Candler 
>>>> wrote:
>>>>
>>>>> > even if application is not down, it sends alerts every 1 min. how to 
>>>>> debug this i am using below exp:- alert: "Instance Down" expr: up == 0
>>>>>
>>>>> You need to show the actual alerts, from the Prometheus web interface 
>>>>> and/or the notifications, and then describe how these are different from 
>>>>> what you expect.
>>>>>
>>>>> I very much doubt that the expression "up == 0" is firing unless there 
>>>>> is at least one target which is not being scraped, and therefore the "up" 
>>>>> metric has a value of 0 for a particular timeseries (metric with a given 
>>>>> set of labels).
>>>>>
>>>>> > if the threshold cross and value changes, it fires multiple alerts 
>>>>> having same alert rule thats fine. But with same '{{ $value }}' it should 
>>>>> fire alerts after 5 min. same alert rule with same value should not get 
>>>>> fire for next 5 min. how to get this ??
>>>>>
>>>>> I cannot work out what problem you are trying to describe. As long as 
>>>>> you only use '{{ $value }}' in annotations, not labels, then the same 
>>>>> alert 
>>>>> will just continue firing.
>>>>>
>>>>> Whether you get repeated *notifications* about that ongoing alert is a 
>>>>> different matter. With "repeat_interval: 15m" you should get them every 
>>>>> 15 
>>>>> minutes at least. You may get additional notifications if a new alert is 
>>>>> added into the same alert group, or one is resolved from the alert group.
>>>>>
>>>>> > whats is for, keep_firing_for and evaluation_interval ?
>>>>>
>>>>> keep_firing_for is debouncing: once the alert condition has gone away, 
>>>>> it will continue firing for this period of time. This is so that if the 
>>>>> alert condition vanishes briefly but reappears, it doesn't cause the 
>>>>> alert 
>>>>> to be resolved and then retriggered.
>>>>>
>>>>> evaluation_interval is how often the alerting expression is evaluated.
>>>>>
>>>>>
>>>>> On Friday, 14 February 2025 at 15:53:24 UTC Amol Nagotkar wrote:
>>>>>
>>>>>> Hi all,
>>>>>> i want same alert(alert rule) to be fire after 5 min, currently i am 
>>>>>> getting same alert (alert rule) after every one minute for same '{{ 
>>>>>> $value 
>>>>>> }}'.
>>>>>> if the threshold cross and value changes, it fires multiple alerts 
>>>>>> having same alert rule thats fine. But with same '{{ $value }}' it 
>>>>>> should 
>>>>>> fire alerts after 5 min. same alert rule with same value should not get 
>>>>>> fire for next 5 min. how to get this ??
>>>>>> even if application is not down, it sends alerts every 1 min. how to 
>>>>>> debug this i am using below exp:- alert: "Instance Down" expr: up == 0
>>>>>> whats is for, keep_firing_for and evaluation_interval ?
>>>>>> prometheus.yml
>>>>>>
>>>>>> global:
>>>>>> scrape_interval: 15s # Set the scrape interval to every 15 seconds. 
>>>>>> Default is every 1 minute.
>>>>>> evaluation_interval: 15s # Evaluate rules every 15 seconds. The 
>>>>>> default is every 1 minute.
>>>>>>
>>>>>> alerting:
>>>>>> alertmanagers:
>>>>>>
>>>>>> - static_configs:
>>>>>> - targets:
>>>>>> - ip:port
>>>>>>
>>>>>> rule_files:
>>>>>>
>>>>>> - "alerts_rules.yml"
>>>>>>
>>>>>> scrape_configs:
>>>>>>
>>>>>> - job_name: "prometheus"
>>>>>>   static_configs:
>>>>>>   - targets: ["ip:port"]
>>>>>>
>>>>>> alertmanager.yml
>>>>>> global:
>>>>>> resolve_timeout: 5m
>>>>>> route:
>>>>>> group_wait: 5s
>>>>>> group_interval: 5m
>>>>>> repeat_interval: 15m
>>>>>> receiver: webhook_receiver
>>>>>> receivers:
>>>>>>
>>>>>> - name: webhook_receiver
>>>>>>   webhook_configs:
>>>>>>   - url: 'http://ip:port'
>>>>>>     send_resolved: false
>>>>>>
>>>>>> alerts_rules.yml
>>>>>>
>>>>>>
>>>>>> groups:
>>>>>> - name: instance_alerts
>>>>>>   rules:
>>>>>>   - alert: "Instance Down"
>>>>>>     expr: up == 0
>>>>>>     # for: 30s
>>>>>>     # keep_firing_for: 30s
>>>>>>     labels:
>>>>>>       severity: "Critical"
>>>>>>     annotations:
>>>>>>       summary: "Endpoint {{ $labels.instance }} down"
>>>>>>       description: "{{ $labels.instance }} of job {{ $labels.job }} 
>>>>>> has been down for more than 30 sec."
>>>>>>
>>>>>> - name: rabbitmq_alerts
>>>>>>   rules:
>>>>>>     - alert: "Consumer down for last 1 min"
>>>>>>       expr: rabbitmq_queue_consumers == 0
>>>>>>       # for: 1m
>>>>>>       # keep_firing_for: 30s
>>>>>>       labels:
>>>>>>         severity: Critical
>>>>>>       annotations:
>>>>>>         summary: "shortify | '{{ $labels.queue }}' has no consumers"
>>>>>>         description: "The queue '{{ $labels.queue }}' in vhost '{{ 
>>>>>> $labels.vhost }}' has zero consumers for more than 30 sec. Immediate 
>>>>>> attention is required."
>>>>>>
>>>>>>
>>>>>>     - alert: "Total Messages > 10k in last 1 min"
>>>>>>       expr: rabbitmq_queue_messages > 10000
>>>>>>       # for: 1m
>>>>>>       # keep_firing_for: 30s
>>>>>>       labels:
>>>>>>         severity: Critical
>>>>>>       annotations:
>>>>>>         summary: "'{{ $labels.queue }}' has total '{{ $value }}' 
>>>>>> messages for more than 1 min."
>>>>>>         description: |
>>>>>>           Queue {{ $labels.queue }} in RabbitMQ has total {{ $value 
>>>>>> }} messages for more than 1 min.
>>>>>>
>>>>>>
>>>>>> Thank you in advance.
>>>>>>
>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/prometheus-users/adf766ee-ac21-40f2-9961-aa423cf92e2fn%40googlegroups.com.

[prometheus-users] Re: Alerts are getting fire after every minute

Reply via email to