[prometheus-users] Re: Alerts are getting fire after every minute

Amol Nagotkar Wed, 05 Mar 2025 21:39:09 -0800

Thanks for the reply. 

1. when i keep evaluation_interval: 5m and for: 30s -> i get alerts every 5 
min. (those alerts gets store in prometheus and triggers every 5 min, i 
mean even if condition is not matching, i still used to get alerts every 
5min)



now i am changing config to below:-

evaluation_interval: 15s  *# on the rule group, or globally*

for: 5m   *# on the individual alerting rule(s)*

i will update you about this soon.


2. If you want a more readable string in the annotation, you can use {{ 
$value | humanize }}, but it will lose some precision.

This is serious concern for us. how to solve this?

On Wednesday, March 5, 2025 at 11:43:02 PM UTC+5:30 Brian Candler wrote:

> I notice that your "up == 0" graph shows lots of green which are values 
> where up == 0. These are legitimately generating alerts, in my opinion. If 
> you have set evaluation_interval to 5m, and "for:" to be less than 5m, then 
> a single instance of up == 0 will send an alert, because that's what you 
> asked for.
>
> *> I want alerts to be trigger after 5 min and only if condition is true.*
>
> Then you want:
>
> evaluation_interval: 15s  # on the rule group, or globally
> for: 5m   # on the individual alerting rule(s)
>
> Then an alert will only be sent if alert condition has been present 
> consecutively for the whole 5 minutes (i.e. 20 cycles).
>
> Finally: you may find it helpful to include {{ $value }} in an annotation 
> on each alerting rule, so you can tell the value which triggered the alert. 
> I can see you've done this already in one of your alerts:
>
>    - alert: "Total Messages > 10k in last 1 min"
>       expr: rabbitmq_queue_messages > 10000
> ...
>
>       annotations:
>         summary: "'{{ $labels.queue }}' has total '*{{ $value }}*' 
> messages for more than 1 min."
>
> And this is reflected in the alert:
>
>       description: 'Queue QUEUE_NAME in RabbitMQ has total *1.110738e+06* 
> messages\n' +
>
>         'for more than 1 minutes.\n',
>
>       summary: "RabbitMQ Queue 'QUEUE_NAME' has more than 10L messages"
>
> rabbitmq_queue_messages is a vector containing zero or more instances of 
> that metric.
>
> rabbitmq_queue_messages > 10000 is a reduced vector, containing only those 
> instance of the metric with a value greater than 10000.
>
> You can see that the $value at the time the alert was generated 
> was 1.110738e+06, which is 1,110,738, and that's clearly a lot more than 
> 10,000. Hence you get an alert. It's what you asked for.
>
> If you want a more readable string in the annotation, you can use {{ 
> $value | humanize }}, but it will lose some precision.
>
> On Wednesday, 5 March 2025 at 10:28:15 UTC Amol Nagotkar wrote:
>
>> As u can see in below images
>> Last trigger was at 15:31:29
>> And receive emails after that time also, which is for example 15:35, 
>> 15:37, etc. 
>> [image: IMG-20250305-WA0061.jpg]
>>
>> [image: IMG-20250305-WA0060.jpg]
>> On Wednesday, March 5, 2025 at 3:28:20 PM UTC+5:30 Amol Nagotkar wrote:
>>
>>>
>>> Thank you for the quick reply.
>>>
>>> So, as i told you i am not using alertmanager. i am getting alerts based 
>>> on config->
>>>
>>> alerting:
>>>
>>>   alertmanagers:
>>>
>>>     - static_configs:
>>>
>>>         - targets:
>>>
>>>           - IP_ADDRESS_OF_EMAIL_APPLICATION:PORT
>>>
>>>
>>> written in prometheus.yml file. below is the alert response (array of 
>>> object) i am receiving from prometheus.
>>>
>>>
>>> [
>>>
>>>   {
>>>
>>>     annotations: {
>>>
>>>       description: 'Queue QUEUE_NAME in RabbitMQ has total 1.110738e+06 
>>> messages\n' +
>>>
>>>         'for more than 1 minutes.\n',
>>>
>>>       summary: "RabbitMQ Queue 'QUEUE_NAME' has more than 10L messages"
>>>
>>>     },
>>>
>>>     endsAt: '2025-02-03T06:33:31.893Z',
>>>
>>>     startsAt: '2025-02-03T06:28:31.893Z',
>>>
>>>     generatorURL: '
>>> http://helo-container-pr:9091/graph?g0.expr=rabbitmq_queue_messages+%3E+1e%2B06&g0.tab=1
>>> ',
>>>
>>>     labels: {
>>>
>>>       alertname: 'Total Messages > 10L in last 1 min',
>>>
>>>       instance: 'IP_ADDRESS:15692',
>>>
>>>       job: 'rabbitmq-rcs',
>>>
>>>       queue: 'QUEUE_NAME',
>>>
>>>       severity: 'critical',
>>>
>>>       vhost: 'webhook'
>>>
>>>     }
>>>
>>>   }
>>>
>>> ]
>>>
>>>
>>>
>>> *If i keep evaluation_internal**: **15s, it started triggering every 
>>> minute.* 
>>>
>>> *I want alerts to be trigger after 5 min and only if condition is true.*
>>> On Wednesday, March 5, 2025 at 2:18:34 PM UTC+5:30 Brian Candler wrote:
>>>
>>>> You still haven't shown an example of the actual alert you're concerned 
>>>> about (for example, the E-mail containing all the labels and the 
>>>> annotations)
>>>>
>>>> alertmanager cannot generate any alert unless Prometheus triggers it. 
>>>> Please go into the PromQL web interface, switch to the "Graph" tab with 
>>>> the 
>>>> default 1 hour time window (or less), and enter the following queries:
>>>>
>>>> up == 0
>>>> rabbitmq_queue_consumers == 0
>>>> rabbitmq_queue_messages > 10000
>>>>
>>>> Show the graphs.  If they are not blank, then alerts will be generated. 
>>>>
>>>> "*for: 30s" *has no effect when you have "*evaluation_interval: 5m".* I 
>>>> suggest you use *evaluation_internal: 15s* (to match your scrape 
>>>> internal), and then "for: 30s" will have some benefit; it will only send 
>>>> an 
>>>> alert if the alerting condition has been true for two successive cycles.
>>>>
>>>> On Wednesday, 5 March 2025 at 07:50:23 UTC Amol Nagotkar wrote:
>>>>
>>>>> Thank you for the reply.
>>>>>
>>>>>
>>>>> answers for above points-
>>>>>
>>>>> 1. i checked expression "up == 0" is firing rarely and all my targets 
>>>>> are being scraped.
>>>>>
>>>>> 2. for not to get alerts every minutes, now i kept  *evaluation_interval 
>>>>> as 5m* 
>>>>>
>>>>> 3. i have removed keep_firing_for as it is not suitable for my use 
>>>>> case.
>>>>>
>>>>>
>>>>> Updated:
>>>>>
>>>>> I am using prometheus alerting for rabbitmq. Below is the 
>>>>> configuration I am using.
>>>>>
>>>>>
>>>>> *prometheus.yml file*
>>>>>
>>>>> global:
>>>>>
>>>>>   scrape_interval: 15s # Set the scrape interval to every 15 seconds. 
>>>>> Default is every 1 minute.
>>>>>
>>>>>   evaluation_interval: 5m # Evaluate rules every 15 seconds. The 
>>>>> default is every 1 minute.
>>>>>
>>>>>   # scrape_timeout is set to the global default (10s).
>>>>>
>>>>>
>>>>> alerting:
>>>>>
>>>>>    alertmanagers:
>>>>>
>>>>>        - static_configs:
>>>>>
>>>>>            - targets:
>>>>>
>>>>>                - ip:port
>>>>>
>>>>> rule_files:
>>>>>
>>>>> - "alerts_rules.yml"
>>>>>
>>>>> scrape_configs:
>>>>>
>>>>> - job_name: "prometheus"
>>>>>
>>>>>   static_configs:
>>>>>
>>>>>   - targets: ["ip:port"]
>>>>>
>>>>>
>>>>> *alerts_rules.yml file*
>>>>>
>>>>> groups:
>>>>>
>>>>> - name: instance_alerts
>>>>>
>>>>>   rules:
>>>>>
>>>>>   - alert: "Instance Down"
>>>>>
>>>>>     expr: up == 0
>>>>>
>>>>>     for: 30s
>>>>>
>>>>>     # keep_firing_for: 30s
>>>>>
>>>>>     labels:
>>>>>
>>>>>       severity: "Critical"
>>>>>
>>>>>     annotations:
>>>>>
>>>>>       summary: "Endpoint {{ $labels.instance }} down"
>>>>>
>>>>>       description: "{{ $labels.instance }} of job {{ $labels.job }} 
>>>>> has been down for more than 30 sec."
>>>>>
>>>>>
>>>>> - name: rabbitmq_alerts
>>>>>
>>>>>   rules:
>>>>>
>>>>>     - alert: "Consumer down for last 1 min"
>>>>>
>>>>>       expr: rabbitmq_queue_consumers == 0
>>>>>
>>>>>       for: 30s
>>>>>
>>>>>       # keep_firing_for: 30s
>>>>>
>>>>>       labels:
>>>>>
>>>>>         severity: Critical
>>>>>
>>>>>       annotations:
>>>>>
>>>>>         summary: "shortify | '{{ $labels.queue }}' has no consumers"
>>>>>
>>>>>         description: "The queue '{{ $labels.queue }}' in vhost '{{ 
>>>>> $labels.vhost }}' has zero consumers for more than 30 sec. Immediate 
>>>>> attention is required."
>>>>>
>>>>>
>>>>>
>>>>>     - alert: "Total Messages > 10k in last 1 min"
>>>>>
>>>>>       expr: rabbitmq_queue_messages > 10000
>>>>>
>>>>>       for: 30s
>>>>>
>>>>>       # keep_firing_for: 30s
>>>>>
>>>>>       labels:
>>>>>
>>>>>         severity: Critical
>>>>>
>>>>>       annotations:
>>>>>
>>>>>         summary: "'{{ $labels.queue }}' has total '{{ $value }}' 
>>>>> messages for more than 1 min."
>>>>>
>>>>>         description: |
>>>>>
>>>>>           Queue {{ $labels.queue }} in RabbitMQ has total {{ $value }} 
>>>>> messages for more than 1 min.
>>>>>
>>>>>
>>>>> Event if there is no data in queue, it sends me alerts. I have kept 
>>>>> *evaluation_interval: 
>>>>> 5m* ( Prometheus evaluates alert rules every 5 minutes) and *for: 30s* 
>>>>> (Ensures 
>>>>> the alert fires only if the condition persists for 30s).
>>>>>
>>>>> I guess *for* is not working for me.
>>>>>
>>>>> By the way* i am not using alertmanager*(
>>>>> https://github.com/prometheus/alertmanager/releases/latest/download/alertmanager-0.28.0.linux-amd64.tar.gz
>>>>> )
>>>>>
>>>>> i am just using *prometheus* (
>>>>> https://github.com/prometheus/prometheus/releases/download/v3.1.0/prometheus-3.1.0.linux-amd64.tar.gz
>>>>> )
>>>>>
>>>>> https://prometheus.io/download/
>>>>>
>>>>> How can i solve this. Thank you in advance.
>>>>>
>>>>> On Saturday, February 15, 2025 at 12:13:01 AM UTC+5:30 Brian Candler 
>>>>> wrote:
>>>>>
>>>>>> > even if application is not down, it sends alerts every 1 min. how 
>>>>>> to debug this i am using below exp:- alert: "Instance Down" expr: up == 0
>>>>>>
>>>>>> You need to show the actual alerts, from the Prometheus web interface 
>>>>>> and/or the notifications, and then describe how these are different from 
>>>>>> what you expect.
>>>>>>
>>>>>> I very much doubt that the expression "up == 0" is firing unless 
>>>>>> there is at least one target which is not being scraped, and therefore 
>>>>>> the 
>>>>>> "up" metric has a value of 0 for a particular timeseries (metric with a 
>>>>>> given set of labels).
>>>>>>
>>>>>> > if the threshold cross and value changes, it fires multiple alerts 
>>>>>> having same alert rule thats fine. But with same '{{ $value }}' it 
>>>>>> should 
>>>>>> fire alerts after 5 min. same alert rule with same value should not get 
>>>>>> fire for next 5 min. how to get this ??
>>>>>>
>>>>>> I cannot work out what problem you are trying to describe. As long as 
>>>>>> you only use '{{ $value }}' in annotations, not labels, then the same 
>>>>>> alert 
>>>>>> will just continue firing.
>>>>>>
>>>>>> Whether you get repeated *notifications* about that ongoing alert is 
>>>>>> a different matter. With "repeat_interval: 15m" you should get them 
>>>>>> every 
>>>>>> 15 minutes at least. You may get additional notifications if a new alert 
>>>>>> is 
>>>>>> added into the same alert group, or one is resolved from the alert group.
>>>>>>
>>>>>> > whats is for, keep_firing_for and evaluation_interval ?
>>>>>>
>>>>>> keep_firing_for is debouncing: once the alert condition has gone 
>>>>>> away, it will continue firing for this period of time. This is so that 
>>>>>> if 
>>>>>> the alert condition vanishes briefly but reappears, it doesn't cause the 
>>>>>> alert to be resolved and then retriggered.
>>>>>>
>>>>>> evaluation_interval is how often the alerting expression is evaluated.
>>>>>>
>>>>>>
>>>>>> On Friday, 14 February 2025 at 15:53:24 UTC Amol Nagotkar wrote:
>>>>>>
>>>>>>> Hi all,
>>>>>>> i want same alert(alert rule) to be fire after 5 min, currently i am 
>>>>>>> getting same alert (alert rule) after every one minute for same '{{ 
>>>>>>> $value 
>>>>>>> }}'.
>>>>>>> if the threshold cross and value changes, it fires multiple alerts 
>>>>>>> having same alert rule thats fine. But with same '{{ $value }}' it 
>>>>>>> should 
>>>>>>> fire alerts after 5 min. same alert rule with same value should not get 
>>>>>>> fire for next 5 min. how to get this ??
>>>>>>> even if application is not down, it sends alerts every 1 min. how to 
>>>>>>> debug this i am using below exp:- alert: "Instance Down" expr: up == 0
>>>>>>> whats is for, keep_firing_for and evaluation_interval ?
>>>>>>> prometheus.yml
>>>>>>>
>>>>>>> global:
>>>>>>> scrape_interval: 15s # Set the scrape interval to every 15 seconds. 
>>>>>>> Default is every 1 minute.
>>>>>>> evaluation_interval: 15s # Evaluate rules every 15 seconds. The 
>>>>>>> default is every 1 minute.
>>>>>>>
>>>>>>> alerting:
>>>>>>> alertmanagers:
>>>>>>>
>>>>>>> - static_configs:
>>>>>>> - targets:
>>>>>>> - ip:port
>>>>>>>
>>>>>>> rule_files:
>>>>>>>
>>>>>>> - "alerts_rules.yml"
>>>>>>>
>>>>>>> scrape_configs:
>>>>>>>
>>>>>>> - job_name: "prometheus"
>>>>>>>   static_configs:
>>>>>>>   - targets: ["ip:port"]
>>>>>>>
>>>>>>> alertmanager.yml
>>>>>>> global:
>>>>>>> resolve_timeout: 5m
>>>>>>> route:
>>>>>>> group_wait: 5s
>>>>>>> group_interval: 5m
>>>>>>> repeat_interval: 15m
>>>>>>> receiver: webhook_receiver
>>>>>>> receivers:
>>>>>>>
>>>>>>> - name: webhook_receiver
>>>>>>>   webhook_configs:
>>>>>>>   - url: 'http://ip:port'
>>>>>>>     send_resolved: false
>>>>>>>
>>>>>>> alerts_rules.yml
>>>>>>>
>>>>>>>
>>>>>>> groups:
>>>>>>> - name: instance_alerts
>>>>>>>   rules:
>>>>>>>   - alert: "Instance Down"
>>>>>>>     expr: up == 0
>>>>>>>     # for: 30s
>>>>>>>     # keep_firing_for: 30s
>>>>>>>     labels:
>>>>>>>       severity: "Critical"
>>>>>>>     annotations:
>>>>>>>       summary: "Endpoint {{ $labels.instance }} down"
>>>>>>>       description: "{{ $labels.instance }} of job {{ $labels.job }} 
>>>>>>> has been down for more than 30 sec."
>>>>>>>
>>>>>>> - name: rabbitmq_alerts
>>>>>>>   rules:
>>>>>>>     - alert: "Consumer down for last 1 min"
>>>>>>>       expr: rabbitmq_queue_consumers == 0
>>>>>>>       # for: 1m
>>>>>>>       # keep_firing_for: 30s
>>>>>>>       labels:
>>>>>>>         severity: Critical
>>>>>>>       annotations:
>>>>>>>         summary: "shortify | '{{ $labels.queue }}' has no consumers"
>>>>>>>         description: "The queue '{{ $labels.queue }}' in vhost '{{ 
>>>>>>> $labels.vhost }}' has zero consumers for more than 30 sec. Immediate 
>>>>>>> attention is required."
>>>>>>>
>>>>>>>
>>>>>>>     - alert: "Total Messages > 10k in last 1 min"
>>>>>>>       expr: rabbitmq_queue_messages > 10000
>>>>>>>       # for: 1m
>>>>>>>       # keep_firing_for: 30s
>>>>>>>       labels:
>>>>>>>         severity: Critical
>>>>>>>       annotations:
>>>>>>>         summary: "'{{ $labels.queue }}' has total '{{ $value }}' 
>>>>>>> messages for more than 1 min."
>>>>>>>         description: |
>>>>>>>           Queue {{ $labels.queue }} in RabbitMQ has total {{ $value 
>>>>>>> }} messages for more than 1 min.
>>>>>>>
>>>>>>>
>>>>>>> Thank you in advance.
>>>>>>>
>>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/prometheus-users/dec1a3a8-2c00-4d20-9b92-511effb7b043n%40googlegroups.com.

[prometheus-users] Re: Alerts are getting fire after every minute

Reply via email to