[prometheus-users] Re: Alerts are getting fire after every minute

Amol Nagotkar Sat, 15 Mar 2025 09:22:48 -0700


one more imp thing,


why do i receive *same {{ $value }}*  alerts again and again. In rabbitmq, 
it is possible to get different values, but same value not possible *always*. 
but i receive alerts having same value many times.
On Thursday, March 6, 2025 at 10:53:02 AM UTC+5:30 Amol Nagotkar wrote:

> Thanks for the reply. 
>
> 1. when i keep evaluation_interval: 5m and for: 30s -> i get alerts every 
> 5 min. (those alerts gets store in prometheus and triggers every 5 min, i 
> mean even if condition is not matching, i still used to get alerts every 
> 5min)
>
>
> now i am changing config to below:-
>
> evaluation_interval: 15s  *# on the rule group, or globally*
>
> for: 5m   *# on the individual alerting rule(s)*
>
> i will update you about this soon.
>
>
> 2. If you want a more readable string in the annotation, you can use {{ 
> $value | humanize }}, but it will lose some precision.
>
> This is serious concern for us. how to solve this?
>
> On Wednesday, March 5, 2025 at 11:43:02 PM UTC+5:30 Brian Candler wrote:
>
>> I notice that your "up == 0" graph shows lots of green which are values 
>> where up == 0. These are legitimately generating alerts, in my opinion. If 
>> you have set evaluation_interval to 5m, and "for:" to be less than 5m, then 
>> a single instance of up == 0 will send an alert, because that's what you 
>> asked for.
>>
>> *> I want alerts to be trigger after 5 min and only if condition is true.*
>>
>> Then you want:
>>
>> evaluation_interval: 15s  # on the rule group, or globally
>> for: 5m   # on the individual alerting rule(s)
>>
>> Then an alert will only be sent if alert condition has been present 
>> consecutively for the whole 5 minutes (i.e. 20 cycles).
>>
>> Finally: you may find it helpful to include {{ $value }} in an annotation 
>> on each alerting rule, so you can tell the value which triggered the alert. 
>> I can see you've done this already in one of your alerts:
>>
>>    - alert: "Total Messages > 10k in last 1 min"
>>       expr: rabbitmq_queue_messages > 10000
>> ...
>>
>>       annotations:
>>         summary: "'{{ $labels.queue }}' has total '*{{ $value }}*' 
>> messages for more than 1 min."
>>
>> And this is reflected in the alert:
>>
>>       description: 'Queue QUEUE_NAME in RabbitMQ has total *1.110738e+06* 
>> messages\n' +
>>
>>         'for more than 1 minutes.\n',
>>
>>       summary: "RabbitMQ Queue 'QUEUE_NAME' has more than 10L messages"
>>
>> rabbitmq_queue_messages is a vector containing zero or more instances of 
>> that metric.
>>
>> rabbitmq_queue_messages > 10000 is a reduced vector, containing only 
>> those instance of the metric with a value greater than 10000.
>>
>> You can see that the $value at the time the alert was generated 
>> was 1.110738e+06, which is 1,110,738, and that's clearly a lot more than 
>> 10,000. Hence you get an alert. It's what you asked for.
>>
>> If you want a more readable string in the annotation, you can use {{ 
>> $value | humanize }}, but it will lose some precision.
>>
>> On Wednesday, 5 March 2025 at 10:28:15 UTC Amol Nagotkar wrote:
>>
>>> As u can see in below images
>>> Last trigger was at 15:31:29
>>> And receive emails after that time also, which is for example 15:35, 
>>> 15:37, etc. 
>>> [image: IMG-20250305-WA0061.jpg]
>>>
>>> [image: IMG-20250305-WA0060.jpg]
>>> On Wednesday, March 5, 2025 at 3:28:20 PM UTC+5:30 Amol Nagotkar wrote:
>>>
>>>>
>>>> Thank you for the quick reply.
>>>>
>>>> So, as i told you i am not using alertmanager. i am getting alerts 
>>>> based on config->
>>>>
>>>> alerting:
>>>>
>>>>   alertmanagers:
>>>>
>>>>     - static_configs:
>>>>
>>>>         - targets:
>>>>
>>>>           - IP_ADDRESS_OF_EMAIL_APPLICATION:PORT
>>>>
>>>>
>>>> written in prometheus.yml file. below is the alert response (array of 
>>>> object) i am receiving from prometheus.
>>>>
>>>>
>>>> [
>>>>
>>>>   {
>>>>
>>>>     annotations: {
>>>>
>>>>       description: 'Queue QUEUE_NAME in RabbitMQ has total 
>>>> 1.110738e+06 messages\n' +
>>>>
>>>>         'for more than 1 minutes.\n',
>>>>
>>>>       summary: "RabbitMQ Queue 'QUEUE_NAME' has more than 10L messages"
>>>>
>>>>     },
>>>>
>>>>     endsAt: '2025-02-03T06:33:31.893Z',
>>>>
>>>>     startsAt: '2025-02-03T06:28:31.893Z',
>>>>
>>>>     generatorURL: '
>>>> http://helo-container-pr:9091/graph?g0.expr=rabbitmq_queue_messages+%3E+1e%2B06&g0.tab=1
>>>> ',
>>>>
>>>>     labels: {
>>>>
>>>>       alertname: 'Total Messages > 10L in last 1 min',
>>>>
>>>>       instance: 'IP_ADDRESS:15692',
>>>>
>>>>       job: 'rabbitmq-rcs',
>>>>
>>>>       queue: 'QUEUE_NAME',
>>>>
>>>>       severity: 'critical',
>>>>
>>>>       vhost: 'webhook'
>>>>
>>>>     }
>>>>
>>>>   }
>>>>
>>>> ]
>>>>
>>>>
>>>>
>>>> *If i keep evaluation_internal**: **15s, it started triggering every 
>>>> minute.* 
>>>>
>>>> *I want alerts to be trigger after 5 min and only if condition is true.*
>>>> On Wednesday, March 5, 2025 at 2:18:34 PM UTC+5:30 Brian Candler wrote:
>>>>
>>>>> You still haven't shown an example of the actual alert you're 
>>>>> concerned about (for example, the E-mail containing all the labels and 
>>>>> the 
>>>>> annotations)
>>>>>
>>>>> alertmanager cannot generate any alert unless Prometheus triggers it. 
>>>>> Please go into the PromQL web interface, switch to the "Graph" tab with 
>>>>> the 
>>>>> default 1 hour time window (or less), and enter the following queries:
>>>>>
>>>>> up == 0
>>>>> rabbitmq_queue_consumers == 0
>>>>> rabbitmq_queue_messages > 10000
>>>>>
>>>>> Show the graphs.  If they are not blank, then alerts will be 
>>>>> generated. 
>>>>>
>>>>> "*for: 30s" *has no effect when you have "*evaluation_interval: 5m".* I 
>>>>> suggest you use *evaluation_internal: 15s* (to match your scrape 
>>>>> internal), and then "for: 30s" will have some benefit; it will only send 
>>>>> an 
>>>>> alert if the alerting condition has been true for two successive cycles.
>>>>>
>>>>> On Wednesday, 5 March 2025 at 07:50:23 UTC Amol Nagotkar wrote:
>>>>>
>>>>>> Thank you for the reply.
>>>>>>
>>>>>>
>>>>>> answers for above points-
>>>>>>
>>>>>> 1. i checked expression "up == 0" is firing rarely and all my targets 
>>>>>> are being scraped.
>>>>>>
>>>>>> 2. for not to get alerts every minutes, now i kept  *evaluation_interval 
>>>>>> as 5m* 
>>>>>>
>>>>>> 3. i have removed keep_firing_for as it is not suitable for my use 
>>>>>> case.
>>>>>>
>>>>>>
>>>>>> Updated:
>>>>>>
>>>>>> I am using prometheus alerting for rabbitmq. Below is the 
>>>>>> configuration I am using.
>>>>>>
>>>>>>
>>>>>> *prometheus.yml file*
>>>>>>
>>>>>> global:
>>>>>>
>>>>>>   scrape_interval: 15s # Set the scrape interval to every 15 seconds. 
>>>>>> Default is every 1 minute.
>>>>>>
>>>>>>   evaluation_interval: 5m # Evaluate rules every 15 seconds. The 
>>>>>> default is every 1 minute.
>>>>>>
>>>>>>   # scrape_timeout is set to the global default (10s).
>>>>>>
>>>>>>
>>>>>> alerting:
>>>>>>
>>>>>>    alertmanagers:
>>>>>>
>>>>>>        - static_configs:
>>>>>>
>>>>>>            - targets:
>>>>>>
>>>>>>                - ip:port
>>>>>>
>>>>>> rule_files:
>>>>>>
>>>>>> - "alerts_rules.yml"
>>>>>>
>>>>>> scrape_configs:
>>>>>>
>>>>>> - job_name: "prometheus"
>>>>>>
>>>>>>   static_configs:
>>>>>>
>>>>>>   - targets: ["ip:port"]
>>>>>>
>>>>>>
>>>>>> *alerts_rules.yml file*
>>>>>>
>>>>>> groups:
>>>>>>
>>>>>> - name: instance_alerts
>>>>>>
>>>>>>   rules:
>>>>>>
>>>>>>   - alert: "Instance Down"
>>>>>>
>>>>>>     expr: up == 0
>>>>>>
>>>>>>     for: 30s
>>>>>>
>>>>>>     # keep_firing_for: 30s
>>>>>>
>>>>>>     labels:
>>>>>>
>>>>>>       severity: "Critical"
>>>>>>
>>>>>>     annotations:
>>>>>>
>>>>>>       summary: "Endpoint {{ $labels.instance }} down"
>>>>>>
>>>>>>       description: "{{ $labels.instance }} of job {{ $labels.job }} 
>>>>>> has been down for more than 30 sec."
>>>>>>
>>>>>>
>>>>>> - name: rabbitmq_alerts
>>>>>>
>>>>>>   rules:
>>>>>>
>>>>>>     - alert: "Consumer down for last 1 min"
>>>>>>
>>>>>>       expr: rabbitmq_queue_consumers == 0
>>>>>>
>>>>>>       for: 30s
>>>>>>
>>>>>>       # keep_firing_for: 30s
>>>>>>
>>>>>>       labels:
>>>>>>
>>>>>>         severity: Critical
>>>>>>
>>>>>>       annotations:
>>>>>>
>>>>>>         summary: "shortify | '{{ $labels.queue }}' has no consumers"
>>>>>>
>>>>>>         description: "The queue '{{ $labels.queue }}' in vhost '{{ 
>>>>>> $labels.vhost }}' has zero consumers for more than 30 sec. Immediate 
>>>>>> attention is required."
>>>>>>
>>>>>>
>>>>>>
>>>>>>     - alert: "Total Messages > 10k in last 1 min"
>>>>>>
>>>>>>       expr: rabbitmq_queue_messages > 10000
>>>>>>
>>>>>>       for: 30s
>>>>>>
>>>>>>       # keep_firing_for: 30s
>>>>>>
>>>>>>       labels:
>>>>>>
>>>>>>         severity: Critical
>>>>>>
>>>>>>       annotations:
>>>>>>
>>>>>>         summary: "'{{ $labels.queue }}' has total '{{ $value }}' 
>>>>>> messages for more than 1 min."
>>>>>>
>>>>>>         description: |
>>>>>>
>>>>>>           Queue {{ $labels.queue }} in RabbitMQ has total {{ $value 
>>>>>> }} messages for more than 1 min.
>>>>>>
>>>>>>
>>>>>> Event if there is no data in queue, it sends me alerts. I have kept 
>>>>>> *evaluation_interval: 
>>>>>> 5m* ( Prometheus evaluates alert rules every 5 minutes) and *for: 
>>>>>> 30s* (Ensures the alert fires only if the condition persists for 
>>>>>> 30s).
>>>>>>
>>>>>> I guess *for* is not working for me.
>>>>>>
>>>>>> By the way* i am not using alertmanager*(
>>>>>> https://github.com/prometheus/alertmanager/releases/latest/download/alertmanager-0.28.0.linux-amd64.tar.gz
>>>>>> )
>>>>>>
>>>>>> i am just using *prometheus* (
>>>>>> https://github.com/prometheus/prometheus/releases/download/v3.1.0/prometheus-3.1.0.linux-amd64.tar.gz
>>>>>> )
>>>>>>
>>>>>> https://prometheus.io/download/
>>>>>>
>>>>>> How can i solve this. Thank you in advance.
>>>>>>
>>>>>> On Saturday, February 15, 2025 at 12:13:01 AM UTC+5:30 Brian Candler 
>>>>>> wrote:
>>>>>>
>>>>>>> > even if application is not down, it sends alerts every 1 min. how 
>>>>>>> to debug this i am using below exp:- alert: "Instance Down" expr: up == >>>>>>> 0
>>>>>>>
>>>>>>> You need to show the actual alerts, from the Prometheus web 
>>>>>>> interface and/or the notifications, and then describe how these are 
>>>>>>> different from what you expect.
>>>>>>>
>>>>>>> I very much doubt that the expression "up == 0" is firing unless 
>>>>>>> there is at least one target which is not being scraped, and therefore 
>>>>>>> the 
>>>>>>> "up" metric has a value of 0 for a particular timeseries (metric with a 
>>>>>>> given set of labels).
>>>>>>>
>>>>>>> > if the threshold cross and value changes, it fires multiple alerts 
>>>>>>> having same alert rule thats fine. But with same '{{ $value }}' it 
>>>>>>> should 
>>>>>>> fire alerts after 5 min. same alert rule with same value should not get 
>>>>>>> fire for next 5 min. how to get this ??
>>>>>>>
>>>>>>> I cannot work out what problem you are trying to describe. As long 
>>>>>>> as you only use '{{ $value }}' in annotations, not labels, then the 
>>>>>>> same 
>>>>>>> alert will just continue firing.
>>>>>>>
>>>>>>> Whether you get repeated *notifications* about that ongoing alert is 
>>>>>>> a different matter. With "repeat_interval: 15m" you should get them 
>>>>>>> every 
>>>>>>> 15 minutes at least. You may get additional notifications if a new 
>>>>>>> alert is 
>>>>>>> added into the same alert group, or one is resolved from the alert 
>>>>>>> group.
>>>>>>>
>>>>>>> > whats is for, keep_firing_for and evaluation_interval ?
>>>>>>>
>>>>>>> keep_firing_for is debouncing: once the alert condition has gone 
>>>>>>> away, it will continue firing for this period of time. This is so that 
>>>>>>> if 
>>>>>>> the alert condition vanishes briefly but reappears, it doesn't cause 
>>>>>>> the 
>>>>>>> alert to be resolved and then retriggered.
>>>>>>>
>>>>>>> evaluation_interval is how often the alerting expression is 
>>>>>>> evaluated.
>>>>>>>
>>>>>>>
>>>>>>> On Friday, 14 February 2025 at 15:53:24 UTC Amol Nagotkar wrote:
>>>>>>>
>>>>>>>> Hi all,
>>>>>>>> i want same alert(alert rule) to be fire after 5 min, currently i 
>>>>>>>> am getting same alert (alert rule) after every one minute for same '{{ 
>>>>>>>> $value }}'.
>>>>>>>> if the threshold cross and value changes, it fires multiple alerts 
>>>>>>>> having same alert rule thats fine. But with same '{{ $value }}' it 
>>>>>>>> should 
>>>>>>>> fire alerts after 5 min. same alert rule with same value should not 
>>>>>>>> get 
>>>>>>>> fire for next 5 min. how to get this ??
>>>>>>>> even if application is not down, it sends alerts every 1 min. how 
>>>>>>>> to debug this i am using below exp:- alert: "Instance Down" expr: up 
>>>>>>>> == 0
>>>>>>>> whats is for, keep_firing_for and evaluation_interval ?
>>>>>>>> prometheus.yml
>>>>>>>>
>>>>>>>> global:
>>>>>>>> scrape_interval: 15s # Set the scrape interval to every 15 seconds. 
>>>>>>>> Default is every 1 minute.
>>>>>>>> evaluation_interval: 15s # Evaluate rules every 15 seconds. The 
>>>>>>>> default is every 1 minute.
>>>>>>>>
>>>>>>>> alerting:
>>>>>>>> alertmanagers:
>>>>>>>>
>>>>>>>> - static_configs:
>>>>>>>> - targets:
>>>>>>>> - ip:port
>>>>>>>>
>>>>>>>> rule_files:
>>>>>>>>
>>>>>>>> - "alerts_rules.yml"
>>>>>>>>
>>>>>>>> scrape_configs:
>>>>>>>>
>>>>>>>> - job_name: "prometheus"
>>>>>>>>   static_configs:
>>>>>>>>   - targets: ["ip:port"]
>>>>>>>>
>>>>>>>> alertmanager.yml
>>>>>>>> global:
>>>>>>>> resolve_timeout: 5m
>>>>>>>> route:
>>>>>>>> group_wait: 5s
>>>>>>>> group_interval: 5m
>>>>>>>> repeat_interval: 15m
>>>>>>>> receiver: webhook_receiver
>>>>>>>> receivers:
>>>>>>>>
>>>>>>>> - name: webhook_receiver
>>>>>>>>   webhook_configs:
>>>>>>>>   - url: 'http://ip:port'
>>>>>>>>     send_resolved: false
>>>>>>>>
>>>>>>>> alerts_rules.yml
>>>>>>>>
>>>>>>>>
>>>>>>>> groups:
>>>>>>>> - name: instance_alerts
>>>>>>>>   rules:
>>>>>>>>   - alert: "Instance Down"
>>>>>>>>     expr: up == 0
>>>>>>>>     # for: 30s
>>>>>>>>     # keep_firing_for: 30s
>>>>>>>>     labels:
>>>>>>>>       severity: "Critical"
>>>>>>>>     annotations:
>>>>>>>>       summary: "Endpoint {{ $labels.instance }} down"
>>>>>>>>       description: "{{ $labels.instance }} of job {{ $labels.job }} 
>>>>>>>> has been down for more than 30 sec."
>>>>>>>>
>>>>>>>> - name: rabbitmq_alerts
>>>>>>>>   rules:
>>>>>>>>     - alert: "Consumer down for last 1 min"
>>>>>>>>       expr: rabbitmq_queue_consumers == 0
>>>>>>>>       # for: 1m
>>>>>>>>       # keep_firing_for: 30s
>>>>>>>>       labels:
>>>>>>>>         severity: Critical
>>>>>>>>       annotations:
>>>>>>>>         summary: "shortify | '{{ $labels.queue }}' has no consumers"
>>>>>>>>         description: "The queue '{{ $labels.queue }}' in vhost '{{ 
>>>>>>>> $labels.vhost }}' has zero consumers for more than 30 sec. Immediate 
>>>>>>>> attention is required."
>>>>>>>>
>>>>>>>>
>>>>>>>>     - alert: "Total Messages > 10k in last 1 min"
>>>>>>>>       expr: rabbitmq_queue_messages > 10000
>>>>>>>>       # for: 1m
>>>>>>>>       # keep_firing_for: 30s
>>>>>>>>       labels:
>>>>>>>>         severity: Critical
>>>>>>>>       annotations:
>>>>>>>>         summary: "'{{ $labels.queue }}' has total '{{ $value }}' 
>>>>>>>> messages for more than 1 min."
>>>>>>>>         description: |
>>>>>>>>           Queue {{ $labels.queue }} in RabbitMQ has total {{ $value 
>>>>>>>> }} messages for more than 1 min.
>>>>>>>>
>>>>>>>>
>>>>>>>> Thank you in advance.
>>>>>>>>
>>>>>>>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion visit 
https://groups.google.com/d/msgid/prometheus-users/d21e9c53-64a8-4581-88d4-c75830f8f142n%40googlegroups.com.

[prometheus-users] Re: Alerts are getting fire after every minute

Reply via email to