Re: [prometheus-users] Re: Alerts are getting auto resolved automatically

2022-07-05 Thread Venkatraman Natarajan
Hi Stuart,

Yes I can see both cluster peers and showing information like the cluster
is ready.

[image: image.png]

Thanks,
Venkatraman N

On Tue, Jul 5, 2022 at 1:19 PM Stuart Clark 
wrote:

> Two alerts suggests that the two instances aren't talking to each other.
> How have you configured them? Does the UI show the "other" instance?
>
> On 5 July 2022 08:34:45 BST, Venkatraman Natarajan 
> wrote:
>>
>> Thanks Brian. I have used last_over_time query in our expression instead
>> of turning off auto-resolved.
>>
>> Also, we have two alert managers in our environment. Both are up and
>> running. But Nowadays, we are getting two alerts from two alert managers.
>> Could you please help me to sort this issue as well.?
>>
>> Please find the alert manager configuration.
>>
>>   alertmanager0:
>> image: prom/alertmanager
>> container_name: alertmanager0
>> user: rootuser
>> volumes:
>>   - ../data:/data
>>   - ../config/alertmanager.yml:/etc/alertmanager/alertmanager.yml
>> command:
>>   - '--config.file=/etc/alertmanager/alertmanager.yml'
>>   - '--storage.path=/data/alert0'
>>   - '--cluster.listen-address=0.0.0.0:6783'
>>   - '--cluster.peer={{ IP Address }}:6783'
>>   - '--cluster.peer={{ IP Address }}:6783'
>> restart: unless-stopped
>> logging:
>>   driver: "json-file"
>>   options:
>> max-size: "10m"
>> max-file: "2"
>> ports:
>>   - 9093:9093
>>   - 6783:6783
>> networks:
>>   - network
>>
>> Regards,
>> Venkatraman N
>>
>>
>>
>> On Sat, Jun 25, 2022 at 9:05 PM Brian Candler 
>> wrote:
>>
>>> If probe_success becomes non-zero, even for a single evaluation
>>> interval, then the alert will be immediately resolved.  There is no delay
>>> on resolving, like there is for pending->firing ("for: 5m").
>>>
>>> I suggest you enter the alerting expression, e.g. "probe_success == 0",
>>> into the PromQL web interface (query browser), and switch to Graph view,
>>> and zoom in.  If you see any gaps in the graph, then the alert was resolved
>>> at that instant.
>>>
>>> Conversely, use the query
>>> probe_success{instance="xxx"} != 0
>>> to look at a particular timeseries, as identified by the label9s), and
>>> see if there are any dots shown where the label is non-zero.
>>>
>>> To make your alerts more robust you may need to use queries with range
>>> vectors, e.g. min_over_time(foo[5m]) or max_over_time(foo[5m]) or whatever.
>>>
>>> As a general rule though: you should consider carefully whether you want
>>> to send *any* notification for resolved alerts.  Personally, I have
>>> switched to send_resolved = false.  There are some good explanations here:
>>>
>>> https://www.robustperception.io/running-into-burning-buildings-because-the-fire-alarm-stopped
>>>
>>> https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/
>>>
>>> You don't want to build a culture where people ignore alerts because the
>>> alert cleared itself - or is expected to clear itself.
>>>
>>> You want the alert condition to trigger a *process*, which is an
>>> investigation of *why* the alert happened, *what* caused it, whether the
>>> underlying cause has been fixed, and whether the alerting rule itself was
>>> wrong.  When all that has been investigated, manually close the ticket.
>>> The fact that the alert has gone below threshold doesn't mean that this
>>> work no longer needs to be done.
>>>
>>> On Saturday, 25 June 2022 at 13:27:22 UTC+1 v.ra...@gmail.com wrote:
>>>
 Hi Team,

 We are having two prometheus and two alert managers in separate VMs as
 containers.

 Alerts are getting auto resolved even though the issues are there as
 per threshold.

 For example, if we have an alert rule called probe_success == 0 means
 it is triggering an alert but after sometime the alert gets auto-resolved
 because we have enabled send_resolved = true. But probe_success == 0 still
 there so we don't want to auto resolve the alerts.

 Could you please help us on this.?

 Thanks,
 Venkatraman N

>>> --
>>> You received this message because you are subscribed to the Google
>>> Groups "Prometheus Users" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to prometheus-users+unsubscr...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/prometheus-users/68bff458-ee79-42ce-bafb-facd239e26aen%40googlegroups.com
>>> 
>>> .
>>>
>> --
> Sent from my Android device with K-9 Mail. Please excuse my brevity.
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To 

Re: [prometheus-users] Re: Alerts are getting auto resolved automatically

2022-07-05 Thread Stuart Clark
Two alerts suggests that the two instances aren't talking to each other. How 
have you configured them? Does the UI show the "other" instance? 

On 5 July 2022 08:34:45 BST, Venkatraman Natarajan  wrote:
>Thanks Brian. I have used last_over_time query in our expression instead of
>turning off auto-resolved.
>
>Also, we have two alert managers in our environment. Both are up and
>running. But Nowadays, we are getting two alerts from two alert managers.
>Could you please help me to sort this issue as well.?
>
>Please find the alert manager configuration.
>
>  alertmanager0:
>image: prom/alertmanager
>container_name: alertmanager0
>user: rootuser
>volumes:
>  - ../data:/data
>  - ../config/alertmanager.yml:/etc/alertmanager/alertmanager.yml
>command:
>  - '--config.file=/etc/alertmanager/alertmanager.yml'
>  - '--storage.path=/data/alert0'
>  - '--cluster.listen-address=0.0.0.0:6783'
>  - '--cluster.peer={{ IP Address }}:6783'
>  - '--cluster.peer={{ IP Address }}:6783'
>restart: unless-stopped
>logging:
>  driver: "json-file"
>  options:
>max-size: "10m"
>max-file: "2"
>ports:
>  - 9093:9093
>  - 6783:6783
>networks:
>  - network
>
>Regards,
>Venkatraman N
>
>
>
>On Sat, Jun 25, 2022 at 9:05 PM Brian Candler  wrote:
>
>> If probe_success becomes non-zero, even for a single evaluation interval,
>> then the alert will be immediately resolved.  There is no delay on
>> resolving, like there is for pending->firing ("for: 5m").
>>
>> I suggest you enter the alerting expression, e.g. "probe_success == 0",
>> into the PromQL web interface (query browser), and switch to Graph view,
>> and zoom in.  If you see any gaps in the graph, then the alert was resolved
>> at that instant.
>>
>> Conversely, use the query
>> probe_success{instance="xxx"} != 0
>> to look at a particular timeseries, as identified by the label9s), and see
>> if there are any dots shown where the label is non-zero.
>>
>> To make your alerts more robust you may need to use queries with range
>> vectors, e.g. min_over_time(foo[5m]) or max_over_time(foo[5m]) or whatever.
>>
>> As a general rule though: you should consider carefully whether you want
>> to send *any* notification for resolved alerts.  Personally, I have
>> switched to send_resolved = false.  There are some good explanations here:
>>
>> https://www.robustperception.io/running-into-burning-buildings-because-the-fire-alarm-stopped
>>
>> https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/
>>
>> You don't want to build a culture where people ignore alerts because the
>> alert cleared itself - or is expected to clear itself.
>>
>> You want the alert condition to trigger a *process*, which is an
>> investigation of *why* the alert happened, *what* caused it, whether the
>> underlying cause has been fixed, and whether the alerting rule itself was
>> wrong.  When all that has been investigated, manually close the ticket.
>> The fact that the alert has gone below threshold doesn't mean that this
>> work no longer needs to be done.
>>
>> On Saturday, 25 June 2022 at 13:27:22 UTC+1 v.ra...@gmail.com wrote:
>>
>>> Hi Team,
>>>
>>> We are having two prometheus and two alert managers in separate VMs as
>>> containers.
>>>
>>> Alerts are getting auto resolved even though the issues are there as per
>>> threshold.
>>>
>>> For example, if we have an alert rule called probe_success == 0 means it
>>> is triggering an alert but after sometime the alert gets auto-resolved
>>> because we have enabled send_resolved = true. But probe_success == 0 still
>>> there so we don't want to auto resolve the alerts.
>>>
>>> Could you please help us on this.?
>>>
>>> Thanks,
>>> Venkatraman N
>>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "Prometheus Users" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to prometheus-users+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/prometheus-users/68bff458-ee79-42ce-bafb-facd239e26aen%40googlegroups.com
>> 
>> .
>>
>
>-- 
>You received this message because you are subscribed to the Google Groups 
>"Prometheus Users" group.
>To unsubscribe from this group and stop receiving emails from it, send an 
>email to prometheus-users+unsubscr...@googlegroups.com.
>To view this discussion on the web visit 
>https://groups.google.com/d/msgid/prometheus-users/CANSgTEbTrr7Jjf_XwD0J8wgMAdiLg9g_MmWDK%3DpgkTjwMA5YZA%40mail.gmail.com.

-- 
Sent from my Android device with K-9 Mail. Please excuse my brevity.

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an 

Re: [prometheus-users] Re: Alerts are getting auto resolved automatically

2022-07-05 Thread Venkatraman Natarajan
Thanks Brian. I have used last_over_time query in our expression instead of
turning off auto-resolved.

Also, we have two alert managers in our environment. Both are up and
running. But Nowadays, we are getting two alerts from two alert managers.
Could you please help me to sort this issue as well.?

Please find the alert manager configuration.

  alertmanager0:
image: prom/alertmanager
container_name: alertmanager0
user: rootuser
volumes:
  - ../data:/data
  - ../config/alertmanager.yml:/etc/alertmanager/alertmanager.yml
command:
  - '--config.file=/etc/alertmanager/alertmanager.yml'
  - '--storage.path=/data/alert0'
  - '--cluster.listen-address=0.0.0.0:6783'
  - '--cluster.peer={{ IP Address }}:6783'
  - '--cluster.peer={{ IP Address }}:6783'
restart: unless-stopped
logging:
  driver: "json-file"
  options:
max-size: "10m"
max-file: "2"
ports:
  - 9093:9093
  - 6783:6783
networks:
  - network

Regards,
Venkatraman N



On Sat, Jun 25, 2022 at 9:05 PM Brian Candler  wrote:

> If probe_success becomes non-zero, even for a single evaluation interval,
> then the alert will be immediately resolved.  There is no delay on
> resolving, like there is for pending->firing ("for: 5m").
>
> I suggest you enter the alerting expression, e.g. "probe_success == 0",
> into the PromQL web interface (query browser), and switch to Graph view,
> and zoom in.  If you see any gaps in the graph, then the alert was resolved
> at that instant.
>
> Conversely, use the query
> probe_success{instance="xxx"} != 0
> to look at a particular timeseries, as identified by the label9s), and see
> if there are any dots shown where the label is non-zero.
>
> To make your alerts more robust you may need to use queries with range
> vectors, e.g. min_over_time(foo[5m]) or max_over_time(foo[5m]) or whatever.
>
> As a general rule though: you should consider carefully whether you want
> to send *any* notification for resolved alerts.  Personally, I have
> switched to send_resolved = false.  There are some good explanations here:
>
> https://www.robustperception.io/running-into-burning-buildings-because-the-fire-alarm-stopped
>
> https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/
>
> You don't want to build a culture where people ignore alerts because the
> alert cleared itself - or is expected to clear itself.
>
> You want the alert condition to trigger a *process*, which is an
> investigation of *why* the alert happened, *what* caused it, whether the
> underlying cause has been fixed, and whether the alerting rule itself was
> wrong.  When all that has been investigated, manually close the ticket.
> The fact that the alert has gone below threshold doesn't mean that this
> work no longer needs to be done.
>
> On Saturday, 25 June 2022 at 13:27:22 UTC+1 v.ra...@gmail.com wrote:
>
>> Hi Team,
>>
>> We are having two prometheus and two alert managers in separate VMs as
>> containers.
>>
>> Alerts are getting auto resolved even though the issues are there as per
>> threshold.
>>
>> For example, if we have an alert rule called probe_success == 0 means it
>> is triggering an alert but after sometime the alert gets auto-resolved
>> because we have enabled send_resolved = true. But probe_success == 0 still
>> there so we don't want to auto resolve the alerts.
>>
>> Could you please help us on this.?
>>
>> Thanks,
>> Venkatraman N
>>
> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to prometheus-users+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-users/68bff458-ee79-42ce-bafb-facd239e26aen%40googlegroups.com
> 
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CANSgTEbTrr7Jjf_XwD0J8wgMAdiLg9g_MmWDK%3DpgkTjwMA5YZA%40mail.gmail.com.


[prometheus-users] Re: Alerts are getting auto resolved automatically

2022-06-25 Thread Brian Candler
If probe_success becomes non-zero, even for a single evaluation interval, 
then the alert will be immediately resolved.  There is no delay on 
resolving, like there is for pending->firing ("for: 5m").

I suggest you enter the alerting expression, e.g. "probe_success == 0", 
into the PromQL web interface (query browser), and switch to Graph view, 
and zoom in.  If you see any gaps in the graph, then the alert was resolved 
at that instant.

Conversely, use the query
probe_success{instance="xxx"} != 0
to look at a particular timeseries, as identified by the label9s), and see 
if there are any dots shown where the label is non-zero.

To make your alerts more robust you may need to use queries with range 
vectors, e.g. min_over_time(foo[5m]) or max_over_time(foo[5m]) or whatever.

As a general rule though: you should consider carefully whether you want to 
send *any* notification for resolved alerts.  Personally, I have switched 
to send_resolved = false.  There are some good explanations here:
https://www.robustperception.io/running-into-burning-buildings-because-the-fire-alarm-stopped
https://docs.google.com/document/d/199PqyG3UsyXlwieHaqbGiWVa8eMWi8zzAn0YfcApr8Q/

You don't want to build a culture where people ignore alerts because the 
alert cleared itself - or is expected to clear itself.

You want the alert condition to trigger a *process*, which is an 
investigation of *why* the alert happened, *what* caused it, whether the 
underlying cause has been fixed, and whether the alerting rule itself was 
wrong.  When all that has been investigated, manually close the ticket.  
The fact that the alert has gone below threshold doesn't mean that this 
work no longer needs to be done.

On Saturday, 25 June 2022 at 13:27:22 UTC+1 v.ra...@gmail.com wrote:

> Hi Team,
>
> We are having two prometheus and two alert managers in separate VMs as 
> containers. 
>
> Alerts are getting auto resolved even though the issues are there as per 
> threshold.
>
> For example, if we have an alert rule called probe_success == 0 means it 
> is triggering an alert but after sometime the alert gets auto-resolved 
> because we have enabled send_resolved = true. But probe_success == 0 still 
> there so we don't want to auto resolve the alerts. 
>
> Could you please help us on this.? 
>
> Thanks,
> Venkatraman N
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/68bff458-ee79-42ce-bafb-facd239e26aen%40googlegroups.com.