Re: [prometheus-users] Re: Discrepancy in Alert Rule Evaluation.

2020-11-08 Thread Brian Candler
On Sunday, 8 November 2020 12:10:54 UTC, Yagyansh S. Kumar wrote: > > I'll try and get a backtrace and post it here. > > But still the question remains, is BBE is returning probe_success 0, why > is it doing only for 2.20.1 🙄. > > It could be that 2.12 is missing the data point (scrape) entirely.

Re: [prometheus-users] Re: Discrepancy in Alert Rule Evaluation.

2020-11-08 Thread Yagyansh S. Kumar
I'll try and get a backtrace and post it here. But still the question remains, is BBE is returning probe_success 0, why is it doing only for 2.20.1 🙄. On Sat, 7 Nov, 2020, 11:33 pm Brian Candler, wrote: > I don't think it's a false alert. If it's the rule you showed, then the > only way you

Re: [prometheus-users] Re: Discrepancy in Alert Rule Evaluation.

2020-11-07 Thread Brian Candler
I don't think it's a false alert. If it's the rule you showed, then the only way you can get an alert is if the metric probe_success has value zero. You should try to understand *why* BBE is returning zero; if necessary use tcpdump or wireshark to capture the HTTP traffic to and from it. But

Re: [prometheus-users] Re: Discrepancy in Alert Rule Evaluation.

2020-11-07 Thread Brian Candler
On Saturday, 7 November 2020 13:35:47 UTC, Yagyansh S. Kumar wrote: > > Try looking at scrape_duration_seconds{job="Ping-All-Servers"}. Maybe > it's borderline to the scrape interval. > >> That's interesting. Here are the top 20 scrape_duration_seconds maxed > for last 1 hour by instance. Close

Re: [prometheus-users] Re: Discrepancy in Alert Rule Evaluation.

2020-11-07 Thread Yagyansh S. Kumar
Try looking at scrape_duration_seconds{job="Ping-All-Servers"}. Maybe it's borderline to the scrape interval. >> That's interesting. Here are the top 20 scrape_duration_seconds maxed for last 1 hour by instance. Close to 5 seconds. Can this lead to some issue? But again the thing comes why no

Re: [prometheus-users] Re: Discrepancy in Alert Rule Evaluation.

2020-11-07 Thread Brian Candler
Try looking at scrape_duration_seconds{job="Ping-All-Servers"}. Maybe it's borderline to the scrape interval. What does min_over_time(up{job="Ping-All-Servers"}[5m]) show? In other words, is it the scrape to BBE which is failing, or the BBE probe? (I think the latter). Is there a different n

Re: [prometheus-users] Re: Discrepancy in Alert Rule Evaluation.

2020-11-07 Thread Yagyansh S. Kumar
Yes, both the Prometheus instances are talking to the same BBE indeed. Infact both have the exact same configuration file and are scraping the exact same targets. Here is the graph for the modified query. Fails visible for 2.20.1 but none for 2.12.0. 2.12.0 [image: image.png] 2.20.1 [image: imag

Re: [prometheus-users] Re: Discrepancy in Alert Rule Evaluation.

2020-11-07 Thread Brian Candler
You won't necessarily see all the failures on that graph. With a 5-second scrape interval, a 6 hour window contains 4,320 scrapes - more than the number of points fetched. Many of the points will be skipped over. I suggest you graph this instead: min_over_time(probe_success[5m]) (Otherwise,

[prometheus-users] Re: Discrepancy in Alert Rule Evaluation.

2020-11-07 Thread Brian Candler
On Saturday, 7 November 2020 08:49:15 UTC, yagyans...@gmail.com wrote: > > My Blackbox exporter is already running with Debug Log Mode and still, I > don't see and probe failed logs for that period. > But is this the same blackbox exporter which is also showing panics in its logs? https://groups

[prometheus-users] Re: Discrepancy in Alert Rule Evaluation.

2020-11-07 Thread Brian Candler
The promQL queryprobe_success{job=~"Ping-All-Servers"} == 0 is a filter. It returns the set of timeseries where the job label matches "Ping-All-Servers" *and* the value is zero. It cannot return a non-empty set of results unless those conditions are met. What's your rule evaluation interv

[prometheus-users] Re: Discrepancy in Alert Rule Evaluation.

2020-11-07 Thread yagyans...@gmail.com
Hi Brian, My Blackbox exporter is already running with Debug Log Mode and still, I don't see and probe failed logs for that period. Also, I have ran the query for some of the instances that I saw in PENDING state, but I do not see any failures there also, probe_success is 1 for them constantly

[prometheus-users] Re: Discrepancy in Alert Rule Evaluation.

2020-11-07 Thread Brian Candler
Go into the Prometheus query browser (front page in the web interface, normally port 9090), and enter the query: probe_success{job=~"Ping-All-Servers"} and switch to graph mode. Is the line going up and down? Then probes are failing. If you want to see logs of these failures, then on the bla

[prometheus-users] Re: Discrepancy in Alert Rule Evaluation.

2020-11-07 Thread yagyans...@gmail.com
Prometheus Version - 2.20.1 On Saturday, November 7, 2020 at 1:46:31 PM UTC+5:30 yagyans...@gmail.com wrote: > > Hi. I am using Blackbox Exporter v 0.18.0 for generating Host Down Alerts. > Below is the configured rule. > - alert: HostDown > expr: probe_success{job=~"Ping-All-Servers"} ==