Re: [prometheus-users] probe_success VS up

'Brian Candler' via Prometheus Users Tue, 28 Nov 2023 02:18:03 -0800

On Tuesday, 28 November 2023 at 04:15:41 UTC Chris Siebenmann wrote:

The Blackbox exporter is a bit tricky to understand in relation to up{}, 
because unlike many exporters you create multiple scrape targets against 
(or through) the same exporter. This generally means you want to ignore 
the up{} metric for any particular blackbox probe and instead scrape 
Blackbox's metric endpoint and pay attention to its up{} (for alerts, 
for example).

I think that's worded in a misleading way.

Blackbox exporter does have a /metrics endpoint, but this is only for
metrics internal to the operation of blackbox_exporter itself (e.g. memory
stats, software version). You don't need to scrape this, but it gives you a
little bit of extra info about how your exporter is performing.

Blackbox exporter's main interface is the /probe endpoint, where you tell
it to run individual tests: /probe?target=xxx&module=yyy

The 'up' metric is generated by Prometheus itself, and only tells you that
it was successfully able to communicate with the exporter and get some
results (without a 4xx / 5xx error for example). So it's correct to say
that you're not interested in the 'up' metric for scrapes to /probe, since
it will always be 1 unless blackbox_exporter itself is badly broken, and
you're interested in probe_success instead.

This is pretty easy to arrange in alerting rules. Here's a starting point:

groups:
- name: UpDown
rules:
- alert: UpDown
expr: up == 0
for: 3m
keep_firing_for: 3m
labels:
severity: critical
annotations:
summary: 'Scrape failed: host is down or scrape endpoint
down/unreachable'
- name: BlackboxRules
rules:
- alert: ProbeFail
expr: probe_success == 0
for: 3m
keep_firing_for: 3m
labels:
severity: critical
annotations:
description: |
{{ $labels.instance }} ({{ $labels.module }}) probe is failing
summary: Probed service is down

For Grafana I'd probably just make two dashboards, but if you really want a
grand summary of all "problems" then you can simply use a PromQL expression
like this:

up == 0 or probe_success == 0

The "or" operator
<https://prometheus.io/docs/prometheus/latest/querying/operators/#logical-set-binary-operators>

in PromQL is not a boolean: it's more like a set union operator. It will
give you all the values of the "up" vector where the value is 0, along with
all values of the "probe_success" vector where the value is 0 (except for
values of probe_success == 0 which have *exactly* the same labels as up ==
0, but those are unlikely anyway)

The consumer of this query is going to see a mixture of up{...} and
probe_success{...} metrics, all with value 0.

there are other multi-target
indirect exporters like Blackbox. I believe that the SNMP exporter is
another one where you often have one exporter separately scraping a lot
of targets, and each target will have its own up{} metric that you
probably want to ignore.)

The first part of that is correct: SNMP exporter uses
/snmp?target=xxx&module=yyy&auth=zzz.

But the second part is wrong: if SNMP exporter fails to talk to the target
then it returns an empty scrape with a 4xx/5xx error code, which prometheus
turns into up==0. So you definitely *do* want to alert on up==0 in this
case, as that's how you detect a device which is failing to respond to SNMP.

In our environment, it's useful for us to have a granular view of what
has failed. That a device has stopped pinging is a different issue than
its node_exporter not being up, so our dashboards (and alerts) reflect
that.

I agree with that. Different metrics inherently have different meanings,
and although 'up' and 'probe_success' have similar 0/1 semantics, there's
other information you can get from blackbox_exporter when probe_success==0
which can tell you more about the nature of the problem (e.g. failure to
connect, failure to resolve DNS name, TLS certificate validation failure
etc)

--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/adf18a14-269f-41a3-b60f-d8c7a49858ean%40googlegroups.com.

Re: [prometheus-users] probe_success VS up

Reply via email to