One more thing to talk about is that the Prometheus ecosystem assumes and
follows the "Fail Fast" principle[0].

Best practice[1] in Prometheus is to fail the whole scrape and return a 5xx
error if any part of the data collection fails. For simple exporters this
is typical. The reason for this is that partial failure can be hard to
reason and write alerts for. Either get all the data that's expected or
return an error.

But for more complex exporters that gather a lot of data, or you are OK
with partial results and will handle that with more complex alerts, proxy
"up" metrics are used.

For example, the mysqld_exporter has a `mysql_up` metric if it is able to
establish a base connection to the server or not. Or in the node_exporter,
there is node_scrape_collector_success.

[0]: https://en.wikipedia.org/wiki/Fail-fast
[1]:
https://prometheus.io/docs/instrumenting/writing_exporters/#failed-scrapes

On Tue, Nov 28, 2023 at 11:18 AM 'Brian Candler' via Prometheus Users <
prometheus-users@googlegroups.com> wrote:

> On Tuesday, 28 November 2023 at 04:15:41 UTC Chris Siebenmann wrote:
>
> The Blackbox exporter is a bit tricky to understand in relation to up{},
> because unlike many exporters you create multiple scrape targets against
> (or through) the same exporter. This generally means you want to ignore
> the up{} metric for any particular blackbox probe and instead scrape
> Blackbox's metric endpoint and pay attention to its up{} (for alerts,
> for example).
>
>
> I think that's worded in a misleading way.
>
> Blackbox exporter does have a /metrics endpoint, but this is only for
> metrics internal to the operation of blackbox_exporter itself (e.g. memory
> stats, software version). You don't need to scrape this, but it gives you a
> little bit of extra info about how your exporter is performing.
>
> Blackbox exporter's main interface is the /probe endpoint, where you tell
> it to run individual tests: /probe?target=xxx&module=yyy
>
> The 'up' metric is generated by Prometheus itself, and only tells you that
> it was successfully able to communicate with the exporter and get some
> results (without a 4xx / 5xx error for example).  So it's correct to say
> that you're not interested in the 'up' metric for scrapes to /probe, since
> it will always be 1 unless blackbox_exporter itself is badly broken, and
> you're interested in probe_success instead.
>
> This is pretty easy to arrange in alerting rules. Here's a starting point:
>
> groups:
> - name: UpDown
>   rules:
>   - alert: UpDown
>     expr: up == 0
>     for: 3m
>     keep_firing_for: 3m
>     labels:
>       severity: critical
>     annotations:
>       summary: 'Scrape failed: host is down or scrape endpoint
> down/unreachable'
> - name: BlackboxRules
>   rules:
>   - alert: ProbeFail
>     expr: probe_success == 0
>     for: 3m
>     keep_firing_for: 3m
>     labels:
>       severity: critical
>     annotations:
>       description: |
>         {{ $labels.instance }} ({{ $labels.module }}) probe is failing
>       summary: Probed service is down
>
> For Grafana I'd probably just make two dashboards, but if you really want
> a grand summary of all "problems" then you can simply use a PromQL
> expression like this:
>
>     up == 0 or probe_success == 0
>
> The "or" operator
> <https://prometheus.io/docs/prometheus/latest/querying/operators/#logical-set-binary-operators>
> in PromQL is not a boolean: it's more like a set union operator.  It will
> give you all the values of the "up" vector where the value is 0, along with
> all values of the "probe_success" vector where the value is 0 (except for
> values of probe_success == 0 which have *exactly* the same labels as up ==
> 0, but those are unlikely anyway)
>
> The consumer of this query is going to see a mixture of up{...} and
> probe_success{...} metrics, all with value 0.
>
>  there are other multi-target
> indirect exporters like Blackbox. I believe that the SNMP exporter is
> another one where you often have one exporter separately scraping a lot
> of targets, and each target will have its own up{} metric that you
> probably want to ignore.)
>
>
> The first part of that is correct: SNMP exporter uses
> /snmp?target=xxx&module=yyy&auth=zzz.
>
> But the second part is wrong: if SNMP exporter fails to talk to the target
> then it returns an empty scrape with a 4xx/5xx error code, which prometheus
> turns into up==0.  So you definitely *do* want to alert on up==0 in this
> case, as that's how you detect a device which is failing to respond to SNMP.
>
>
>
>
> In our environment, it's useful for us to have a granular view of what
> has failed. That a device has stopped pinging is a different issue than
> its node_exporter not being up, so our dashboards (and alerts) reflect
> that.
>
>
> I agree with that. Different metrics inherently have different meanings,
> and although 'up' and 'probe_success' have similar 0/1 semantics, there's
> other information you can get from blackbox_exporter when probe_success==0
> which can tell you more about the nature of the problem (e.g. failure to
> connect, failure to resolve DNS name, TLS certificate validation failure
> etc)
>
> --
> You received this message because you are subscribed to the Google Groups
> "Prometheus Users" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to prometheus-users+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/prometheus-users/adf18a14-269f-41a3-b60f-d8c7a49858ean%40googlegroups.com
> <https://groups.google.com/d/msgid/prometheus-users/adf18a14-269f-41a3-b60f-d8c7a49858ean%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
>

-- 
You received this message because you are subscribed to the Google Groups 
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to prometheus-users+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/prometheus-users/CABbyFmqRdTMkHKK5ru3o_qpy0m9gZhZ1j6p7NtYhwF5SuAEnQw%40mail.gmail.com.

Reply via email to