Hi,
I'm trying to figure out why data are missing sometime in dashboard backed
by prometheus. Our setup is more or less standard prometheus-operator helm
chart. It defines following recording rule:
record: instance:node_cpu_utilisation:rate1m
expr: 1 - avg without(cpu, mode)
(rate(node_cpu_seconds_total{job="node-exporter",mode="idle"}[1m]))
There are 9 nodes in the cluster, but the dashboard that displays this
metric only displays 7 nodes. Switching the dashboard to the expression
directly shows all data as expected. Noteworthy things:
- there are no exceptions in the log, no failed rule evaluations
- the issue shows (almost) consistently for more than 2 hours by now
- in two occasions in this period one of the missing nodes became part
of the recorded rule for what seems to be one scrape interval and dropped
again immediately
- after prometheus restart, the issue persists
- other rules defined within the same group seem to be impacted in the
same way (e.g. *instance:node_network_receive_bytes_excluding_lo:rate1m*
that calculates network usage in the same fashion)
This cluster suffered some performance issues in the past and had the
scrape/evaluation interval extended to 90s. During this period the
*instance:node_cpu_utilisation:rate1m* didn't record any data (because it
uses range that was shorter than actual scrape/evaluation). The problem
became apparent after switching back to the original 30s scrape/evaluation
interval. In this moment all 9 nodes should have its CPU usage correctly
displayed, but only 7 appeared.
Has anybody encountered similar situation?
Thanks,
Vojta
--
You received this message because you are subscribed to the Google Groups
"Prometheus Users" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/prometheus-users/69c7ec0b-8f29-4f33-a31b-878c568a961dn%40googlegroups.com.