[ceph-users] Odd RBD stats metrics

Christopher James Mon, 28 Jul 2025 04:40:17 -0700

Hi,

We have encountered some issues regarding the RBD stats metrics.


We have some Grafana panels that show the per-pool rate of changes in RBD
IOPS and Throughput.

We use these queries for IOPS
round(sum(irate(ceph_rbd_write_ops[1m])) by (pool))
round(sum(irate(ceph_rbd_read_ops[1m])) by (pool))

and these queries for throughput
round(sum(irate(ceph_rbd_write_bytes[1m])) by (pool))
round(sum(irate(ceph_rbd_read_bytes[1m])) by (pool))

Now, the problem we encounter here is that at some points in time, there
are some odd outputs.
In a way that the output has a lot of spikes, and at many points in time,
there seems to be no change in the mentioned values, hence the irate()'s
output would become zero, and the next data point would show that the data
has somehow doubled from the last time.

I investigated the mgr/prometheus module to see if it is related to the
cache mechanism, and it doesn't seem to be about that because the metrics
collection method is done in less than 10 seconds most of the time and
there is no "collecting data took more than ..." log.

I also investigated the raw time series data and saw that the two
datapoints are exactly the same at the times that the irate() returns zero.
I have put some screenshots of the panels in this read-only Google Doc:
https://docs.google.com/document/d/1Rf0dl4qAWnOtG80BsxVY9cjljQqliG2JHxgzVxDjGo8/edit?usp=sharing

It is somewhat odd, to be honest, that every image in every pool does not
change in IOPS or throughput.

Is this the natural way the metrics are exposed? Should I not use irate()?
Has anybody else encountered this issue as well?

Cheers,
Chris
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Odd RBD stats metrics

Reply via email to