Hi, We have encountered some issues regarding the RBD stats metrics.
We have some Grafana panels that show the per-pool rate of changes in RBD IOPS and Throughput. We use these queries for IOPS round(sum(irate(ceph_rbd_write_ops[1m])) by (pool)) round(sum(irate(ceph_rbd_read_ops[1m])) by (pool)) and these queries for throughput round(sum(irate(ceph_rbd_write_bytes[1m])) by (pool)) round(sum(irate(ceph_rbd_read_bytes[1m])) by (pool)) Now, the problem we encounter here is that at some points in time, there are some odd outputs. In a way that the output has a lot of spikes, and at many points in time, there seems to be no change in the mentioned values, hence the irate()'s output would become zero, and the next data point would show that the data has somehow doubled from the last time. I investigated the mgr/prometheus module to see if it is related to the cache mechanism, and it doesn't seem to be about that because the metrics collection method is done in less than 10 seconds most of the time and there is no "collecting data took more than ..." log. I also investigated the raw time series data and saw that the two datapoints are exactly the same at the times that the irate() returns zero. I have put some screenshots of the panels in this read-only Google Doc: https://docs.google.com/document/d/1Rf0dl4qAWnOtG80BsxVY9cjljQqliG2JHxgzVxDjGo8/edit?usp=sharing It is somewhat odd, to be honest, that every image in every pool does not change in IOPS or throughput. Is this the natural way the metrics are exposed? Should I not use irate()? Has anybody else encountered this issue as well? Cheers, Chris _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io