Hi Joshua I don't think this is the case. The scrape interval for prom is 15 seconds. I also checked the debug logs for the prom module, and the metrics collection took less than 10 seconds most of the time.
Chris. On Mon, Jul 28, 2025 at 5:54 PM Joshua Baergen <jbaer...@digitalocean.com> wrote: > Hi Chris, > > Assuming that the scrape period for prom is set to 1 minute, you could > simply be racing against the scrape. Usually it's not a good idea to > create range vectors with the same time range as the scrape period. > Given that you're using irate(), you could increase that to [2m] or > higher and still get the effect that you want, since it'll always pick > the two most recent points within the range on which to perform the > rate calculation. > > Josh > > On Mon, Jul 28, 2025 at 5:40 AM Christopher James > <christopher.jamesjr2...@gmail.com> wrote: > > > > Hi, > > > > We have encountered some issues regarding the RBD stats metrics. > > > > We have some Grafana panels that show the per-pool rate of changes in RBD > > IOPS and Throughput. > > > > We use these queries for IOPS > > round(sum(irate(ceph_rbd_write_ops[1m])) by (pool)) > > round(sum(irate(ceph_rbd_read_ops[1m])) by (pool)) > > > > and these queries for throughput > > round(sum(irate(ceph_rbd_write_bytes[1m])) by (pool)) > > round(sum(irate(ceph_rbd_read_bytes[1m])) by (pool)) > > > > Now, the problem we encounter here is that at some points in time, there > > are some odd outputs. > > In a way that the output has a lot of spikes, and at many points in time, > > there seems to be no change in the mentioned values, hence the irate()'s > > output would become zero, and the next data point would show that the > data > > has somehow doubled from the last time. > > > > I investigated the mgr/prometheus module to see if it is related to the > > cache mechanism, and it doesn't seem to be about that because the metrics > > collection method is done in less than 10 seconds most of the time and > > there is no "collecting data took more than ..." log. > > > > I also investigated the raw time series data and saw that the two > > datapoints are exactly the same at the times that the irate() returns > zero. > > I have put some screenshots of the panels in this read-only Google Doc: > > > https://docs.google.com/document/d/1Rf0dl4qAWnOtG80BsxVY9cjljQqliG2JHxgzVxDjGo8/edit?usp=sharing > > > > It is somewhat odd, to be honest, that every image in every pool does not > > change in IOPS or throughput. > > > > Is this the natural way the metrics are exposed? Should I not use > irate()? > > Has anybody else encountered this issue as well? > > > > Cheers, > > Chris > > _______________________________________________ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io