[ceph-users] Re: Odd RBD stats metrics

Christopher James Tue, 29 Jul 2025 01:17:13 -0700

Hi Joshua

I don't think this is the case. The scrape interval for prom is 15 seconds.
I also checked the debug logs for the prom module, and the metrics
collection took less than 10 seconds most of the time.


Chris.

On Mon, Jul 28, 2025 at 5:54 PM Joshua Baergen <jbaer...@digitalocean.com>
wrote:

> Hi Chris,
>
> Assuming that the scrape period for prom is set to 1 minute, you could
> simply be racing against the scrape. Usually it's not a good idea to
> create range vectors with the same time range as the scrape period.
> Given that you're using irate(), you could increase that to [2m] or
> higher and still get the effect that you want, since it'll always pick
> the two most recent points within the range on which to perform the
> rate calculation.
>
> Josh
>
> On Mon, Jul 28, 2025 at 5:40 AM Christopher James
> <christopher.jamesjr2...@gmail.com> wrote:
> >
> > Hi,
> >
> > We have encountered some issues regarding the RBD stats metrics.
> >
> > We have some Grafana panels that show the per-pool rate of changes in RBD
> > IOPS and Throughput.
> >
> > We use these queries for IOPS
> > round(sum(irate(ceph_rbd_write_ops[1m])) by (pool))
> > round(sum(irate(ceph_rbd_read_ops[1m])) by (pool))
> >
> > and these queries for throughput
> > round(sum(irate(ceph_rbd_write_bytes[1m])) by (pool))
> > round(sum(irate(ceph_rbd_read_bytes[1m])) by (pool))
> >
> > Now, the problem we encounter here is that at some points in time, there
> > are some odd outputs.
> > In a way that the output has a lot of spikes, and at many points in time,
> > there seems to be no change in the mentioned values, hence the irate()'s
> > output would become zero, and the next data point would show that the
> data
> > has somehow doubled from the last time.
> >
> > I investigated the mgr/prometheus module to see if it is related to the
> > cache mechanism, and it doesn't seem to be about that because the metrics
> > collection method is done in less than 10 seconds most of the time and
> > there is no "collecting data took more than ..." log.
> >
> > I also investigated the raw time series data and saw that the two
> > datapoints are exactly the same at the times that the irate() returns
> zero.
> > I have put some screenshots of the panels in this read-only Google Doc:
> >
> https://docs.google.com/document/d/1Rf0dl4qAWnOtG80BsxVY9cjljQqliG2JHxgzVxDjGo8/edit?usp=sharing
> >
> > It is somewhat odd, to be honest, that every image in every pool does not
> > change in IOPS or throughput.
> >
> > Is this the natural way the metrics are exposed? Should I not use
> irate()?
> > Has anybody else encountered this issue as well?
> >
> > Cheers,
> > Chris
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: Odd RBD stats metrics

Reply via email to