[ceph-users] Re: OSD SLOW_OPS is filling MONs disk space

Gaël THEROND Wed, 23 Feb 2022 03:52:39 -0800

Thanks a lot Eugene, I dumbly forgot about the rbd block prefix!

I’ll try that this afternoon and told you how it went.


Le mer. 23 févr. 2022 à 11:41, Eugen Block <ebl...@nde.ag> a écrit :

> Hi,
>
> > How can I identify which operation this OSD is trying to achieve as
> > osd_op() is a bit large ^^ ?
>
> I would start by querying the OSD for historic_slow_ops:
>
> ceph daemon osd.<OSD> dump_historic_slow_ops to see which operation it is.
>
> > How can I identify the related images to this data chunk?
>
> You could go through all rbd images and check for the line containing
> block_name_prefix, this could take some time depending on how many
> images you have:
>
>          block_name_prefix: rbd_data.ca69416b8b4567
>
> I sometimes do that with this for loop:
>
> for i in `rbd -p <POOL> ls`; do if [ $(rbd info <POOL>/$i | grep -c
> <PREFIX>) -gt 0 ]; then echo "image: $i"; break; fi; done
>
> So in your case it would look something like this:
>
> for i in `rbd -p <POOL> ls`; do if [ $(rbd info <POOL>/$i | grep -c
> 89a4a940aba90b -gt 0 ]; then echo "image: $i"; break; fi; done
>
> To see which clients are connected you can check the mon daemon:
>
> ceph daemon mon.<MON> sessions
>
> The mon daemon also has a history of slow ops:
>
> ceph daemon mon.<MON> dump_historic_slow_ops
>
> Regards,
> Eugen
>
>
> Zitat von Gaël THEROND <gael.ther...@bitswalk.com>:
>
> > Hi everyone, I'm having a really nasty issue since around two days where
> > our cluster report a bunch of SLOW_OPS on one of our OSD as:
> >
> > https://paste.openstack.org/show/b3DkgnJDVx05vL5o4OmY/
> >
> > Here is the cluster specification:
> >   * Used to store Openstack related data (VMs/Snaphots/Volumes/Swift).
> >   * Based on CEPH Nautilus 14.2.8 installed using ceph-ansible.
> >   * Use an EC based storage profile.
> >   * We have a separate and dedicated frontend and backend 10Gbps network.
> >   * We don't have any network issues observed or reported by our
> monitoring
> > system.
> >
> > Here is our current cluster status:
> > https://paste.openstack.org/show/biVnkm9Yyog3lmSUn0UK/
> > Here is a detailed view of our cluster status:
> > https://paste.openstack.org/show/bgKCSVuow0JUZITo2Ndj/
> >
> > My main issue here is that this health alert is starting to fill the
> > Monitor's disk and so trigger a MON_DISK_BIG alert.
> >
> > I'm worried as I'm having a hard time to identify which osd operation is
> > actually slow and especially, which image does it concern and which
> client
> > is using it.
> >
> > So far I've try:
> >   * To match this client ID with any watcher of our stored
> > volumes/vms/snaphots by extracting the whole list and then using the
> > following command: *rbd status <pool>/<image>*
> >      Unfortunately none of the watchers is matching my reported client
> from
> > the OSD on any pool.
> >
> > *  * *To map this reported chunk of data to any of our store image
> > using:  *ceph
> > osd map <pool>/rbd_data.5.89a4a940aba90b.00000000000000a0*
> >      Unfortunately any pool name existing within our cluster give me back
> > an answer with no image information and a different watcher client ID.
> >
> > So my questions are:
> >
> > How can I identify which operation this OSD is trying to achieve as
> > osd_op() is a bit large ^^ ?
> > Does the *snapc *information part within the log relate to snapshot or is
> > that something totally different?
> > How can I identify the related images to this data chunk?
> > Is there official documentation about SLOW_OPS operations code explaining
> > how to read the logs like something that explains which block is PG
> > number, which is the ID of something etc?
> >
> > Thanks a lot everyone and feel free to ask for additional information!
> > G.
> > _______________________________________________
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
>
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: OSD SLOW_OPS is filling MONs disk space

Reply via email to