Hi all, Thanks for the great responses. Confirming that this was the issue (feature). No idea why this was set differently for us in Nautilus.
This should make the recovery benchmarking a bit faster now. :) Cheers, Sean > On 6/12/2022, at 3:09 PM, Wesley Dillingham <w...@wesdillingham.com> wrote: > > I think you are experiencing the mon_osd_down_out_interval > > https://docs.ceph.com/en/latest/rados/configuration/mon-osd-interaction/#confval-mon_osd_down_out_interval > > Ceph waits 10 minutes before marking a down osd as out for the reasons you > mention, but this would have been the case in nautilus as well. > > Respectfully, > > Wes Dillingham > w...@wesdillingham.com <mailto:w...@wesdillingham.com> > LinkedIn <http://www.linkedin.com/in/wesleydillingham> > > > On Mon, Dec 5, 2022 at 5:20 PM Sean Matheny <sean.math...@nesi.org.nz > <mailto:sean.math...@nesi.org.nz>> wrote: >> Hi all, >> >> New Quincy cluster here that I'm just running through some benchmarks >> against: >> >> ceph version 17.2.3 (dff484dfc9e19a9819f375586300b3b79d80034d) quincy >> (stable) >> 11 nodes of 24x 18TB HDD OSDs, 2x 2.9TB SSD OSDs >> >> I'm seeing a delay of almost exactly 10 minutes when I remove an OSD/node >> from the cluster until actual recovery IO begins. This is much different >> behaviour that what I'm used to in Nautilus previously, where recovery IO >> would commence within seconds. Downed OSDs are reflected in ceph health >> within a few seconds (as expected), and affected PGs show as undersized a >> few seconds later (as expected). I guess this 10-minute delay may even be a >> feature-- accidentally rebooting a node before setting recovery flags would >> prevent rebalancing, for example. Just thought it was worth asking in case >> it's a bug or something to look deeper into. >> >> I've read through the OSD config and all of my recovery tuneables look ok, >> for example: >> https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/ >> <https://docs.ceph.com/en/latest/rados/configuration/osd-config-ref/%EF%BF%BC> >> >> [ceph: root@ /]# ceph config get osd osd_recovery_delay_start >> 20.000000 >> 3[ceph: root@ /]# ceph config get osd osd_recovery_sleep >> 40.000000 >> 5[ceph: root@ /]# ceph config get osd osd_recovery_sleep_hdd >> 60.100000 >> 7[ceph: root@ /]# ceph config get osd osd_recovery_sleep_ssd >> 80.000000 >> 9[ceph: root@ /]# ceph config get osd osd_recovery_sleep_hybrid >> 100.025000 >> >> Thanks in advance. >> >> Ngā mihi, >> >> Sean Matheny >> HPC Cloud Platform DevOps Lead >> New Zealand eScience Infrastructure (NeSI) >> >> e: sean.math...@nesi.org.nz <mailto:sean.math...@nesi.org.nz> >> >> >> >> _______________________________________________ >> ceph-users mailing list -- ceph-users@ceph.io <mailto:ceph-users@ceph.io> >> To unsubscribe send an email to ceph-users-le...@ceph.io >> <mailto:ceph-users-le...@ceph.io> _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io