I will assume you are running the default mclock scheduler and not wpq. I'm not too familiar with tuning mclock settings but this is the docs to look at https://docs.ceph.com/en/latest/rados/configuration/mclock-config-ref/#recovery-backfill-options
osd_max_backfills is set to 1 by default and this is the first thing I would tune if you want faster backfilling. I would mainly look at this setting first before looking into the various knobs mclock provides https://docs.ceph.com/en/latest/rados/configuration/mclock-config-ref/#confval-osd_mclock_profile I use wpq and I have two main levers for backfilling osd_max_backfills - higher is faster backfilling osd_recovery_sleep + any other sleep setting - throttles recovery ops mclock doesn't use the sleep configs so I'm not too sure the various knobs mclock has but the docs above have some good options to tweak. I would probably try to experiment with the different mclock profiles to see if it speeds up backfilling like the high recovery ops setting. https://docs.ceph.com/en/latest/rados/configuration/mclock-config-ref/#high-recovery-ops On Tue, Oct 7, 2025 at 2:20 AM Jan Kasprzak <[email protected]> wrote: > Hello, Ceph users, > > on my new cluster which I filled with testing data two weeks ago > there are many repmapped PGs in backfill_wait state, probably as result > of autoscaling the number of PGs per pool. But the recovery speed > is quite low, in order of small MB/s and < 10 obj/s according to ceph -s. > > The cluster is otherwise idle, no client traffic after initial import, > so I wonder why the backfill does not progress faster. Also, it seems like > more pgs are getting remapped as existing ones get successfully backfilled > - the percentage of misplaced objects is steadily around 6 % for the last > two weeks. > > The PGs waiting for backfill all belong to the biggest pool I have > according to "ceph pg dump | grep backfill", no surprise here. > The pool has 229 TB of data and currently 128 PGs. It is replicated > with k=4 m=2. The second biggest pool has only 23 TB of data: > > rados df > POOL_NAME USED OBJECTS CLONES COPIES > MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS RD WR_OPS WR > USED COMPR UNDER COMPR > pool_with_backfill 229 TiB 10086940 0 60521640 > 0 0 0 0 0 B 72545009 54 TiB 0 B > 0 B > second_biggest_pool 23 TiB 1153174 0 6919044 > 0 0 0 0 0 B 38506397 16 TiB 0 B > 0 B > [...] > > I tried to do "ceph osd pool force-backfill $pool", it helped to speed > things up a bit, but it still runs at 50-200 MB/s and 4-20 obj/s. > The initial data import ran at around 600 MB/s. > > Is it normal or can I speed the recovery up a bit somehow? > > Output of ceph -s: > > cluster: > id: ... > health: HEALTH_WARN > 2 large omap objects > > services: > mon: 3 daemons, quorum istor11,istor21,istor31 (age 13d) > mgr: istor31(active, since 3w), standbys: istor21, istor11 > osd: 36 osds: 36 up (since 2w), 36 in (since 3w); 14 remapped pgs > > data: > pools: 45 pools, 1505 pgs > objects: 13.39M objects, 198 TiB > usage: 303 TiB used, 421 TiB / 724 TiB avail > pgs: 5335074/80345832 objects misplaced (6.640%) > 1449 active+clean > 34 active+clean+scrubbing > 11 active+remapped+backfill_wait+forced_backfill > 8 active+clean+scrubbing+deep > 2 active+remapped+forced_backfill > 1 active+remapped+backfilling+forced_backfill > > io: > recovery: 69 MiB/s, 4 objects/s > > The OSDs are HDD-based with metadata on NVMe, 4 OSDs per node, > and all the nodes have load average somewhere between 0.3 and 0.6. > > Thanks! > > -Yenya > > -- > | Jan "Yenya" Kasprzak <kas at {fi.muni.cz - work | yenya.net - private}> > | > | https://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 > | > We all agree on the necessity of compromise. We just can't agree on > when it's necessary to compromise. --Larry Wall > _______________________________________________ > ceph-users mailing list -- [email protected] > To unsubscribe send an email to [email protected] > _______________________________________________ ceph-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
