Perhaps not the root cause, but might be worth looking at if you haven't already:
* How is the host power profile configured? Eg, I'm running HPe: System Configuration > BIOS/Platform Configuration (RBSU) > Power Management > Power Profile > Maximum Performance * In which C-state are your cores running? Use linux-cpupower or similar tool to verify. From my experience if I don't configure anything 99% of the time it's in C6 and I want it to be in C0/C1. * I noticed in our cluster that correctly configuring both bullet points above, had a significant impact on network latency. It brought it down from 0.124ms avg over 12s to 0.043ms avg over 12s. What about CPU wait states? Do you see any? To visualize and correlate with HDDs, I personally like nmon (http://kb.ictbanking.net/article.php?id=550&oid=1) then press lower case 'L' to get a long term graph of CPU usage. From my experience, if you see blue blocks ('W' if color isn't enabled), that's wait states and ideally you only want to see none at all. A very occasional blue (W) block might be ~acceptable but if it's more than that, there's very likely hardware (HDDs would be my main suspect) noticably dragging down performance. Pressing 'c' in nmon will toggle an overview per core. That'll give a bit more "visual" insight into how much time cores are spending in user/system/wait . In nmon, you can also press 'd' to toggle disk stats ('h' to show help). To correlate with disk activity: press 'd' to toggle a graph showing R/W activity on each disk. Then maybe maybe, the mclock scheduler might help? Although I doubt it'll be of help if the cluster is totally idle like you said. ________________________________ From: Jan Kasprzak <[email protected]> Sent: Tuesday, October 7, 2025 11:19 To: [email protected] <[email protected]> Subject: [ceph-users] How to speed up the backfill on replicated pool? Hello, Ceph users, on my new cluster which I filled with testing data two weeks ago there are many repmapped PGs in backfill_wait state, probably as result of autoscaling the number of PGs per pool. But the recovery speed is quite low, in order of small MB/s and < 10 obj/s according to ceph -s. The cluster is otherwise idle, no client traffic after initial import, so I wonder why the backfill does not progress faster. Also, it seems like more pgs are getting remapped as existing ones get successfully backfilled - the percentage of misplaced objects is steadily around 6 % for the last two weeks. The PGs waiting for backfill all belong to the biggest pool I have according to "ceph pg dump | grep backfill", no surprise here. The pool has 229 TB of data and currently 128 PGs. It is replicated with k=4 m=2. The second biggest pool has only 23 TB of data: rados df POOL_NAME USED OBJECTS CLONES COPIES MISSING_ON_PRIMARY UNFOUND DEGRADED RD_OPS RD WR_OPS WR USED COMPR UNDER COMPR pool_with_backfill 229 TiB 10086940 0 60521640 0 0 0 0 0 B 72545009 54 TiB 0 B 0 B second_biggest_pool 23 TiB 1153174 0 6919044 0 0 0 0 0 B 38506397 16 TiB 0 B 0 B [...] I tried to do "ceph osd pool force-backfill $pool", it helped to speed things up a bit, but it still runs at 50-200 MB/s and 4-20 obj/s. The initial data import ran at around 600 MB/s. Is it normal or can I speed the recovery up a bit somehow? Output of ceph -s: cluster: id: ... health: HEALTH_WARN 2 large omap objects services: mon: 3 daemons, quorum istor11,istor21,istor31 (age 13d) mgr: istor31(active, since 3w), standbys: istor21, istor11 osd: 36 osds: 36 up (since 2w), 36 in (since 3w); 14 remapped pgs data: pools: 45 pools, 1505 pgs objects: 13.39M objects, 198 TiB usage: 303 TiB used, 421 TiB / 724 TiB avail pgs: 5335074/80345832 objects misplaced (6.640%) 1449 active+clean 34 active+clean+scrubbing 11 active+remapped+backfill_wait+forced_backfill 8 active+clean+scrubbing+deep 2 active+remapped+forced_backfill 1 active+remapped+backfilling+forced_backfill io: recovery: 69 MiB/s, 4 objects/s The OSDs are HDD-based with metadata on NVMe, 4 OSDs per node, and all the nodes have load average somewhere between 0.3 and 0.6. Thanks! -Yenya -- | Jan "Yenya" Kasprzak <kas at {fi.muni.cz - work | yenya.net - private}> | | https://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 | We all agree on the necessity of compromise. We just can't agree on when it's necessary to compromise. --Larry Wall _______________________________________________ ceph-users mailing list -- [email protected] To unsubscribe send an email to [email protected] _______________________________________________ ceph-users mailing list -- [email protected] To unsubscribe send an email to [email protected]
