Perhaps not the root cause, but might be worth looking at if you haven't 
already:


  *
How is the host power profile configured? Eg, I'm running HPe:
    System Configuration > BIOS/Platform Configuration (RBSU) > Power 
Management > Power Profile > Maximum Performance
  *
In which C-state are your cores running?  Use linux-cpupower or similar tool to 
verify. From my experience if I don't configure anything 99% of the time it's 
in C6 and I want it to be in C0/C1.
  *
I noticed in our cluster that correctly configuring both bullet points above, 
had a significant impact on network latency. It brought it down from 0.124ms 
avg over 12s to 0.043ms avg over 12s.

What about CPU wait states? Do you see any? To visualize and correlate with 
HDDs, I personally like nmon 
(http://kb.ictbanking.net/article.php?id=550&oid=1) then press lower case 'L' 
to get a long term graph of CPU usage. From my experience, if you see blue 
blocks ('W' if color isn't enabled), that's wait states and ideally you only 
want to see none at all. A very occasional blue (W) block might be ~acceptable 
but if it's more than that, there's very likely hardware (HDDs would be my main 
suspect) noticably dragging down performance.

Pressing 'c' in nmon will toggle an overview per core. That'll give a bit more 
"visual" insight into how much time cores are spending in user/system/wait .
In nmon, you can also press 'd' to toggle disk stats ('h' to show help).

To correlate with disk activity: press 'd' to toggle a graph showing R/W 
activity on each disk.

Then maybe maybe,  the mclock scheduler might help? Although I doubt it'll be 
of help if the cluster is totally idle like you said.
________________________________
From: Jan Kasprzak <[email protected]>
Sent: Tuesday, October 7, 2025 11:19
To: [email protected] <[email protected]>
Subject: [ceph-users] How to speed up the backfill on replicated pool?

        Hello, Ceph users,

on my new cluster which I filled with testing data two weeks ago
there are many repmapped PGs in backfill_wait state, probably as result
of autoscaling the number of PGs per pool. But the recovery speed
is quite low, in order of small MB/s and < 10 obj/s according to ceph -s.

The cluster is otherwise idle, no client traffic after initial import,
so I wonder why the backfill does not progress faster. Also, it seems like
more pgs are getting remapped as existing ones get successfully backfilled
- the percentage of misplaced objects is steadily around 6 % for the last
two weeks.

The PGs waiting for backfill all belong to the biggest pool I have
according to "ceph pg dump | grep backfill", no surprise here.
The pool has 229 TB of data and currently 128 PGs. It is replicated
with k=4 m=2. The second biggest pool has only 23 TB of data:

rados df
POOL_NAME               USED   OBJECTS  CLONES    COPIES  MISSING_ON_PRIMARY  
UNFOUND  DEGRADED  RD_OPS       RD     WR_OPS       WR  USED COMPR  UNDER COMPR
pool_with_backfill   229 TiB  10086940       0  60521640                   0    
    0         0       0      0 B   72545009   54 TiB         0 B          0 B
second_biggest_pool   23 TiB   1153174       0   6919044                   0    
    0         0       0      0 B   38506397   16 TiB         0 B          0 B
[...]

I tried to do "ceph osd pool force-backfill $pool", it helped to speed
things up a bit, but it still runs at 50-200 MB/s and 4-20 obj/s.
The initial data import ran at around 600 MB/s.

Is it normal or can I speed the recovery up a bit somehow?

Output of ceph -s:

  cluster:
    id:     ...
    health: HEALTH_WARN
            2 large omap objects

  services:
    mon: 3 daemons, quorum istor11,istor21,istor31 (age 13d)
    mgr: istor31(active, since 3w), standbys: istor21, istor11
    osd: 36 osds: 36 up (since 2w), 36 in (since 3w); 14 remapped pgs

  data:
    pools:   45 pools, 1505 pgs
    objects: 13.39M objects, 198 TiB
    usage:   303 TiB used, 421 TiB / 724 TiB avail
    pgs:     5335074/80345832 objects misplaced (6.640%)
             1449 active+clean
             34   active+clean+scrubbing
             11   active+remapped+backfill_wait+forced_backfill
             8    active+clean+scrubbing+deep
             2    active+remapped+forced_backfill
             1    active+remapped+backfilling+forced_backfill

  io:
    recovery: 69 MiB/s, 4 objects/s

The OSDs are HDD-based with metadata on NVMe, 4 OSDs per node,
and all the nodes have load average somewhere between 0.3 and 0.6.

Thanks!

-Yenya

--
| Jan "Yenya" Kasprzak <kas at {fi.muni.cz - work | yenya.net - private}> |
| https://www.fi.muni.cz/~kas/                        GPG: 4096R/A45477D5 |
    We all agree on the necessity of compromise. We just can't agree on
    when it's necessary to compromise.                     --Larry Wall
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to