[ceph-users] Why does recovering objects take much longer than the outage that caused them?

Niklas Hambüchen Fri, 19 Sep 2025 04:24:23 -0700

I noticed that for my clusters, even a short 5-minute network outage or 
single-host reboot can cause


    pgs:     5586988/366684639 objects misplaced (1.524%)

which at the speed of

    recovery: 2.2 GiB/s, 676 objects/s

can take hours to recover.

I don't understand how this can be. If it's down for so short, how can 
rebalancing can take this long?

I'm using Ceph 19.2.2 on HDDs with SSDs as BlueStore "db" device.
Is this perhaps that writes of new files are written linearly to HDD (fast) but 
recovery seeks around on my HDDs in random order (slow)?

In any case, this asymmetry is quite annoying.
Could anything be done against it?

Thanks!
Niklas
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Why does recovering objects take much longer than the outage that caused them?

Reply via email to