[ceph-users] Re: radosgw (s3) stops when setting weight to 0 on one OSD

Rok Jaklič via ceph-users Wed, 04 Feb 2026 00:53:47 -0800

On Wed, Feb 4, 2026 at 2:59 AM Anthony D'Atri <[email protected]>
wrote:


> Are these rear bay drives, hence the limit of 2? Or
> You might consider an M.2 AIC adapter card with bifurcation.  M.2
> enterprise SSDs are sunsetting but for retrofits you should be able to find
> Micron 6450 units.
>
> What’s your workload like?
>

On average 10-50MB/s of write, with spikes up to a few hundred MB/s during
evening/night time; it went up to 1GB/s during tests without a problem. All
these are S3 workloads/tests.

I would have to check that on site, RM does not show, however we are just
about to migrate to new machines, which have 4 NVMe slots ... so I am
really considering moving WAL/DB to NVMe, however I am still a little bit
hesitant, since I am not really sure this will solve the problem of why
radosgw/s3 stops after some time when setting crush reweight to 0 on one
failed disk. We are doing the same thing on HPC where radosgw/s3 is not
used and we are not experiencing this problem there. If we move WAL/DB to
NVMe, and if one NVMe fails and we have to recover 10 OSDs for example, it
would take much longer than if just 1 OSD has to be recovered (while users
being unable to access s3).

---

My suspicion is that when we set crush reweight of the failed disk to 0,
all other affected disks from that pool disables some write (because of
recovery) and some queue fills up which then stops/hangs radosgw...

Rok
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: radosgw (s3) stops when setting weight to 0 on one OSD

Reply via email to