[ceph-users] Re: Erasure Code with Autoscaler and Backfill_toofull

2024-03-27 Thread Alexander E. Patrakov
Hello Daniel, The situation is not as bad as you described. It is just PG_BACKFILL_FULL, which means: if the backfills proceed, then one osd will become backfillfull (i.e., over 90% by default). This is definitely something that the balancer should be able to resolve if it were allowed to act.

[ceph-users] Re: Erasure Code with Autoscaler and Backfill_toofull

2024-03-27 Thread Daniel Williams
The backfilling was caused by decommissioning an old host and moving a bunch of OSD to new machines. Balancer has not been activated since the backfill started / OSDs were moved around on hosts. Busy OSD level ? Do you mean fullness? The cluster is relatively unused in terms of business. # ceph

[ceph-users] Re: Erasure Code with Autoscaler and Backfill_toofull

2024-03-27 Thread David C.
Hi Daniel, Changing pg_num when some OSD is almost full is not a good strategy (or even dangerous). What is causing this backfilling? loss of an OSD? balancer? other ? What is the least busy OSD level (sort -nrk17) Is the balancer activated? (upmap?) Once the situation stabilizes, it becomes