> Hi,
> Thanks Anthony - as always a very useful and comprehensive response
Most welcome.
> yes, there were only 139 OSD and , indeed, raw capacity increased
> I also noticed that the "max available " column from "cepf df" is getting
> higher (177 TB) so, it seems the capacity is being added
Aye, as expected.
> Too many MDS daemons are due to the fact that I have over 150 CEPHFS clients
> and thought that deploying a daemon on each host for each filesystem is going
> to
> provide better performance - was I wrong ?
MDS strategy is nuanced, but I think deploying them redundantly on each
back-end node might not give you much. The workload AIUI is more a function
of the number and size of files.
> High number of remapped PGs are due to improperly using
> osd.all-available-devices unmaged command
> so adding the new drives triggered automatically detecting and adding them
Yep, I typically suggest setting OSD services to unmanaged when not actively
using them.
> I am not sure what I did wrong though - see below output from "ceph orch ls"
> before adding the drive
> Shouldn't setting it like that prevent automatic discovery ?
You do have three other OSD rules, at least one of them likely matched and took
action.
> osd.all-available-devices 0 -
> 4w <unmanaged>
>
> osd.hdd_osds 72 10m ago
> 5w *
>
> osd.nvme_osds 25 10m ago
> 5w *
>
> osd.ssd_osds 84 10m ago
> 3w ceph-host-1
>
`ceph orch ls --export`
> Resources mentioned are very useful
> running upmap-remapped.py did bring the number of PGs to be remapped close
> to zero
It's a super, super useful tool. Letting the balancer do the work incrementally
helps deter unexpected OSD fullness issues and if a problem arises, you're a
lot closer to HEALTH_OK.
> 2. upmap-remapped.py | sh
Sometimes 2-3 runs are needed for full effect.
>
> 3. change target_max_misplaced_ratio to a higher number than default 0.005
> (since we want to rebalance faster and client performance is not a huge issue
> )
>
> 4. enable balancer
>
> 5.wait
>
> Doing it like this will, eventually, increase the number misplaced PGs until
> it is higher than the ratio when , I guess, the balancer stops
> ( "optimize_result": "Too many objects (0.115742 > 0.090000) are misplaced;
> try again later )
Exactly.
> Should I repeat the process when the number of objects misplaced is higher
> than the ratio or what is the proper way of doing it ?
As the backfill progresses and the misplaced percentage drops, the balancer
will kick in with another increment.
>
> Steven
>
>
> On Sun, 31 Aug 2025 at 11:08, Anthony D'Atri <[email protected]
> <mailto:[email protected]>> wrote:
>>
>>
>>> On Aug 31, 2025, at 4:15 AM, Steven Vacaroaia <[email protected]
>>> <mailto:[email protected]>> wrote:
>>>
>>> Hi,
>>>
>>> I have added 42 x 18TB HDD disks ( 6 on each of the 7 servers )
>>
>> The ultimate answer to Ceph, the cluster, and everything!
>>
>>> My expectation was that the pools configured to use "hdd_class" will
>>> have their capacity increased ( e.g. default.rgw.buckets.data which is
>>> uses an EC 4+2 pool for data )
>>
>> First, did the raw capacity increase when you added these drives?
>>
>>> --- RAW STORAGE ---
>>> CLASS SIZE AVAIL USED RAW USED %RAW USED
>>> hdd_class 1.4 PiB 814 TiB 579 TiB 579 TiB 41.54
>>
>> Was the number of OSDs previously 139?
>>
>>> It seems it is not happening ...yet ?!
>>> Is it because the peering is still going ?
>>
>> Ceph nomenclature can be mystifying at first. And sometimes at thirteenth.
>>
>> Peering is daemons checking in with each other to ensure they’re in
>> agreement.
>>
>> I think you mean backfill / balancing.
>>
>> The available space reported by “ceph df” for a *pool* is a function of:
>>
>> * Raw space available in the associated CRUSH rule’s device class (or if the
>> rule isn’t ideal, all device classes)
>> * The cluster’s three full ratios # ceph osd dump | grep ratio
>> * The fullness of the single most-full OSD in the device class
>>
>> BTW I learned only yesterday that you can restrict `ceph osd df` by
>> specifying a device class, so try running
>>
>> `ceph osd df hdd_class | tail -10`
>>
>> Notably, this will show you the min/max variance among OSDs of just that
>> device class, and the standard deviation.
>> When you have multiple OSD sizes, these figures are much less useful when
>> calculated across the whole cluster by “ceph osd df”
>>
>> # ceph osd df hdd
>> ...
>> 318 hdd 18.53969 1.00000 19 TiB 15 TiB 15 TiB 15 KiB 67 GiB
>> 3.6 TiB 80.79 1.04 127 up
>> 319 hdd 18.53969 1.00000 19 TiB 15 TiB 14 TiB 936 KiB 60 GiB
>> 3.7 TiB 79.87 1.03 129 up
>> 320 hdd 18.53969 1.00000 19 TiB 15 TiB 14 TiB 33 KiB 72 GiB
>> 3.7 TiB 79.99 1.03 129 up
>> 30 hdd 18.53969 1.00000 19 TiB 3.3 TiB 2.9 TiB 129 KiB 11 GiB
>> 15 TiB 17.55 0.23 26 up
>> TOTAL 5.4 PiB 4.2 PiB 4.1 PiB 186 MiB 17 TiB
>> 1.2 PiB 77.81
>> MIN/MAX VAR: 0.23/1.09 STDDEV: 4.39
>>
>> You can even run this for a specific OSD so you don’t have to get creative
>> with an egrep regex or exercise your pattern-matching skills, though the
>> summary values naturally aren’t useful.
>>
>> # ceph osd df osd.30
>> ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META
>> AVAIL %USE VAR PGS STATUS
>> 30 hdd 18.53969 1.00000 19 TiB 3.3 TiB 2.9 TiB 129 KiB 12 GiB 15
>> TiB 17.55 1.00 26 up
>> TOTAL 19 TiB 3.3 TiB 2.9 TiB 130 KiB 12 GiB 15
>> TiB 17.55
>> MIN/MAX VAR: 1.00/1.00 STDDEV: 0
>>
>> Here there’s a wide variation among the hdd OSDs because osd.30 had been
>> down for a while and was recently restarted due to a host reboot, so it’s
>> slowly filling with data.
>>
>>
>>> ssd_class 6.98630
>>
>> That seems like an unusual size, what are these? Are they SAN LUNs?
>>
>>> Below are outputs from
>>> ceph -s
>>> ceph df
>>> ceph osd df tree
>>
>> Thanks for providing the needful up front.
>>
>>> cluster:
>>> id: 0cfa836d-68b5-11f0-90bf-7cc2558e5ce8
>>> health: HEALTH_WARN
>>> 1 OSD(s) experiencing slow operations in BlueStore
>>
>> This warning state by default persists for a long time after it clears, I’m
>> not sure why but I like to set this lower:
>>
>> # ceph config dump | grep blue
>> global advanced
>> bluestore_slow_ops_warn_lifetime 300
>>
>>
>>
>>> 1 failed cephadm daemon(s)
>>> 39 daemons have recently crashed
>>
>> That’s a bit worrisome, what happened?
>>
>> `ceph crash ls`
>>
>>
>>> 569 pgs not deep-scrubbed in time
>>> 2609 pgs not scrubbed in time
>>
>> Scrubs don’t happen during recovery, when complete these should catch up.
>>
>>> services:
>>> mon: 5 daemons, quorum
>>> ceph-host-1,ceph-host-2,ceph-host-3,ceph-host-7,ceph-host-6 (age 2m)
>>> mgr: ceph-host-1.lqlece(active, since 18h), standbys: ceph-host-2.suiuxi
>>
>> I’m paranoid and would suggest deploying at least one more mgr.
>>
>>> mds: 19/19 daemons up, 7 standby
>>
>> Yikes why so many?
>>
>>> osd: 181 osds: 181 up (since 4d), 181 in (since 14h)
>>
>> What happened 14 hours ago? It seems unusual for these durations to vary so
>> much.
>>
>>> 2770 remapped pgs
>>
>> That’s an indication of balancing or backfill in progress.
>>
>>> flags noautoscale
>>>
>>> data:
>>> volumes: 4/4 healthy
>>> pools: 16 pools, 7137 pgs
>>> objects: 256.82M objects, 484 TiB
>>> usage: 742 TiB used, 1.5 PiB / 2.2 PiB avail
>>> pgs: 575889786/1468742421 objects misplaced (39.210%)
>>
>> 39% is a lot of misplaced objects, this would be consistent with you having
>> successfully added those OSDs.
>> Here is where the factor of the most-full OSD comes in.
>>
>> Technically backfill is a subset of recovery, but in practice people usually
>> think in terms:
>>
>> Recovery: PGs healing from OSDs having failed or been down
>> Backfill: Rebalancing of data due to topology changes, including adjusted
>> CRUSH rules, expansion, etc.
>>
>>
>>> 4247 active+clean
>>> 2763 active+remapped+backfill_wait
>>> 77 active+clean+scrubbing
>>> 43 active+clean+scrubbing+deep
>>> 7 active+remapped+backfilling
>>
>> Configuration options throttle how much backfill goes on in parallel to keep
>> the cluster from DoSing itself. Here I suspect that you’re running a recent
>> release with the notorious mclock op scheduling shortcomings, which is a
>> tangent.
>>
>>
>> I suggest checking out these two resources re upmap-remapped.py :
>>
>> https://ceph.io/assets/pdfs/events/2024/ceph-days-nyc/Mastering Ceph
>> Operations with Upmap.pdf
>> <https://ceph.io/assets/pdfs/events/2024/ceph-days-nyc/Mastering%20Ceph%20Operations%20with%20Upmap.pdf>
>> https://community.ibm.com/community/user/blogs/anthony-datri/2025/07/30/gracefully-expanding-your-ibm-storage-ceph
>>
>>
>>
>> This tool, in conjunction with the balancer module, will do the backfill
>> more elegantly with various benefits.
>>
>>
>>> --- RAW STORAGE ---
>>> CLASS SIZE AVAIL USED RAW USED %RAW USED
>>> hdd_class 1.4 PiB 814 TiB 579 TiB 579 TiB 41.54
>>
>> I hope the formatting below comes through, makes it a lot easier to read a
>> table.
>>
>>>
>>> --- POOLS ---
>>> POOL ID PGS STORED OBJECTS USED %USED MAX
>>> AVAIL
>>> .mgr 1 1 277 MiB 71 831 MiB 0 93
>>> TiB
>>> .rgw.root 2 32 1.6 KiB 6 72 KiB 0 93
>>> TiB
>>> default.rgw.log 3 32 63 KiB 210 972 KiB 0 93
>>> TiB
>>> default.rgw.control 4 32 0 B 8 0 B 0 93
>>> TiB
>>> default.rgw.meta 5 32 1.4 KiB 8 72 KiB 0 93
>>> TiB
>>> default.rgw.buckets.data 6 2048 289 TiB 100.35M 434 TiB 68.04 136
>>> TiB
>>> default.rgw.buckets.index 7 1024 5.4 GiB 521 16 GiB 0 93
>>> TiB
>>> default.rgw.buckets.non-ec 8 32 551 B 1 13 KiB 0 93
>>> TiB
>>> metadata_fs_ssd 9 128 6.1 GiB 15.69M 18 GiB 0 93
>>> TiB
>>> ssd_ec_project 10 1024 108 TiB 44.46M 162 TiB 29.39 260
>>> TiB
>>> metadata_fs_hdd 11 128 9.9 GiB 8.38M 30 GiB 0.01 93
>>> TiB
>>> hdd_ec_archive 12 1024 90 TiB 87.94M 135 TiB 39.79 136
>>> TiB
>>> metadata_fs_nvme 13 32 260 MiB 177 780 MiB 0 93
>>> TiB
>>> metadata_fs_ssd_rep 14 32 17 MiB 103 51 MiB 0 93
>>> TiB
>>> ssd_rep_projects 15 1024 132 B 1 12 KiB 0 130
>>> TiB
>>> nvme_rep_projects 16 512 3.5 KiB 30 336 KiB 093 TiB
>>
>> Do you have multiple EC RBD pools and/or multiple CephFSes?
>>
>>
>>> ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP
>>> META AVAIL %USE VAR PGS STATUS TYPE NAME
>>> -1 2272.93311 - 2.2 PiB 742 TiB 731 TiB 63 GiB
>>> 3.1 TiB 1.5 PiB 32.64 1.00 - root default
>>> -7 254.54175 - 255 TiB 104 TiB 102 TiB 9.6 GiB
>>> 455 GiB 151 TiB 40.78 1.25 - host ceph-host-1
>>
>>> ...
>>
>>> 137 hdd_class 18.43300 1.00000 18 TiB 14 TiB 14 TiB 6 KiB
>>> 50 GiB 4.3 TiB 76.47 2.34 449 up osd.137
>>> 152 hdd_class 18.19040 1.00000 18 TiB 241 GiB 239 GiB 10 KiB
>>> 1.8 GiB 18 TiB 1.29 0.04 7 up osd.152
>>> 3.1 TiB 1.5 PiB 32.64
>>> MIN/MAX VAR: 0.00/2.46 STDDEV: 26.17
>>
>> There ya go. osd.152 must be one of the new OSDs. Note that only 7 PGs are
>> currently resident and that it holds just 4% of the average amount of data
>> on the entire set of OSDs.
>> Run the focused `osd df` above and that number will change slightly.
>>
>> Here is your least full hdd_class OSD:
>>
>> 151 hdd_class 18.19040 1.00000 18 TiB 38 GiB 37 GiB 6 KiB
>> 1.1 GiB 18 TiB 0.20 0.01 1 up osd.151
>>
>> And the most full:
>>
>> 180 hdd_class 18.19040 1.00000 18 TiB 198 GiB 197 GiB 10 KiB
>> 1.7 GiB 18 TiB 1.07 0.03 5 up osd.180
>>
>>
>> I suspect that the most-full is at 107% of average due to the bolus of
>> backfill and/or the balancer not being active. Using upmap-remapped as
>> described above can help avoid this kind of overload.
>>
>> In a nutshell, the available space will gradually increase as data is
>> backfilled, especially if you have the balancer enabled.
>>
>>
>>
>>
>>
>>
>>
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]