Thanks again Anthony I guess I assumed wrong that "osd.all-available-devices" includes all available devices It seems I should have targeted specifical ones (like osd.hdd_osds)
Going back to the capacity increase I am trying to estimate / calculate the final capacity Looking at the output of "ceph df" below it would be greatly appreciated if you could please confirm if this is the correct interpretation for hdd_class : "Out of a total capacity of 1.4 PiB you currently use 579 TiB . The 579 TiB used are the sum of the data used by pools configured to use hdd_class e.g osd.all-available-devices (434) , hdd_ec_archive (135)...etc Therefore, when the backfilling / balancing is done (and assuming no more data is added to the pools) the amount of pools "MAX Available" should be 814 TB " Is the above correct ? Note The "MAX available " is increasing - 202 TB now Steven --- RAW STORAGE --- CLASS SIZE AVAIL USED RAW USED %RAW USED hdd_class 1.4 PiB 814 TiB 579 TiB 579 TiB 41.54 nvme_class 293 TiB 293 TiB 216 GiB 216 GiB 0.07 ssd_class 587 TiB 424 TiB 163 TiB 163 TiB 27.80 TOTAL 2.2 PiB 1.5 PiB 742 TiB 742 TiB 32.64 --- POOLS --- POOL ID PGS STORED OBJECTS USED %USED MAX AVAIL .mgr 1 1 277 MiB 71 831 MiB 0 93 TiB .rgw.root 2 32 1.6 KiB 6 72 KiB 0 93 TiB default.rgw.log 3 32 63 KiB 210 972 KiB 0 93 TiB default.rgw.control 4 32 0 B 8 0 B 0 93 TiB default.rgw.meta 5 32 1.4 KiB 8 72 KiB 0 93 TiB default.rgw.buckets.data 6 2048 289 TiB 100.35M 434 TiB 68.04 136 TiB default.rgw.buckets.index 7 1024 5.4 GiB 521 16 GiB 0 93 TiB default.rgw.buckets.non-ec 8 32 551 B 1 13 KiB 0 93 TiB metadata_fs_ssd 9 128 6.1 GiB 15.69M 18 GiB 0 93 TiB ssd_ec_project 10 1024 108 TiB 44.46M 162 TiB 29.39 260 TiB metadata_fs_hdd 11 128 9.9 GiB 8.38M 30 GiB 0.01 93 TiB hdd_ec_archive 12 1024 90 TiB 87.94M 135 TiB 39.79 136 TiB On Mon, 1 Sept 2025 at 17:07, Anthony D'Atri <a...@dreamsnake.net> wrote: > > > Hi, > Thanks Anthony - as always a very useful and comprehensive response > > > Most welcome. > > yes, there were only 139 OSD and , indeed, raw capacity increased > I also noticed that the "max available " column from "cepf df" is getting > higher (177 TB) so, it seems the capacity is being added > > > Aye, as expected. > > Too many MDS daemons are due to the fact that I have over 150 CEPHFS > clients > and thought that deploying a daemon on each host for each filesystem is > going to > provide better performance - was I wrong ? > > > MDS strategy is nuanced, but I think deploying them redundantly on each > back-end node might not give you much. The workload AIUI is more a > function of the number and size of files. > > High number of remapped PGs are due to improperly using > osd.all-available-devices unmaged command > so adding the new drives triggered automatically detecting and adding them > > > Yep, I typically suggest setting OSD services to unmanaged when not > actively using them. > > I am not sure what I did wrong though - see below output from "ceph orch > ls" before adding the drive > Shouldn't setting it like that prevent automatic discovery ? > > > You do have three other OSD rules, at least one of them likely matched and > took action. > > > > osd.all-available-devices 0 - > 4w <unmanaged> > > osd.hdd_osds 72 10m ago > 5w * > > osd.nvme_osds 25 10m ago > 5w * > > osd.ssd_osds 84 10m ago > 3w ceph-host-1 > > > > `ceph orch ls --export` > > Resources mentioned are very useful > running upmap-remapped.py did bring the number of PGs to be remapped > close to zero > > > It's a super, super useful tool. Letting the balancer do the work > incrementally helps deter unexpected OSD fullness issues and if a problem > arises, you're a lot closer to HEALTH_OK. > > 2. upmap-remapped.py | sh > > > Sometimes 2-3 runs are needed for full effect. > > > 3. change target_max_misplaced_ratio to a higher number than default 0.005 > (since we want to rebalance faster and client performance is not a huge > issue ) > > 4. enable balancer > > 5.wait > > Doing it like this will, eventually, increase the number misplaced PGs > until it is higher than the ratio when , I guess, the balancer stops > ( "optimize_result": "Too many objects (0.115742 > 0.090000) are > misplaced; try again later ) > > > Exactly. > > Should I repeat the process when the number of objects misplaced is higher > than the ratio or what is the proper way of doing it ? > > > As the backfill progresses and the misplaced percentage drops, the > balancer will kick in with another increment. > > > Steven > > > On Sun, 31 Aug 2025 at 11:08, Anthony D'Atri <a...@dreamsnake.net> wrote: > >> >> >> On Aug 31, 2025, at 4:15 AM, Steven Vacaroaia <ste...@gmail.com> wrote: >> >> Hi, >> >> I have added 42 x 18TB HDD disks ( 6 on each of the 7 servers ) >> >> >> The ultimate answer to Ceph, the cluster, and everything! >> >> My expectation was that the pools configured to use "hdd_class" will >> have their capacity increased ( e.g. default.rgw.buckets.data which is >> uses an EC 4+2 pool for data ) >> >> >> First, did the raw capacity increase when you added these drives? >> >> --- RAW STORAGE --- >> CLASS SIZE AVAIL USED RAW USED %RAW USED >> hdd_class 1.4 PiB 814 TiB 579 TiB 579 TiB 41.54 >> >> >> Was the number of OSDs previously 139? >> >> It seems it is not happening ...yet ?! >> Is it because the peering is still going ? >> >> >> Ceph nomenclature can be mystifying at first. And sometimes at >> thirteenth. >> >> Peering is daemons checking in with each other to ensure they’re in >> agreement. >> >> I think you mean backfill / balancing. >> >> The available space reported by “ceph df” for a *pool* is a function of: >> >> * Raw space available in the associated CRUSH rule’s device class (or if >> the rule isn’t ideal, all device classes) >> * The cluster’s three full ratios # ceph osd dump | grep ratio >> * The fullness of the single most-full OSD in the device class >> >> BTW I learned only yesterday that you can restrict `ceph osd df` by >> specifying a device class, so try running >> >> `ceph osd df hdd_class | tail -10` >> >> Notably, this will show you the min/max variance among OSDs of just that >> device class, and the standard deviation. >> When you have multiple OSD sizes, these figures are much less useful when >> calculated across the whole cluster by “ceph osd df” >> >> # ceph osd df hdd >> ... >> 318 hdd 18.53969 1.00000 19 TiB 15 TiB 15 TiB 15 KiB 67 >> GiB 3.6 TiB 80.79 1.04 127 up >> 319 hdd 18.53969 1.00000 19 TiB 15 TiB 14 TiB 936 KiB 60 >> GiB 3.7 TiB 79.87 1.03 129 up >> 320 hdd 18.53969 1.00000 19 TiB 15 TiB 14 TiB 33 KiB 72 >> GiB 3.7 TiB 79.99 1.03 129 up >> 30 hdd 18.53969 1.00000 19 TiB 3.3 TiB 2.9 TiB 129 KiB 11 >> GiB 15 TiB 17.55 0.23 26 up >> TOTAL 5.4 PiB 4.2 PiB 4.1 PiB 186 MiB 17 >> TiB 1.2 PiB 77.81 >> MIN/MAX VAR: 0.23/1.09 STDDEV: 4.39 >> >> You can even run this for a specific OSD so you don’t have to get >> creative with an egrep regex or exercise your pattern-matching skills, >> though the summary values naturally aren’t useful. >> >> # ceph osd df osd.30 >> ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META >> AVAIL %USE VAR PGS STATUS >> 30 hdd 18.53969 1.00000 19 TiB 3.3 TiB 2.9 TiB 129 KiB 12 GiB >> 15 TiB 17.55 1.00 26 up >> TOTAL 19 TiB 3.3 TiB 2.9 TiB 130 KiB 12 GiB >> 15 TiB 17.55 >> MIN/MAX VAR: 1.00/1.00 STDDEV: 0 >> >> Here there’s a wide variation among the hdd OSDs because osd.30 had been >> down for a while and was recently restarted due to a host reboot, so it’s >> slowly filling with data. >> >> >> ssd_class 6.98630 >> >> >> That seems like an unusual size, what are these? Are they SAN LUNs? >> >> Below are outputs from >> ceph -s >> ceph df >> ceph osd df tree >> >> >> Thanks for providing the needful up front. >> >> cluster: >> id: 0cfa836d-68b5-11f0-90bf-7cc2558e5ce8 >> health: HEALTH_WARN >> 1 OSD(s) experiencing slow operations in BlueStore >> >> >> This warning state by default persists for a long time after it clears, >> I’m not sure why but I like to set this lower: >> >> # ceph config dump | grep blue >> global >> advanced bluestore_slow_ops_warn_lifetime 300 >> >> >> >> 1 failed cephadm daemon(s) >> >> 39 daemons have recently crashed >> >> >> That’s a bit worrisome, what happened? >> >> `ceph crash ls` >> >> >> 569 pgs not deep-scrubbed in time >> 2609 pgs not scrubbed in time >> >> >> Scrubs don’t happen during recovery, when complete these should catch up. >> >> services: >> mon: 5 daemons, quorum >> ceph-host-1,ceph-host-2,ceph-host-3,ceph-host-7,ceph-host-6 (age 2m) >> mgr: ceph-host-1.lqlece(active, since 18h), standbys: >> ceph-host-2.suiuxi >> >> >> I’m paranoid and would suggest deploying at least one more mgr. >> >> mds: 19/19 daemons up, 7 standby >> >> >> Yikes why so many? >> >> osd: 181 osds: 181 up (since 4d), 181 in (since 14h) >> >> >> What happened 14 hours ago? It seems unusual for these durations to vary >> so much. >> >> 2770 remapped pgs >> >> >> That’s an indication of balancing or backfill in progress. >> >> flags noautoscale >> >> data: >> volumes: 4/4 healthy >> pools: 16 pools, 7137 pgs >> objects: 256.82M objects, 484 TiB >> usage: 742 TiB used, 1.5 PiB / 2.2 PiB avail >> pgs: 575889786/1468742421 objects misplaced (39.210%) >> >> >> 39% is a lot of misplaced objects, this would be consistent with you >> having successfully added those OSDs. >> Here is where the factor of the most-full OSD comes in. >> >> Technically backfill is a subset of recovery, but in practice people >> usually think in terms: >> >> Recovery: PGs healing from OSDs having failed or been down >> Backfill: Rebalancing of data due to topology changes, including adjusted >> CRUSH rules, expansion, etc. >> >> >> 4247 active+clean >> 2763 active+remapped+backfill_wait >> 77 active+clean+scrubbing >> 43 active+clean+scrubbing+deep >> 7 active+remapped+backfilling >> >> >> Configuration options throttle how much backfill goes on in parallel to >> keep the cluster from DoSing itself. Here I suspect that you’re running a >> recent release with the notorious mclock op scheduling shortcomings, which >> is a tangent. >> >> >> I suggest checking out these two resources re upmap-remapped.py : >> >> https://ceph.io/assets/pdfs/events/2024/ceph-days-nyc/Mastering Ceph >> Operations with Upmap.pdf >> >> https://community.ibm.com/community/user/blogs/anthony-datri/2025/07/30/gracefully-expanding-your-ibm-storage-ceph >> >> >> >> This tool, in conjunction with the balancer module, will do the backfill >> more elegantly with various benefits. >> >> >> --- RAW STORAGE --- >> CLASS SIZE AVAIL USED RAW USED %RAW USED >> hdd_class 1.4 PiB 814 TiB 579 TiB 579 TiB 41.54 >> >> >> I hope the formatting below comes through, makes it a lot easier to read >> a table. >> >> >> --- POOLS --- >> POOL ID PGS STORED OBJECTS USED %USED >> MAX AVAIL >> .mgr 1 1 277 MiB 71 831 MiB 0 93 >> TiB >> .rgw.root 2 32 1.6 KiB 6 72 KiB 0 93 >> TiB >> default.rgw.log 3 32 63 KiB 210 972 KiB 0 93 >> TiB >> default.rgw.control 4 32 0 B 8 0 B 0 93 >> TiB >> default.rgw.meta 5 32 1.4 KiB 8 72 KiB 0 93 >> TiB >> default.rgw.buckets.data 6 2048 289 TiB 100.35M 434 TiB 68.04 >> 136 TiB >> default.rgw.buckets.index 7 1024 5.4 GiB 521 16 GiB 0 93 >> TiB >> default.rgw.buckets.non-ec 8 32 551 B 1 13 KiB 0 93 >> TiB >> metadata_fs_ssd 9 128 6.1 GiB 15.69M 18 GiB 0 93 >> TiB >> ssd_ec_project 10 1024 108 TiB 44.46M 162 TiB 29.39 >> 260 TiB >> metadata_fs_hdd 11 128 9.9 GiB 8.38M 30 GiB 0.01 93 >> TiB >> hdd_ec_archive 12 1024 90 TiB 87.94M 135 TiB 39.79 >> 136 TiB >> metadata_fs_nvme 13 32 260 MiB 177 780 MiB 0 93 >> TiB >> metadata_fs_ssd_rep 14 32 17 MiB 103 51 MiB 0 93 >> TiB >> ssd_rep_projects 15 1024 132 B 1 12 KiB 0 >> 130 TiB >> nvme_rep_projects 16 512 3.5 KiB 30 336 KiB 093 >> TiB >> >> >> Do you have multiple EC RBD pools and/or multiple CephFSes? >> >> >> ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP >> META AVAIL %USE VAR PGS STATUS TYPE NAME >> -1 2272.93311 - 2.2 PiB 742 TiB 731 TiB 63 GiB >> 3.1 TiB 1.5 PiB 32.64 1.00 - root default >> -7 254.54175 - 255 TiB 104 TiB 102 TiB 9.6 GiB >> 455 GiB 151 TiB 40.78 1.25 - host ceph-host-1 >> >> >> ... >> >> >> 137 hdd_class 18.43300 1.00000 18 TiB 14 TiB 14 TiB 6 KiB >> 50 GiB 4.3 TiB 76.47 2.34 449 up osd.137 >> 152 hdd_class 18.19040 1.00000 18 TiB 241 GiB 239 GiB 10 KiB >> 1.8 GiB 18 TiB 1.29 0.04 7 up osd.152 >> 3.1 TiB 1.5 PiB 32.64 >> MIN/MAX VAR: 0.00/2.46 STDDEV: 26.17 >> >> >> There ya go. osd.152 must be one of the new OSDs. Note that only 7 PGs >> are currently resident and that it holds just 4% of the average amount of >> data on the entire set of OSDs. >> Run the focused `osd df` above and that number will change slightly. >> >> Here is your least full hdd_class OSD: >> >> 151 hdd_class 18.19040 1.00000 18 TiB 38 GiB 37 GiB 6 KiB >> 1.1 GiB 18 TiB 0.20 0.01 1 up osd.151 >> >> And the most full: >> >> 180 hdd_class 18.19040 1.00000 18 TiB 198 GiB 197 GiB 10 KiB >> 1.7 GiB 18 TiB 1.07 0.03 5 up osd.180 >> >> >> I suspect that the most-full is at 107% of average due to the bolus of >> backfill and/or the balancer not being active. Using upmap-remapped as >> described above can help avoid this kind of overload. >> >> In a nutshell, the available space will gradually increase as data is >> backfilled, especially if you have the balancer enabled. >> >> >> >> >> >> >> >> > _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io