> On Aug 31, 2025, at 4:15 AM, Steven Vacaroaia <ste...@gmail.com> wrote: > > Hi, > > I have added 42 x 18TB HDD disks ( 6 on each of the 7 servers )
The ultimate answer to Ceph, the cluster, and everything! > My expectation was that the pools configured to use "hdd_class" will > have their capacity increased ( e.g. default.rgw.buckets.data which is > uses an EC 4+2 pool for data ) First, did the raw capacity increase when you added these drives? > --- RAW STORAGE --- > CLASS SIZE AVAIL USED RAW USED %RAW USED > hdd_class 1.4 PiB 814 TiB 579 TiB 579 TiB 41.54 Was the number of OSDs previously 139? > It seems it is not happening ...yet ?! > Is it because the peering is still going ? Ceph nomenclature can be mystifying at first. And sometimes at thirteenth. Peering is daemons checking in with each other to ensure they’re in agreement. I think you mean backfill / balancing. The available space reported by “ceph df” for a *pool* is a function of: * Raw space available in the associated CRUSH rule’s device class (or if the rule isn’t ideal, all device classes) * The cluster’s three full ratios # ceph osd dump | grep ratio * The fullness of the single most-full OSD in the device class BTW I learned only yesterday that you can restrict `ceph osd df` by specifying a device class, so try running `ceph osd df hdd_class | tail -10` Notably, this will show you the min/max variance among OSDs of just that device class, and the standard deviation. When you have multiple OSD sizes, these figures are much less useful when calculated across the whole cluster by “ceph osd df” # ceph osd df hdd ... 318 hdd 18.53969 1.00000 19 TiB 15 TiB 15 TiB 15 KiB 67 GiB 3.6 TiB 80.79 1.04 127 up 319 hdd 18.53969 1.00000 19 TiB 15 TiB 14 TiB 936 KiB 60 GiB 3.7 TiB 79.87 1.03 129 up 320 hdd 18.53969 1.00000 19 TiB 15 TiB 14 TiB 33 KiB 72 GiB 3.7 TiB 79.99 1.03 129 up 30 hdd 18.53969 1.00000 19 TiB 3.3 TiB 2.9 TiB 129 KiB 11 GiB 15 TiB 17.55 0.23 26 up TOTAL 5.4 PiB 4.2 PiB 4.1 PiB 186 MiB 17 TiB 1.2 PiB 77.81 MIN/MAX VAR: 0.23/1.09 STDDEV: 4.39 You can even run this for a specific OSD so you don’t have to get creative with an egrep regex or exercise your pattern-matching skills, though the summary values naturally aren’t useful. # ceph osd df osd.30 ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META AVAIL %USE VAR PGS STATUS 30 hdd 18.53969 1.00000 19 TiB 3.3 TiB 2.9 TiB 129 KiB 12 GiB 15 TiB 17.55 1.00 26 up TOTAL 19 TiB 3.3 TiB 2.9 TiB 130 KiB 12 GiB 15 TiB 17.55 MIN/MAX VAR: 1.00/1.00 STDDEV: 0 Here there’s a wide variation among the hdd OSDs because osd.30 had been down for a while and was recently restarted due to a host reboot, so it’s slowly filling with data. > ssd_class 6.98630 That seems like an unusual size, what are these? Are they SAN LUNs? > Below are outputs from > ceph -s > ceph df > ceph osd df tree Thanks for providing the needful up front. > cluster: > id: 0cfa836d-68b5-11f0-90bf-7cc2558e5ce8 > health: HEALTH_WARN > 1 OSD(s) experiencing slow operations in BlueStore This warning state by default persists for a long time after it clears, I’m not sure why but I like to set this lower: # ceph config dump | grep blue global advanced bluestore_slow_ops_warn_lifetime 300 > 1 failed cephadm daemon(s) > 39 daemons have recently crashed That’s a bit worrisome, what happened? `ceph crash ls` > 569 pgs not deep-scrubbed in time > 2609 pgs not scrubbed in time Scrubs don’t happen during recovery, when complete these should catch up. > services: > mon: 5 daemons, quorum > ceph-host-1,ceph-host-2,ceph-host-3,ceph-host-7,ceph-host-6 (age 2m) > mgr: ceph-host-1.lqlece(active, since 18h), standbys: ceph-host-2.suiuxi I’m paranoid and would suggest deploying at least one more mgr. > mds: 19/19 daemons up, 7 standby Yikes why so many? > osd: 181 osds: 181 up (since 4d), 181 in (since 14h) What happened 14 hours ago? It seems unusual for these durations to vary so much. > 2770 remapped pgs That’s an indication of balancing or backfill in progress. > flags noautoscale > > data: > volumes: 4/4 healthy > pools: 16 pools, 7137 pgs > objects: 256.82M objects, 484 TiB > usage: 742 TiB used, 1.5 PiB / 2.2 PiB avail > pgs: 575889786/1468742421 objects misplaced (39.210%) 39% is a lot of misplaced objects, this would be consistent with you having successfully added those OSDs. Here is where the factor of the most-full OSD comes in. Technically backfill is a subset of recovery, but in practice people usually think in terms: Recovery: PGs healing from OSDs having failed or been down Backfill: Rebalancing of data due to topology changes, including adjusted CRUSH rules, expansion, etc. > 4247 active+clean > 2763 active+remapped+backfill_wait > 77 active+clean+scrubbing > 43 active+clean+scrubbing+deep > 7 active+remapped+backfilling Configuration options throttle how much backfill goes on in parallel to keep the cluster from DoSing itself. Here I suspect that you’re running a recent release with the notorious mclock op scheduling shortcomings, which is a tangent. I suggest checking out these two resources re upmap-remapped.py : https://ceph.io/assets/pdfs/events/2024/ceph-days-nyc/Mastering Ceph Operations with Upmap.pdf <https://ceph.io/assets/pdfs/events/2024/ceph-days-nyc/Mastering%20Ceph%20Operations%20with%20Upmap.pdf> https://community.ibm.com/community/user/blogs/anthony-datri/2025/07/30/gracefully-expanding-your-ibm-storage-ceph This tool, in conjunction with the balancer module, will do the backfill more elegantly with various benefits. > --- RAW STORAGE --- > CLASS SIZE AVAIL USED RAW USED %RAW USED > hdd_class 1.4 PiB 814 TiB 579 TiB 579 TiB 41.54 I hope the formatting below comes through, makes it a lot easier to read a table. > > --- POOLS --- > POOL ID PGS STORED OBJECTS USED %USED MAX > AVAIL > .mgr 1 1 277 MiB 71 831 MiB 0 93 TiB > .rgw.root 2 32 1.6 KiB 6 72 KiB 0 93 TiB > default.rgw.log 3 32 63 KiB 210 972 KiB 0 93 TiB > default.rgw.control 4 32 0 B 8 0 B 0 93 TiB > default.rgw.meta 5 32 1.4 KiB 8 72 KiB 0 93 TiB > default.rgw.buckets.data 6 2048 289 TiB 100.35M 434 TiB 68.04 136 TiB > default.rgw.buckets.index 7 1024 5.4 GiB 521 16 GiB 0 93 TiB > default.rgw.buckets.non-ec 8 32 551 B 1 13 KiB 0 93 TiB > metadata_fs_ssd 9 128 6.1 GiB 15.69M 18 GiB 0 93 TiB > ssd_ec_project 10 1024 108 TiB 44.46M 162 TiB 29.39 260 TiB > metadata_fs_hdd 11 128 9.9 GiB 8.38M 30 GiB 0.01 93 TiB > hdd_ec_archive 12 1024 90 TiB 87.94M 135 TiB 39.79 136 TiB > metadata_fs_nvme 13 32 260 MiB 177 780 MiB 0 93 TiB > metadata_fs_ssd_rep 14 32 17 MiB 103 51 MiB 0 93 TiB > ssd_rep_projects 15 1024 132 B 1 12 KiB 0 130 TiB > nvme_rep_projects 16 512 3.5 KiB 30 336 KiB 093 TiB Do you have multiple EC RBD pools and/or multiple CephFSes? > ID CLASS WEIGHT REWEIGHT SIZE RAW USE DATA OMAP META > AVAIL %USE VAR PGS STATUS TYPE NAME > -1 2272.93311 - 2.2 PiB 742 TiB 731 TiB 63 GiB 3.1 > TiB 1.5 PiB 32.64 1.00 - root default > -7 254.54175 - 255 TiB 104 TiB 102 TiB 9.6 GiB 455 > GiB 151 TiB 40.78 1.25 - host ceph-host-1 > ... > 137 hdd_class 18.43300 1.00000 18 TiB 14 TiB 14 TiB 6 KiB 50 > GiB 4.3 TiB 76.47 2.34 449 up osd.137 > 152 hdd_class 18.19040 1.00000 18 TiB 241 GiB 239 GiB 10 KiB 1.8 > GiB 18 TiB 1.29 0.04 7 up osd.152 > 3.1 TiB 1.5 PiB 32.64 > MIN/MAX VAR: 0.00/2.46 STDDEV: 26.17 There ya go. osd.152 must be one of the new OSDs. Note that only 7 PGs are currently resident and that it holds just 4% of the average amount of data on the entire set of OSDs. Run the focused `osd df` above and that number will change slightly. Here is your least full hdd_class OSD: 151 hdd_class 18.19040 1.00000 18 TiB 38 GiB 37 GiB 6 KiB 1.1 GiB 18 TiB 0.20 0.01 1 up osd.151 And the most full: 180 hdd_class 18.19040 1.00000 18 TiB 198 GiB 197 GiB 10 KiB 1.7 GiB 18 TiB 1.07 0.03 5 up osd.180 I suspect that the most-full is at 107% of average due to the bolus of backfill and/or the balancer not being active. Using upmap-remapped as described above can help avoid this kind of overload. In a nutshell, the available space will gradually increase as data is backfilled, especially if you have the balancer enabled. _______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io