Thanks again Anthony

I guess I assumed wrong that "osd.all-available-devices" includes all
available devices
It seems I should have targeted specifical ones (like osd.hdd_osds)

Going back to the capacity increase
I am trying to estimate / calculate the final capacity

Looking at the output of "ceph df" below
it would be greatly appreciated if you could please confirm
if this is the correct interpretation for hdd_class :

"Out of a total capacity of 1.4 PiB you currently use 579 TiB .

The 579 TiB used are the sum of the data used by pools configured to use
 hdd_class e.g osd.all-available-devices (434) ,  hdd_ec_archive
 (135)...etc

Therefore, when the backfilling / balancing is done
 (and assuming no more data is added to the pools)
the amount of pools "MAX Available" should be 814 TB

"

Is the above correct ?

Note
The "MAX available " is increasing - 202 TB now

Steven


--- RAW STORAGE ---
CLASS          SIZE    AVAIL     USED  RAW USED  %RAW USED
hdd_class   1.4 PiB  814 TiB  579 TiB   579 TiB      41.54
nvme_class  293 TiB  293 TiB  216 GiB   216 GiB       0.07
ssd_class   587 TiB  424 TiB  163 TiB   163 TiB      27.80
TOTAL       2.2 PiB  1.5 PiB  742 TiB   742 TiB      32.64

--- POOLS ---
POOL                        ID   PGS   STORED  OBJECTS     USED  %USED  MAX
AVAIL
.mgr                         1     1  277 MiB       71  831 MiB      0
93 TiB
.rgw.root                    2    32  1.6 KiB        6   72 KiB      0
93 TiB
default.rgw.log              3    32   63 KiB      210  972 KiB      0
93 TiB
default.rgw.control          4    32      0 B        8      0 B      0
93 TiB
default.rgw.meta             5    32  1.4 KiB        8   72 KiB      0
93 TiB
default.rgw.buckets.data     6  2048  289 TiB  100.35M  434 TiB  68.04
 136 TiB
default.rgw.buckets.index    7  1024  5.4 GiB      521   16 GiB      0
93 TiB
default.rgw.buckets.non-ec   8    32    551 B        1   13 KiB      0
93 TiB
metadata_fs_ssd              9   128  6.1 GiB   15.69M   18 GiB      0
93 TiB
ssd_ec_project              10  1024  108 TiB   44.46M  162 TiB  29.39
 260 TiB
metadata_fs_hdd             11   128  9.9 GiB    8.38M   30 GiB   0.01
93 TiB
hdd_ec_archive              12  1024   90 TiB   87.94M  135 TiB  39.79
 136 TiB

On Mon, 1 Sept 2025 at 17:07, Anthony D'Atri <a...@dreamsnake.net> wrote:

>
>
> Hi,
> Thanks Anthony - as always a very useful and comprehensive response
>
>
> Most welcome.
>
> yes, there were only 139 OSD and , indeed, raw capacity increased
> I also noticed that the "max available "  column from "cepf df" is getting
> higher (177 TB) so, it seems the capacity is being added
>
>
> Aye, as expected.
>
> Too many MDS daemons are due to the fact that I have over 150 CEPHFS
> clients
> and thought that deploying a daemon on each host for each filesystem is
> going to
> provide better performance  - was I wrong ?
>
>
> MDS strategy is nuanced, but I think deploying them redundantly on each
> back-end node might not give you much.  The workload AIUI is  more a
> function of the number and size of files.
>
> High number of  remapped PGs are due to improperly using
> osd.all-available-devices unmaged command
> so adding the new drives triggered  automatically detecting and adding them
>
>
> Yep, I typically suggest setting OSD services to unmanaged when not
> actively using them.
>
> I am not sure what I did wrong  though - see below output from "ceph orch
> ls" before adding the drive
> Shouldn't setting it like that prevent automatic discovery  ?
>
>
> You do have three other OSD rules, at least one of them likely matched and
> took action.
>
>
>
> osd.all-available-devices                                     0  -
>  4w   <unmanaged>
>
> osd.hdd_osds                                                 72  10m ago
>  5w   *
>
> osd.nvme_osds                                                25  10m ago
>  5w   *
>
> osd.ssd_osds                                                 84  10m ago
>  3w   ceph-host-1
>
>
>
> `ceph orch ls --export`
>
> Resources mentioned are very useful
> running upmap-remapped.py did  bring the number of PGs to be remapped
> close to zero
>
>
> It's a super, super useful tool. Letting the balancer do the work
> incrementally helps deter unexpected OSD fullness issues and if a problem
> arises, you're a lot closer to HEALTH_OK.
>
> 2. upmap-remapped.py | sh
>
>
> Sometimes 2-3 runs are needed for full effect.
>
>
> 3. change target_max_misplaced_ratio to a higher number than default 0.005
>  (since we want to rebalance faster and client performance is not a huge
> issue )
>
> 4. enable balancer
>
> 5.wait
>
> Doing it like this will, eventually, increase the number misplaced PGs
> until it is higher than the ratio when , I guess, the balancer stops
> (  "optimize_result": "Too many objects (0.115742 > 0.090000) are
> misplaced; try again later )
>
>
> Exactly.
>
> Should I repeat the process when the number of objects misplaced is higher
> than the ratio or what is the proper way of doing it ?
>
>
> As the backfill progresses and the misplaced percentage drops, the
> balancer will kick in with another increment.
>
>
> Steven
>
>
> On Sun, 31 Aug 2025 at 11:08, Anthony D'Atri <a...@dreamsnake.net> wrote:
>
>>
>>
>> On Aug 31, 2025, at 4:15 AM, Steven Vacaroaia <ste...@gmail.com> wrote:
>>
>> Hi,
>>
>> I have added 42 x 18TB HDD disks ( 6 on each of the 7 servers )
>>
>>
>> The ultimate answer to Ceph, the cluster, and everything!
>>
>> My expectation was that the pools configured to use "hdd_class" will
>> have their capacity  increased ( e.g. default.rgw.buckets.data which is
>> uses an EC 4+2 pool  for data )
>>
>>
>> First, did the raw capacity increase when you added these drives?
>>
>> --- RAW STORAGE ---
>> CLASS          SIZE    AVAIL     USED  RAW USED  %RAW USED
>> hdd_class   1.4 PiB  814 TiB  579 TiB   579 TiB      41.54
>>
>>
>> Was the number of OSDs previously 139?
>>
>> It seems it is not happening ...yet ?!
>> Is it because the peering is still going ?
>>
>>
>> Ceph nomenclature can be mystifying at first.  And sometimes at
>> thirteenth.
>>
>> Peering is daemons checking in with each other to ensure they’re in
>> agreement.
>>
>> I think you mean backfill / balancing.
>>
>> The available space reported by “ceph df” for a *pool* is a function of:
>>
>> * Raw space available in the associated CRUSH rule’s device class (or if
>> the rule isn’t ideal, all device classes)
>> * The cluster’s three full ratios # ceph osd dump | grep ratio
>> * The fullness of the single most-full OSD in the device class
>>
>> BTW I learned only yesterday that you can restrict `ceph osd df` by
>> specifying a device class, so try running
>>
>> `ceph osd df hdd_class | tail -10`
>>
>> Notably, this will show you the min/max variance among OSDs of just that
>> device class, and the standard deviation.
>> When you have multiple OSD sizes, these figures are much less useful when
>> calculated across the whole cluster by “ceph osd df”
>>
>> # ceph osd df hdd
>> ...
>> 318    hdd  18.53969   1.00000   19 TiB   15 TiB   15 TiB   15 KiB  67
>> GiB  3.6 TiB  80.79  1.04  127      up
>> 319    hdd  18.53969   1.00000   19 TiB   15 TiB   14 TiB  936 KiB  60
>> GiB  3.7 TiB  79.87  1.03  129      up
>> 320    hdd  18.53969   1.00000   19 TiB   15 TiB   14 TiB   33 KiB  72
>> GiB  3.7 TiB  79.99  1.03  129      up
>>  30    hdd  18.53969   1.00000   19 TiB  3.3 TiB  2.9 TiB  129 KiB   11
>> GiB   15 TiB  17.55  0.23   26      up
>>                          TOTAL  5.4 PiB  4.2 PiB  4.1 PiB  186 MiB  17
>> TiB  1.2 PiB  77.81
>> MIN/MAX VAR: 0.23/1.09  STDDEV: 4.39
>>
>> You can even run this for a specific OSD so you don’t have to get
>> creative with an egrep regex or exercise your pattern-matching skills,
>> though the summary values naturally aren’t useful.
>>
>> # ceph osd df osd.30
>> ID  CLASS  WEIGHT    REWEIGHT  SIZE    RAW USE  DATA     OMAP     META
>>  AVAIL   %USE   VAR   PGS  STATUS
>> 30    hdd  18.53969   1.00000  19 TiB  3.3 TiB  2.9 TiB  129 KiB  12 GiB
>>  15 TiB  17.55  1.00   26      up
>>                         TOTAL  19 TiB  3.3 TiB  2.9 TiB  130 KiB  12 GiB
>>  15 TiB  17.55
>> MIN/MAX VAR: 1.00/1.00  STDDEV: 0
>>
>> Here there’s a wide variation among the hdd OSDs because osd.30 had been
>> down for a while and was recently restarted due to a host reboot, so it’s
>> slowly filling with data.
>>
>>
>> ssd_class     6.98630
>>
>>
>> That seems like an unusual size, what are these? Are they SAN LUNs?
>>
>> Below are outputs from
>> ceph -s
>> ceph df
>> ceph osd df tree
>>
>>
>> Thanks for providing the needful up front.
>>
>>  cluster:
>>    id:     0cfa836d-68b5-11f0-90bf-7cc2558e5ce8
>>    health: HEALTH_WARN
>>            1 OSD(s) experiencing slow operations in BlueStore
>>
>>
>> This warning state by default persists for a long time after it clears,
>> I’m not sure why but I like to set this lower:
>>
>> # ceph config dump | grep blue
>> global
>>  advanced  bluestore_slow_ops_warn_lifetime           300
>>
>>
>>
>>            1 failed cephadm daemon(s)
>>
>>            39 daemons have recently crashed
>>
>>
>> That’s a bit worrisome, what happened?
>>
>> `ceph crash ls`
>>
>>
>>            569 pgs not deep-scrubbed in time
>>            2609 pgs not scrubbed in time
>>
>>
>> Scrubs don’t happen during recovery, when complete these should catch up.
>>
>>  services:
>>    mon: 5 daemons, quorum
>> ceph-host-1,ceph-host-2,ceph-host-3,ceph-host-7,ceph-host-6 (age 2m)
>>    mgr: ceph-host-1.lqlece(active, since 18h), standbys:
>> ceph-host-2.suiuxi
>>
>>
>> I’m paranoid and would suggest deploying at least one more mgr.
>>
>>    mds: 19/19 daemons up, 7 standby
>>
>>
>> Yikes why so many?
>>
>>    osd: 181 osds: 181 up (since 4d), 181 in (since 14h)
>>
>>
>> What happened 14 hours ago?  It seems unusual for these durations to vary
>> so much.
>>
>> 2770 remapped pgs
>>
>>
>> That’s an indication of balancing or backfill in progress.
>>
>>         flags noautoscale
>>
>>  data:
>>    volumes: 4/4 healthy
>>    pools:   16 pools, 7137 pgs
>>    objects: 256.82M objects, 484 TiB
>>    usage:   742 TiB used, 1.5 PiB / 2.2 PiB avail
>>    pgs:     575889786/1468742421 objects misplaced (39.210%)
>>
>>
>> 39% is a lot of misplaced objects, this would be consistent with you
>> having successfully added those OSDs.
>> Here is where the factor of the most-full OSD comes in.
>>
>> Technically backfill is a subset of recovery, but in practice people
>> usually think in terms:
>>
>> Recovery: PGs healing from OSDs having failed or been down
>> Backfill: Rebalancing of data due to topology changes, including adjusted
>> CRUSH rules, expansion, etc.
>>
>>
>>             4247 active+clean
>>             2763 active+remapped+backfill_wait
>>             77   active+clean+scrubbing
>>             43   active+clean+scrubbing+deep
>>             7    active+remapped+backfilling
>>
>>
>> Configuration options throttle how much backfill goes on in parallel to
>> keep the cluster from DoSing itself.  Here I suspect that you’re running a
>> recent release with the notorious mclock op scheduling shortcomings, which
>> is a tangent.
>>
>>
>> I suggest checking out these two resources re upmap-remapped.py :
>>
>> https://ceph.io/assets/pdfs/events/2024/ceph-days-nyc/Mastering Ceph
>> Operations with Upmap.pdf
>>
>> https://community.ibm.com/community/user/blogs/anthony-datri/2025/07/30/gracefully-expanding-your-ibm-storage-ceph
>>
>>
>>
>> This tool, in conjunction with the balancer module, will do the backfill
>> more elegantly with various benefits.
>>
>>
>> --- RAW STORAGE ---
>> CLASS          SIZE    AVAIL     USED  RAW USED  %RAW USED
>> hdd_class   1.4 PiB  814 TiB  579 TiB   579 TiB      41.54
>>
>>
>> I hope the formatting below comes through, makes it a lot easier to read
>> a table.
>>
>>
>> --- POOLS ---
>> POOL                        ID   PGS   STORED  OBJECTS     USED  %USED
>>  MAX AVAIL
>> .mgr                         1     1  277 MiB       71  831 MiB      0 93
>> TiB
>> .rgw.root                    2    32  1.6 KiB        6   72 KiB      0 93
>> TiB
>> default.rgw.log              3    32   63 KiB      210  972 KiB      0 93
>> TiB
>> default.rgw.control          4    32      0 B        8      0 B      0 93
>> TiB
>> default.rgw.meta             5    32  1.4 KiB        8   72 KiB      0 93
>> TiB
>> default.rgw.buckets.data     6  2048  289 TiB  100.35M  434 TiB  68.04
>> 136 TiB
>> default.rgw.buckets.index    7  1024  5.4 GiB      521   16 GiB      0 93
>> TiB
>> default.rgw.buckets.non-ec   8    32    551 B        1   13 KiB      0 93
>> TiB
>> metadata_fs_ssd              9   128  6.1 GiB   15.69M   18 GiB      0 93
>> TiB
>> ssd_ec_project              10  1024  108 TiB   44.46M  162 TiB  29.39
>> 260 TiB
>> metadata_fs_hdd             11   128  9.9 GiB    8.38M   30 GiB   0.01 93
>> TiB
>> hdd_ec_archive              12  1024   90 TiB   87.94M  135 TiB  39.79
>> 136 TiB
>> metadata_fs_nvme            13    32  260 MiB      177  780 MiB      0 93
>> TiB
>> metadata_fs_ssd_rep         14    32   17 MiB      103   51 MiB      0 93
>> TiB
>> ssd_rep_projects            15  1024    132 B        1   12 KiB      0
>> 130 TiB
>> nvme_rep_projects           16   512  3.5 KiB       30  336 KiB      093
>> TiB
>>
>>
>> Do you have multiple EC RBD pools and/or multiple CephFSes?
>>
>>
>> ID   CLASS       WEIGHT      REWEIGHT  SIZE     RAW USE  DATA     OMAP
>> META     AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME
>> -1              2272.93311         -  2.2 PiB  742 TiB  731 TiB   63 GiB
>> 3.1 TiB  1.5 PiB  32.64  1.00    -          root default
>> -7               254.54175         -  255 TiB  104 TiB  102 TiB  9.6 GiB
>> 455 GiB  151 TiB  40.78  1.25    -              host ceph-host-1
>>
>>
>> ...
>>
>>
>> 137   hdd_class    18.43300   1.00000   18 TiB   14 TiB   14 TiB    6 KiB
>> 50 GiB   4.3 TiB  76.47  2.34  449      up          osd.137
>> 152   hdd_class    18.19040   1.00000   18 TiB  241 GiB  239 GiB   10 KiB
>> 1.8 GiB   18 TiB   1.29  0.04    7      up          osd.152
>> 3.1 TiB  1.5 PiB  32.64
>> MIN/MAX VAR: 0.00/2.46  STDDEV: 26.17
>>
>>
>> There ya go.  osd.152 must be one of the new OSDs.  Note that only 7 PGs
>> are currently resident and that it holds just 4% of the average amount of
>> data on the entire set of OSDs.
>> Run the focused `osd df` above and that number will change slightly.
>>
>> Here is your least full hdd_class OSD:
>>
>> 151   hdd_class    18.19040   1.00000   18 TiB   38 GiB   37 GiB    6 KiB
>> 1.1 GiB   18 TiB   0.20  0.01    1      up          osd.151
>>
>> And the most full:
>>
>> 180   hdd_class    18.19040   1.00000   18 TiB  198 GiB  197 GiB   10 KiB
>> 1.7 GiB   18 TiB   1.07  0.03    5      up          osd.180
>>
>>
>> I suspect that the most-full is at 107% of average due to the bolus of
>> backfill and/or the balancer not being active.  Using upmap-remapped as
>> described above can help avoid this kind of overload.
>>
>> In a nutshell, the available space will gradually increase as data is
>> backfilled, especially if you have the balancer enabled.
>>
>>
>>
>>
>>
>>
>>
>>
>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to