> On Aug 31, 2025, at 4:15 AM, Steven Vacaroaia <ste...@gmail.com> wrote:
> 
> Hi,
> 
> I have added 42 x 18TB HDD disks ( 6 on each of the 7 servers )

The ultimate answer to Ceph, the cluster, and everything!

> My expectation was that the pools configured to use "hdd_class" will
> have their capacity  increased ( e.g. default.rgw.buckets.data which is
> uses an EC 4+2 pool  for data )

First, did the raw capacity increase when you added these drives?

> --- RAW STORAGE ---
> CLASS          SIZE    AVAIL     USED  RAW USED  %RAW USED
> hdd_class   1.4 PiB  814 TiB  579 TiB   579 TiB      41.54

Was the number of OSDs previously 139?

> It seems it is not happening ...yet ?!
> Is it because the peering is still going ?

Ceph nomenclature can be mystifying at first.  And sometimes at thirteenth.

Peering is daemons checking in with each other to ensure they’re in agreement.

I think you mean backfill / balancing.

The available space reported by “ceph df” for a *pool* is a function of:

* Raw space available in the associated CRUSH rule’s device class (or if the 
rule isn’t ideal, all device classes)
* The cluster’s three full ratios # ceph osd dump | grep ratio
* The fullness of the single most-full OSD in the device class

BTW I learned only yesterday that you can restrict `ceph osd df` by specifying 
a device class, so try running

`ceph osd df hdd_class | tail -10`

Notably, this will show you the min/max variance among OSDs of just that device 
class, and the standard deviation.
When you have multiple OSD sizes, these figures are much less useful when 
calculated across the whole cluster by “ceph osd df”

# ceph osd df hdd
...
318    hdd  18.53969   1.00000   19 TiB   15 TiB   15 TiB   15 KiB  67 GiB  3.6 
TiB  80.79  1.04  127      up
319    hdd  18.53969   1.00000   19 TiB   15 TiB   14 TiB  936 KiB  60 GiB  3.7 
TiB  79.87  1.03  129      up
320    hdd  18.53969   1.00000   19 TiB   15 TiB   14 TiB   33 KiB  72 GiB  3.7 
TiB  79.99  1.03  129      up
 30    hdd  18.53969   1.00000   19 TiB  3.3 TiB  2.9 TiB  129 KiB   11 GiB   
15 TiB  17.55  0.23   26      up
                         TOTAL  5.4 PiB  4.2 PiB  4.1 PiB  186 MiB  17 TiB  1.2 
PiB  77.81
MIN/MAX VAR: 0.23/1.09  STDDEV: 4.39

You can even run this for a specific OSD so you don’t have to get creative with 
an egrep regex or exercise your pattern-matching skills, though the summary 
values naturally aren’t useful.

# ceph osd df osd.30
ID  CLASS  WEIGHT    REWEIGHT  SIZE    RAW USE  DATA     OMAP     META    AVAIL 
  %USE   VAR   PGS  STATUS
30    hdd  18.53969   1.00000  19 TiB  3.3 TiB  2.9 TiB  129 KiB  12 GiB  15 
TiB  17.55  1.00   26      up
                        TOTAL  19 TiB  3.3 TiB  2.9 TiB  130 KiB  12 GiB  15 
TiB  17.55
MIN/MAX VAR: 1.00/1.00  STDDEV: 0

Here there’s a wide variation among the hdd OSDs because osd.30 had been down 
for a while and was recently restarted due to a host reboot, so it’s slowly 
filling with data.


> ssd_class     6.98630

That seems like an unusual size, what are these? Are they SAN LUNs?

> Below are outputs from
> ceph -s
> ceph df
> ceph osd df tree

Thanks for providing the needful up front.

>  cluster:
>    id:     0cfa836d-68b5-11f0-90bf-7cc2558e5ce8
>    health: HEALTH_WARN
>            1 OSD(s) experiencing slow operations in BlueStore

This warning state by default persists for a long time after it clears, I’m not 
sure why but I like to set this lower:

# ceph config dump | grep blue
global                                                            advanced  
bluestore_slow_ops_warn_lifetime           300



>            1 failed cephadm daemon(s)
>            39 daemons have recently crashed

That’s a bit worrisome, what happened?

`ceph crash ls`


>            569 pgs not deep-scrubbed in time
>            2609 pgs not scrubbed in time

Scrubs don’t happen during recovery, when complete these should catch up.

>  services:
>    mon: 5 daemons, quorum
> ceph-host-1,ceph-host-2,ceph-host-3,ceph-host-7,ceph-host-6 (age 2m)
>    mgr: ceph-host-1.lqlece(active, since 18h), standbys: ceph-host-2.suiuxi

I’m paranoid and would suggest deploying at least one more mgr.

>    mds: 19/19 daemons up, 7 standby

Yikes why so many?

>    osd: 181 osds: 181 up (since 4d), 181 in (since 14h)

What happened 14 hours ago?  It seems unusual for these durations to vary so 
much.

> 2770 remapped pgs

That’s an indication of balancing or backfill in progress.

>         flags noautoscale
> 
>  data:
>    volumes: 4/4 healthy
>    pools:   16 pools, 7137 pgs
>    objects: 256.82M objects, 484 TiB
>    usage:   742 TiB used, 1.5 PiB / 2.2 PiB avail
>    pgs:     575889786/1468742421 objects misplaced (39.210%)

39% is a lot of misplaced objects, this would be consistent with you having 
successfully added those OSDs.
Here is where the factor of the most-full OSD comes in.

Technically backfill is a subset of recovery, but in practice people usually 
think in terms:

Recovery: PGs healing from OSDs having failed or been down
Backfill: Rebalancing of data due to topology changes, including adjusted CRUSH 
rules, expansion, etc.


>             4247 active+clean
>             2763 active+remapped+backfill_wait
>             77   active+clean+scrubbing
>             43   active+clean+scrubbing+deep
>             7    active+remapped+backfilling

Configuration options throttle how much backfill goes on in parallel to keep 
the cluster from DoSing itself.  Here I suspect that you’re running a recent 
release with the notorious mclock op scheduling shortcomings, which is a 
tangent.


I suggest checking out these two resources re upmap-remapped.py :

https://ceph.io/assets/pdfs/events/2024/ceph-days-nyc/Mastering Ceph Operations 
with Upmap.pdf 
<https://ceph.io/assets/pdfs/events/2024/ceph-days-nyc/Mastering%20Ceph%20Operations%20with%20Upmap.pdf>
https://community.ibm.com/community/user/blogs/anthony-datri/2025/07/30/gracefully-expanding-your-ibm-storage-ceph
 


This tool, in conjunction with the balancer module, will do the backfill more 
elegantly with various benefits.


> --- RAW STORAGE ---
> CLASS          SIZE    AVAIL     USED  RAW USED  %RAW USED
> hdd_class   1.4 PiB  814 TiB  579 TiB   579 TiB      41.54

I hope the formatting below comes through, makes it a lot easier to read a 
table.

> 
> --- POOLS ---
> POOL                        ID   PGS   STORED  OBJECTS     USED  %USED  MAX 
> AVAIL
> .mgr                         1     1  277 MiB       71  831 MiB      0 93 TiB
> .rgw.root                    2    32  1.6 KiB        6   72 KiB      0 93 TiB
> default.rgw.log              3    32   63 KiB      210  972 KiB      0 93 TiB
> default.rgw.control          4    32      0 B        8      0 B      0 93 TiB
> default.rgw.meta             5    32  1.4 KiB        8   72 KiB      0 93 TiB
> default.rgw.buckets.data     6  2048  289 TiB  100.35M  434 TiB  68.04 136 TiB
> default.rgw.buckets.index    7  1024  5.4 GiB      521   16 GiB      0 93 TiB
> default.rgw.buckets.non-ec   8    32    551 B        1   13 KiB      0 93 TiB
> metadata_fs_ssd              9   128  6.1 GiB   15.69M   18 GiB      0 93 TiB
> ssd_ec_project              10  1024  108 TiB   44.46M  162 TiB  29.39 260 TiB
> metadata_fs_hdd             11   128  9.9 GiB    8.38M   30 GiB   0.01 93 TiB
> hdd_ec_archive              12  1024   90 TiB   87.94M  135 TiB  39.79 136 TiB
> metadata_fs_nvme            13    32  260 MiB      177  780 MiB      0 93 TiB
> metadata_fs_ssd_rep         14    32   17 MiB      103   51 MiB      0 93 TiB
> ssd_rep_projects            15  1024    132 B        1   12 KiB      0 130 TiB
> nvme_rep_projects           16   512  3.5 KiB       30  336 KiB      093 TiB

Do you have multiple EC RBD pools and/or multiple CephFSes?


> ID   CLASS       WEIGHT      REWEIGHT  SIZE     RAW USE  DATA     OMAP   META 
>     AVAIL    %USE   VAR   PGS  STATUS  TYPE NAME
> -1              2272.93311         -  2.2 PiB  742 TiB  731 TiB   63 GiB 3.1 
> TiB  1.5 PiB  32.64  1.00    -          root default
> -7               254.54175         -  255 TiB  104 TiB  102 TiB  9.6 GiB 455 
> GiB  151 TiB  40.78  1.25    -              host ceph-host-1

> ...

> 137   hdd_class    18.43300   1.00000   18 TiB   14 TiB   14 TiB    6 KiB 50 
> GiB   4.3 TiB  76.47  2.34  449      up          osd.137
> 152   hdd_class    18.19040   1.00000   18 TiB  241 GiB  239 GiB   10 KiB 1.8 
> GiB   18 TiB   1.29  0.04    7      up          osd.152
> 3.1 TiB  1.5 PiB  32.64
> MIN/MAX VAR: 0.00/2.46  STDDEV: 26.17

There ya go.  osd.152 must be one of the new OSDs.  Note that only 7 PGs are 
currently resident and that it holds just 4% of the average amount of data on 
the entire set of OSDs.
Run the focused `osd df` above and that number will change slightly.

Here is your least full hdd_class OSD:

151   hdd_class    18.19040   1.00000   18 TiB   38 GiB   37 GiB    6 KiB 1.1 
GiB   18 TiB   0.20  0.01    1      up          osd.151

And the most full:

180   hdd_class    18.19040   1.00000   18 TiB  198 GiB  197 GiB   10 KiB 1.7 
GiB   18 TiB   1.07  0.03    5      up          osd.180


I suspect that the most-full is at 107% of average due to the bolus of backfill 
and/or the balancer not being active.  Using upmap-remapped as described above 
can help avoid this kind of overload.

In a nutshell, the available space will gradually increase as data is 
backfilled, especially if you have the balancer enabled.







_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to