[ceph-users] Re: Ceph GWCLI issue

Anthony D'Atri Tue, 30 Sep 2025 18:34:38 -0700


> The PG numbers are still very low in my opinion. you have 42 OSDs and only 
> 614 PGs that makes roughly 15 PG / OSD. That's quite far from the rule of 
> thumb of 100 PG/OSD.


I've been trying to clear up this nuance in the docs when I can.  The PGS field 
in `ceph osd df` and the target value (aka "PG ratio") are for PG replicas, not 
PGs, so one has to factor in replication.  For EC, the replication factor is 
k+m.

For a cluster with one pool:

pg_num = (#OSDs * ratio) / replication

ratio = (pg_num * replication) / #OSDs

Round to the nearest power of 2, if in doubt round up.

When you have multiple pools it gets more complicated.  One can use the pgcalc, 
or leverage the PG autoscaler.  That said the default target of 100 is way too 
low, especially since it's a max not a target as such.

global     advanced  mon_max_pg_per_osd                         600
global     advanced  mon_target_pg_per_osd                      300





> But maybe your problem is located in a different place. You may want to check 
> whether all your `rbd-target-api` services are up and running. gwcli relies 
> on them.
> 
> Kind regards,
> Laszlo Budai
> 
> 
> On 9/30/25 10:31, Kardos László wrote:
>> Hello,
>> I apologize for sending the wrong pool details earlier.
>> We store the data in the following data pool:  xxxx0-data
>> 
>> pool 15 'xxxx0-data' erasure profile laurel_ec size 4 min_size 3 crush_rule

If this is a 3+1 pool, note that a value of m=1 is ... fraught
If this is a 2+2 pool, note that with current releases, EC for RBD is usually a 
significant latency liability.  Tentacle's fast EC improves that dynamic.


>> 8 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode off last_change
>> 30830 lfor 0/0/30825 flags hashpspool,ec_overwrites,selfmanaged_snaps
>> stripe_width 12288 application rbd,rgw
>> 
>> cluster:
>>     id:     c404fafe-767c-11ee-bc37-0509d00921ba
>>     health: HEALTH_OK
>> 
>>   services:
>>     mon:         5 daemons, quorum
>> v188-ceph-mgr0,v188-ceph-mgr1,v188-ceph-iscsigw2,v188-ceph6,v188-ceph5 (age
>> 5d)
>>     mgr:         v188-ceph-mgr0.rxcecw(active, since 11w), standbys:
>> v188-ceph-mgr1.hmbuma
>>     mds:         1/1 daemons up, 1 standby
>>     osd:         42 osds: 42 up (since 2M), 42 in (since 3M)
>>     tcmu-runner: 10 portals active (4 hosts)
>> 
>>   data:
>>     volumes: 1/1 healthy
>>     pools:   11 pools, 614 pgs
>>     objects: 13.63M objects, 51 TiB
>>     usage:   75 TiB used, 71 TiB / 147 TiB avail
>>     pgs:     613 active+clean
>>              1   active+clean+scrubbing+deep
>> 
>>   io:
>>     client:   8.1 MiB/s rd, 105 MiB/s wr, 320 op/s rd, 2.31k op/s wr
>> 
>> 
>> Best Regards,
>> Laszlo Kardos
>> 
>> -----Original Message-----
>> From: Eugen Block<[email protected]>
>> Sent: Tuesday, September 30, 2025 9:03 AM
>> To:[email protected]
>> Subject: [ceph-users] Re: Ceph GWCLI issue
>> 
>> 
>> Hi,
>> 
>> I don't have an answer why the image is in unknown state, but I'd be
>> concerned about the pool's pg_num. You have Terabytes in a pool with a
>> single PG? That's awful and should be increased to a more suitable value. I
>> can't say if that would fix anything regarding the unknown issue, but that's
>> definitely not good at all.
>> 
>> What is the overall Ceph status (ceph -s)?
>> 
>> Regards,
>> Eugen
>> 
>> 
>> Zitat von Kardos László<[email protected]>:
>> 
>>> Hello,
>>> 
>>> We have encountered the following issue in our production environment:
>>> 
>>> A new RBD Image was created within an existing pool, and its status is
>>> reported as "unknown" in GWCLI. Based on our tests, this does not
>>> appear to cause operational issues, but we would like to investigate
>>> the root cause. No relevant information regarding this issue was found in
>>> the logs.
>>> 
>>> GWCLI output:
>>> 
>>> 
>>> 
>>> o- /
>>> ..........................................................................
>>> ............................................... [...]
>>> 
>>>   o- cluster
>>> ..........................................................................
>>> ............................... [Clusters: 1]
>>> 
>>>   | o- ceph
>>> ..........................................................................
>>> .................................. [HEALTH_OK]
>>> 
>>>   |   o- pools
>>> ..........................................................................
>>> ............................... [Pools: 11]
>>> 
>>>   |   | o- .mgr
>>> ................................................................
>>> [(x3),
>>> Commit: 0.00Y/15591725M (0%), Used: 194124K]
>>> 
>>>  |   | o- .nfs
>>> .................................................................
>>> [(x3),
>>> Commit: 0.00Y/15591725M (0%), Used: 16924b]
>>> 
>>>   |   | o- xxxx-test
>>> ............................................................. [(2+1),
>>> Commit: 0.00Y/23727198M (0%), Used: 0.00Y]
>>> 
>>>   |   | o- xxxxx-erasure-0 ............................................
>>> [(2+1), Commit: 0.00Y/23727198M (0%), Used: 61519257668K]
>>> 
>>>   |   | o- xxxxxx-repl
>>> ...................................................... [(x3), Commit:
>>> 0.00Y/15591725M (0%), Used: 130084b]
>>> 
>>>   |   | o- cephfs.cephfs-test.data
>>> ............................................ [(x3), Commit:
>>> 0.00Y/15591725M (0%), Used: 9090444K]
>>> 
>>>   |   | o- cephfs.cephfs-test.meta
>>> .......................................... [(x3), Commit:
>>> 0.00Y/15591725M (0%), Used: 516415713b]
>>> 
>>>   |   | o- xxxxx-data
>>> ..................................................... [(3+1), Commit:
>>> 0.00Y/9604386M (0%), Used: 7547753556K]
>>> 
>>>   |   | o- xxxxx-rpl
>>> .......................................................... [(x3), Commit:
>>> 12.0T/4268616M (294%), Used: 85265b]
>>> 
>>>   |   | o- xxxxx-data ...................................................
>>> [(3+1), Commit: 0.00Y/5011626M (0%), Used: 10955179612K]
>>> 
>>>   |   | o- replicated_xxxx ...............................................
>>> [(x3), Commit: 25.0T/2280846592K (1176%), Used: 46912b]
>>> 
>>>   |   o- topology
>>> ..........................................................................
>>> ..................... [OSDs: 42,MONs: 5]
>>> 
>>>   o- disks
>>> ..........................................................................
>>> ............................. [37.0T, Disks: 3]
>>> 
>>>  | o- xxxx-rpl
>>> ..........................................................................
>>> ................... [xxxx-rpl (12.0T)]
>>> 
>>>   | | o- xxxxx_lun0
>>> ........................................................................
>>> [xxxx-rpl/xxxxx_lun0 (Online, 12.0T)]
>>> 
>>>   | o- replicated_xxxx
>>> ..........................................................................
>>> ..... [replicated_xxxx (25.0T)]
>>> 
>>>   |   o- xxxx_lun0
>>> ...............................................................
>>> [replicated_xxxx/xxxx_lun0 (Online, 12.0T)]
>>> 
>>>   |   o- xxxx_lun_new
>>> ........................................................
>>> [replicated_xxxx/xxxx_lun_new (Unknown, 13.0T)]
>>> 
>>> 
>>> 
>>> The image (xxxx_lun_new) is provisioned to multiple ESXi hosts,
>>> mounted, and formatted with VMFS6. The datastore is writable and
>>> readable by the hosts.
>>> 
>>> There is a change in the block size of the RBD Image: the older RBD
>>> Images use a 4 MiB block size, while the new RBD Image uses a 512 KiB
>>> block size.
>>> 
>>> RBD Image Parameters:
>>> 
>>> For replicated_xxxx / xxxx_lun0 (Online status in GWCLI):
>>> 
>>> 
>>> 
>>> rbd image 'xxxx_lun0':
>>> 
>>>         size 12 TiB in 3145728 objects
>>> 
>>>         order 22 (4 MiB objects)
>>> 
>>>         snapshot_count: 0
>>> 
>>>         id: 5c1b5ecfdfa46
>>> 
>>>         data_pool: xxxx0-data
>>> 
>>>         block_name_prefix: rbd_data.14.5c1b5ecfdfa46
>>> 
>>>         format: 2
>>> 
>>>         features: exclusive-lock, data-pool
>>> 
>>>         op_features:
>>> 
>>>         flags:
>>> 
>>>         create_timestamp: Tue Jul  8 13:02:11 2025
>>> 
>>>         access_timestamp: Thu Sep 25 13:49:47 2025
>>> 
>>>         modify_timestamp: Thu Sep 25 13:50:05 2025
>>> 
>>> 
>>> 
>>> 
>>> 
>>> For replicated_xxxx / xxxx_lun_new (Unknown status in GWCLI):
>>> 
>>> rbd image 'xxxx_lun_new':
>>>         size 13 TiB in 27262976 objects
>>>         order 19 (512 KiB objects)
>>>         snapshot_count: 0
>>>         id: 1945d9cf9f41ab
>>>         data_pool: xxxx0-data
>>>         block_name_prefix: rbd_data.14.1945d9cf9f41ab
>>>         format: 2
>>>         features: exclusive-lock, data-pool
>>>         op_features:
>>>         flags:
>>>         create_timestamp: Wed Sep 24 11:21:21 2025
>>>         access_timestamp: Thu Sep 25 13:50:42 2025
>>>         modify_timestamp: Thu Sep 25 13:49:48 2025
>>> 
>>> 
>>> 
>>> Pool Parameters:
>>> 
>>> pool 14 'replicated_xxxx' replicated size 3 min_size 2 crush_rule 7
>>> object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change
>>> 30743 flags hashpspool stripe_width 0 application rbd,rgw
>>> 
>>> Ceph version:
>>> 
>>> ceph --version
>>> 
>>> ceph version 17.2.6 (d7ff0d10654d2280e08f1ab989c7cdf3064446a5) quincy
>>> (stable)
>>> 
>>> 
>>> 
>>> Question:
>>> 
>>> What could be causing the RBD Image (xxxx_lun_new) to appear in an
>>> "unknown" state in GWCLI?
>> 
>> _______________________________________________
>> ceph-users mailing list [email protected] To unsubscribe send an email
>> [email protected]
>> _______________________________________________
>> ceph-users mailing list [email protected]
>> To unsubscribe send an email [email protected]
> _______________________________________________
> ceph-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: Ceph GWCLI issue

Reply via email to