[ceph-users] cephadm configuration in git
Hello ceph community, Currently we have deployed ceph clusters with ceph-ansible and whole configuration (number od daemons, osd configurations, rgw configurations, crush configuration, ...) of each cluster is stored in git and ansible variables and we can recreate clusters with ceph-ansible in case we need it. To change the configuration of a cluster we change appropriate Ansible variable, we test it on testing cluster and if new configuration works correctly we apply it on prod cluster. Is it possible to do it with cephadm? Is it possible to have some config files in git and then apply same cluster configuration on multiple clusters? Or is this approach not aligned with cephadm and we should do it different way? Kamil Madac ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] rbd-mirror and DR test
One of our customers is currently facing a challenge in testing our disaster recovery (DR) procedures on a pair of Ceph clusters (Quincy version 17.2.5). Our issue revolves around the need to resynchronize data after conducting a DR procedure test. In small-scale scenarios, this may not be a significant problem. However, when dealing with terabytes of data, it becomes a considerable challenge. In a typical DR procedure, there are two sites, Site A and Site B. The process involves demoting Site A and promoting Site B, followed by the reverse operation to ensure data resynchronization. However, our specific challenge lies in the fact that, in our case: - Site A is running and serving production traffic, Site B is just for DR purposes. - Network connectivity between Site A and Site B is deliberately disrupted. - A "promote" operation is enforced (--force) on Site B, creating a split-brain situation. - Data access and modifications are performed on Site B during this state. - To revert to the original configuration, we must demote Site B, but the only way to re-establish RBD mirroring is by forcing a full resynchronization, essentially recopying the entire dataset. Given these circumstances, we are interested in how to address this challenge efficiently, especially when dealing with large datasets (TBs of data). Are there alternative approaches, best practices, or recommendations such that we won't need to fully resync site A to site B in order to reestablish rbd-mirror? Thank you very much for any advice. Kamil Madac ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: rbd map: corrupt full osdmap (-22) when
Ilya, Thanks for clarification. On Thu, May 4, 2023 at 1:12 PM Ilya Dryomov wrote: > On Thu, May 4, 2023 at 11:27 AM Kamil Madac wrote: > > > > Thanks for the info. > > > > As a solution we used rbd-nbd which works fine without any issues. If we > will have time we will also try to disable ipv4 on the cluster and will try > kernel rbd mapping again. Are there any disadvantages when using NBD > instead of kernel driver? > > Ceph doesn't really support dual stack configurations. It's not > something that is tested: even if it happens to work for some use case > today, it can very well break tomorrow. The kernel client just makes > that very explicit ;) > > rbd-nbd is less performant and historically also less stable (although > that might have changed in recent kernels as a bunch of work went into > the NBD driver upstream). It's also heavier on resource usage but that > won't be noticeable/can be disregarded if you are not mapping dozens of > RBD images on a single node. > > Thanks, > > Ilya > -- Kamil Madac <https://kmadac.github.io/> ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: rbd map: corrupt full osdmap (-22) when
Thanks for the info. As a solution we used rbd-nbd which works fine without any issues. If we will have time we will also try to disable ipv4 on the cluster and will try kernel rbd mapping again. Are there any disadvantages when using NBD instead of kernel driver? Thanks On Wed, May 3, 2023 at 4:06 PM Ilya Dryomov wrote: > On Wed, May 3, 2023 at 11:24 AM Kamil Madac wrote: > > > > Hi, > > > > We deployed pacific cluster 16.2.12 with cephadm. We experience following > > error during rbd map: > > > > [Wed May 3 08:59:11 2023] libceph: mon2 (1)[2a00:da8:ffef:1433::]:6789 > > session established > > [Wed May 3 08:59:11 2023] libceph: another match of type 1 in addrvec > > [Wed May 3 08:59:11 2023] libceph: corrupt full osdmap (-22) epoch 200 > off > > 1042 (9876284d of 0cb24b58-80b70596) > > [Wed May 3 08:59:11 2023] osdmap: : 08 07 7d 10 00 00 09 01 5d > 09 > > 00 00 a2 22 3b 86 ..}.]";. > > [Wed May 3 08:59:11 2023] osdmap: 0010: e4 f5 11 ed 99 ee 47 75 ca > 3c > > ad 23 c8 00 00 00 ..Gu.<.# > > [Wed May 3 08:59:11 2023] osdmap: 0020: 21 68 4a 64 98 d2 5d 2e 84 > fd > > 50 64 d9 3a 48 26 !hJd..]...Pd.:H& > > [Wed May 3 08:59:11 2023] osdmap: 0030: 02 00 00 00 01 00 00 00 00 > 00 > > 00 00 1d 05 71 01 ..q. > > > > > > Linux Kernel is 6.1.13 and the important thing is that we are using ipv6 > > addresses for connection to ceph nodes. > > We were able to map rbd from client with kernel 5.10, but in prod > > environment we are not allowed to use that kernel. > > > > What could be the reason for such behavior on newer kernels and how to > > troubleshoot it? > > > > Here is output of ceph osd dump: > > > > # ceph osd dump > > epoch 200 > > fsid a2223b86-e4f5-11ed-99ee-4775ca3cad23 > > created 2023-04-27T12:18:41.777900+ > > modified 2023-05-02T12:09:40.642267+ > > flags sortbitwise,recovery_deletes,purged_snapdirs,pglog_hardlimit > > crush_version 34 > > full_ratio 0.95 > > backfillfull_ratio 0.9 > > nearfull_ratio 0.85 > > require_min_compat_client luminous > > min_compat_client jewel > > require_osd_release pacific > > stretch_mode_enabled false > > pool 1 'device_health_metrics' replicated size 3 min_size 2 crush_rule 0 > > object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 183 > > flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application > > mgr_devicehealth > > pool 2 'idp' replicated size 3 min_size 2 crush_rule 0 object_hash > rjenkins > > pg_num 32 pgp_num 32 autoscale_mode on last_change 48 flags > > hashpspool,selfmanaged_snaps stripe_width 0 application rbd > > max_osd 3 > > osd.0 up in weight 1 up_from 176 up_thru 182 down_at 172 > > last_clean_interval [170,171) > > > [v2:[2a00:da8:ffef:1431::]:6800/805023868,v1:[2a00:da8:ffef:1431::]:6801/805023868,v2: > > 0.0.0.0:6802/805023868,v1:0.0.0.0:6803/805023868] > > > [v2:[2a00:da8:ffef:1431::]:6804/805023868,v1:[2a00:da8:ffef:1431::]:6805/805023868,v2: > > 0.0.0.0:6806/805023868,v1:0.0.0.0:6807/805023868] exists,up > > e8fd0ee2-ea63-4d02-8f36-219d36869078 > > osd.1 up in weight 1 up_from 136 up_thru 182 down_at 0 > > last_clean_interval [0,0) > > > [v2:[2a00:da8:ffef:1432::]:6800/2172723816,v1:[2a00:da8:ffef:1432::]:6801/2172723816,v2: > > 0.0.0.0:6802/2172723816,v1:0.0.0.0:6803/2172723816] > > > [v2:[2a00:da8:ffef:1432::]:6804/2172723816,v1:[2a00:da8:ffef:1432::]:6805/2172723816,v2: > > 0.0.0.0:6806/2172723816,v1:0.0.0.0:6807/2172723816] exists,up > > 0b7b5628-9273-4757-85fb-9c16e8441895 > > osd.2 up in weight 1 up_from 182 up_thru 182 down_at 178 > > last_clean_interval [123,177) > > > [v2:[2a00:da8:ffef:1433::]:6800/887631330,v1:[2a00:da8:ffef:1433::]:6801/887631330,v2: > > 0.0.0.0:6802/887631330,v1:0.0.0.0:6803/887631330] > > > [v2:[2a00:da8:ffef:1433::]:6804/887631330,v1:[2a00:da8:ffef:1433::]:6805/887631330,v2: > > 0.0.0.0:6806/887631330,v1:0.0.0.0:6807/887631330] exists,up > > 21f8d0d5-6a3f-4f78-96c8-8ec4e4f78a01 > > Hi Kamil, > > The issue is bogus 0.0.0.0 addresses. This came up before, see [1] and > later messages from Stefan in the thread. You would need to ensure that > ms_bind_ipv4 is set to false and restart OSDs. > > [1] > https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/Q6VYRJBPHQI63OQTBJG2N3BJD2KBEZM4/ > > Thanks, > > Ilya > -- Kamil Madac <https://kmadac.github.io/> ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] rbd map: corrupt full osdmap (-22) when
Hi, We deployed pacific cluster 16.2.12 with cephadm. We experience following error during rbd map: [Wed May 3 08:59:11 2023] libceph: mon2 (1)[2a00:da8:ffef:1433::]:6789 session established [Wed May 3 08:59:11 2023] libceph: another match of type 1 in addrvec [Wed May 3 08:59:11 2023] libceph: corrupt full osdmap (-22) epoch 200 off 1042 (9876284d of 0cb24b58-80b70596) [Wed May 3 08:59:11 2023] osdmap: : 08 07 7d 10 00 00 09 01 5d 09 00 00 a2 22 3b 86 ..}.]";. [Wed May 3 08:59:11 2023] osdmap: 0010: e4 f5 11 ed 99 ee 47 75 ca 3c ad 23 c8 00 00 00 ..Gu.<.# [Wed May 3 08:59:11 2023] osdmap: 0020: 21 68 4a 64 98 d2 5d 2e 84 fd 50 64 d9 3a 48 26 !hJd..]...Pd.:H& [Wed May 3 08:59:11 2023] osdmap: 0030: 02 00 00 00 01 00 00 00 00 00 00 00 1d 05 71 01 ..q. Linux Kernel is 6.1.13 and the important thing is that we are using ipv6 addresses for connection to ceph nodes. We were able to map rbd from client with kernel 5.10, but in prod environment we are not allowed to use that kernel. What could be the reason for such behavior on newer kernels and how to troubleshoot it? Here is output of ceph osd dump: # ceph osd dump epoch 200 fsid a2223b86-e4f5-11ed-99ee-4775ca3cad23 created 2023-04-27T12:18:41.777900+ modified 2023-05-02T12:09:40.642267+ flags sortbitwise,recovery_deletes,purged_snapdirs,pglog_hardlimit crush_version 34 full_ratio 0.95 backfillfull_ratio 0.9 nearfull_ratio 0.85 require_min_compat_client luminous min_compat_client jewel require_osd_release pacific stretch_mode_enabled false pool 1 'device_health_metrics' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 183 flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr_devicehealth pool 2 'idp' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins pg_num 32 pgp_num 32 autoscale_mode on last_change 48 flags hashpspool,selfmanaged_snaps stripe_width 0 application rbd max_osd 3 osd.0 up in weight 1 up_from 176 up_thru 182 down_at 172 last_clean_interval [170,171) [v2:[2a00:da8:ffef:1431::]:6800/805023868,v1:[2a00:da8:ffef:1431::]:6801/805023868,v2: 0.0.0.0:6802/805023868,v1:0.0.0.0:6803/805023868] [v2:[2a00:da8:ffef:1431::]:6804/805023868,v1:[2a00:da8:ffef:1431::]:6805/805023868,v2: 0.0.0.0:6806/805023868,v1:0.0.0.0:6807/805023868] exists,up e8fd0ee2-ea63-4d02-8f36-219d36869078 osd.1 up in weight 1 up_from 136 up_thru 182 down_at 0 last_clean_interval [0,0) [v2:[2a00:da8:ffef:1432::]:6800/2172723816,v1:[2a00:da8:ffef:1432::]:6801/2172723816,v2: 0.0.0.0:6802/2172723816,v1:0.0.0.0:6803/2172723816] [v2:[2a00:da8:ffef:1432::]:6804/2172723816,v1:[2a00:da8:ffef:1432::]:6805/2172723816,v2: 0.0.0.0:6806/2172723816,v1:0.0.0.0:6807/2172723816] exists,up 0b7b5628-9273-4757-85fb-9c16e8441895 osd.2 up in weight 1 up_from 182 up_thru 182 down_at 178 last_clean_interval [123,177) [v2:[2a00:da8:ffef:1433::]:6800/887631330,v1:[2a00:da8:ffef:1433::]:6801/887631330,v2: 0.0.0.0:6802/887631330,v1:0.0.0.0:6803/887631330] [v2:[2a00:da8:ffef:1433::]:6804/887631330,v1:[2a00:da8:ffef:1433::]:6805/887631330,v2: 0.0.0.0:6806/887631330,v1:0.0.0.0:6807/887631330] exists,up 21f8d0d5-6a3f-4f78-96c8-8ec4e4f78a01 Thank you. -- Kamil Madac ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: RGW can't create bucket
Hi Boris, debug log showed that the problem was that the customer accidentally misconfigured placement_targets and default_placement in zonegroup configuration which caused access denied issues during bucket creation. This is what was found in debug logs: s3:create_bucket user not permitted to use placement rule default-placement s3:create_bucket rgw_create_bucket returned ret=-1 bucket= On Fri, Mar 31, 2023 at 11:12 AM Boris Behrens wrote: > Sounds like all user have the problem? > > so what I would do in my setup now: > - start a new rgw client with maximum logging (debug_rgw = 20) on a non > public port > - test against this endpoint and check logs > > This might give you more insight. > > Am Fr., 31. März 2023 um 09:36 Uhr schrieb Kamil Madac < > kamil.ma...@gmail.com>: > >> We checked s3cmd --debug and endpoint is ok (Working with existing >> buckets works ok with same s3cmd config). From what I read, "max_buckets": >> 0 means that there is no quota for the number of buckets. There are also >> users who have "max_buckets": 1000, and those users have the same >> access_denied issue when creating a bucket. >> >> We also tried other bucket names and it is the same issue. >> >> On Thu, Mar 30, 2023 at 6:28 PM Boris Behrens wrote: >> >>> Hi Kamil, >>> is this with all new buckets or only the 'test' bucket? Maybe the name is >>> already taken? >>> Can you check s3cmd --debug if you are connecting to the correct >>> endpoint? >>> >>> Also I see that the user seems to not be allowed to create bukets >>> ... >>> "max_buckets": 0, >>> ... >>> >>> Cheers >>> Boris >>> >>> Am Do., 30. März 2023 um 17:43 Uhr schrieb Kamil Madac < >>> kamil.ma...@gmail.com>: >>> >>> > Hi Eugen >>> > >>> > It is version 16.2.6, we checked quotas and we can't see any applied >>> quotas >>> > for users. As I wrote, every user is affected. Are there any non-user >>> or >>> > global quotas, which can cause that no user can create a bucket? >>> > >>> > Here is example output of newly created user which cannot create >>> buckets >>> > too: >>> > >>> > { >>> > "user_id": "user123", >>> > "display_name": "user123", >>> > "email": "", >>> > "suspended": 0, >>> > "max_buckets": 0, >>> > "subusers": [], >>> > "keys": [ >>> > { >>> > "user": "user123", >>> > "access_key": "ZIYY6XNSC06EU8YPL1AM", >>> > "secret_key": "xx" >>> > } >>> > ], >>> > "swift_keys": [], >>> > "caps": [ >>> > { >>> > "type": "buckets", >>> > "perm": "*" >>> > } >>> > ], >>> > "op_mask": "read, write, delete", >>> > "default_placement": "", >>> > "default_storage_class": "", >>> > "placement_tags": [], >>> > "bucket_quota": { >>> > "enabled": false, >>> > "check_on_raw": false, >>> > "max_size": -1, >>> > "max_size_kb": 0, >>> > "max_objects": -1 >>> > }, >>> > "user_quota": { >>> > "enabled": false, >>> > "check_on_raw": false, >>> > "max_size": -1, >>> > "max_size_kb": 0, >>> > "max_objects": -1 >>> > }, >>> > "temp_url_keys": [], >>> > "type": "rgw", >>> > "mfa_ids": [] >>> > } >>> > >>> > On Thu, Mar 30, 2023 at 1:25 PM Eugen Block wrote: >>> > >>> > > Hi, >>> > > >>> > > what ceph version is this? Could you have hit some quota? >>> > > >>> > > Zitat von Kamil Madac : >>> > > >>> > > > Hi, >>> > > > >>> > > > One of my customers had a correctly working RGW cluster with two >>> zones >>> > in >>> > > > one zonegroup and since a few days a
[ceph-users] Re: RGW can't create bucket
We checked s3cmd --debug and endpoint is ok (Working with existing buckets works ok with same s3cmd config). From what I read, "max_buckets": 0 means that there is no quota for the number of buckets. There are also users who have "max_buckets": 1000, and those users have the same access_denied issue when creating a bucket. We also tried other bucket names and it is the same issue. On Thu, Mar 30, 2023 at 6:28 PM Boris Behrens wrote: > Hi Kamil, > is this with all new buckets or only the 'test' bucket? Maybe the name is > already taken? > Can you check s3cmd --debug if you are connecting to the correct endpoint? > > Also I see that the user seems to not be allowed to create bukets > ... > "max_buckets": 0, > ... > > Cheers > Boris > > Am Do., 30. März 2023 um 17:43 Uhr schrieb Kamil Madac < > kamil.ma...@gmail.com>: > > > Hi Eugen > > > > It is version 16.2.6, we checked quotas and we can't see any applied > quotas > > for users. As I wrote, every user is affected. Are there any non-user or > > global quotas, which can cause that no user can create a bucket? > > > > Here is example output of newly created user which cannot create buckets > > too: > > > > { > > "user_id": "user123", > > "display_name": "user123", > > "email": "", > > "suspended": 0, > > "max_buckets": 0, > > "subusers": [], > > "keys": [ > > { > > "user": "user123", > > "access_key": "ZIYY6XNSC06EU8YPL1AM", > > "secret_key": "xx" > > } > > ], > > "swift_keys": [], > > "caps": [ > > { > > "type": "buckets", > > "perm": "*" > > } > > ], > > "op_mask": "read, write, delete", > > "default_placement": "", > > "default_storage_class": "", > > "placement_tags": [], > > "bucket_quota": { > > "enabled": false, > > "check_on_raw": false, > > "max_size": -1, > > "max_size_kb": 0, > > "max_objects": -1 > > }, > > "user_quota": { > > "enabled": false, > > "check_on_raw": false, > > "max_size": -1, > > "max_size_kb": 0, > > "max_objects": -1 > > }, > > "temp_url_keys": [], > > "type": "rgw", > > "mfa_ids": [] > > } > > > > On Thu, Mar 30, 2023 at 1:25 PM Eugen Block wrote: > > > > > Hi, > > > > > > what ceph version is this? Could you have hit some quota? > > > > > > Zitat von Kamil Madac : > > > > > > > Hi, > > > > > > > > One of my customers had a correctly working RGW cluster with two > zones > > in > > > > one zonegroup and since a few days ago users are not able to create > > > buckets > > > > and are always getting Access denied. Working with existing buckets > > works > > > > (like listing/putting objects into existing bucket). The only > operation > > > > which is not working is bucket creation. We also tried to create a > new > > > > user, but the behavior is the same, and he is not able to create the > > > > bucket. We tried s3cmd, python script with boto library and also > > > Dashboard > > > > as admin user. We are always getting Access Denied. Zones are > in-sync. > > > > > > > > Has anyone experienced such behavior? > > > > > > > > Thanks in advance, here are some outputs: > > > > > > > > $ s3cmd -c .s3cfg_python_client mb s3://test > > > > ERROR: Access to bucket 'test' was denied > > > > ERROR: S3 error: 403 (AccessDenied) > > > > > > > > Zones are in-sync: > > > > > > > > Primary cluster: > > > > > > > > # radosgw-admin sync status > > > > realm 5429b434-6d43-4a18-8f19-a5720a89c621 (solargis-prod) > > > > zonegroup 00e4b3ff-1da8-4a86-9f52-4300c6d0f149 (solargis-prod-ba) > > > > zone 6067eec6-a930-45c7-af7d-a7ef2785a2d7 (solargis-prod-ba-dc) > > > > metadata sync no sync (zone is master) > > > > data sync source: e84fd242-dbae-466c-b4d9-545990590995 > > > (solargis-prod-ba-hq) > > > > syncing
[ceph-users] Re: RGW can't create bucket
Hi Eugen It is version 16.2.6, we checked quotas and we can't see any applied quotas for users. As I wrote, every user is affected. Are there any non-user or global quotas, which can cause that no user can create a bucket? Here is example output of newly created user which cannot create buckets too: { "user_id": "user123", "display_name": "user123", "email": "", "suspended": 0, "max_buckets": 0, "subusers": [], "keys": [ { "user": "user123", "access_key": "ZIYY6XNSC06EU8YPL1AM", "secret_key": "xx" } ], "swift_keys": [], "caps": [ { "type": "buckets", "perm": "*" } ], "op_mask": "read, write, delete", "default_placement": "", "default_storage_class": "", "placement_tags": [], "bucket_quota": { "enabled": false, "check_on_raw": false, "max_size": -1, "max_size_kb": 0, "max_objects": -1 }, "user_quota": { "enabled": false, "check_on_raw": false, "max_size": -1, "max_size_kb": 0, "max_objects": -1 }, "temp_url_keys": [], "type": "rgw", "mfa_ids": [] } On Thu, Mar 30, 2023 at 1:25 PM Eugen Block wrote: > Hi, > > what ceph version is this? Could you have hit some quota? > > Zitat von Kamil Madac : > > > Hi, > > > > One of my customers had a correctly working RGW cluster with two zones in > > one zonegroup and since a few days ago users are not able to create > buckets > > and are always getting Access denied. Working with existing buckets works > > (like listing/putting objects into existing bucket). The only operation > > which is not working is bucket creation. We also tried to create a new > > user, but the behavior is the same, and he is not able to create the > > bucket. We tried s3cmd, python script with boto library and also > Dashboard > > as admin user. We are always getting Access Denied. Zones are in-sync. > > > > Has anyone experienced such behavior? > > > > Thanks in advance, here are some outputs: > > > > $ s3cmd -c .s3cfg_python_client mb s3://test > > ERROR: Access to bucket 'test' was denied > > ERROR: S3 error: 403 (AccessDenied) > > > > Zones are in-sync: > > > > Primary cluster: > > > > # radosgw-admin sync status > > realm 5429b434-6d43-4a18-8f19-a5720a89c621 (solargis-prod) > > zonegroup 00e4b3ff-1da8-4a86-9f52-4300c6d0f149 (solargis-prod-ba) > > zone 6067eec6-a930-45c7-af7d-a7ef2785a2d7 (solargis-prod-ba-dc) > > metadata sync no sync (zone is master) > > data sync source: e84fd242-dbae-466c-b4d9-545990590995 > (solargis-prod-ba-hq) > > syncing > > full sync: 0/128 shards > > incremental sync: 128/128 shards > > data is caught up with source > > > > > > Secondary cluster: > > > > # radosgw-admin sync status > > realm 5429b434-6d43-4a18-8f19-a5720a89c621 (solargis-prod) > > zonegroup 00e4b3ff-1da8-4a86-9f52-4300c6d0f149 (solargis-prod-ba) > > zone e84fd242-dbae-466c-b4d9-545990590995 (solargis-prod-ba-hq) > > metadata sync syncing > > full sync: 0/64 shards > > incremental sync: 64/64 shards > > metadata is caught up with master > > data sync source: 6067eec6-a930-45c7-af7d-a7ef2785a2d7 > (solargis-prod-ba-dc) > > syncing > > full sync: 0/128 shards > > incremental sync: 128/128 shards > > data is caught up with source > > > > -- > > Kamil Madac > > ___ > > ceph-users mailing list -- ceph-users@ceph.io > > To unsubscribe send an email to ceph-users-le...@ceph.io > > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > -- Kamil Madac <https://kmadac.github.io/> ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] RGW can't create bucket
Hi, One of my customers had a correctly working RGW cluster with two zones in one zonegroup and since a few days ago users are not able to create buckets and are always getting Access denied. Working with existing buckets works (like listing/putting objects into existing bucket). The only operation which is not working is bucket creation. We also tried to create a new user, but the behavior is the same, and he is not able to create the bucket. We tried s3cmd, python script with boto library and also Dashboard as admin user. We are always getting Access Denied. Zones are in-sync. Has anyone experienced such behavior? Thanks in advance, here are some outputs: $ s3cmd -c .s3cfg_python_client mb s3://test ERROR: Access to bucket 'test' was denied ERROR: S3 error: 403 (AccessDenied) Zones are in-sync: Primary cluster: # radosgw-admin sync status realm 5429b434-6d43-4a18-8f19-a5720a89c621 (solargis-prod) zonegroup 00e4b3ff-1da8-4a86-9f52-4300c6d0f149 (solargis-prod-ba) zone 6067eec6-a930-45c7-af7d-a7ef2785a2d7 (solargis-prod-ba-dc) metadata sync no sync (zone is master) data sync source: e84fd242-dbae-466c-b4d9-545990590995 (solargis-prod-ba-hq) syncing full sync: 0/128 shards incremental sync: 128/128 shards data is caught up with source Secondary cluster: # radosgw-admin sync status realm 5429b434-6d43-4a18-8f19-a5720a89c621 (solargis-prod) zonegroup 00e4b3ff-1da8-4a86-9f52-4300c6d0f149 (solargis-prod-ba) zone e84fd242-dbae-466c-b4d9-545990590995 (solargis-prod-ba-hq) metadata sync syncing full sync: 0/64 shards incremental sync: 64/64 shards metadata is caught up with master data sync source: 6067eec6-a930-45c7-af7d-a7ef2785a2d7 (solargis-prod-ba-dc) syncing full sync: 0/128 shards incremental sync: 128/128 shards data is caught up with source ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] RGW can't create bucket
Hi, One of my customers had a correctly working RGW cluster with two zones in one zonegroup and since a few days ago users are not able to create buckets and are always getting Access denied. Working with existing buckets works (like listing/putting objects into existing bucket). The only operation which is not working is bucket creation. We also tried to create a new user, but the behavior is the same, and he is not able to create the bucket. We tried s3cmd, python script with boto library and also Dashboard as admin user. We are always getting Access Denied. Zones are in-sync. Has anyone experienced such behavior? Thanks in advance, here are some outputs: $ s3cmd -c .s3cfg_python_client mb s3://test ERROR: Access to bucket 'test' was denied ERROR: S3 error: 403 (AccessDenied) Zones are in-sync: Primary cluster: # radosgw-admin sync status realm 5429b434-6d43-4a18-8f19-a5720a89c621 (solargis-prod) zonegroup 00e4b3ff-1da8-4a86-9f52-4300c6d0f149 (solargis-prod-ba) zone 6067eec6-a930-45c7-af7d-a7ef2785a2d7 (solargis-prod-ba-dc) metadata sync no sync (zone is master) data sync source: e84fd242-dbae-466c-b4d9-545990590995 (solargis-prod-ba-hq) syncing full sync: 0/128 shards incremental sync: 128/128 shards data is caught up with source Secondary cluster: # radosgw-admin sync status realm 5429b434-6d43-4a18-8f19-a5720a89c621 (solargis-prod) zonegroup 00e4b3ff-1da8-4a86-9f52-4300c6d0f149 (solargis-prod-ba) zone e84fd242-dbae-466c-b4d9-545990590995 (solargis-prod-ba-hq) metadata sync syncing full sync: 0/64 shards incremental sync: 64/64 shards metadata is caught up with master data sync source: 6067eec6-a930-45c7-af7d-a7ef2785a2d7 (solargis-prod-ba-dc) syncing full sync: 0/128 shards incremental sync: 128/128 shards data is caught up with source -- Kamil Madac <https://kmadac.github.io/> ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] RGW can't create bucket
Hi, One of my customers had a correctly working RGW cluster with two zones in one zonegroup and since a few days ago users are not able to create buckets and are always getting Access denied. Working with existing buckets works (like listing/putting objects into existing bucket). The only operation which is not working is bucket creation. We also tried to create a new user, but the behavior is the same, and he is not able to create the bucket. We tried s3cmd, python script with boto library and also Dashboard as admin user. We are always getting Access Denied. Zones are in-sync. Has anyone experienced such behavior? Thanks in advance, here are some outputs: $ s3cmd -c .s3cfg_python_client mb s3://test ERROR: Access to bucket 'test' was denied ERROR: S3 error: 403 (AccessDenied) Zones are in-sync: Primary cluster: # radosgw-admin sync status realm 5429b434-6d43-4a18-8f19-a5720a89c621 (solargis-prod) zonegroup 00e4b3ff-1da8-4a86-9f52-4300c6d0f149 (solargis-prod-ba) zone 6067eec6-a930-45c7-af7d-a7ef2785a2d7 (solargis-prod-ba-dc) metadata sync no sync (zone is master) data sync source: e84fd242-dbae-466c-b4d9-545990590995 (solargis-prod-ba-hq) syncing full sync: 0/128 shards incremental sync: 128/128 shards data is caught up with source Secondary cluster: # radosgw-admin sync status realm 5429b434-6d43-4a18-8f19-a5720a89c621 (solargis-prod) zonegroup 00e4b3ff-1da8-4a86-9f52-4300c6d0f149 (solargis-prod-ba) zone e84fd242-dbae-466c-b4d9-545990590995 (solargis-prod-ba-hq) metadata sync syncing full sync: 0/64 shards incremental sync: 64/64 shards metadata is caught up with master data sync source: 6067eec6-a930-45c7-af7d-a7ef2785a2d7 (solargis-prod-ba-dc) syncing full sync: 0/128 shards incremental sync: 128/128 shards data is caught up with source -- Kamil Madac ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Re: RGW replication and multiple endpoints
Hi Christian, Thanks for the response and sharing the experience. Those bugs looks like quite an issue for me personally and for the customer, so we will replicate the data over LBs in front of RGWs. I will regularly check the status of a bugs and once those will be resolved I will do another round of tests in our testlab. Kamil On Mon, Nov 14, 2022 at 3:05 PM Christian Rohmann < christian.rohm...@inovex.de> wrote: > Hey Kamil > > On 14/11/2022 13:54, Kamil Madac wrote: > > Hello, > > > > I'm trying to create a RGW Zonegroup with two zones, and to have data > > replicated between the zones. Each zone is separate Ceph cluster. There > is > > a possibility to use list of endpoints in zone definitions (not just > single > > endpoint) which will be then used for the replication between zones. so I > > tried to use it instead of using LB in front of clusters for the > > replication . > > > > [...] > > > > When node is back again, replication continue to work. > > > > What is the reason to have possibility to have multiple endpoints in the > > zone configuration when outage of one of them makes replication not > > working? > > We are running a similar setup and ran into similar issues before when > doing rolling restarts of the RGWs. > > 1) Mostly it's a single metadata shard never syncing up and requireing a > complete "metadata init". But this issue will likely be address via > https://tracker.ceph.com/issues/39657 > > 2) But we also observed issues with one RGW being unavailable or just > slow and as a result influencing the whole sync process. I suppose the > HTTP client used within rgw syncer does not do a good job of tracking > which remote RGW is healthy or a slow reading RGW could just be locking > all the shards ... > > 3) But as far as "cooperating" goes there are improvements being worked > on, see https://tracker.ceph.com/issues/41230 or > https://github.com/ceph/ceph/pull/45958 which then makes better use of > having multiple distinct RGW in both zones. > > > Regards > > > Christian > > ___ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io > -- Kamil Madac <https://kmadac.github.io/> ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] RGW replication and multiple endpoints
Hello, I'm trying to create a RGW Zonegroup with two zones, and to have data replicated between the zones. Each zone is separate Ceph cluster. There is a possibility to use list of endpoints in zone definitions (not just single endpoint) which will be then used for the replication between zones. so I tried to use it instead of using LB in front of clusters for the replication . Here is how I create the zones: radosgw-admin zone create --rgw-zone=sg-ba-pri --master --rgw-zonegroup=sg-ba --endpoints=http://192.168.121.157:80, http://192.168.121.5:80,http://192.168.121.93:80 --access-key=1234567 --secret=098765 --default When I configure it on both sides, replication is working, but when one of source rgw nodes is unavailable, replication stops working with error message Input output error: [ceph: root@ceph2-node0 /]# radosgw-admin sync status realm b131aff4-2e6f-4fb2-8b61-c895bf6be9f3 (sg) zonegroup 9a2956bc-2ea3-4943-81c9-6350c7abd6d1 (sg-ba) zone baa3b15c-36ce-4a74-9ca1-afb2e21fd809 (sg-ba-sec) 2022-11-14T08:32:50.069+ 7fa201d37500 0 ERROR: failed to fetch mdlog info metadata sync syncing full sync: 0/64 shards failed to fetch master sync status: (5) Input/output error 2022-11-14T08:32:53.140+ 7fa201d37500 0 ERROR: failed to fetch datalog info data sync source: 457539c6-995c-4116-8189-50490c126903 (sg-ba-pri) failed to retrieve sync info: (5) Input/output error When node is back again, replication continue to work. What is the reason to have possibility to have multiple endpoints in the zone configuration when outage of one of them makes replication not working? Thank you. Kamil Madac ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io
[ceph-users] Fwd: Active-Active MDS RAM consumption
Hi Ceph Community One of my customer has an issue with the MDS cluster. Ceph cluster is deployed with cephadm and is in version 16.2.7. As soon as MDS is switched from Active-Standby to Active-Active-Standby, MDS daemon starts to consume a lot of RAM. After some time it consumes 48GB RAM, and container engine kills it. Same thing happens then on the second node which is killed after some time as well, and the situation repeats again. When the MDS cluster is switched back to Active-Backup MDS configuration the situation stabilizes. mds_cache_memory_limit is set to 4294967296, which is the default value. No health warning about high cache consumption is generated. Is that known behavior, and can it be solved by some reconfiguration? Can someone give us a hint on what to check, debug or tune? Thank you. Kamil Madac ___ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io