[ceph-users] cephadm configuration in git

2023-10-11 Thread Kamil Madac
Hello ceph community,

Currently we have deployed ceph clusters with ceph-ansible and whole
configuration (number od daemons, osd configurations, rgw configurations,
crush configuration, ...) of each cluster is stored in git and ansible
variables and we can recreate clusters with ceph-ansible in case we need
it.
To change the configuration of a cluster we change appropriate Ansible
variable, we test it on testing cluster and if new configuration works
correctly we apply it on prod cluster.

Is it possible to do it with cephadm? Is it possible to have some config
files in git and then apply  same cluster configuration on multiple
clusters? Or is this approach not aligned with cephadm and we should do it
different way?

Kamil Madac
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] rbd-mirror and DR test

2023-09-18 Thread Kamil Madac
One of our customers is currently facing a challenge in testing our
disaster recovery (DR) procedures on a pair of Ceph clusters (Quincy
version 17.2.5).

Our issue revolves around the need to resynchronize data after
conducting a DR procedure test. In small-scale scenarios, this may not
be a significant problem. However, when dealing with terabytes of
data, it becomes a considerable challenge.

In a typical DR procedure, there are two sites, Site A and Site B. The
process involves demoting Site A and promoting Site B, followed by the
reverse operation to ensure data resynchronization. However, our
specific challenge lies in the fact that, in our case:

- Site A is running and serving production traffic, Site B is just for
DR purposes.
- Network connectivity between Site A and Site B is deliberately disrupted.
- A "promote" operation is enforced (--force) on Site B, creating a
split-brain situation.
- Data access and modifications are performed on Site B during this state.
- To revert to the original configuration, we must demote Site B, but
the only way to re-establish RBD mirroring is by forcing a full
resynchronization, essentially recopying the entire dataset.

Given these circumstances, we are interested in how to address this
challenge efficiently, especially when dealing with large datasets
(TBs of data). Are there alternative approaches, best practices, or
recommendations such that we won't need to fully resync site A to site
B in order to reestablish rbd-mirror?

Thank you very much for any advice.

Kamil Madac
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: rbd map: corrupt full osdmap (-22) when

2023-05-05 Thread Kamil Madac
Ilya, Thanks for clarification.

On Thu, May 4, 2023 at 1:12 PM Ilya Dryomov  wrote:

> On Thu, May 4, 2023 at 11:27 AM Kamil Madac  wrote:
> >
> > Thanks for the info.
> >
> > As a solution we used rbd-nbd which works fine without any issues. If we
> will have time we will also try to disable ipv4 on the cluster and will try
> kernel rbd mapping again. Are there any disadvantages when using NBD
> instead of kernel driver?
>
> Ceph doesn't really support dual stack configurations.  It's not
> something that is tested: even if it happens to work for some use case
> today, it can very well break tomorrow.  The kernel client just makes
> that very explicit ;)
>
> rbd-nbd is less performant and historically also less stable (although
> that might have changed in recent kernels as a bunch of work went into
> the NBD driver upstream).  It's also heavier on resource usage but that
> won't be noticeable/can be disregarded if you are not mapping dozens of
> RBD images on a single node.
>
> Thanks,
>
> Ilya
>


-- 
Kamil Madac <https://kmadac.github.io/>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: rbd map: corrupt full osdmap (-22) when

2023-05-04 Thread Kamil Madac
Thanks for the info.

As a solution we used rbd-nbd which works fine without any issues. If we
will have time we will also try to disable ipv4 on the cluster and will try
kernel rbd mapping again. Are there any disadvantages when using NBD
instead of kernel driver?

Thanks

On Wed, May 3, 2023 at 4:06 PM Ilya Dryomov  wrote:

> On Wed, May 3, 2023 at 11:24 AM Kamil Madac  wrote:
> >
> > Hi,
> >
> > We deployed pacific cluster 16.2.12 with cephadm. We experience following
> > error during rbd map:
> >
> > [Wed May  3 08:59:11 2023] libceph: mon2 (1)[2a00:da8:ffef:1433::]:6789
> > session established
> > [Wed May  3 08:59:11 2023] libceph: another match of type 1 in addrvec
> > [Wed May  3 08:59:11 2023] libceph: corrupt full osdmap (-22) epoch 200
> off
> > 1042 (9876284d of 0cb24b58-80b70596)
> > [Wed May  3 08:59:11 2023] osdmap: : 08 07 7d 10 00 00 09 01 5d
> 09
> > 00 00 a2 22 3b 86  ..}.]";.
> > [Wed May  3 08:59:11 2023] osdmap: 0010: e4 f5 11 ed 99 ee 47 75 ca
> 3c
> > ad 23 c8 00 00 00  ..Gu.<.#
> > [Wed May  3 08:59:11 2023] osdmap: 0020: 21 68 4a 64 98 d2 5d 2e 84
> fd
> > 50 64 d9 3a 48 26  !hJd..]...Pd.:H&
> > [Wed May  3 08:59:11 2023] osdmap: 0030: 02 00 00 00 01 00 00 00 00
> 00
> > 00 00 1d 05 71 01  ..q.
> > 
> >
> > Linux Kernel is 6.1.13 and the important thing is that we are using ipv6
> > addresses for connection to ceph nodes.
> > We were able to map rbd from client with kernel 5.10, but in prod
> > environment we are not allowed to use that kernel.
> >
> > What could be the reason for such behavior on newer kernels and how to
> > troubleshoot it?
> >
> > Here is output of ceph osd dump:
> >
> > # ceph osd dump
> > epoch 200
> > fsid a2223b86-e4f5-11ed-99ee-4775ca3cad23
> > created 2023-04-27T12:18:41.777900+
> > modified 2023-05-02T12:09:40.642267+
> > flags sortbitwise,recovery_deletes,purged_snapdirs,pglog_hardlimit
> > crush_version 34
> > full_ratio 0.95
> > backfillfull_ratio 0.9
> > nearfull_ratio 0.85
> > require_min_compat_client luminous
> > min_compat_client jewel
> > require_osd_release pacific
> > stretch_mode_enabled false
> > pool 1 'device_health_metrics' replicated size 3 min_size 2 crush_rule 0
> > object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 183
> > flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application
> > mgr_devicehealth
> > pool 2 'idp' replicated size 3 min_size 2 crush_rule 0 object_hash
> rjenkins
> > pg_num 32 pgp_num 32 autoscale_mode on last_change 48 flags
> > hashpspool,selfmanaged_snaps stripe_width 0 application rbd
> > max_osd 3
> > osd.0 up   in  weight 1 up_from 176 up_thru 182 down_at 172
> > last_clean_interval [170,171)
> >
> [v2:[2a00:da8:ffef:1431::]:6800/805023868,v1:[2a00:da8:ffef:1431::]:6801/805023868,v2:
> > 0.0.0.0:6802/805023868,v1:0.0.0.0:6803/805023868]
> >
> [v2:[2a00:da8:ffef:1431::]:6804/805023868,v1:[2a00:da8:ffef:1431::]:6805/805023868,v2:
> > 0.0.0.0:6806/805023868,v1:0.0.0.0:6807/805023868] exists,up
> > e8fd0ee2-ea63-4d02-8f36-219d36869078
> > osd.1 up   in  weight 1 up_from 136 up_thru 182 down_at 0
> > last_clean_interval [0,0)
> >
> [v2:[2a00:da8:ffef:1432::]:6800/2172723816,v1:[2a00:da8:ffef:1432::]:6801/2172723816,v2:
> > 0.0.0.0:6802/2172723816,v1:0.0.0.0:6803/2172723816]
> >
> [v2:[2a00:da8:ffef:1432::]:6804/2172723816,v1:[2a00:da8:ffef:1432::]:6805/2172723816,v2:
> > 0.0.0.0:6806/2172723816,v1:0.0.0.0:6807/2172723816] exists,up
> > 0b7b5628-9273-4757-85fb-9c16e8441895
> > osd.2 up   in  weight 1 up_from 182 up_thru 182 down_at 178
> > last_clean_interval [123,177)
> >
> [v2:[2a00:da8:ffef:1433::]:6800/887631330,v1:[2a00:da8:ffef:1433::]:6801/887631330,v2:
> > 0.0.0.0:6802/887631330,v1:0.0.0.0:6803/887631330]
> >
> [v2:[2a00:da8:ffef:1433::]:6804/887631330,v1:[2a00:da8:ffef:1433::]:6805/887631330,v2:
> > 0.0.0.0:6806/887631330,v1:0.0.0.0:6807/887631330] exists,up
> > 21f8d0d5-6a3f-4f78-96c8-8ec4e4f78a01
>
> Hi Kamil,
>
> The issue is bogus 0.0.0.0 addresses.  This came up before, see [1] and
> later messages from Stefan in the thread.  You would need to ensure that
> ms_bind_ipv4 is set to false and restart OSDs.
>
> [1]
> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/message/Q6VYRJBPHQI63OQTBJG2N3BJD2KBEZM4/
>
> Thanks,
>
> Ilya
>


-- 
Kamil Madac <https://kmadac.github.io/>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] rbd map: corrupt full osdmap (-22) when

2023-05-03 Thread Kamil Madac
Hi,

We deployed pacific cluster 16.2.12 with cephadm. We experience following
error during rbd map:

[Wed May  3 08:59:11 2023] libceph: mon2 (1)[2a00:da8:ffef:1433::]:6789
session established
[Wed May  3 08:59:11 2023] libceph: another match of type 1 in addrvec
[Wed May  3 08:59:11 2023] libceph: corrupt full osdmap (-22) epoch 200 off
1042 (9876284d of 0cb24b58-80b70596)
[Wed May  3 08:59:11 2023] osdmap: : 08 07 7d 10 00 00 09 01 5d 09
00 00 a2 22 3b 86  ..}.]";.
[Wed May  3 08:59:11 2023] osdmap: 0010: e4 f5 11 ed 99 ee 47 75 ca 3c
ad 23 c8 00 00 00  ..Gu.<.#
[Wed May  3 08:59:11 2023] osdmap: 0020: 21 68 4a 64 98 d2 5d 2e 84 fd
50 64 d9 3a 48 26  !hJd..]...Pd.:H&
[Wed May  3 08:59:11 2023] osdmap: 0030: 02 00 00 00 01 00 00 00 00 00
00 00 1d 05 71 01  ..q.


Linux Kernel is 6.1.13 and the important thing is that we are using ipv6
addresses for connection to ceph nodes.
We were able to map rbd from client with kernel 5.10, but in prod
environment we are not allowed to use that kernel.

What could be the reason for such behavior on newer kernels and how to
troubleshoot it?

Here is output of ceph osd dump:

# ceph osd dump
epoch 200
fsid a2223b86-e4f5-11ed-99ee-4775ca3cad23
created 2023-04-27T12:18:41.777900+
modified 2023-05-02T12:09:40.642267+
flags sortbitwise,recovery_deletes,purged_snapdirs,pglog_hardlimit
crush_version 34
full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.85
require_min_compat_client luminous
min_compat_client jewel
require_osd_release pacific
stretch_mode_enabled false
pool 1 'device_health_metrics' replicated size 3 min_size 2 crush_rule 0
object_hash rjenkins pg_num 1 pgp_num 1 autoscale_mode on last_change 183
flags hashpspool stripe_width 0 pg_num_max 32 pg_num_min 1 application
mgr_devicehealth
pool 2 'idp' replicated size 3 min_size 2 crush_rule 0 object_hash rjenkins
pg_num 32 pgp_num 32 autoscale_mode on last_change 48 flags
hashpspool,selfmanaged_snaps stripe_width 0 application rbd
max_osd 3
osd.0 up   in  weight 1 up_from 176 up_thru 182 down_at 172
last_clean_interval [170,171)
[v2:[2a00:da8:ffef:1431::]:6800/805023868,v1:[2a00:da8:ffef:1431::]:6801/805023868,v2:
0.0.0.0:6802/805023868,v1:0.0.0.0:6803/805023868]
[v2:[2a00:da8:ffef:1431::]:6804/805023868,v1:[2a00:da8:ffef:1431::]:6805/805023868,v2:
0.0.0.0:6806/805023868,v1:0.0.0.0:6807/805023868] exists,up
e8fd0ee2-ea63-4d02-8f36-219d36869078
osd.1 up   in  weight 1 up_from 136 up_thru 182 down_at 0
last_clean_interval [0,0)
[v2:[2a00:da8:ffef:1432::]:6800/2172723816,v1:[2a00:da8:ffef:1432::]:6801/2172723816,v2:
0.0.0.0:6802/2172723816,v1:0.0.0.0:6803/2172723816]
[v2:[2a00:da8:ffef:1432::]:6804/2172723816,v1:[2a00:da8:ffef:1432::]:6805/2172723816,v2:
0.0.0.0:6806/2172723816,v1:0.0.0.0:6807/2172723816] exists,up
0b7b5628-9273-4757-85fb-9c16e8441895
osd.2 up   in  weight 1 up_from 182 up_thru 182 down_at 178
last_clean_interval [123,177)
[v2:[2a00:da8:ffef:1433::]:6800/887631330,v1:[2a00:da8:ffef:1433::]:6801/887631330,v2:
0.0.0.0:6802/887631330,v1:0.0.0.0:6803/887631330]
[v2:[2a00:da8:ffef:1433::]:6804/887631330,v1:[2a00:da8:ffef:1433::]:6805/887631330,v2:
0.0.0.0:6806/887631330,v1:0.0.0.0:6807/887631330] exists,up
21f8d0d5-6a3f-4f78-96c8-8ec4e4f78a01


Thank you.
-- 
Kamil Madac
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RGW can't create bucket

2023-04-05 Thread Kamil Madac
Hi Boris,

debug log showed that the problem was that the customer
accidentally misconfigured placement_targets and default_placement in
zonegroup configuration which caused access denied issues during bucket
creation.

This is what was found in debug logs:

s3:create_bucket user not permitted to use placement rule default-placement
s3:create_bucket rgw_create_bucket returned ret=-1 bucket=

On Fri, Mar 31, 2023 at 11:12 AM Boris Behrens  wrote:

> Sounds like all user have the problem?
>
> so what I would do in my setup now:
> - start a new rgw client with maximum logging (debug_rgw = 20) on a non
> public port
> - test against this endpoint and check logs
>
> This might give you more insight.
>
> Am Fr., 31. März 2023 um 09:36 Uhr schrieb Kamil Madac <
> kamil.ma...@gmail.com>:
>
>> We checked s3cmd --debug and endpoint is ok (Working with existing
>> buckets works ok with same s3cmd config).  From what I read, "max_buckets":
>> 0 means that there is no quota for the number of buckets. There are also
>> users who have "max_buckets": 1000, and those users have the same
>> access_denied issue when creating a bucket.
>>
>> We also tried other bucket names and it is the same issue.
>>
>> On Thu, Mar 30, 2023 at 6:28 PM Boris Behrens  wrote:
>>
>>> Hi Kamil,
>>> is this with all new buckets or only the 'test' bucket? Maybe the name is
>>> already taken?
>>> Can you check s3cmd --debug if you are connecting to the correct
>>> endpoint?
>>>
>>> Also I see that the user seems to not be allowed to create bukets
>>> ...
>>> "max_buckets": 0,
>>> ...
>>>
>>> Cheers
>>>  Boris
>>>
>>> Am Do., 30. März 2023 um 17:43 Uhr schrieb Kamil Madac <
>>> kamil.ma...@gmail.com>:
>>>
>>> > Hi Eugen
>>> >
>>> > It is version 16.2.6, we checked quotas and we can't see any applied
>>> quotas
>>> > for users. As I wrote, every user is affected. Are there any non-user
>>> or
>>> > global quotas, which can cause that no user can create a bucket?
>>> >
>>> > Here is example output of newly created user which cannot create
>>> buckets
>>> > too:
>>> >
>>> > {
>>> > "user_id": "user123",
>>> > "display_name": "user123",
>>> > "email": "",
>>> > "suspended": 0,
>>> > "max_buckets": 0,
>>> > "subusers": [],
>>> > "keys": [
>>> > {
>>> > "user": "user123",
>>> > "access_key": "ZIYY6XNSC06EU8YPL1AM",
>>> > "secret_key": "xx"
>>> > }
>>> > ],
>>> > "swift_keys": [],
>>> > "caps": [
>>> > {
>>> > "type": "buckets",
>>> > "perm": "*"
>>> > }
>>> > ],
>>> > "op_mask": "read, write, delete",
>>> > "default_placement": "",
>>> > "default_storage_class": "",
>>> > "placement_tags": [],
>>> > "bucket_quota": {
>>> > "enabled": false,
>>> > "check_on_raw": false,
>>> > "max_size": -1,
>>> > "max_size_kb": 0,
>>> > "max_objects": -1
>>> > },
>>> > "user_quota": {
>>> > "enabled": false,
>>> > "check_on_raw": false,
>>> > "max_size": -1,
>>> > "max_size_kb": 0,
>>> > "max_objects": -1
>>> > },
>>> > "temp_url_keys": [],
>>> > "type": "rgw",
>>> > "mfa_ids": []
>>> > }
>>> >
>>> > On Thu, Mar 30, 2023 at 1:25 PM Eugen Block  wrote:
>>> >
>>> > > Hi,
>>> > >
>>> > > what ceph version is this? Could you have hit some quota?
>>> > >
>>> > > Zitat von Kamil Madac :
>>> > >
>>> > > > Hi,
>>> > > >
>>> > > > One of my customers had a correctly working RGW cluster with two
>>> zones
>>> > in
>>> > > > one zonegroup and since a few days a

[ceph-users] Re: RGW can't create bucket

2023-03-31 Thread Kamil Madac
We checked s3cmd --debug and endpoint is ok (Working with existing buckets
works ok with same s3cmd config).  From what I read, "max_buckets": 0 means
that there is no quota for the number of buckets. There are also users who
have "max_buckets": 1000, and those users have the same access_denied issue
when creating a bucket.

We also tried other bucket names and it is the same issue.

On Thu, Mar 30, 2023 at 6:28 PM Boris Behrens  wrote:

> Hi Kamil,
> is this with all new buckets or only the 'test' bucket? Maybe the name is
> already taken?
> Can you check s3cmd --debug if you are connecting to the correct endpoint?
>
> Also I see that the user seems to not be allowed to create bukets
> ...
> "max_buckets": 0,
> ...
>
> Cheers
>  Boris
>
> Am Do., 30. März 2023 um 17:43 Uhr schrieb Kamil Madac <
> kamil.ma...@gmail.com>:
>
> > Hi Eugen
> >
> > It is version 16.2.6, we checked quotas and we can't see any applied
> quotas
> > for users. As I wrote, every user is affected. Are there any non-user or
> > global quotas, which can cause that no user can create a bucket?
> >
> > Here is example output of newly created user which cannot create buckets
> > too:
> >
> > {
> > "user_id": "user123",
> > "display_name": "user123",
> > "email": "",
> > "suspended": 0,
> > "max_buckets": 0,
> > "subusers": [],
> > "keys": [
> > {
> > "user": "user123",
> > "access_key": "ZIYY6XNSC06EU8YPL1AM",
> > "secret_key": "xx"
> > }
> > ],
> > "swift_keys": [],
> > "caps": [
> > {
> > "type": "buckets",
> > "perm": "*"
> > }
> > ],
> > "op_mask": "read, write, delete",
> > "default_placement": "",
> > "default_storage_class": "",
> > "placement_tags": [],
> > "bucket_quota": {
> > "enabled": false,
> > "check_on_raw": false,
> > "max_size": -1,
> > "max_size_kb": 0,
> > "max_objects": -1
> > },
> > "user_quota": {
> > "enabled": false,
> > "check_on_raw": false,
> > "max_size": -1,
> > "max_size_kb": 0,
> > "max_objects": -1
> > },
> > "temp_url_keys": [],
> > "type": "rgw",
> > "mfa_ids": []
> > }
> >
> > On Thu, Mar 30, 2023 at 1:25 PM Eugen Block  wrote:
> >
> > > Hi,
> > >
> > > what ceph version is this? Could you have hit some quota?
> > >
> > > Zitat von Kamil Madac :
> > >
> > > > Hi,
> > > >
> > > > One of my customers had a correctly working RGW cluster with two
> zones
> > in
> > > > one zonegroup and since a few days ago users are not able to create
> > > buckets
> > > > and are always getting Access denied. Working with existing buckets
> > works
> > > > (like listing/putting objects into existing bucket). The only
> operation
> > > > which is not working is bucket creation. We also tried to create a
> new
> > > > user, but the behavior is the same, and he is not able to create the
> > > > bucket. We tried s3cmd, python script with boto library and also
> > > Dashboard
> > > > as admin user. We are always getting Access Denied. Zones are
> in-sync.
> > > >
> > > > Has anyone experienced such behavior?
> > > >
> > > > Thanks in advance, here are some outputs:
> > > >
> > > > $ s3cmd -c .s3cfg_python_client mb s3://test
> > > > ERROR: Access to bucket 'test' was denied
> > > > ERROR: S3 error: 403 (AccessDenied)
> > > >
> > > > Zones are in-sync:
> > > >
> > > > Primary cluster:
> > > >
> > > > # radosgw-admin sync status
> > > > realm 5429b434-6d43-4a18-8f19-a5720a89c621 (solargis-prod)
> > > > zonegroup 00e4b3ff-1da8-4a86-9f52-4300c6d0f149 (solargis-prod-ba)
> > > > zone 6067eec6-a930-45c7-af7d-a7ef2785a2d7 (solargis-prod-ba-dc)
> > > > metadata sync no sync (zone is master)
> > > > data sync source: e84fd242-dbae-466c-b4d9-545990590995
> > > (solargis-prod-ba-hq)
> > > > syncing

[ceph-users] Re: RGW can't create bucket

2023-03-30 Thread Kamil Madac
Hi Eugen

It is version 16.2.6, we checked quotas and we can't see any applied quotas
for users. As I wrote, every user is affected. Are there any non-user or
global quotas, which can cause that no user can create a bucket?

Here is example output of newly created user which cannot create buckets
too:

{
"user_id": "user123",
"display_name": "user123",
"email": "",
"suspended": 0,
"max_buckets": 0,
"subusers": [],
"keys": [
{
"user": "user123",
"access_key": "ZIYY6XNSC06EU8YPL1AM",
"secret_key": "xx"
}
],
"swift_keys": [],
"caps": [
{
"type": "buckets",
"perm": "*"
}
],
"op_mask": "read, write, delete",
"default_placement": "",
"default_storage_class": "",
"placement_tags": [],
"bucket_quota": {
"enabled": false,
"check_on_raw": false,
"max_size": -1,
"max_size_kb": 0,
"max_objects": -1
},
"user_quota": {
"enabled": false,
"check_on_raw": false,
"max_size": -1,
"max_size_kb": 0,
"max_objects": -1
},
"temp_url_keys": [],
"type": "rgw",
"mfa_ids": []
}

On Thu, Mar 30, 2023 at 1:25 PM Eugen Block  wrote:

> Hi,
>
> what ceph version is this? Could you have hit some quota?
>
> Zitat von Kamil Madac :
>
> > Hi,
> >
> > One of my customers had a correctly working RGW cluster with two zones in
> > one zonegroup and since a few days ago users are not able to create
> buckets
> > and are always getting Access denied. Working with existing buckets works
> > (like listing/putting objects into existing bucket). The only operation
> > which is not working is bucket creation. We also tried to create a new
> > user, but the behavior is the same, and he is not able to create the
> > bucket. We tried s3cmd, python script with boto library and also
> Dashboard
> > as admin user. We are always getting Access Denied. Zones are in-sync.
> >
> > Has anyone experienced such behavior?
> >
> > Thanks in advance, here are some outputs:
> >
> > $ s3cmd -c .s3cfg_python_client mb s3://test
> > ERROR: Access to bucket 'test' was denied
> > ERROR: S3 error: 403 (AccessDenied)
> >
> > Zones are in-sync:
> >
> > Primary cluster:
> >
> > # radosgw-admin sync status
> > realm 5429b434-6d43-4a18-8f19-a5720a89c621 (solargis-prod)
> > zonegroup 00e4b3ff-1da8-4a86-9f52-4300c6d0f149 (solargis-prod-ba)
> > zone 6067eec6-a930-45c7-af7d-a7ef2785a2d7 (solargis-prod-ba-dc)
> > metadata sync no sync (zone is master)
> > data sync source: e84fd242-dbae-466c-b4d9-545990590995
> (solargis-prod-ba-hq)
> > syncing
> > full sync: 0/128 shards
> > incremental sync: 128/128 shards
> > data is caught up with source
> >
> >
> > Secondary cluster:
> >
> > # radosgw-admin sync status
> > realm 5429b434-6d43-4a18-8f19-a5720a89c621 (solargis-prod)
> > zonegroup 00e4b3ff-1da8-4a86-9f52-4300c6d0f149 (solargis-prod-ba)
> > zone e84fd242-dbae-466c-b4d9-545990590995 (solargis-prod-ba-hq)
> > metadata sync syncing
> > full sync: 0/64 shards
> > incremental sync: 64/64 shards
> > metadata is caught up with master
> > data sync source: 6067eec6-a930-45c7-af7d-a7ef2785a2d7
> (solargis-prod-ba-dc)
> > syncing
> > full sync: 0/128 shards
> > incremental sync: 128/128 shards
> > data is caught up with source
> >
> > --
> > Kamil Madac
> > ___
> > ceph-users mailing list -- ceph-users@ceph.io
> > To unsubscribe send an email to ceph-users-le...@ceph.io
>
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>


-- 
Kamil Madac <https://kmadac.github.io/>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] RGW can't create bucket

2023-03-30 Thread kamil . madac
Hi,

One of my customers had a correctly working RGW cluster with two zones in one 
zonegroup and since a few days ago users are not able to create buckets and are 
always getting Access denied. Working with existing buckets works (like 
listing/putting objects into existing bucket). The only operation which is not 
working is bucket creation. We also tried to create a new user, but the 
behavior is the same, and he is not able to create the bucket. We tried s3cmd, 
python script with boto library and also Dashboard as admin user. We are always 
getting Access Denied. Zones are in-sync.

Has anyone experienced such behavior?

Thanks in advance, here are some outputs:

$ s3cmd -c .s3cfg_python_client mb s3://test
ERROR: Access to bucket 'test' was denied
ERROR: S3 error: 403 (AccessDenied)

Zones are in-sync:

Primary cluster:

# radosgw-admin sync status
realm 5429b434-6d43-4a18-8f19-a5720a89c621 (solargis-prod)
zonegroup 00e4b3ff-1da8-4a86-9f52-4300c6d0f149 (solargis-prod-ba)
zone 6067eec6-a930-45c7-af7d-a7ef2785a2d7 (solargis-prod-ba-dc)
metadata sync no sync (zone is master)
data sync source: e84fd242-dbae-466c-b4d9-545990590995 (solargis-prod-ba-hq)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
data is caught up with source


Secondary cluster:

# radosgw-admin sync status
realm 5429b434-6d43-4a18-8f19-a5720a89c621 (solargis-prod)
zonegroup 00e4b3ff-1da8-4a86-9f52-4300c6d0f149 (solargis-prod-ba)
zone e84fd242-dbae-466c-b4d9-545990590995 (solargis-prod-ba-hq)
metadata sync syncing
full sync: 0/64 shards
incremental sync: 64/64 shards
metadata is caught up with master
data sync source: 6067eec6-a930-45c7-af7d-a7ef2785a2d7 (solargis-prod-ba-dc)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
data is caught up with source
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] RGW can't create bucket

2023-03-30 Thread Kamil Madac
Hi,

One of my customers had a correctly working RGW cluster with two zones in
one zonegroup and since a few days ago users are not able to create buckets
and are always getting Access denied. Working with existing buckets works
(like listing/putting objects into existing bucket). The only operation
which is not working is bucket creation. We also tried to create a new
user, but the behavior is the same, and he is not able to create the
bucket. We tried s3cmd, python script with boto library and also Dashboard
as admin user. We are always getting Access Denied. Zones are in-sync.

Has anyone experienced such behavior?

Thanks in advance, here are some outputs:

$ s3cmd -c .s3cfg_python_client mb s3://test
ERROR: Access to bucket 'test' was denied
ERROR: S3 error: 403 (AccessDenied)

Zones are in-sync:

Primary cluster:

# radosgw-admin sync status
realm 5429b434-6d43-4a18-8f19-a5720a89c621 (solargis-prod)
zonegroup 00e4b3ff-1da8-4a86-9f52-4300c6d0f149 (solargis-prod-ba)
zone 6067eec6-a930-45c7-af7d-a7ef2785a2d7 (solargis-prod-ba-dc)
metadata sync no sync (zone is master)
data sync source: e84fd242-dbae-466c-b4d9-545990590995 (solargis-prod-ba-hq)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
data is caught up with source


Secondary cluster:

# radosgw-admin sync status
realm 5429b434-6d43-4a18-8f19-a5720a89c621 (solargis-prod)
zonegroup 00e4b3ff-1da8-4a86-9f52-4300c6d0f149 (solargis-prod-ba)
zone e84fd242-dbae-466c-b4d9-545990590995 (solargis-prod-ba-hq)
metadata sync syncing
full sync: 0/64 shards
incremental sync: 64/64 shards
metadata is caught up with master
data sync source: 6067eec6-a930-45c7-af7d-a7ef2785a2d7 (solargis-prod-ba-dc)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
data is caught up with source


-- 
Kamil Madac <https://kmadac.github.io/>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] RGW can't create bucket

2023-03-29 Thread Kamil Madac
Hi,

One of my customers had a correctly working RGW cluster with two zones in
one zonegroup and since a few days ago users are not able to create buckets
and are always getting Access denied. Working with existing buckets works
(like listing/putting objects into existing bucket). The only operation
which is not working is bucket creation. We also tried to create a new
user, but the behavior is the same, and he is not able to create the
bucket. We tried s3cmd, python script with boto library and also Dashboard
as admin user. We are always getting Access Denied. Zones are in-sync.

Has anyone experienced such behavior?

Thanks in advance, here are some outputs:

$ s3cmd -c .s3cfg_python_client mb s3://test
ERROR: Access to bucket 'test' was denied
ERROR: S3 error: 403 (AccessDenied)

Zones are in-sync:

Primary cluster:

# radosgw-admin sync status
realm 5429b434-6d43-4a18-8f19-a5720a89c621 (solargis-prod)
zonegroup 00e4b3ff-1da8-4a86-9f52-4300c6d0f149 (solargis-prod-ba)
zone 6067eec6-a930-45c7-af7d-a7ef2785a2d7 (solargis-prod-ba-dc)
metadata sync no sync (zone is master)
data sync source: e84fd242-dbae-466c-b4d9-545990590995 (solargis-prod-ba-hq)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
data is caught up with source


Secondary cluster:

# radosgw-admin sync status
realm 5429b434-6d43-4a18-8f19-a5720a89c621 (solargis-prod)
zonegroup 00e4b3ff-1da8-4a86-9f52-4300c6d0f149 (solargis-prod-ba)
zone e84fd242-dbae-466c-b4d9-545990590995 (solargis-prod-ba-hq)
metadata sync syncing
full sync: 0/64 shards
incremental sync: 64/64 shards
metadata is caught up with master
data sync source: 6067eec6-a930-45c7-af7d-a7ef2785a2d7 (solargis-prod-ba-dc)
syncing
full sync: 0/128 shards
incremental sync: 128/128 shards
data is caught up with source

-- 
Kamil Madac
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: RGW replication and multiple endpoints

2022-11-15 Thread Kamil Madac
Hi Christian,

Thanks for the response and sharing the experience. Those bugs looks like
quite an issue for me personally and for the customer, so we will replicate
the data over LBs in front of RGWs. I will regularly check the status of a
bugs and once those will be resolved I will do another round of tests in
our testlab.

Kamil

On Mon, Nov 14, 2022 at 3:05 PM Christian Rohmann <
christian.rohm...@inovex.de> wrote:

> Hey Kamil
>
> On 14/11/2022 13:54, Kamil Madac wrote:
> > Hello,
> >
> > I'm trying to create a RGW Zonegroup with two zones, and to have data
> > replicated between the zones. Each zone is separate Ceph cluster. There
> is
> > a possibility to use list of endpoints in zone definitions (not just
> single
> > endpoint) which will be then used for the replication between zones. so I
> > tried to use it instead of using LB in front of clusters for the
> > replication .
> >
> > [...]
> >
> > When node is back again, replication continue to work.
> >
> > What is the reason to have possibility to have multiple endpoints in the
> > zone configuration when outage of one of them makes replication not
> > working?
>
> We are running a similar setup and ran into similar issues before when
> doing rolling restarts of the RGWs.
>
> 1) Mostly it's a single metadata shard never syncing up and requireing a
> complete "metadata init". But this issue will likely be address via
> https://tracker.ceph.com/issues/39657
>
> 2) But we also observed issues with one RGW being unavailable or just
> slow and as a result influencing the whole sync process. I suppose the
> HTTP client used within rgw syncer does not do a good job of tracking
> which remote RGW is healthy or a slow reading RGW could just be locking
> all the shards ...
>
> 3) But as far as "cooperating" goes there are improvements being worked
> on, see https://tracker.ceph.com/issues/41230 or
> https://github.com/ceph/ceph/pull/45958 which then makes better use of
> having multiple distinct RGW in both zones.
>
>
> Regards
>
>
> Christian
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>


-- 
Kamil Madac <https://kmadac.github.io/>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] RGW replication and multiple endpoints

2022-11-14 Thread Kamil Madac
Hello,

I'm trying to create a RGW Zonegroup with two zones, and to have data
replicated between the zones. Each zone is separate Ceph cluster. There is
a possibility to use list of endpoints in zone definitions (not just single
endpoint) which will be then used for the replication between zones. so I
tried to use it instead of using LB in front of clusters for the
replication .

Here is how I create the zones:

radosgw-admin zone create --rgw-zone=sg-ba-pri --master
--rgw-zonegroup=sg-ba --endpoints=http://192.168.121.157:80,
http://192.168.121.5:80,http://192.168.121.93:80 --access-key=1234567
--secret=098765 --default

When I configure it on both sides, replication is working, but when one of
source rgw nodes is unavailable, replication stops working with error
message Input output error:

[ceph: root@ceph2-node0 /]# radosgw-admin sync status
  realm b131aff4-2e6f-4fb2-8b61-c895bf6be9f3 (sg)
  zonegroup 9a2956bc-2ea3-4943-81c9-6350c7abd6d1 (sg-ba)
   zone baa3b15c-36ce-4a74-9ca1-afb2e21fd809 (sg-ba-sec)
2022-11-14T08:32:50.069+ 7fa201d37500  0 ERROR: failed to fetch mdlog
info
  metadata sync syncing
full sync: 0/64 shards
failed to fetch master sync status: (5) Input/output error
2022-11-14T08:32:53.140+ 7fa201d37500  0 ERROR: failed to fetch datalog
info
  data sync source: 457539c6-995c-4116-8189-50490c126903 (sg-ba-pri)
failed to retrieve sync info: (5) Input/output error

When node is back again, replication continue to work.

What is the reason to have possibility to have multiple endpoints in the
zone configuration when outage of one of them makes replication not
working?

Thank you.

Kamil Madac
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Fwd: Active-Active MDS RAM consumption

2022-09-01 Thread Kamil Madac
Hi Ceph Community

One of my customer has an issue with the MDS cluster. Ceph cluster is
deployed with cephadm and is in version 16.2.7. As soon as MDS is switched
from Active-Standby to Active-Active-Standby, MDS daemon starts to consume
a lot of RAM. After some time it consumes 48GB RAM, and container engine
kills it. Same thing happens then on the second node which is killed after
some time as well, and the situation repeats again.

When the MDS cluster is switched back to Active-Backup MDS configuration
the situation stabilizes.

mds_cache_memory_limit is set to 4294967296, which is the default value. No
health warning about high cache consumption is generated.

Is that known behavior, and can it be solved by some reconfiguration?

Can someone give us a hint on what to check, debug or tune?

Thank you.

Kamil Madac
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io