[ceph-users] Re: Grafana service fails to start due to bad directory name after Quincy upgrade

2023-06-17 Thread Adiga, Anantha
Hi Eugene,

Thank you for your response, here is the update.

The upgrade to Quincy was done  following the cephadm orch upgrade procedure
ceph orch upgrade start --image quay.io/ceph/ceph:v17.2.6

Upgrade completed with out errors. After the upgrade, upon creating the Grafana 
service from Ceph dashboard, it deployed Grafana 6.7.4. The version is 
hardcoded in the code, should it not be 8.3.5 as listed below in Quincy 
documentation? See below

[Grafana service started from Cephdashboard]

Quincy documentation states: https://docs.ceph.com/en/latest/releases/quincy/
……documentation snippet
Monitoring and alerting:
43 new alerts have been added (totalling 68) improving observability of events 
affecting: cluster health, monitors, storage devices, PGs and CephFS.
Alerts can now be sent externally as SNMP traps via the new SNMP gateway 
service (the MIB is provided).
Improved integrated full/nearfull event notifications.
Grafana Dashboards now use grafonnet format (though they’re still available in 
JSON format).
Stack update: images for monitoring containers have been updated. Grafana 
8.3.5, Prometheus 2.33.4, Alertmanager 0.23.0 and Node Exporter 1.3.1. This 
reduced exposure to several Grafana vulnerabilities (CVE-2021-43798, 
CVE-2021-39226, CVE-2021-43798, CVE-2020-29510, CVE-2020-29511).
……….

I notice that the versions of the remaining stack, that Ceph dashboard deploys, 
 are also older than what is documented.  Prometheus 2.7.2, Alertmanager 0.16.2 
and Node Exporter 0.17.0.

AND 6.7.4 Grafana service reports a few warnings: highlighted below

root@fl31ca104ja0201:/home/general# systemctl status 
ceph-d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e@grafana.fl31ca104ja0201.service
● ceph-d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e@grafana.fl31ca104ja0201.service - 
Ceph grafana.fl31ca104ja0201 for d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e
 Loaded: loaded 
(/etc/systemd/system/ceph-d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e@.service; 
enabled; vendor preset: enabled)
 Active: active (running) since Tue 2023-06-13 03:37:58 UTC; 11h ago
   Main PID: 391896 (bash)
  Tasks: 53 (limit: 618607)
 Memory: 17.9M
 CGroup: 
/system.slice/system-ceph\x2dd0a3b6e0\x2dd2c3\x2d11ed\x2dbe05\x2da7a3a1d7a87e.slice/ceph-d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e@grafana.fl31ca104j>
 ├─391896 /bin/bash 
/var/lib/ceph/d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e/grafana.fl31ca104ja0201/unit.run
 └─391969 /usr/bin/docker run --rm --ipc=host --stop-signal=SIGTERM 
--net=host --init --name ceph-d0a3b6e0-d2c3-11ed-be05-a7a3a1d7a87e-grafana-fl>
-- Logs begin at Sun 2023-06-11 20:41:51 UTC, end at Tue 2023-06-13 15:35:12 
UTC. --
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+ 
lvl=info msg="Executing migration" logger=migrator id="alter user_auth.auth_id 
to length 190"
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+ 
lvl=info msg="Executing migration" logger=migrator id="Add OAuth access token 
to user_auth"
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+ 
lvl=info msg="Executing migration" logger=migrator id="Add OAuth refresh token 
to user_auth"
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+ 
lvl=info msg="Executing migration" logger=migrator id="Add OAuth token type to 
user_auth"
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+ 
lvl=info msg="Executing migration" logger=migrator id="Add OAuth expiry to 
user_auth"
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+ 
lvl=info msg="Executing migration" logger=migrator id="Add index to user_id 
column in user_auth"
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+ 
lvl=info msg="Executing migration" logger=migrator id="create server_lock table"
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+ 
lvl=info msg="Executing migration" logger=migrator id="add index 
server_lock.operation_uid"
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+ 
lvl=info msg="Executing migration" logger=migrator id="create user auth token 
table"
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+ 
lvl=info msg="Executing migration" logger=migrator id="add unique index 
user_auth_token.auth_token"
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+ 
lvl=info msg="Executing migration" logger=migrator id="add unique index 
user_auth_token.prev_auth_token"
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+ 
lvl=info msg="Executing migration" logger=migrator id="create cache_data table"
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+ 
lvl=info msg="Executing migration" logger=migrator id="add unique index 
cache_data.cache_key"
Jun 13 03:37:59 fl31ca104ja0201 bash[391969]: t=2023-06-13T03:37:59+ 
lvl=info msg="Created default organization" logger=sqlstore
Jun 13 03:37:59 fl31ca104ja0201 

[ceph-users] Re: EC 8+3 Pool PGs stuck in remapped+incomplete

2023-06-17 Thread 胡 玮文
Hi Jayanth,

Can you post the complete output of “ceph pg  query”? So that we can 
understand the situation better.

Can you get OSD 3 or 4 back into the cluster? If you are sure they cannot 
rejoin, you may try “ceph osd lost ” (doc says this may result in permanent 
data lost. I didn’t have a chance to try this myself).

Weiwen Hu

> 在 2023年6月18日,00:26,Jayanth Reddy  写道:
> 
> Hello Nino / Users,
> 
> After some initial analysis, I had increased max_pg_per_osd to 480, but
> we're out of luck. Also tried force-backfill and force-repair as well.
> On querying PG using *# ceph pg ** query* the output says blocked_by
> 3 to 4 OSDs which are out of the cluster already. Guessing if these have to
> do something with the recovery.
> 
> Thanks,
> Jayanth Reddy
> 
>> On Sat, Jun 17, 2023 at 12:31 PM Jayanth Reddy 
>> wrote:
>> 
>> Thanks, Nino.
>> 
>> Would give these initial suggestions a try and let you know at the
>> earliest.
>> 
>> Regards,
>> Jayanth Reddy
>> --
>> *From:* Nino Kotur 
>> *Sent:* Saturday, June 17, 2023 12:16:09 PM
>> *To:* Jayanth Reddy 
>> *Cc:* ceph-users@ceph.io 
>> *Subject:* Re: [ceph-users] EC 8+3 Pool PGs stuck in remapped+incomplete
>> 
>> problem is just that some of your OSDs have too much PGs and pool cannot
>> recover as it cannot create more PGs
>> 
>> [osd.214,osd.223,osd.548,osd.584] have slow ops.
>>too many PGs per OSD (330 > max 250)
>> 
>> I'd have to guess that the safest thing would be permanently or
>> temporarily adding more storage so that PGs drop below 250, another option
>> is just dropping down the total number of PGs but I don't know if I would
>> perform that action before my pool was healthy!
>> 
>> in case that there is only one OSD that has this number of OSDs but all
>> other OSDs have less than 100-150 than you can just reweight problematic
>> OSD so it rebalances those "too many PGs"
>> 
>> But it looks to me that you have way too many PGs which is also super
>> negatively impacting performance.
>> 
>> Another option is to increase max allowed PGs per OSD to say 350 this
>> should also allow cluster to rebuild honestly even tho this may be easiest
>> option, i'd never do this, performance cost of having over 150 PGs per OSD
>> suffer greatly.
>> 
>> 
>> kind regards,
>> Nino
>> 
>> 
>> On Sat, Jun 17, 2023 at 8:23 AM Jayanth Reddy 
>> wrote:
>> 
>> Hello Users,
>> Greetings. We've a Ceph Cluster with the version
>> *ceph version 14.2.5-382-g8881d33957
>> (8881d33957b54b101eae9c7627b351af10e87ee8) nautilus (stable)*
>> 
>> 5 PGs belonging to our RGW 8+3 EC Pool are stuck in incomplete and
>> incomplete+remapped states. Below are the PGs,
>> 
>> # ceph pg dump_stuck inactive
>> ok
>> PG_STAT STATE   UP
>> UP_PRIMARY ACTING
>> ACTING_PRIMARY
>> 15.251e  incomplete[151,464,146,503,166,41,555,542,9,565,268]
>> 151
>> [151,464,146,503,166,41,555,542,9,565,268]151
>> 15.3f3   incomplete [584,281,672,699,199,224,239,430,355,504,196]
>> 584
>> [584,281,672,699,199,224,239,430,355,504,196]584
>> 15.985  remapped+incomplete  [396,690,493,214,319,209,546,91,599,237,352]
>> 396
>> 
>> [2147483647,2147483647,2147483647,214,319,2147483647,546,91,599,2147483647,352]
>>   214
>> 15.39d3 remapped+incomplete  [404,221,223,585,38,102,533,471,568,451,195]
>> 404
>> [2147483647,2147483647,223,585,38,102,533,2147483647,231,451,2147483647]
>> 223
>> 15.d46  remapped+incomplete [297,646,212,254,110,169,500,372,623,470,678]
>> 297
>> [2147483647,548,2147483647,2147483647,110,169,500,372,2147483647,470,678]
>> 548
>> 
>> Some of the OSDs had gone down on the cluster. Below is the # ceph status
>> 
>> # ceph -s
>>  cluster:
>>id: 30d6f7ee-fa02-4ab3-8a09-9321c8002794
>>health: HEALTH_WARN
>>noscrub,nodeep-scrub flag(s) set
>>1 pools have many more objects per pg than average
>>Reduced data availability: 5 pgs inactive, 5 pgs incomplete
>>Degraded data redundancy: 44798/8718528059 objects degraded
>> (0.001%), 1 pg degraded, 1 pg undersized
>>22726 pgs not deep-scrubbed in time
>>23552 pgs not scrubbed in time
>>77 slow ops, oldest one blocked for 56400 sec, daemons
>> [osd.214,osd.223,osd.548,osd.584] have slow ops.
>>too many PGs per OSD (330 > max 250)
>> 
>>  services:
>>mon: 3 daemons, quorum brc1mon2,brc1mon3,brc1mon1 (age 2y)
>>mgr: brc1mon2(active, since 8d), standbys: brc1mon1, brc1mon3
>>mds: cephfs:1 {0=brc1mds2=up:active} 1 up:standby
>>osd: 1012 osds: 698 up (since 14h), 698 in (since 2d); 3 remapped pgs
>> flags noscrub,nodeep-scrub
>>rgw: 2 daemons active (brc1rgw1, brc1rgw2)
>> 
>>  data:
>>pools:   17 pools, 23552 pgs
>>objects: 863.74M objects, 1.2 PiB
>>usage:   2.4 PiB used, 6.2 PiB / 8.6 PiB avail
>>pgs: 0.021% pgs not active
>> 44798/8718528059 

[ceph-users] Starting v17.2.5 RGW SSE with default key (likely others) no longer works

2023-06-17 Thread Jayanth Reddy
Hello Folks,

I've been experimenting with RGW encryption and found this out.
Focusing on Quincy and Reef dev, for the SSE (any methods) to work, transit
has to be end to end encrypted, however if there is a proxy, then [1] can
be made use to tell RGW that SSL is being terminated. As per docs, RGW can
still continue to accept SSE if rgw_crypt_require_ssl is set to false as an
overriding item for the requirement of encryption in transit. Below are my
observations.

Until v17.2.3 (
quay.io/ceph/ceph@sha256:43f6e905f3e34abe4adbc9042b9d6f6b625dee8fa8d93c2bae53fa9b61c3df1a),
setting the same key as in [2], would show the object unreadable when
copied using
# rados -p default.rgw.buckets.data get
03c2ef32-b7c8-4e18-8e0c-ebac10a42f10.17254.1_file.plain file.enc
The object would be unreadable. The original object is in plain text.
Ofcourse, with rgw_crypt_require_ssl to false or [1]

However, starting with v17.2.4 onwards and even until my recent testing
with reef-dev (18.0.0-4353-g1e3835ab
1e3835abb2d19ce6ac4149c260ef804f1041d751)
When I try getting the same object onto the disk using rados command, the
object (contains plain text) would still be readable.

Has something changed since v17.2.4? I'll also test with Pacific and let
you know. Not sure if it affects other SSE mechanisms as well.

[1]
https://docs.ceph.com/en/quincy/radosgw/config-ref/#confval-rgw_trust_forwarded_https
[2]
https://docs.ceph.com/en/quincy/radosgw/encryption/#automatic-encryption-for-testing-only

Thanks,
Jayanth Reddy
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: EC 8+3 Pool PGs stuck in remapped+incomplete

2023-06-17 Thread Jayanth Reddy
Hello Nino / Users,

After some initial analysis, I had increased max_pg_per_osd to 480, but
we're out of luck. Also tried force-backfill and force-repair as well.
On querying PG using *# ceph pg ** query* the output says blocked_by
3 to 4 OSDs which are out of the cluster already. Guessing if these have to
do something with the recovery.

Thanks,
Jayanth Reddy

On Sat, Jun 17, 2023 at 12:31 PM Jayanth Reddy 
wrote:

> Thanks, Nino.
>
> Would give these initial suggestions a try and let you know at the
> earliest.
>
> Regards,
> Jayanth Reddy
> --
> *From:* Nino Kotur 
> *Sent:* Saturday, June 17, 2023 12:16:09 PM
> *To:* Jayanth Reddy 
> *Cc:* ceph-users@ceph.io 
> *Subject:* Re: [ceph-users] EC 8+3 Pool PGs stuck in remapped+incomplete
>
> problem is just that some of your OSDs have too much PGs and pool cannot
> recover as it cannot create more PGs
>
> [osd.214,osd.223,osd.548,osd.584] have slow ops.
> too many PGs per OSD (330 > max 250)
>
> I'd have to guess that the safest thing would be permanently or
> temporarily adding more storage so that PGs drop below 250, another option
> is just dropping down the total number of PGs but I don't know if I would
> perform that action before my pool was healthy!
>
> in case that there is only one OSD that has this number of OSDs but all
> other OSDs have less than 100-150 than you can just reweight problematic
> OSD so it rebalances those "too many PGs"
>
> But it looks to me that you have way too many PGs which is also super
> negatively impacting performance.
>
> Another option is to increase max allowed PGs per OSD to say 350 this
> should also allow cluster to rebuild honestly even tho this may be easiest
> option, i'd never do this, performance cost of having over 150 PGs per OSD
> suffer greatly.
>
>
> kind regards,
> Nino
>
>
> On Sat, Jun 17, 2023 at 8:23 AM Jayanth Reddy 
> wrote:
>
> Hello Users,
> Greetings. We've a Ceph Cluster with the version
> *ceph version 14.2.5-382-g8881d33957
> (8881d33957b54b101eae9c7627b351af10e87ee8) nautilus (stable)*
>
> 5 PGs belonging to our RGW 8+3 EC Pool are stuck in incomplete and
> incomplete+remapped states. Below are the PGs,
>
> # ceph pg dump_stuck inactive
> ok
> PG_STAT STATE   UP
>  UP_PRIMARY ACTING
>  ACTING_PRIMARY
> 15.251e  incomplete[151,464,146,503,166,41,555,542,9,565,268]
>  151
>  [151,464,146,503,166,41,555,542,9,565,268]151
> 15.3f3   incomplete [584,281,672,699,199,224,239,430,355,504,196]
>  584
> [584,281,672,699,199,224,239,430,355,504,196]584
> 15.985  remapped+incomplete  [396,690,493,214,319,209,546,91,599,237,352]
>  396
>
> [2147483647,2147483647,2147483647,214,319,2147483647,546,91,599,2147483647,352]
>214
> 15.39d3 remapped+incomplete  [404,221,223,585,38,102,533,471,568,451,195]
>  404
>  [2147483647,2147483647,223,585,38,102,533,2147483647,231,451,2147483647]
>  223
> 15.d46  remapped+incomplete [297,646,212,254,110,169,500,372,623,470,678]
>  297
> [2147483647,548,2147483647,2147483647,110,169,500,372,2147483647,470,678]
>  548
>
> Some of the OSDs had gone down on the cluster. Below is the # ceph status
>
> # ceph -s
>   cluster:
> id: 30d6f7ee-fa02-4ab3-8a09-9321c8002794
> health: HEALTH_WARN
> noscrub,nodeep-scrub flag(s) set
> 1 pools have many more objects per pg than average
> Reduced data availability: 5 pgs inactive, 5 pgs incomplete
> Degraded data redundancy: 44798/8718528059 objects degraded
> (0.001%), 1 pg degraded, 1 pg undersized
> 22726 pgs not deep-scrubbed in time
> 23552 pgs not scrubbed in time
> 77 slow ops, oldest one blocked for 56400 sec, daemons
> [osd.214,osd.223,osd.548,osd.584] have slow ops.
> too many PGs per OSD (330 > max 250)
>
>   services:
> mon: 3 daemons, quorum brc1mon2,brc1mon3,brc1mon1 (age 2y)
> mgr: brc1mon2(active, since 8d), standbys: brc1mon1, brc1mon3
> mds: cephfs:1 {0=brc1mds2=up:active} 1 up:standby
> osd: 1012 osds: 698 up (since 14h), 698 in (since 2d); 3 remapped pgs
>  flags noscrub,nodeep-scrub
> rgw: 2 daemons active (brc1rgw1, brc1rgw2)
>
>   data:
> pools:   17 pools, 23552 pgs
> objects: 863.74M objects, 1.2 PiB
> usage:   2.4 PiB used, 6.2 PiB / 8.6 PiB avail
> pgs: 0.021% pgs not active
>  44798/8718528059 objects degraded (0.001%)
>  23546 active+clean
>  3 remapped+incomplete
>  2 incomplete
>  1 active+undersized+degraded
>
>   io:
> client:   24 MiB/s rd, 3.2 KiB/s wr, 56 op/s rd, 4 op/s wr
>
> And the health detail shows as
>
> # ceph health detail
> HEALTH_WARN noscrub,nodeep-scrub flag(s) set; 1 pools have many more
> objects per pg than average; Reduced data availability: 5 pgs inactive, 5
> pgs incomplete; Degraded data 

[ceph-users] Re: EC 8+3 Pool PGs stuck in remapped+incomplete

2023-06-17 Thread Jayanth Reddy
Hello Anthony / Users,

After some initial analysis, I had increased max_pg_per_osd to 480, but
we're out of luck. Also tried force-backfill and force-repair as well.
On querying PG using *# ceph pg ** query* the output says blocked_by
3 to 4 OSDs which are out of the cluster already. Guessing if these have to
do something with the recovery.

Thanks,
Jayanth Reddy

On Sat, Jun 17, 2023 at 4:17 PM Anthony D'Atri 
wrote:

> Your cluster’s configuration is preventing CRUSH from calculating full
> placements
>
> set max_pg_per_osd = 1000, either in central config (or ceph.conf if you
> have it set there now).
>
> If you have it set in ceph.conf, you may need to serially restart the mons.
>
> ceph osd down 214
> sleep 60
> ceph osd down 223
> sleep 60
> ceph osd down 548
> sleep 60
> ceph osd down 584
>
>
>
>
>
>
> > On Jun 17, 2023, at 2:22 AM, Jayanth Reddy 
> wrote:
> >
> > Hello Users,
> > Greetings. We've a Ceph Cluster with the version
> > *ceph version 14.2.5-382-g8881d33957
> > (8881d33957b54b101eae9c7627b351af10e87ee8) nautilus (stable)*
> >
> > 5 PGs belonging to our RGW 8+3 EC Pool are stuck in incomplete and
> > incomplete+remapped states. Below are the PGs,
> >
> > # ceph pg dump_stuck inactive
> > ok
> > PG_STAT STATE   UP
> > UP_PRIMARY ACTING
> > ACTING_PRIMARY
> > 15.251e  incomplete[151,464,146,503,166,41,555,542,9,565,268]
> > 151
> > [151,464,146,503,166,41,555,542,9,565,268]151
> > 15.3f3   incomplete [584,281,672,699,199,224,239,430,355,504,196]
> > 584
> > [584,281,672,699,199,224,239,430,355,504,196]584
> > 15.985  remapped+incomplete  [396,690,493,214,319,209,546,91,599,237,352]
> > 396
> >
> [2147483647,2147483647,2147483647,214,319,2147483647,546,91,599,2147483647,352]
> >   214
> > 15.39d3 remapped+incomplete  [404,221,223,585,38,102,533,471,568,451,195]
> > 404
> > [2147483647,2147483647,223,585,38,102,533,2147483647,231,451,2147483647]
> > 223
> > 15.d46  remapped+incomplete [297,646,212,254,110,169,500,372,623,470,678]
> > 297
> > [2147483647,548,2147483647,2147483647,110,169,500,372,2147483647,470,678]
> > 548
> >
> > Some of the OSDs had gone down on the cluster. Below is the # ceph status
> >
> > # ceph -s
> >  cluster:
> >id: 30d6f7ee-fa02-4ab3-8a09-9321c8002794
> >health: HEALTH_WARN
> >noscrub,nodeep-scrub flag(s) set
> >1 pools have many more objects per pg than average
> >Reduced data availability: 5 pgs inactive, 5 pgs incomplete
> >Degraded data redundancy: 44798/8718528059 objects degraded
> > (0.001%), 1 pg degraded, 1 pg undersized
> >22726 pgs not deep-scrubbed in time
> >23552 pgs not scrubbed in time
> >77 slow ops, oldest one blocked for 56400 sec, daemons
> > [osd.214,osd.223,osd.548,osd.584] have slow ops.
> >too many PGs per OSD (330 > max 250)
> >
> >  services:
> >mon: 3 daemons, quorum brc1mon2,brc1mon3,brc1mon1 (age 2y)
> >mgr: brc1mon2(active, since 8d), standbys: brc1mon1, brc1mon3
> >mds: cephfs:1 {0=brc1mds2=up:active} 1 up:standby
> >osd: 1012 osds: 698 up (since 14h), 698 in (since 2d); 3 remapped pgs
> > flags noscrub,nodeep-scrub
> >rgw: 2 daemons active (brc1rgw1, brc1rgw2)
> >
> >  data:
> >pools:   17 pools, 23552 pgs
> >objects: 863.74M objects, 1.2 PiB
> >usage:   2.4 PiB used, 6.2 PiB / 8.6 PiB avail
> >pgs: 0.021% pgs not active
> > 44798/8718528059 objects degraded (0.001%)
> > 23546 active+clean
> > 3 remapped+incomplete
> > 2 incomplete
> > 1 active+undersized+degraded
> >
> >  io:
> >client:   24 MiB/s rd, 3.2 KiB/s wr, 56 op/s rd, 4 op/s wr
> >
> > And the health detail shows as
> >
> > # ceph health detail
> > HEALTH_WARN noscrub,nodeep-scrub flag(s) set; 1 pools have many more
> > objects per pg than average; Reduced data availability: 5 pgs inactive, 5
> > pgs incomplete; Degraded data redundancy: 44798/8718528081 objects
> degraded
> > (0.001%), 1 pg degraded, 1 pg undersized; 22726 pgs not deep-scrubbed in
> > time; 23552 pgs not scrubbed in time; 77 slow ops, oldest one blocked for
> > 56440 sec, daemons [osd.214,osd.223,osd.548,osd.584] have slow ops.; too
> > many PGs per OSD (330 > max 250)
> > OSDMAP_FLAGS noscrub,nodeep-scrub flag(s) set
> > MANY_OBJECTS_PER_PG 1 pools have many more objects per pg than average
> >pool iscsi-images objects per pg (540004) is more than 14.7248 times
> > cluster average (36673)
> > PG_AVAILABILITY Reduced data availability: 5 pgs inactive, 5 pgs
> incomplete
> >pg 15.3f3 is incomplete, acting
> > [584,281,672,699,199,224,239,430,355,504,196] (reducing pool
> > default.rgw.buckets.data min_size from 9 may help; search ceph.com/docs
> for
> > 'incomplete')
> >pg 15.985 is remapped+incomplete, acting
> >
> 

[ceph-users] Re: [rgw multisite] Perpetual behind

2023-06-17 Thread Alexander E. Patrakov
On Sat, Jun 17, 2023 at 4:41 AM Yixin Jin  wrote:
>
> Hi ceph gurus,
>
> I am experimenting with rgw multisite sync feature using Quincy release 
> (17.2.5). I am using the zone-level sync, not bucket-level sync policy. 
> During my experiment, somehow my setup got into a situation that it doesn't 
> seem to get out of. One zone is perpetually behind the other, although there 
> is no ongoing client request.
>
> Here is the output of my "sync status":
>
> root@mon1-z1:~# radosgw-admin sync status
>   realm f90e4356-3aa7-46eb-a6b7-117dfa4607c4 (test-realm)
>   zonegroup a5f23c9c-0640-41f2-956f-a8523eccecb3 (zg)
>zone bbe3e2a1-bdba-4977-affb-80596a6fe2b9 (z1)
>   metadata sync no sync (zone is master)
>   data sync source: 9645a68b-012e-4889-bf24-096e7478f786 (z2)
> syncing
> full sync: 0/128 shards
> incremental sync: 128/128 shards
> data is behind on 14 shards
> behind shards: 
> [56,61,63,107,108,109,110,111,112,113,114,115,116,117]
>
>
> It stays behind forever while rgw is almost completely idle (1% of CPU).
>
> Any suggestion on how to drill deeper to see what happened?

Hello!

I have no idea what has happened, but it would be helpful if you
confirm the latency between the two clusters. In other words, please
don't expect the sync between e.g. Germany and Singapore to catch up
fast. It will be limited by the amount of data that can be synced in
one request and the hard-coded maximum number of requests in flight.

In Reef, there are new tunables that help on high-latency links:
rgw_data_sync_spawn_window, rgw_bucket_sync_spawn_window.

-- 
Alexander E. Patrakov
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Removing the encryption: (essentially decrypt) encrypted RGW objects

2023-06-17 Thread Jayanth Reddy
Hello Users,
We've a big cluster (Quincy) with almost 1.7 billion RGW objects, and we've
enabled SSE on as per
https://docs.ceph.com/en/quincy/radosgw/encryption/#automatic-encryption-for-testing-only
(yes, we've chosen this insecure method to store the key)
We're now in the process of implementing RGW multisite, but stuck due to
https://tracker.ceph.com/issues/46062 and list at
https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/PQW66JJ5DCRTH5XFGTRESF3XXTOSIWFF/#43RHLUVFYNSDLZPXXPZSSXEDX34KWGJX

Was wondering if there is a way to decrypt the objects in-place with the
applied symmetric key. I tried to remove
the rgw_crypt_default_encryption_key from the mon configuration database
(on a test cluster), but as expected, RGW daemons throw 500 server errors
as it can not work on encrypted objects.

There is a PR being worked on about introducing the command option at
https://github.com/ceph/ceph/pull/51842 but it appears it takes some time
to be merged.

Cheers,
Jayanth Reddy
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] header_limit in AsioFrontend class

2023-06-17 Thread Vahideh Alinouri
Dear Ceph Users,

I am writing to request the backporting changes related to the
AsioFrontend class and specifically regarding the header_limit value.

In the Pacific release of Ceph, the header_limit value in the
AsioFrontend class was set to 4096. From Quincy release, there has
been a configurable option introduced to set the header_limit value
and the default value is 16384.

I would greatly appreciate it if someone from the Ceph development
team backport this change to the older version.

Best regards,
Vahideh Alinouri
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: EC 8+3 Pool PGs stuck in remapped+incomplete

2023-06-17 Thread Jayanth Reddy
Thanks, Nino.

Would give these initial suggestions a try and let you know at the earliest.

Regards,
Jayanth Reddy

From: Nino Kotur 
Sent: Saturday, June 17, 2023 12:16:09 PM
To: Jayanth Reddy 
Cc: ceph-users@ceph.io 
Subject: Re: [ceph-users] EC 8+3 Pool PGs stuck in remapped+incomplete

problem is just that some of your OSDs have too much PGs and pool cannot 
recover as it cannot create more PGs

[osd.214,osd.223,osd.548,osd.584] have slow ops.
too many PGs per OSD (330 > max 250)

I'd have to guess that the safest thing would be permanently or temporarily 
adding more storage so that PGs drop below 250, another option is just dropping 
down the total number of PGs but I don't know if I would perform that action 
before my pool was healthy!

in case that there is only one OSD that has this number of OSDs but all other 
OSDs have less than 100-150 than you can just reweight problematic OSD so it 
rebalances those "too many PGs"

But it looks to me that you have way too many PGs which is also super 
negatively impacting performance.

Another option is to increase max allowed PGs per OSD to say 350 this should 
also allow cluster to rebuild honestly even tho this may be easiest option, i'd 
never do this, performance cost of having over 150 PGs per OSD suffer greatly.


kind regards,
Nino


On Sat, Jun 17, 2023 at 8:23 AM Jayanth Reddy 
mailto:jayanthreddy5...@gmail.com>> wrote:
Hello Users,
Greetings. We've a Ceph Cluster with the version
*ceph version 14.2.5-382-g8881d33957
(8881d33957b54b101eae9c7627b351af10e87ee8) nautilus (stable)*

5 PGs belonging to our RGW 8+3 EC Pool are stuck in incomplete and
incomplete+remapped states. Below are the PGs,

# ceph pg dump_stuck inactive
ok
PG_STAT STATE   UP
 UP_PRIMARY ACTING
 ACTING_PRIMARY
15.251e  incomplete[151,464,146,503,166,41,555,542,9,565,268]
 151
 [151,464,146,503,166,41,555,542,9,565,268]151
15.3f3   incomplete [584,281,672,699,199,224,239,430,355,504,196]
 584
[584,281,672,699,199,224,239,430,355,504,196]584
15.985  remapped+incomplete  [396,690,493,214,319,209,546,91,599,237,352]
 396
[2147483647,2147483647,2147483647,214,319,2147483647,546,91,599,2147483647,352]
   214
15.39d3 remapped+incomplete  [404,221,223,585,38,102,533,471,568,451,195]
 404
 [2147483647,2147483647,223,585,38,102,533,2147483647,231,451,2147483647]
 223
15.d46  remapped+incomplete [297,646,212,254,110,169,500,372,623,470,678]
 297
[2147483647,548,2147483647,2147483647,110,169,500,372,2147483647,470,678]
 548

Some of the OSDs had gone down on the cluster. Below is the # ceph status

# ceph -s
  cluster:
id: 30d6f7ee-fa02-4ab3-8a09-9321c8002794
health: HEALTH_WARN
noscrub,nodeep-scrub flag(s) set
1 pools have many more objects per pg than average
Reduced data availability: 5 pgs inactive, 5 pgs incomplete
Degraded data redundancy: 44798/8718528059 objects degraded
(0.001%), 1 pg degraded, 1 pg undersized
22726 pgs not deep-scrubbed in time
23552 pgs not scrubbed in time
77 slow ops, oldest one blocked for 56400 sec, daemons
[osd.214,osd.223,osd.548,osd.584] have slow ops.
too many PGs per OSD (330 > max 250)

  services:
mon: 3 daemons, quorum brc1mon2,brc1mon3,brc1mon1 (age 2y)
mgr: brc1mon2(active, since 8d), standbys: brc1mon1, brc1mon3
mds: cephfs:1 {0=brc1mds2=up:active} 1 up:standby
osd: 1012 osds: 698 up (since 14h), 698 in (since 2d); 3 remapped pgs
 flags noscrub,nodeep-scrub
rgw: 2 daemons active (brc1rgw1, brc1rgw2)

  data:
pools:   17 pools, 23552 pgs
objects: 863.74M objects, 1.2 PiB
usage:   2.4 PiB used, 6.2 PiB / 8.6 PiB avail
pgs: 0.021% pgs not active
 44798/8718528059 objects degraded (0.001%)
 23546 active+clean
 3 remapped+incomplete
 2 incomplete
 1 active+undersized+degraded

  io:
client:   24 MiB/s rd, 3.2 KiB/s wr, 56 op/s rd, 4 op/s wr

And the health detail shows as

# ceph health detail
HEALTH_WARN noscrub,nodeep-scrub flag(s) set; 1 pools have many more
objects per pg than average; Reduced data availability: 5 pgs inactive, 5
pgs incomplete; Degraded data redundancy: 44798/8718528081 objects degraded
(0.001%), 1 pg degraded, 1 pg undersized; 22726 pgs not deep-scrubbed in
time; 23552 pgs not scrubbed in time; 77 slow ops, oldest one blocked for
56440 sec, daemons [osd.214,osd.223,osd.548,osd.584] have slow ops.; too
many PGs per OSD (330 > max 250)
OSDMAP_FLAGS noscrub,nodeep-scrub flag(s) set
MANY_OBJECTS_PER_PG 1 pools have many more objects per pg than average
pool iscsi-images objects per pg (540004) is more than 14.7248 times
cluster average (36673)
PG_AVAILABILITY Reduced data availability: 5 pgs inactive, 5 pgs incomplete
pg 15.3f3 is incomplete, 

[ceph-users] Re: EC 8+3 Pool PGs stuck in remapped+incomplete

2023-06-17 Thread Nino Kotur
problem is just that some of your OSDs have too much PGs and pool cannot
recover as it cannot create more PGs

[osd.214,osd.223,osd.548,osd.584] have slow ops.
too many PGs per OSD (330 > max 250)

I'd have to guess that the safest thing would be permanently or temporarily
adding more storage so that PGs drop below 250, another option is just
dropping down the total number of PGs but I don't know if I would perform
that action before my pool was healthy!

in case that there is only one OSD that has this number of OSDs but all
other OSDs have less than 100-150 than you can just reweight problematic
OSD so it rebalances those "too many PGs"

But it looks to me that you have way too many PGs which is also super
negatively impacting performance.

Another option is to increase max allowed PGs per OSD to say 350 this
should also allow cluster to rebuild honestly even tho this may be easiest
option, i'd never do this, performance cost of having over 150 PGs per OSD
suffer greatly.


kind regards,
Nino


On Sat, Jun 17, 2023 at 8:23 AM Jayanth Reddy 
wrote:

> Hello Users,
> Greetings. We've a Ceph Cluster with the version
> *ceph version 14.2.5-382-g8881d33957
> (8881d33957b54b101eae9c7627b351af10e87ee8) nautilus (stable)*
>
> 5 PGs belonging to our RGW 8+3 EC Pool are stuck in incomplete and
> incomplete+remapped states. Below are the PGs,
>
> # ceph pg dump_stuck inactive
> ok
> PG_STAT STATE   UP
>  UP_PRIMARY ACTING
>  ACTING_PRIMARY
> 15.251e  incomplete[151,464,146,503,166,41,555,542,9,565,268]
>  151
>  [151,464,146,503,166,41,555,542,9,565,268]151
> 15.3f3   incomplete [584,281,672,699,199,224,239,430,355,504,196]
>  584
> [584,281,672,699,199,224,239,430,355,504,196]584
> 15.985  remapped+incomplete  [396,690,493,214,319,209,546,91,599,237,352]
>  396
>
> [2147483647,2147483647,2147483647,214,319,2147483647,546,91,599,2147483647,352]
>214
> 15.39d3 remapped+incomplete  [404,221,223,585,38,102,533,471,568,451,195]
>  404
>  [2147483647,2147483647,223,585,38,102,533,2147483647,231,451,2147483647]
>  223
> 15.d46  remapped+incomplete [297,646,212,254,110,169,500,372,623,470,678]
>  297
> [2147483647,548,2147483647,2147483647,110,169,500,372,2147483647,470,678]
>  548
>
> Some of the OSDs had gone down on the cluster. Below is the # ceph status
>
> # ceph -s
>   cluster:
> id: 30d6f7ee-fa02-4ab3-8a09-9321c8002794
> health: HEALTH_WARN
> noscrub,nodeep-scrub flag(s) set
> 1 pools have many more objects per pg than average
> Reduced data availability: 5 pgs inactive, 5 pgs incomplete
> Degraded data redundancy: 44798/8718528059 objects degraded
> (0.001%), 1 pg degraded, 1 pg undersized
> 22726 pgs not deep-scrubbed in time
> 23552 pgs not scrubbed in time
> 77 slow ops, oldest one blocked for 56400 sec, daemons
> [osd.214,osd.223,osd.548,osd.584] have slow ops.
> too many PGs per OSD (330 > max 250)
>
>   services:
> mon: 3 daemons, quorum brc1mon2,brc1mon3,brc1mon1 (age 2y)
> mgr: brc1mon2(active, since 8d), standbys: brc1mon1, brc1mon3
> mds: cephfs:1 {0=brc1mds2=up:active} 1 up:standby
> osd: 1012 osds: 698 up (since 14h), 698 in (since 2d); 3 remapped pgs
>  flags noscrub,nodeep-scrub
> rgw: 2 daemons active (brc1rgw1, brc1rgw2)
>
>   data:
> pools:   17 pools, 23552 pgs
> objects: 863.74M objects, 1.2 PiB
> usage:   2.4 PiB used, 6.2 PiB / 8.6 PiB avail
> pgs: 0.021% pgs not active
>  44798/8718528059 objects degraded (0.001%)
>  23546 active+clean
>  3 remapped+incomplete
>  2 incomplete
>  1 active+undersized+degraded
>
>   io:
> client:   24 MiB/s rd, 3.2 KiB/s wr, 56 op/s rd, 4 op/s wr
>
> And the health detail shows as
>
> # ceph health detail
> HEALTH_WARN noscrub,nodeep-scrub flag(s) set; 1 pools have many more
> objects per pg than average; Reduced data availability: 5 pgs inactive, 5
> pgs incomplete; Degraded data redundancy: 44798/8718528081 objects degraded
> (0.001%), 1 pg degraded, 1 pg undersized; 22726 pgs not deep-scrubbed in
> time; 23552 pgs not scrubbed in time; 77 slow ops, oldest one blocked for
> 56440 sec, daemons [osd.214,osd.223,osd.548,osd.584] have slow ops.; too
> many PGs per OSD (330 > max 250)
> OSDMAP_FLAGS noscrub,nodeep-scrub flag(s) set
> MANY_OBJECTS_PER_PG 1 pools have many more objects per pg than average
> pool iscsi-images objects per pg (540004) is more than 14.7248 times
> cluster average (36673)
> PG_AVAILABILITY Reduced data availability: 5 pgs inactive, 5 pgs incomplete
> pg 15.3f3 is incomplete, acting
> [584,281,672,699,199,224,239,430,355,504,196] (reducing pool
> default.rgw.buckets.data min_size from 9 may help; search ceph.com/docs
> for
> 'incomplete')
> pg 15.985 is 

[ceph-users] Re: RGW: Migrating a long-lived cluster to multi-site, fixing an EC pool mistake

2023-06-17 Thread Nino Kotur
True, good luck with that, its kind of a tedious process that takes just
too long time :(

Nino


On Sat, Jun 17, 2023 at 7:48 AM Christian Theune  wrote:

> What got lost is that I need to change the pool’s m/k parameters, which is
> only possible by creating a new pool and moving all data from the old pool.
> Changing the crush rule doesn’t allow you to do that.
>
> > On 16. Jun 2023, at 23:32, Nino Kotur  wrote:
> >
> > If you create new crush rule for ssd/nvme/hdd and attach it to existing
> pool you should be able to do the migration seamlessly while everything is
> online... However impact to user will depend on storage devices load and
> network utilization as it will create chaos on cluster network.
> >
> > Or did i get something wrong?
> >
> >
> >
> >
> > Kind regards,
> > Nino
> >
> >
> > On Wed, Jun 14, 2023 at 5:44 PM Christian Theune 
> wrote:
> > Hi,
> >
> > further note to self and for posterity … ;)
> >
> > This turned out to be a no-go as well, because you can’t silently switch
> the pools to a different storage class: the objects will be found, but the
> index still refers to the old storage class and lifecycle migrations won’t
> work.
> >
> > I’ve brainstormed for further options and it appears that the last
> resort is to use placement targets and copy the buckets explicitly - twice,
> because on Nautilus I don’t have renames available, yet. :(
> >
> > This will require temporary downtimes prohibiting users to access their
> bucket. Fortunately we only have a few very large buckets (200T+) that will
> take a while to copy. We can pre-sync them of course, so the downtime will
> only be during the second copy.
> >
> > Christian
> >
> > > On 13. Jun 2023, at 14:52, Christian Theune 
> wrote:
> > >
> > > Following up to myself and for posterity:
> > >
> > > I’m going to try to perform a switch here using (temporary) storage
> classes and renaming of the pools to ensure that I can quickly change the
> STANDARD class to a better EC pool and have new objects located there.
> After that we’ll add (temporary) lifecycle rules to all buckets to ensure
> their objects will be migrated to the STANDARD class.
> > >
> > > Once that is finished we should be able to delete the old pool and the
> temporary storage class.
> > >
> > > First tests appear successfull, but I’m a bit struggling to get the
> bucket rules working (apparently 0 days isn’t a real rule … and the debug
> interval setting causes high frequent LC runs but doesn’t seem move objects
> just yet. I’ll play around with that setting a bit more, though, I think I
> might have tripped something that only wants to process objects every so
> often and on an interval of 10 a day is still 2.4 hours …
> > >
> > > Cheers,
> > > Christian
> > >
> > >> On 9. Jun 2023, at 11:16, Christian Theune 
> wrote:
> > >>
> > >> Hi,
> > >>
> > >> we are running a cluster that has been alive for a long time and we
> tread carefully regarding updates. We are still a bit lagging and our
> cluster (that started around Firefly) is currently at Nautilus. We’re
> updating and we know we’re still behind, but we do keep running into
> challenges along the way that typically are still unfixed on main and - as
> I started with - have to tread carefully.
> > >>
> > >> Nevertheless, mistakes happen, and we found ourselves in this
> situation: we converted our RGW data pool from replicated (n=3) to erasure
> coded (k=10, m=3, with 17 hosts) but when doing the EC profile selection we
> missed that our hosts are not evenly balanced (this is a growing cluster
> and some machines have around 20TiB capacity for the RGW data pool, wheres
> newer machines have around 160TiB and we rather should have gone with k=4,
> m=3.  In any case, having 13 chunks causes too many hosts to participate in
> each object. Going for k+m=7 will allow distribution to be more effective
> as we have 7 hosts that have the 160TiB sizing.
> > >>
> > >> Our original migration used the “cache tiering” approach, but that
> only works once when moving from replicated to EC and can not be used for
> further migrations.
> > >>
> > >> The amount of data is at 215TiB somewhat significant, so using an
> approach that scales when copying data[1] to avoid ending up with months of
> migration.
> > >>
> > >> I’ve run out of ideas doing this on a low-level (i.e. trying to fix
> it on a rados/pool level) and I guess we can only fix this on an
> application level using multi-zone replication.
> > >>
> > >> I have the setup nailed in general, but I’m running into issues with
> buckets in our staging and production environment that have
> `explicit_placement` pools attached, AFAICT is this an outdated mechanisms
> but there are no migration tools around. I’ve seen some people talk about
> patched versions of the `radosgw-admin metadata put` variant that (still)
> prohibits removing explicit placements.
> > >>
> > >> AFAICT those explicit placements will be synced to the secondary zone
> and the effect that I’m seeing 

[ceph-users] EC 8+3 Pool PGs stuck in remapped+incomplete

2023-06-17 Thread Jayanth Reddy
Hello Users,
Greetings. We've a Ceph Cluster with the version
*ceph version 14.2.5-382-g8881d33957
(8881d33957b54b101eae9c7627b351af10e87ee8) nautilus (stable)*

5 PGs belonging to our RGW 8+3 EC Pool are stuck in incomplete and
incomplete+remapped states. Below are the PGs,

# ceph pg dump_stuck inactive
ok
PG_STAT STATE   UP
 UP_PRIMARY ACTING
 ACTING_PRIMARY
15.251e  incomplete[151,464,146,503,166,41,555,542,9,565,268]
 151
 [151,464,146,503,166,41,555,542,9,565,268]151
15.3f3   incomplete [584,281,672,699,199,224,239,430,355,504,196]
 584
[584,281,672,699,199,224,239,430,355,504,196]584
15.985  remapped+incomplete  [396,690,493,214,319,209,546,91,599,237,352]
 396
[2147483647,2147483647,2147483647,214,319,2147483647,546,91,599,2147483647,352]
   214
15.39d3 remapped+incomplete  [404,221,223,585,38,102,533,471,568,451,195]
 404
 [2147483647,2147483647,223,585,38,102,533,2147483647,231,451,2147483647]
 223
15.d46  remapped+incomplete [297,646,212,254,110,169,500,372,623,470,678]
 297
[2147483647,548,2147483647,2147483647,110,169,500,372,2147483647,470,678]
 548

Some of the OSDs had gone down on the cluster. Below is the # ceph status

# ceph -s
  cluster:
id: 30d6f7ee-fa02-4ab3-8a09-9321c8002794
health: HEALTH_WARN
noscrub,nodeep-scrub flag(s) set
1 pools have many more objects per pg than average
Reduced data availability: 5 pgs inactive, 5 pgs incomplete
Degraded data redundancy: 44798/8718528059 objects degraded
(0.001%), 1 pg degraded, 1 pg undersized
22726 pgs not deep-scrubbed in time
23552 pgs not scrubbed in time
77 slow ops, oldest one blocked for 56400 sec, daemons
[osd.214,osd.223,osd.548,osd.584] have slow ops.
too many PGs per OSD (330 > max 250)

  services:
mon: 3 daemons, quorum brc1mon2,brc1mon3,brc1mon1 (age 2y)
mgr: brc1mon2(active, since 8d), standbys: brc1mon1, brc1mon3
mds: cephfs:1 {0=brc1mds2=up:active} 1 up:standby
osd: 1012 osds: 698 up (since 14h), 698 in (since 2d); 3 remapped pgs
 flags noscrub,nodeep-scrub
rgw: 2 daemons active (brc1rgw1, brc1rgw2)

  data:
pools:   17 pools, 23552 pgs
objects: 863.74M objects, 1.2 PiB
usage:   2.4 PiB used, 6.2 PiB / 8.6 PiB avail
pgs: 0.021% pgs not active
 44798/8718528059 objects degraded (0.001%)
 23546 active+clean
 3 remapped+incomplete
 2 incomplete
 1 active+undersized+degraded

  io:
client:   24 MiB/s rd, 3.2 KiB/s wr, 56 op/s rd, 4 op/s wr

And the health detail shows as

# ceph health detail
HEALTH_WARN noscrub,nodeep-scrub flag(s) set; 1 pools have many more
objects per pg than average; Reduced data availability: 5 pgs inactive, 5
pgs incomplete; Degraded data redundancy: 44798/8718528081 objects degraded
(0.001%), 1 pg degraded, 1 pg undersized; 22726 pgs not deep-scrubbed in
time; 23552 pgs not scrubbed in time; 77 slow ops, oldest one blocked for
56440 sec, daemons [osd.214,osd.223,osd.548,osd.584] have slow ops.; too
many PGs per OSD (330 > max 250)
OSDMAP_FLAGS noscrub,nodeep-scrub flag(s) set
MANY_OBJECTS_PER_PG 1 pools have many more objects per pg than average
pool iscsi-images objects per pg (540004) is more than 14.7248 times
cluster average (36673)
PG_AVAILABILITY Reduced data availability: 5 pgs inactive, 5 pgs incomplete
pg 15.3f3 is incomplete, acting
[584,281,672,699,199,224,239,430,355,504,196] (reducing pool
default.rgw.buckets.data min_size from 9 may help; search ceph.com/docs for
'incomplete')
pg 15.985 is remapped+incomplete, acting
[2147483647,2147483647,2147483647,214,319,2147483647,546,91,599,2147483647,352]
(reducing pool default.rgw.buckets.data min_size from 9 may help; search
ceph.com/docs for 'incomplete')
pg 15.d46 is remapped+incomplete, acting
[2147483647,548,2147483647,2147483647,110,169,500,372,2147483647,470,678]
(reducing pool default.rgw.buckets.data min_size from 9 may help; search
ceph.com/docs for 'incomplete')
pg 15.251e is incomplete, acting
[151,464,146,503,166,41,555,542,9,565,268] (reducing pool
default.rgw.buckets.data min_size from 9 may help; search ceph.com/docs for
'incomplete')
pg 15.39d3 is remapped+incomplete, acting
[2147483647,2147483647,223,585,38,102,533,2147483647,231,451,2147483647]
(reducing pool default.rgw.buckets.data min_size from 9 may help; search
ceph.com/docs for 'incomplete')
PG_DEGRADED Degraded data redundancy: 44798/8718528081 objects degraded
(0.001%), 1 pg degraded, 1 pg undersized
pg 15.28f0 is stuck undersized for 67359238.592403, current state
active+undersized+degraded, last acting
[2147483647,343,355,415,426,640,302,392,78,202,607]
PG_NOT_DEEP_SCRUBBED 22726 pgs not deep-scrubbed in time

We've the pools as below

# ceph osd lspools
1 iscsi-images
2 cephfs_data
3 cephfs_metadata