[ceph-users] Re: PGs stuck undersized and not scrubbed

2023-06-05 Thread Nicola Mori

Dear Wes,

thank you for your suggestion! I restarted OSDs 57 and 79 and the 
recovery operations restarted as well. In the log I found that for both 
of them a kernel issue raised, but they were not in error state. 
Probably they got stuck because of this.

Thanks again for your help,

Nicola


smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: PGs stuck undersized and not scrubbed

2023-06-05 Thread Wesley Dillingham
When PGs are degraded they won't scrub, further, if an OSD is involved with
recovery of another PG it wont accept scrubs either so that is the likely
explanation of your not-scrubbed-in time issue. Its of low concern.

Are you sure that recovery is not progressing? I see: "7349/147534197
objects degraded" can you check that again (maybe wait an hour) and see if
7,349 has been reduced.

Another thing I'm noticing is that OSD 57 and 79 are the primary for many
of the PGs which are degraded. They might could use a service restart.

Respectfully,

*Wes Dillingham*
w...@wesdillingham.com
LinkedIn 


On Mon, Jun 5, 2023 at 12:01 PM Nicola Mori  wrote:

> Dear Ceph users,
>
> after an outage and recovery of one machine I have several PGs stuck in
> active+recovering+undersized+degraded+remapped. Furthermore, many PGs
> have not been (deep-)scrubbed in time. See below for status and health
> details.
> It's been like this for two days, with no recovery I/O being reported,
> so I guess something is stuck in a bad state. I'd need some help in
> understanding what's going on here and how to fix it.
> Thanks,
>
> Nicola
>
> -
>
> # ceph -s
>cluster:
>  id: b1029256-7bb3-11ec-a8ce-ac1f6b627b45
>  health: HEALTH_WARN
>  2 OSD(s) have spurious read errors
>  Degraded data redundancy: 7349/147534197 objects degraded
> (0.005%), 22 pgs degraded, 22 pgs undersized
>  332 pgs not deep-scrubbed in time
>  503 pgs not scrubbed in time
>  (muted: OSD_SLOW_PING_TIME_BACK OSD_SLOW_PING_TIME_FRONT)
>
>services:
>  mon: 5 daemons, quorum bofur,balin,aka,romolo,dwalin (age 2d)
>  mgr: bofur.tklnrn(active, since 32h), standbys: balin.hvunfe,
> aka.wzystq
>  mds: 2/2 daemons up, 1 standby
>  osd: 104 osds: 104 up (since 37h), 104 in (since 37h); 22 remapped pgs
>
>data:
>  volumes: 1/1 healthy
>  pools:   3 pools, 529 pgs
>  objects: 18.53M objects, 40 TiB
>  usage:   54 TiB used, 142 TiB / 196 TiB avail
>  pgs: 7349/147534197 objects degraded (0.005%)
>   2715/147534197 objects misplaced (0.002%)
>   507 active+clean
>   20  active+recovering+undersized+degraded+remapped
>   2   active+recovery_wait+undersized+degraded+remapped
>
> # ceph health detail
> [WRN] PG_DEGRADED: Degraded data redundancy: 7349/147534197 objects
> degraded (0.005%), 22 pgs degraded, 22 pgs undersized
>  pg 3.2c is stuck undersized for 37h, current state
> active+recovery_wait+undersized+degraded+remapped, last acting
> [79,83,34,37,65,NONE,18,95]
>  pg 3.57 is stuck undersized for 2d, current state
> active+recovering+undersized+degraded+remapped, last acting
> [57,99,37,NONE,15,104,55,40]
>  pg 3.76 is stuck undersized for 2d, current state
> active+recovering+undersized+degraded+remapped, last acting
> [57,5,37,15,100,33,85,NONE]
>  pg 3.9c is stuck undersized for 2d, current state
> active+recovering+undersized+degraded+remapped, last acting
> [57,86,88,NONE,11,69,20,10]
>  pg 3.106 is stuck undersized for 2d, current state
> active+recovering+undersized+degraded+remapped, last acting
> [79,15,89,NONE,36,32,23,64]
>  pg 3.107 is stuck undersized for 2d, current state
> active+recovering+undersized+degraded+remapped, last acting
> [79,NONE,64,20,61,92,104,43]
>  pg 3.10c is stuck undersized for 37h, current state
> active+recovery_wait+undersized+degraded+remapped, last acting
> [79,34,NONE,95,104,16,69,18]
>  pg 3.11e is stuck undersized for 2d, current state
> active+recovering+undersized+degraded+remapped, last acting
> [79,89,64,46,32,NONE,40,15]
>  pg 3.14e is stuck undersized for 37h, current state
> active+recovering+undersized+degraded+remapped, last acting
> [57,34,69,97,85,NONE,46,62]
>  pg 3.160 is stuck undersized for 2d, current state
> active+recovering+undersized+degraded+remapped, last acting
> [57,1,101,84,18,33,NONE,69]
>  pg 3.16a is stuck undersized for 37h, current state
> active+recovering+undersized+degraded+remapped, last acting
> [57,16,59,103,13,38,49,NONE]
>  pg 3.16e is stuck undersized for 2d, current state
> active+recovering+undersized+degraded+remapped, last acting
> [57,0,27,96,55,10,81,NONE]
>  pg 3.170 is stuck undersized for 2d, current state
> active+recovering+undersized+degraded+remapped, last acting
> [NONE,57,14,46,55,99,15,40]
>  pg 3.19b is stuck undersized for 2d, current state
> active+recovering+undersized+degraded+remapped, last acting
> [NONE,79,59,8,32,17,7,90]
>  pg 3.1a0 is stuck undersized for 2d, current state
> active+recovering+undersized+degraded+remapped, last acting
> [NONE,79,26,50,104,24,97,40]
>  pg 3.1a5 is stuck undersized for 37h, current state
> active+recovering+undersized+degraded+remapped, last acting
> [57,100,61,27,20,NONE,24,85]
>  pg 3.1a8 is stuck undersized for 2d, current state
>