The PGs will stay active+recovery_wait+degraded until you solve the unfound
objects issue.
You can follow this doc to look at which objects are unfound[1] and if no
other recourse mark them lost
[1]
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/#unfound-objects
.
On Thu, May 23, 2019 at 5:47 AM Kevin Flöh wrote:
> thank you for this idea, it has improved the situation. Nevertheless,
> there are still 2 PGs in recovery_wait. ceph -s gives me:
>
>cluster:
> id: 23e72372-0d44-4cad-b24f-3641b14b86f4
> health: HEALTH_WARN
> 3/125481112 objects unfound (0.000%)
> Degraded data redundancy: 3/497011315 objects degraded
> (0.000%), 2 pgs degraded
>
>services:
> mon: 3 daemons, quorum ceph-node03,ceph-node01,ceph-node02
> mgr: ceph-node01(active), standbys: ceph-node01.etp.kit.edu
> mds: cephfs-1/1/1 up {0=ceph-node03.etp.kit.edu=up:active}, 3
> up:standby
> osd: 96 osds: 96 up, 96 in
>
>data:
> pools: 2 pools, 4096 pgs
> objects: 125.48M objects, 259TiB
> usage: 370TiB used, 154TiB / 524TiB avail
> pgs: 3/497011315 objects degraded (0.000%)
> 3/125481112 objects unfound (0.000%)
> 4083 active+clean
> 10 active+clean+scrubbing+deep
> 2active+recovery_wait+degraded
> 1active+clean+scrubbing
>
>io:
> client: 318KiB/s rd, 77.0KiB/s wr, 190op/s rd, 0op/s wr
>
>
> and ceph health detail:
>
> HEALTH_WARN 3/125481112 objects unfound (0.000%); Degraded data
> redundancy: 3/497011315 objects degraded (0.000%), 2 p
> gs degraded
> OBJECT_UNFOUND 3/125481112 objects unfound (0.000%)
> pg 1.24c has 1 unfound objects
> pg 1.779 has 2 unfound objects
> PG_DEGRADED Degraded data redundancy: 3/497011315 objects degraded
> (0.000%), 2 pgs degraded
> pg 1.24c is active+recovery_wait+degraded, acting [32,4,61,36], 1
> unfound
> pg 1.779 is active+recovery_wait+degraded, acting [50,4,77,62], 2
> unfound
>
>
> also the status changed form HEALTH_ERR to HEALTH_WARN. We also did ceph
> osd down for all OSDs of the degraded PGs. Do you have any further
> suggestions on how to proceed?
>
> On 23.05.19 11:08 vorm., Dan van der Ster wrote:
> > I think those osds (1, 11, 21, 32, ...) need a little kick to re-peer
> > their degraded PGs.
> >
> > Open a window with `watch ceph -s`, then in another window slowly do
> >
> > ceph osd down 1
> > # then wait a minute or so for that osd.1 to re-peer fully.
> > ceph osd down 11
> > ...
> >
> > Continue that for each of the osds with stuck requests, or until there
> > are no more recovery_wait/degraded PGs.
> >
> > After each `ceph osd down...`, you should expect to see several PGs
> > re-peer, and then ideally the slow requests will disappear and the
> > degraded PGs will become active+clean.
> > If anything else happens, you should stop and let us know.
> >
> >
> > -- dan
> >
> > On Thu, May 23, 2019 at 10:59 AM Kevin Flöh wrote:
> >> This is the current status of ceph:
> >>
> >>
> >> cluster:
> >> id: 23e72372-0d44-4cad-b24f-3641b14b86f4
> >> health: HEALTH_ERR
> >> 9/125481144 objects unfound (0.000%)
> >> Degraded data redundancy: 9/497011417 objects degraded
> >> (0.000%), 7 pgs degraded
> >> 9 stuck requests are blocked > 4096 sec. Implicated osds
> >> 1,11,21,32,43,50,65
> >>
> >> services:
> >> mon: 3 daemons, quorum ceph-node03,ceph-node01,ceph-node02
> >> mgr: ceph-node01(active), standbys: ceph-node01.etp.kit.edu
> >> mds: cephfs-1/1/1 up {0=ceph-node03.etp.kit.edu=up:active}, 3
> >> up:standby
> >> osd: 96 osds: 96 up, 96 in
> >>
> >> data:
> >> pools: 2 pools, 4096 pgs
> >> objects: 125.48M objects, 259TiB
> >> usage: 370TiB used, 154TiB / 524TiB avail
> >> pgs: 9/497011417 objects degraded (0.000%)
> >>9/125481144 objects unfound (0.000%)
> >>4078 active+clean
> >>11 active+clean+scrubbing+deep
> >>7active+recovery_wait+degraded
> >>
> >> io:
> >> client: 211KiB/s rd, 46.0KiB/s wr, 158op/s rd, 0op/s wr
> >>
> >> On 23.05.19 10:54 vorm., Dan van der Ster wrote:
> >>> What's the full ceph status?
> >>> Normally recovery_wait just means that the relevant osd's are busy
> >>> recovering/backfilling another PG.
> >>>
> >>> On Thu, May 23, 2019 at 10:53 AM Kevin Flöh
> wrote:
> Hi,
>
> we have set the PGs to recover and now they are stuck in
> active+recovery_wait+degraded and instructing them to deep-scrub does not
> change anything. Hence, the rados report is empty. Is there a way to stop
> the recovery wait to start the deep-scrub and get the output? I guess the
> recovery_wait might be caused by missing objects. Do we need to delete them
> first to get the recovery going?
>
> Kevin
>
> On