[ceph-users] Re: 1 pg inconsistent and does not recover

Frank Schilder Wed, 28 Jun 2023 01:47:11 -0700

Hi Stefan,

we run Octopus. The deep-scrub request is (immediately) cancelled if the PG/OSD 
is already part of another (deep-)scrub or if some peering happens. As far as I 
understood, the commands osd/pg deep-scrub and pg repair do not create 
persistent reservations. If you issue this command, when does the PG actually 
start scrubbing? As soon as another one finishes or when it is its natural 
turn? Do you monitor the scrub order to confirm it was the manual command that 
initiated a scrub?

What I see is that the pg repair and the pg deep-scrub are almost immediately 
forgotten on our cluster. This is most prominent with the repair command, which 
can be really hard to get going and complete. Only an osd deep-scrub seems to 
have some effect. On the other hand, when I run the script, which stops all 
operations that conflict with manual reservations, the repair/deep-scrub 
actually start on request.

Best regards,
=================
Frank Schilder
AIT Risø Campus
Bygning 109, rum S14

________________________________________
From: Stefan Kooman <ste...@bit.nl>
Sent: Wednesday, June 28, 2023 9:54 AM
To: Frank Schilder; Alexander E. Patrakov; Niklas Hambüchen
Cc: ceph-users@ceph.io
Subject: Re: [ceph-users] Re: 1 pg inconsistent and does not recover

On 6/28/23 09:41, Frank Schilder wrote:
> Hi Niklas,
>
> please don't do any of the recovery steps yet! Your problem is almost 
> certainly a non-issue. I had a failed disk with 3 scrub-errors, leading to 
> the candidate read error messeges you have:
>
> ceph status/df/pool stats/health detail at 00:00:06:
>    cluster:
>      health: HEALTH_ERR
>              3 scrub errors
>              Possible data damage: 3 pgs inconsistent
>
> After rebuilding the data, it still looked like:
>
>    cluster:
>      health: HEALTH_ERR
>              2 scrub errors
>              Possible data damage: 2 pgs inconsistent
>
> What's the issue here? The issue is that the OGs have not been deep-scrubbed 
> after rebuild. The reply "no scrub data available" of the list-inconsistent 
> is the clue. The response to that is not to try manual repair but to issue a 
> deep-scrub.
>
> Unfortunately, the command "ceph pg deep-scrub ..." does not really work, the 
> deep scrub reservation almost always gets cancelled very quickly.

On what Ceph version do you have this issue? We use this command
everyday, hunderds of times, and it always works.

Or is this an issue when you have a degraded cluster?

Gr. Stefan
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Re: 1 pg inconsistent and does not recover

Reply via email to