[ceph-users] Re: cannot repair a handful of damaged pg's

2023-10-06 Thread Simon Oosthoek

Hi Wesley,

On 06/10/2023 17:48, Wesley Dillingham wrote:
A repair is just a type of scrub and it is also limited by 
osd_max_scrubs which in pacific is 1.


We've increased that to 4 (and temporarily to 8) since we have so many 
OSDs and are running behind on scrubbing.





If another scrub is occurring on any OSD in the PG it wont start.


that explains a lot.



do "ceph osd set noscrub" and "ceph osd set nodeep-scrub" wait for all 
scrubs to stop (a few seconds probably)


Then issue the pg repair command again. It may start.


The script Kai linked seems like a good idea to fix this when needed.



You also have pgs in backfilling state. Note that by default OSDs in 
backfill or backfill_wait also wont perform scrubs.


You can modify this behavior with `ceph config set osd 
osd_scrub_during_recovery true`


We've set this already



I would suggest only setting that after the noscub flags are set and the 
only scrub you want to get processed is your manual repair.


Then rm the scrub_during_recovery config item before unsetting the 
noscrub flags.


Thanks for the suggestion!

Cheers

/Simon





Respectfully,

*Wes Dillingham*
w...@wesdillingham.com 
LinkedIn 


On Fri, Oct 6, 2023 at 11:02 AM Simon Oosthoek > wrote:


On 06/10/2023 16:09, Simon Oosthoek wrote:
 > Hi
 >
 > we're still in HEALTH_ERR state with our cluster, this is the top
of the
 > output of `ceph health detail`
 >
 > HEALTH_ERR 1/846829349 objects unfound (0.000%); 248 scrub errors;
 > Possible data damage: 1 pg recovery_unfound, 2 pgs inconsistent;
 > Degraded data redundancy: 6/7118781559 objects degraded (0.000%),
1 pg
 > degraded, 1 pg undersized; 63 pgs not deep-scrubbed in time; 657
pgs not
 > scrubbed in time
 > [WRN] OBJECT_UNFOUND: 1/846829349 objects unfound (0.000%)
 >      pg 26.323 has 1 unfound objects
 > [ERR] OSD_SCRUB_ERRORS: 248 scrub errors
 > [ERR] PG_DAMAGED: Possible data damage: 1 pg recovery_unfound, 2 pgs
 > inconsistent
 >      pg 26.323 is active+recovery_unfound+degraded+remapped, acting
 > [92,109,116,70,158,128,243,189,256], 1 unfound
 >      pg 26.337 is active+clean+inconsistent, acting
 > [139,137,48,126,165,89,237,199,189]
 >      pg 26.3e2 is active+clean+inconsistent, acting
 > [12,27,24,234,195,173,98,32,35]
 > [WRN] PG_DEGRADED: Degraded data redundancy: 6/7118781559 objects
 > degraded (0.000%), 1 pg degraded, 1 pg undersized
 >      pg 13.3a5 is stuck undersized for 4m, current state
 > active+undersized+remapped+backfilling, last acting
 > [2,45,32,62,2147483647,55,116,25,225,202,240]
 >      pg 26.323 is active+recovery_unfound+degraded+remapped, acting
 > [92,109,116,70,158,128,243,189,256], 1 unfound
 >
 >
 > For the PG_DAMAGED pgs I try the usual `ceph pg repair 26.323` etc.,
 > however it fails to get resolved.
 >
 > The osd.116 is already marked out and is beginning to get empty.
I've
 > tried restarting the osd processes of the first osd listed for
each PG,
 > but that doesn't get it resolved either.
 >
 > I guess we should have enough redundancy to get the correct data
back,
 > but how can I tell ceph to fix it in order to get back to a
healthy state?

I guess this could be related to the number of scrubs going on, I read
somewhere that this may interfere with the repair request. I would
expect the repair would have priority over scrubs...

BTW, we're running pacific for now, we want to update when the cluster
is healthy again.

Cheers

/Simon

___
ceph-users mailing list -- ceph-users@ceph.io

To unsubscribe send an email to ceph-users-le...@ceph.io



___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cannot repair a handful of damaged pg's

2023-10-06 Thread Kai Stian Olstad

On 06.10.2023 17:48, Wesley Dillingham wrote:
A repair is just a type of scrub and it is also limited by 
osd_max_scrubs

which in pacific is 1.

If another scrub is occurring on any OSD in the PG it wont start.

do "ceph osd set noscrub" and "ceph osd set nodeep-scrub" wait for all
scrubs to stop (a few seconds probably)

Then issue the pg repair command again. It may start.

You also have pgs in backfilling state. Note that by default OSDs in
backfill or backfill_wait also wont perform scrubs.

You can modify this behavior with `ceph config set osd
osd_scrub_during_recovery
true`

I would suggest only setting that after the noscub flags are set and 
the

only scrub you want to get processed is your manual repair.

Then rm the scrub_during_recovery config item before unsetting the 
noscrub

flags.


Hi Simon

Just to add to Wes's answer, Cern have made a nice script that do the 
steps Wes explained above

https://github.com/cernceph/ceph-scripts/blob/master/tools/scrubbing/autorepair.sh
that you might want to take a look at.

--
Kai Stian Olstad
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cannot repair a handful of damaged pg's

2023-10-06 Thread Wesley Dillingham
A repair is just a type of scrub and it is also limited by osd_max_scrubs
which in pacific is 1.

If another scrub is occurring on any OSD in the PG it wont start.

do "ceph osd set noscrub" and "ceph osd set nodeep-scrub" wait for all
scrubs to stop (a few seconds probably)

Then issue the pg repair command again. It may start.

You also have pgs in backfilling state. Note that by default OSDs in
backfill or backfill_wait also wont perform scrubs.

You can modify this behavior with `ceph config set osd
osd_scrub_during_recovery
true`

I would suggest only setting that after the noscub flags are set and the
only scrub you want to get processed is your manual repair.

Then rm the scrub_during_recovery config item before unsetting the noscrub
flags.



Respectfully,

*Wes Dillingham*
w...@wesdillingham.com
LinkedIn 


On Fri, Oct 6, 2023 at 11:02 AM Simon Oosthoek 
wrote:

> On 06/10/2023 16:09, Simon Oosthoek wrote:
> > Hi
> >
> > we're still in HEALTH_ERR state with our cluster, this is the top of the
> > output of `ceph health detail`
> >
> > HEALTH_ERR 1/846829349 objects unfound (0.000%); 248 scrub errors;
> > Possible data damage: 1 pg recovery_unfound, 2 pgs inconsistent;
> > Degraded data redundancy: 6/7118781559 objects degraded (0.000%), 1 pg
> > degraded, 1 pg undersized; 63 pgs not deep-scrubbed in time; 657 pgs not
> > scrubbed in time
> > [WRN] OBJECT_UNFOUND: 1/846829349 objects unfound (0.000%)
> >  pg 26.323 has 1 unfound objects
> > [ERR] OSD_SCRUB_ERRORS: 248 scrub errors
> > [ERR] PG_DAMAGED: Possible data damage: 1 pg recovery_unfound, 2 pgs
> > inconsistent
> >  pg 26.323 is active+recovery_unfound+degraded+remapped, acting
> > [92,109,116,70,158,128,243,189,256], 1 unfound
> >  pg 26.337 is active+clean+inconsistent, acting
> > [139,137,48,126,165,89,237,199,189]
> >  pg 26.3e2 is active+clean+inconsistent, acting
> > [12,27,24,234,195,173,98,32,35]
> > [WRN] PG_DEGRADED: Degraded data redundancy: 6/7118781559 objects
> > degraded (0.000%), 1 pg degraded, 1 pg undersized
> >  pg 13.3a5 is stuck undersized for 4m, current state
> > active+undersized+remapped+backfilling, last acting
> > [2,45,32,62,2147483647,55,116,25,225,202,240]
> >  pg 26.323 is active+recovery_unfound+degraded+remapped, acting
> > [92,109,116,70,158,128,243,189,256], 1 unfound
> >
> >
> > For the PG_DAMAGED pgs I try the usual `ceph pg repair 26.323` etc.,
> > however it fails to get resolved.
> >
> > The osd.116 is already marked out and is beginning to get empty. I've
> > tried restarting the osd processes of the first osd listed for each PG,
> > but that doesn't get it resolved either.
> >
> > I guess we should have enough redundancy to get the correct data back,
> > but how can I tell ceph to fix it in order to get back to a healthy
> state?
>
> I guess this could be related to the number of scrubs going on, I read
> somewhere that this may interfere with the repair request. I would
> expect the repair would have priority over scrubs...
>
> BTW, we're running pacific for now, we want to update when the cluster
> is healthy again.
>
> Cheers
>
> /Simon
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: cannot repair a handful of damaged pg's

2023-10-06 Thread Simon Oosthoek

On 06/10/2023 16:09, Simon Oosthoek wrote:

Hi

we're still in HEALTH_ERR state with our cluster, this is the top of the 
output of `ceph health detail`


HEALTH_ERR 1/846829349 objects unfound (0.000%); 248 scrub errors; 
Possible data damage: 1 pg recovery_unfound, 2 pgs inconsistent; 
Degraded data redundancy: 6/7118781559 objects degraded (0.000%), 1 pg 
degraded, 1 pg undersized; 63 pgs not deep-scrubbed in time; 657 pgs not 
scrubbed in time

[WRN] OBJECT_UNFOUND: 1/846829349 objects unfound (0.000%)
     pg 26.323 has 1 unfound objects
[ERR] OSD_SCRUB_ERRORS: 248 scrub errors
[ERR] PG_DAMAGED: Possible data damage: 1 pg recovery_unfound, 2 pgs 
inconsistent
     pg 26.323 is active+recovery_unfound+degraded+remapped, acting 
[92,109,116,70,158,128,243,189,256], 1 unfound
     pg 26.337 is active+clean+inconsistent, acting 
[139,137,48,126,165,89,237,199,189]
     pg 26.3e2 is active+clean+inconsistent, acting 
[12,27,24,234,195,173,98,32,35]
[WRN] PG_DEGRADED: Degraded data redundancy: 6/7118781559 objects 
degraded (0.000%), 1 pg degraded, 1 pg undersized
     pg 13.3a5 is stuck undersized for 4m, current state 
active+undersized+remapped+backfilling, last acting 
[2,45,32,62,2147483647,55,116,25,225,202,240]
     pg 26.323 is active+recovery_unfound+degraded+remapped, acting 
[92,109,116,70,158,128,243,189,256], 1 unfound



For the PG_DAMAGED pgs I try the usual `ceph pg repair 26.323` etc., 
however it fails to get resolved.


The osd.116 is already marked out and is beginning to get empty. I've 
tried restarting the osd processes of the first osd listed for each PG, 
but that doesn't get it resolved either.


I guess we should have enough redundancy to get the correct data back, 
but how can I tell ceph to fix it in order to get back to a healthy state?


I guess this could be related to the number of scrubs going on, I read 
somewhere that this may interfere with the repair request. I would 
expect the repair would have priority over scrubs...


BTW, we're running pacific for now, we want to update when the cluster 
is healthy again.


Cheers

/Simon

___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io