[ceph-users] Re: erasure coded pool PG stuck inconsistent on ceph Pacific 15.2.13

2021-11-19 Thread Wesley Dillingham
You may also be able to use an upmap (or the upmap balancer) to help make
room for you on the osd which is too full.

Respectfully,

*Wes Dillingham*
w...@wesdillingham.com
LinkedIn 


On Fri, Nov 19, 2021 at 1:14 PM Wesley Dillingham 
wrote:

> Okay, now I see your attachment, the pg is in state:
>
> "state":
> "active+undersized+degraded+remapped+inconsistent+backfill_toofull",
>
> The reason it cant scrub or repair is that its degraded and further it
> seems that the cluster doesnt have the space to make that recovery happen
> "backfill_toofull" state. This may clear on its own as other pgs recover
> and this pg is ultimately able to recover. Other options are to remove data
> or add capacity. How full is your cluster? Is your cluster currently
> backfilling actively.
>
> Respectfully,
>
> *Wes Dillingham*
> w...@wesdillingham.com
> LinkedIn 
>
>
> On Fri, Nov 19, 2021 at 10:57 AM J-P Methot 
> wrote:
>
>> We have stopped deepscrubbing a while ago. However, forcing a deepscrub
>> by doing "ceph pg deep-scrub 6.180" doesn't do anything. The deepscrub
>> doesn't run at all. Could the deepscrubbing process be stuck elsewhere?
>> On 11/18/21 3:29 PM, Wesley Dillingham wrote:
>>
>> That response is typically indicative of a pg whose OSD sets has changed
>> since it was last scrubbed (typically from a disk failing).
>>
>> Are you sure its actually getting scrubbed when you issue the scrub? For
>> example you can issue: "ceph pg  query"  and look for
>> "last_deep_scrub_stamp" which will tell you when it was last deep
>> scrubbed.
>>
>> Further, in sufficiently recent versions of Ceph (introduced in
>> 14.2.something iirc) setting the flag "nodeep-scrub" will cause all in
>> flight deep-scrubs to stop immediately. You may have a scheduling issue
>> where you deep-scrub or repairs arent getting scheduled.
>>
>> Set the nodeep-scrub flag: "ceph osd set nodeep-scrub" and wait for all
>> current deep-scrubs to complete then try and manually re-issue the deep
>> scrub "ceph pg deep-scrub " at this point your scrub should start
>> near immediately and "rados
>> list-inconsistent-obj 6.180 --format=json-pretty" should return with
>> something of value.
>>
>> Respectfully,
>>
>> *Wes Dillingham*
>> w...@wesdillingham.com
>> LinkedIn 
>>
>>
>> On Thu, Nov 18, 2021 at 2:38 PM J-P Methot 
>> wrote:
>>
>>> Hi,
>>>
>>> We currently have a PG stuck in an inconsistent state on an erasure
>>> coded pool. The pool's K and M values are 33 and 3.  The command rados
>>> list-inconsistent-obj 6.180 --format=json-pretty results in the
>>> following error:
>>>
>>> No scrub information available for pg 6.180 error 2: (2) No such file or
>>> directory
>>>
>>> Forcing a deep scrub of the pg does not fix this. Doing a ceph pg repair
>>> 6.180 doesn't seem to do anything. Is there a known bug explaining this
>>> behavior? I am attaching informations regarding the PG in question.
>>>
>>> --
>>> Jean-Philippe Méthot
>>> Senior Openstack system administrator
>>> Administrateur système Openstack sénior
>>> PlanetHoster inc.
>>>
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>
>> --
>> Jean-Philippe Méthot
>> Senior Openstack system administrator
>> Administrateur système Openstack sénior
>> PlanetHoster inc.
>>
>>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: erasure coded pool PG stuck inconsistent on ceph Pacific 15.2.13

2021-11-19 Thread Wesley Dillingham
Okay, now I see your attachment, the pg is in state:

"state":
"active+undersized+degraded+remapped+inconsistent+backfill_toofull",

The reason it cant scrub or repair is that its degraded and further it
seems that the cluster doesnt have the space to make that recovery happen
"backfill_toofull" state. This may clear on its own as other pgs recover
and this pg is ultimately able to recover. Other options are to remove data
or add capacity. How full is your cluster? Is your cluster currently
backfilling actively.

Respectfully,

*Wes Dillingham*
w...@wesdillingham.com
LinkedIn 


On Fri, Nov 19, 2021 at 10:57 AM J-P Methot 
wrote:

> We have stopped deepscrubbing a while ago. However, forcing a deepscrub by
> doing "ceph pg deep-scrub 6.180" doesn't do anything. The deepscrub doesn't
> run at all. Could the deepscrubbing process be stuck elsewhere?
> On 11/18/21 3:29 PM, Wesley Dillingham wrote:
>
> That response is typically indicative of a pg whose OSD sets has changed
> since it was last scrubbed (typically from a disk failing).
>
> Are you sure its actually getting scrubbed when you issue the scrub? For
> example you can issue: "ceph pg  query"  and look for
> "last_deep_scrub_stamp" which will tell you when it was last deep
> scrubbed.
>
> Further, in sufficiently recent versions of Ceph (introduced in
> 14.2.something iirc) setting the flag "nodeep-scrub" will cause all in
> flight deep-scrubs to stop immediately. You may have a scheduling issue
> where you deep-scrub or repairs arent getting scheduled.
>
> Set the nodeep-scrub flag: "ceph osd set nodeep-scrub" and wait for all
> current deep-scrubs to complete then try and manually re-issue the deep
> scrub "ceph pg deep-scrub " at this point your scrub should start
> near immediately and "rados
> list-inconsistent-obj 6.180 --format=json-pretty" should return with
> something of value.
>
> Respectfully,
>
> *Wes Dillingham*
> w...@wesdillingham.com
> LinkedIn 
>
>
> On Thu, Nov 18, 2021 at 2:38 PM J-P Methot 
> wrote:
>
>> Hi,
>>
>> We currently have a PG stuck in an inconsistent state on an erasure
>> coded pool. The pool's K and M values are 33 and 3.  The command rados
>> list-inconsistent-obj 6.180 --format=json-pretty results in the
>> following error:
>>
>> No scrub information available for pg 6.180 error 2: (2) No such file or
>> directory
>>
>> Forcing a deep scrub of the pg does not fix this. Doing a ceph pg repair
>> 6.180 doesn't seem to do anything. Is there a known bug explaining this
>> behavior? I am attaching informations regarding the PG in question.
>>
>> --
>> Jean-Philippe Méthot
>> Senior Openstack system administrator
>> Administrateur système Openstack sénior
>> PlanetHoster inc.
>>
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
> --
> Jean-Philippe Méthot
> Senior Openstack system administrator
> Administrateur système Openstack sénior
> PlanetHoster inc.
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: erasure coded pool PG stuck inconsistent on ceph Pacific 15.2.13

2021-11-18 Thread Wesley Dillingham
That response is typically indicative of a pg whose OSD sets has changed
since it was last scrubbed (typically from a disk failing).

Are you sure its actually getting scrubbed when you issue the scrub? For
example you can issue: "ceph pg  query"  and look for
"last_deep_scrub_stamp" which will tell you when it was last deep scrubbed.

Further, in sufficiently recent versions of Ceph (introduced in
14.2.something iirc) setting the flag "nodeep-scrub" will cause all in
flight deep-scrubs to stop immediately. You may have a scheduling issue
where you deep-scrub or repairs arent getting scheduled.

Set the nodeep-scrub flag: "ceph osd set nodeep-scrub" and wait for all
current deep-scrubs to complete then try and manually re-issue the deep
scrub "ceph pg deep-scrub " at this point your scrub should start
near immediately and "rados
list-inconsistent-obj 6.180 --format=json-pretty" should return with
something of value.

Respectfully,

*Wes Dillingham*
w...@wesdillingham.com
LinkedIn 


On Thu, Nov 18, 2021 at 2:38 PM J-P Methot 
wrote:

> Hi,
>
> We currently have a PG stuck in an inconsistent state on an erasure
> coded pool. The pool's K and M values are 33 and 3.  The command rados
> list-inconsistent-obj 6.180 --format=json-pretty results in the
> following error:
>
> No scrub information available for pg 6.180 error 2: (2) No such file or
> directory
>
> Forcing a deep scrub of the pg does not fix this. Doing a ceph pg repair
> 6.180 doesn't seem to do anything. Is there a known bug explaining this
> behavior? I am attaching informations regarding the PG in question.
>
> --
> Jean-Philippe Méthot
> Senior Openstack system administrator
> Administrateur système Openstack sénior
> PlanetHoster inc.
>
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io