[ceph-users] Re: PGs stuck deep-scrubbing for weeks - 16.2.9

2022-07-18 Thread Wesley Dillingham
Yes these seems consistent with what we are experiencing. We have
definitely toggled the noscrub flags in various scenarios in the recent
past. Thanks for tracking down and fixing.

Respectfully,

*Wes Dillingham*
w...@wesdillingham.com
LinkedIn 


On Fri, Jul 15, 2022 at 10:16 PM David Orman  wrote:

> Apologies, backport link should be:
> https://github.com/ceph/ceph/pull/46845
>
> On Fri, Jul 15, 2022 at 9:14 PM David Orman  wrote:
>
>> I think you may have hit the same bug we encountered. Cory submitted a
>> fix, see if it fits what you've encountered:
>>
>> https://github.com/ceph/ceph/pull/46727 (backport to Pacific here:
>> https://github.com/ceph/ceph/pull/46877 )
>> https://tracker.ceph.com/issues/54172
>>
>> On Fri, Jul 15, 2022 at 8:52 AM Wesley Dillingham 
>> wrote:
>>
>>> We have two clusters one 14.2.22 -> 16.2.7 -> 16.2.9
>>>
>>> Another 16.2.7 -> 16.2.9
>>>
>>> Both with a multi disk (spinner block / ssd block.db) and both CephFS
>>> around 600 OSDs each with combo of rep-3 and 8+3 EC data pools. Examples
>>> of
>>> stuck scrubbing PGs from all of the pools.
>>>
>>> They have generally been behind on scrubbing which we attributed to
>>> simply
>>> being large disks (10TB) with a heavy write load and the OSDs just having
>>> trouble keeping up. On closer inspection it appears we have many PGs that
>>> have been lodged in a deep scrubbing state on one cluster for 2 weeks and
>>> another for 7 weeks. Wondering if others have been experiencing anything
>>> similar. The only example of PGs being stuck scrubbing I have seen in the
>>> past has been related to snaptrim PG state but we arent doing anything
>>> with
>>> snapshots in these new clusters.
>>>
>>> Granted my cluster has been warning me with "pgs not deep-scrubbed in
>>> time"
>>> and its on me for not looking more closely into why. Perhaps a separate
>>> warning of "PG Stuck Scrubbing for greater than 24 hours" or similar
>>> might
>>> be helpful to an operator.
>>>
>>> In any case I was able to get scrubs proceeding again by restarting the
>>> primary OSD daemon in the PGs which were stuck. Will monitor closely for
>>> additional stuck scrubs.
>>>
>>>
>>> Respectfully,
>>>
>>> *Wes Dillingham*
>>> w...@wesdillingham.com
>>> LinkedIn 
>>> ___
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>
>>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: PGs stuck deep-scrubbing for weeks - 16.2.9

2022-07-15 Thread David Orman
Apologies, backport link should be: https://github.com/ceph/ceph/pull/46845

On Fri, Jul 15, 2022 at 9:14 PM David Orman  wrote:

> I think you may have hit the same bug we encountered. Cory submitted a
> fix, see if it fits what you've encountered:
>
> https://github.com/ceph/ceph/pull/46727 (backport to Pacific here:
> https://github.com/ceph/ceph/pull/46877 )
> https://tracker.ceph.com/issues/54172
>
> On Fri, Jul 15, 2022 at 8:52 AM Wesley Dillingham 
> wrote:
>
>> We have two clusters one 14.2.22 -> 16.2.7 -> 16.2.9
>>
>> Another 16.2.7 -> 16.2.9
>>
>> Both with a multi disk (spinner block / ssd block.db) and both CephFS
>> around 600 OSDs each with combo of rep-3 and 8+3 EC data pools. Examples
>> of
>> stuck scrubbing PGs from all of the pools.
>>
>> They have generally been behind on scrubbing which we attributed to simply
>> being large disks (10TB) with a heavy write load and the OSDs just having
>> trouble keeping up. On closer inspection it appears we have many PGs that
>> have been lodged in a deep scrubbing state on one cluster for 2 weeks and
>> another for 7 weeks. Wondering if others have been experiencing anything
>> similar. The only example of PGs being stuck scrubbing I have seen in the
>> past has been related to snaptrim PG state but we arent doing anything
>> with
>> snapshots in these new clusters.
>>
>> Granted my cluster has been warning me with "pgs not deep-scrubbed in
>> time"
>> and its on me for not looking more closely into why. Perhaps a separate
>> warning of "PG Stuck Scrubbing for greater than 24 hours" or similar might
>> be helpful to an operator.
>>
>> In any case I was able to get scrubs proceeding again by restarting the
>> primary OSD daemon in the PGs which were stuck. Will monitor closely for
>> additional stuck scrubs.
>>
>>
>> Respectfully,
>>
>> *Wes Dillingham*
>> w...@wesdillingham.com
>> LinkedIn 
>> ___
>> ceph-users mailing list -- ceph-users@ceph.io
>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: PGs stuck deep-scrubbing for weeks - 16.2.9

2022-07-15 Thread David Orman
I think you may have hit the same bug we encountered. Cory submitted a fix,
see if it fits what you've encountered:

https://github.com/ceph/ceph/pull/46727 (backport to Pacific here:
https://github.com/ceph/ceph/pull/46877 )
https://tracker.ceph.com/issues/54172

On Fri, Jul 15, 2022 at 8:52 AM Wesley Dillingham 
wrote:

> We have two clusters one 14.2.22 -> 16.2.7 -> 16.2.9
>
> Another 16.2.7 -> 16.2.9
>
> Both with a multi disk (spinner block / ssd block.db) and both CephFS
> around 600 OSDs each with combo of rep-3 and 8+3 EC data pools. Examples of
> stuck scrubbing PGs from all of the pools.
>
> They have generally been behind on scrubbing which we attributed to simply
> being large disks (10TB) with a heavy write load and the OSDs just having
> trouble keeping up. On closer inspection it appears we have many PGs that
> have been lodged in a deep scrubbing state on one cluster for 2 weeks and
> another for 7 weeks. Wondering if others have been experiencing anything
> similar. The only example of PGs being stuck scrubbing I have seen in the
> past has been related to snaptrim PG state but we arent doing anything with
> snapshots in these new clusters.
>
> Granted my cluster has been warning me with "pgs not deep-scrubbed in time"
> and its on me for not looking more closely into why. Perhaps a separate
> warning of "PG Stuck Scrubbing for greater than 24 hours" or similar might
> be helpful to an operator.
>
> In any case I was able to get scrubs proceeding again by restarting the
> primary OSD daemon in the PGs which were stuck. Will monitor closely for
> additional stuck scrubs.
>
>
> Respectfully,
>
> *Wes Dillingham*
> w...@wesdillingham.com
> LinkedIn 
> ___
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io