Yes these seems consistent with what we are experiencing. We have
definitely toggled the noscrub flags in various scenarios in the recent
past. Thanks for tracking down and fixing.

Respectfully,

*Wes Dillingham*
w...@wesdillingham.com
LinkedIn <http://www.linkedin.com/in/wesleydillingham>


On Fri, Jul 15, 2022 at 10:16 PM David Orman <orma...@corenode.com> wrote:

> Apologies, backport link should be:
> https://github.com/ceph/ceph/pull/46845
>
> On Fri, Jul 15, 2022 at 9:14 PM David Orman <orma...@corenode.com> wrote:
>
>> I think you may have hit the same bug we encountered. Cory submitted a
>> fix, see if it fits what you've encountered:
>>
>> https://github.com/ceph/ceph/pull/46727 (backport to Pacific here:
>> https://github.com/ceph/ceph/pull/46877 )
>> https://tracker.ceph.com/issues/54172
>>
>> On Fri, Jul 15, 2022 at 8:52 AM Wesley Dillingham <w...@wesdillingham.com>
>> wrote:
>>
>>> We have two clusters one 14.2.22 -> 16.2.7 -> 16.2.9
>>>
>>> Another 16.2.7 -> 16.2.9
>>>
>>> Both with a multi disk (spinner block / ssd block.db) and both CephFS
>>> around 600 OSDs each with combo of rep-3 and 8+3 EC data pools. Examples
>>> of
>>> stuck scrubbing PGs from all of the pools.
>>>
>>> They have generally been behind on scrubbing which we attributed to
>>> simply
>>> being large disks (10TB) with a heavy write load and the OSDs just having
>>> trouble keeping up. On closer inspection it appears we have many PGs that
>>> have been lodged in a deep scrubbing state on one cluster for 2 weeks and
>>> another for 7 weeks. Wondering if others have been experiencing anything
>>> similar. The only example of PGs being stuck scrubbing I have seen in the
>>> past has been related to snaptrim PG state but we arent doing anything
>>> with
>>> snapshots in these new clusters.
>>>
>>> Granted my cluster has been warning me with "pgs not deep-scrubbed in
>>> time"
>>> and its on me for not looking more closely into why. Perhaps a separate
>>> warning of "PG Stuck Scrubbing for greater than 24 hours" or similar
>>> might
>>> be helpful to an operator.
>>>
>>> In any case I was able to get scrubs proceeding again by restarting the
>>> primary OSD daemon in the PGs which were stuck. Will monitor closely for
>>> additional stuck scrubs.
>>>
>>>
>>> Respectfully,
>>>
>>> *Wes Dillingham*
>>> w...@wesdillingham.com
>>> LinkedIn <http://www.linkedin.com/in/wesleydillingham>
>>> _______________________________________________
>>> ceph-users mailing list -- ceph-users@ceph.io
>>> To unsubscribe send an email to ceph-users-le...@ceph.io
>>>
>>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to