In our case it was with a EC pool as well. I believe the PG state was
degraded+recovering / recovery_wait and iirc the PGs just simply sat in the
recovering state without any progress (degraded PG object count did not
decline). A repeer of the PG was attempted but no success there. A restart
of all the OSDs for the given PGs was attempted under mclock. That didnt
work. Switching to wpq for all OSDS in the given PG did resolve the issue.
This was on a 17.2.7 cluster.

Respectfully,

*Wes Dillingham*
LinkedIn <http://www.linkedin.com/in/wesleydillingham>
w...@wesdillingham.com




On Thu, May 2, 2024 at 9:54 AM Sridhar Seshasayee <ssesh...@redhat.com>
wrote:

> >
> > Multiple people -- including me -- have also observed backfill/recovery
> > stop completely for no apparent reason.
> >
> > In some cases poking the lead OSD for a PG with `ceph osd down` restores,
> > in other cases it doesn't.
> >
> > Anecdotally this *may* only happen for EC pools on HDDs but that sample
> > size is small.
> >
> >
> Thanks for the information. We will try and reproduce this locally with EC
> pools and investigate this further.
> I will revert with a tracker for this.
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to