[ceph-users] Re: Ceph PG stuck at degraded and very slow

Tyler Stachecki Sat, 18 Oct 2025 11:54:31 -0700

On Mon, Oct 6, 2025, 5:34 PM Sa Pham <[email protected]> wrote:

> Hi Frédéric,
>
> I tried to restart every osd related to that PG many times but it didn't
> work.
>
> When the primary switch to OSD 101, we still have slow on OSD 101. So I
> think it’s not hardware issue
>
> Regards,
>
>
>
>
>
>
> On Mon, 6 Oct 2025 at 20:46 Frédéric Nass <[email protected]> wrote:
>
> > Could be an issue with the primary OSD which is now osd.130. Have you
> > checked osd.130 for any errors?
> > Maybe try restarting osd.130 and osd.302 one at a time and maybe 101 as
> > well, waiting for ~all PGs to become active+clean between all restarts.
> >
> > Could you please share a ceph status? So we get a better view of the
> > situation.
> >
> > Regards,
> > Frédéric.
> >
> > --
> > Frédéric Nass
> > Ceph Ambassador France | Senior Ceph Engineer @ CLYSO
> > Try our Ceph Analyzer -- https://analyzer.clyso.com/
> > https://clyso.com | [email protected]
> >
> > Le lun. 6 oct. 2025 à 14:19, Sa Pham <[email protected]> a écrit :
> >
> >> Hi Frédéric,
> >>
> >> I tried to repeer and deep scrub, but it's not working.
> >>
> >> Have you already checked the logs for osd.302 and /var/log/messages for
> >> any I/O-related issues?
> >>
> >> => I checked , there is no I/O error/issue.
> >>
> >> Regards,
> >>
> >> On Mon, Oct 6, 2025 at 3:15 PM Frédéric Nass <[email protected]>
> >> wrote:
> >>
> >>> Hi Sa,
> >>>
> >>> Regarding the output you provided, it appears that osd.302 is listed as
> >>> UP but not ACTING for PG 18.773:
> >>>
> >>> PG_STAT  STATE
> >>>                UP                     UP_PRIMARY  ACTING
> >>>  ACTING_PRIMARY
> >>> 18.773       active+undersized+degraded+remapped+backfilling
> >>>  [302,150,138]    302                   [130,101]             130
> >>>
> >>> Have you already checked the logs for osd.302 and /var/log/messages for
> >>> any I/O-related issues? Could you also try running 'ceph pg repeer
> 18.773'?
> >>>
> >>> If this is the only PG for which `osd.302` is not acting and the
> >>> 'repeer' command does not resolve the issue, I would suggest
> attempting a
> >>> deep-scrub on this PG.
> >>> This might uncover errors that could potentially be fixed, either
> online
> >>> or offline.
> >>>
> >>> Regards,
> >>> Frédéric
> >>>
> >>> --
> >>> Frédéric Nass
> >>> Ceph Ambassador France | Senior Ceph Engineer @ CLYSO
> >>> Try our Ceph Analyzer -- https://analyzer.clyso.com/
> >>> https://clyso.com | [email protected]
> >>>
> >>>
> >>> Le lun. 6 oct. 2025 à 06:31, Sa Pham <[email protected]> a écrit :
> >>>
> >>>> Hello Eugen,
> >>>>
> >>>>
> >>>> This PG include: 254490 objects, size: 68095493667 bytes
> >>>>
> >>>>
> >>>> Regards,
> >>>>
> >>>> On Fri, Oct 3, 2025 at 9:10 PM Eugen Block <[email protected]> wrote:
> >>>>
> >>>> > Is it possible that this is a huge PG? What size does it have? But
> it
> >>>> > could also be a faulty disk.
> >>>> >
> >>>> >
> >>>> > Zitat von Sa Pham <[email protected]>:
> >>>> >
> >>>> > > *Hello everyone,*
> >>>> > >
> >>>> > > I’m running a Ceph cluster used as an RGW backend, and I’m facing
> an
> >>>> > issue
> >>>> > > with one particular placement group (PG).
> >>>> > >
> >>>> > >
> >>>> > >    -
> >>>> > >
> >>>> > >    Accessing objects from this PG is *extremely slow*.
> >>>> > >    -
> >>>> > >
> >>>> > >    Even running ceph pg <pg_id> takes a very long time.
> >>>> > >    -
> >>>> > >
> >>>> > >    The PG is currently *stuck in a degraded state*, so I’m unable
> >>>> to move
> >>>> > >    it to other OSDs.
> >>>> > >
> >>>> > >
> >>>> > > Current ceph version is reef 18.2.7.
> >>>> > >
> >>>> > > Has anyone encountered a similar issue before or have any
> >>>> suggestions on
> >>>> > > how to troubleshoot and resolve it?
> >>>> > >
> >>>> > >
> >>>> > > Thanks in advance!
> >>>> > > _______________________________________________
> >>>> > > ceph-users mailing list -- [email protected]
> >>>> > > To unsubscribe send an email to [email protected]
> >>>> >
> >>>> >
> >>>> > _______________________________________________
> >>>> > ceph-users mailing list -- [email protected]
> >>>> > To unsubscribe send an email to [email protected]
> >>>> >
> >>>>
> >>>>
> >>>> --
> >>>> Sa Pham Dang
> >>>> Skype: great_bn
> >>>> Phone/Telegram: 0986.849.582
> >>>> _______________________________________________
> >>>> ceph-users mailing list -- [email protected]
> >>>> To unsubscribe send an email to [email protected]
> >>>>
> >>>
> >>
> >> --
> >> Sa Pham Dang
> >> Skype: great_bn
> >> Phone/Telegram: 0986.849.582
> >>
> >>
> >>
> _______________________________________________
> ceph-users mailing list -- [email protected]
> To unsubscribe send an email to [email protected]



Hi,

If it's a performance issue, particularly if it's manifesting as high CPU
load, you can usually pinpoint what's going on based on which symbol(s) are
hot according to `perf top -p <pid>`.

If it's not CPU hot, `iostat` is worth a look to see if the kernel thinks
the block device is busy.

Barring either of those it gets a bit trickier to tease out, but first
discern whether or not it's a resource issue and work backwards from there
would be my advice.

Cheers,
Tyler
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: Ceph PG stuck at degraded and very slow

Reply via email to