[ceph-users] Re: Ceph PG stuck at degraded and very slow

Sa Pham Sat, 18 Oct 2025 14:14:00 -0700

Hi Tyler,

We do have a monitoring system visualized by grafana. I already checked CPU
iowait and loadavg and CPU usage (user + system) as well, they are not high.


So I don't think the problem is resource.


Regards,

On Tue, Oct 7, 2025 at 6:53 AM Tyler Stachecki <[email protected]>
wrote:

> On Mon, Oct 6, 2025, 5:34 PM Sa Pham <[email protected]> wrote:
>
>> Hi Frédéric,
>>
>> I tried to restart every osd related to that PG many times but it didn't
>> work.
>>
>> When the primary switch to OSD 101, we still have slow on OSD 101. So I
>> think it’s not hardware issue
>>
>> Regards,
>>
>>
>>
>>
>>
>>
>> On Mon, 6 Oct 2025 at 20:46 Frédéric Nass <[email protected]>
>> wrote:
>>
>> > Could be an issue with the primary OSD which is now osd.130. Have you
>> > checked osd.130 for any errors?
>> > Maybe try restarting osd.130 and osd.302 one at a time and maybe 101 as
>> > well, waiting for ~all PGs to become active+clean between all restarts.
>> >
>> > Could you please share a ceph status? So we get a better view of the
>> > situation.
>> >
>> > Regards,
>> > Frédéric.
>> >
>> > --
>> > Frédéric Nass
>> > Ceph Ambassador France | Senior Ceph Engineer @ CLYSO
>> > Try our Ceph Analyzer -- https://analyzer.clyso.com/
>> > https://clyso.com | [email protected]
>> >
>> > Le lun. 6 oct. 2025 à 14:19, Sa Pham <[email protected]> a écrit :
>> >
>> >> Hi Frédéric,
>> >>
>> >> I tried to repeer and deep scrub, but it's not working.
>> >>
>> >> Have you already checked the logs for osd.302 and /var/log/messages for
>> >> any I/O-related issues?
>> >>
>> >> => I checked , there is no I/O error/issue.
>> >>
>> >> Regards,
>> >>
>> >> On Mon, Oct 6, 2025 at 3:15 PM Frédéric Nass <[email protected]>
>> >> wrote:
>> >>
>> >>> Hi Sa,
>> >>>
>> >>> Regarding the output you provided, it appears that osd.302 is listed
>> as
>> >>> UP but not ACTING for PG 18.773:
>> >>>
>> >>> PG_STAT  STATE
>> >>>                UP                     UP_PRIMARY  ACTING
>> >>>  ACTING_PRIMARY
>> >>> 18.773       active+undersized+degraded+remapped+backfilling
>> >>>  [302,150,138]    302                   [130,101]             130
>> >>>
>> >>> Have you already checked the logs for osd.302 and /var/log/messages
>> for
>> >>> any I/O-related issues? Could you also try running 'ceph pg repeer
>> 18.773'?
>> >>>
>> >>> If this is the only PG for which `osd.302` is not acting and the
>> >>> 'repeer' command does not resolve the issue, I would suggest
>> attempting a
>> >>> deep-scrub on this PG.
>> >>> This might uncover errors that could potentially be fixed, either
>> online
>> >>> or offline.
>> >>>
>> >>> Regards,
>> >>> Frédéric
>> >>>
>> >>> --
>> >>> Frédéric Nass
>> >>> Ceph Ambassador France | Senior Ceph Engineer @ CLYSO
>> >>> Try our Ceph Analyzer -- https://analyzer.clyso.com/
>> >>> https://clyso.com | [email protected]
>> >>>
>> >>>
>> >>> Le lun. 6 oct. 2025 à 06:31, Sa Pham <[email protected]> a écrit :
>> >>>
>> >>>> Hello Eugen,
>> >>>>
>> >>>>
>> >>>> This PG include: 254490 objects, size: 68095493667 bytes
>> >>>>
>> >>>>
>> >>>> Regards,
>> >>>>
>> >>>> On Fri, Oct 3, 2025 at 9:10 PM Eugen Block <[email protected]> wrote:
>> >>>>
>> >>>> > Is it possible that this is a huge PG? What size does it have? But
>> it
>> >>>> > could also be a faulty disk.
>> >>>> >
>> >>>> >
>> >>>> > Zitat von Sa Pham <[email protected]>:
>> >>>> >
>> >>>> > > *Hello everyone,*
>> >>>> > >
>> >>>> > > I’m running a Ceph cluster used as an RGW backend, and I’m
>> facing an
>> >>>> > issue
>> >>>> > > with one particular placement group (PG).
>> >>>> > >
>> >>>> > >
>> >>>> > >    -
>> >>>> > >
>> >>>> > >    Accessing objects from this PG is *extremely slow*.
>> >>>> > >    -
>> >>>> > >
>> >>>> > >    Even running ceph pg <pg_id> takes a very long time.
>> >>>> > >    -
>> >>>> > >
>> >>>> > >    The PG is currently *stuck in a degraded state*, so I’m unable
>> >>>> to move
>> >>>> > >    it to other OSDs.
>> >>>> > >
>> >>>> > >
>> >>>> > > Current ceph version is reef 18.2.7.
>> >>>> > >
>> >>>> > > Has anyone encountered a similar issue before or have any
>> >>>> suggestions on
>> >>>> > > how to troubleshoot and resolve it?
>> >>>> > >
>> >>>> > >
>> >>>> > > Thanks in advance!
>> >>>> > > _______________________________________________
>> >>>> > > ceph-users mailing list -- [email protected]
>> >>>> > > To unsubscribe send an email to [email protected]
>> >>>> >
>> >>>> >
>> >>>> > _______________________________________________
>> >>>> > ceph-users mailing list -- [email protected]
>> >>>> > To unsubscribe send an email to [email protected]
>> >>>> >
>> >>>>
>> >>>>
>> >>>> --
>> >>>> Sa Pham Dang
>> >>>> Skype: great_bn
>> >>>> Phone/Telegram: 0986.849.582
>> >>>> _______________________________________________
>> >>>> ceph-users mailing list -- [email protected]
>> >>>> To unsubscribe send an email to [email protected]
>> >>>>
>> >>>
>> >>
>> >> --
>> >> Sa Pham Dang
>> >> Skype: great_bn
>> >> Phone/Telegram: 0986.849.582
>> >>
>> >>
>> >>
>> _______________________________________________
>> ceph-users mailing list -- [email protected]
>> To unsubscribe send an email to [email protected]
>
>
> Hi,
>
> If it's a performance issue, particularly if it's manifesting as high CPU
> load, you can usually pinpoint what's going on based on which symbol(s) are
> hot according to `perf top -p <pid>`.
>
> If it's not CPU hot, `iostat` is worth a look to see if the kernel thinks
> the block device is busy.
>
> Barring either of those it gets a bit trickier to tease out, but first
> discern whether or not it's a resource issue and work backwards from there
> would be my advice.
>
> Cheers,
> Tyler
>


-- 
Sa Pham Dang
Skype: great_bn
Phone/Telegram: 0986.849.582
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: Ceph PG stuck at degraded and very slow

Reply via email to