[ceph-users] Re: Ceph PG stuck at degraded and very slow

Sa Pham Fri, 17 Oct 2025 18:38:21 -0700

Hello all,

yesterday, We exported this pg from one OSD and imported to three others
OSD and using pg-upmap to map to those acting OSDs


ceph osd pg-upmap 18.773 11 52 111


But the pg is still slow when accessed (ceph pg 18.773 query) and the
object in this PG is still Slow.

The scrub process cannot finish.

I tried to use ceph-objectstore-tool to repair the PG but it did not work.

I don't know what I should do now? please advise.


Regards,


On Tue, Oct 7, 2025 at 6:57 AM Sa Pham <[email protected]> wrote:

> Hi Tyler,
>
> We do have a monitoring system visualized by grafana. I already checked
> CPU iowait and loadavg and CPU usage (user + system) as well, they are not
> high.
>
> So I don't think the problem is resource.
>
>
> Regards,
>
> On Tue, Oct 7, 2025 at 6:53 AM Tyler Stachecki <[email protected]>
> wrote:
>
>> On Mon, Oct 6, 2025, 5:34 PM Sa Pham <[email protected]> wrote:
>>
>>> Hi Frédéric,
>>>
>>> I tried to restart every osd related to that PG many times but it didn't
>>> work.
>>>
>>> When the primary switch to OSD 101, we still have slow on OSD 101. So I
>>> think it’s not hardware issue
>>>
>>> Regards,
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Mon, 6 Oct 2025 at 20:46 Frédéric Nass <[email protected]>
>>> wrote:
>>>
>>> > Could be an issue with the primary OSD which is now osd.130. Have you
>>> > checked osd.130 for any errors?
>>> > Maybe try restarting osd.130 and osd.302 one at a time and maybe 101 as
>>> > well, waiting for ~all PGs to become active+clean between all restarts.
>>> >
>>> > Could you please share a ceph status? So we get a better view of the
>>> > situation.
>>> >
>>> > Regards,
>>> > Frédéric.
>>> >
>>> > --
>>> > Frédéric Nass
>>> > Ceph Ambassador France | Senior Ceph Engineer @ CLYSO
>>> > Try our Ceph Analyzer -- https://analyzer.clyso.com/
>>> > https://clyso.com | [email protected]
>>> >
>>> > Le lun. 6 oct. 2025 à 14:19, Sa Pham <[email protected]> a écrit :
>>> >
>>> >> Hi Frédéric,
>>> >>
>>> >> I tried to repeer and deep scrub, but it's not working.
>>> >>
>>> >> Have you already checked the logs for osd.302 and /var/log/messages
>>> for
>>> >> any I/O-related issues?
>>> >>
>>> >> => I checked , there is no I/O error/issue.
>>> >>
>>> >> Regards,
>>> >>
>>> >> On Mon, Oct 6, 2025 at 3:15 PM Frédéric Nass <[email protected]
>>> >
>>> >> wrote:
>>> >>
>>> >>> Hi Sa,
>>> >>>
>>> >>> Regarding the output you provided, it appears that osd.302 is listed
>>> as
>>> >>> UP but not ACTING for PG 18.773:
>>> >>>
>>> >>> PG_STAT  STATE
>>> >>>                UP                     UP_PRIMARY  ACTING
>>> >>>  ACTING_PRIMARY
>>> >>> 18.773       active+undersized+degraded+remapped+backfilling
>>> >>>  [302,150,138]    302                   [130,101]             130
>>> >>>
>>> >>> Have you already checked the logs for osd.302 and /var/log/messages
>>> for
>>> >>> any I/O-related issues? Could you also try running 'ceph pg repeer
>>> 18.773'?
>>> >>>
>>> >>> If this is the only PG for which `osd.302` is not acting and the
>>> >>> 'repeer' command does not resolve the issue, I would suggest
>>> attempting a
>>> >>> deep-scrub on this PG.
>>> >>> This might uncover errors that could potentially be fixed, either
>>> online
>>> >>> or offline.
>>> >>>
>>> >>> Regards,
>>> >>> Frédéric
>>> >>>
>>> >>> --
>>> >>> Frédéric Nass
>>> >>> Ceph Ambassador France | Senior Ceph Engineer @ CLYSO
>>> >>> Try our Ceph Analyzer -- https://analyzer.clyso.com/
>>> >>> https://clyso.com | [email protected]
>>> >>>
>>> >>>
>>> >>> Le lun. 6 oct. 2025 à 06:31, Sa Pham <[email protected]> a écrit :
>>> >>>
>>> >>>> Hello Eugen,
>>> >>>>
>>> >>>>
>>> >>>> This PG include: 254490 objects, size: 68095493667 bytes
>>> >>>>
>>> >>>>
>>> >>>> Regards,
>>> >>>>
>>> >>>> On Fri, Oct 3, 2025 at 9:10 PM Eugen Block <[email protected]> wrote:
>>> >>>>
>>> >>>> > Is it possible that this is a huge PG? What size does it have?
>>> But it
>>> >>>> > could also be a faulty disk.
>>> >>>> >
>>> >>>> >
>>> >>>> > Zitat von Sa Pham <[email protected]>:
>>> >>>> >
>>> >>>> > > *Hello everyone,*
>>> >>>> > >
>>> >>>> > > I’m running a Ceph cluster used as an RGW backend, and I’m
>>> facing an
>>> >>>> > issue
>>> >>>> > > with one particular placement group (PG).
>>> >>>> > >
>>> >>>> > >
>>> >>>> > >    -
>>> >>>> > >
>>> >>>> > >    Accessing objects from this PG is *extremely slow*.
>>> >>>> > >    -
>>> >>>> > >
>>> >>>> > >    Even running ceph pg <pg_id> takes a very long time.
>>> >>>> > >    -
>>> >>>> > >
>>> >>>> > >    The PG is currently *stuck in a degraded state*, so I’m
>>> unable
>>> >>>> to move
>>> >>>> > >    it to other OSDs.
>>> >>>> > >
>>> >>>> > >
>>> >>>> > > Current ceph version is reef 18.2.7.
>>> >>>> > >
>>> >>>> > > Has anyone encountered a similar issue before or have any
>>> >>>> suggestions on
>>> >>>> > > how to troubleshoot and resolve it?
>>> >>>> > >
>>> >>>> > >
>>> >>>> > > Thanks in advance!
>>> >>>> > > _______________________________________________
>>> >>>> > > ceph-users mailing list -- [email protected]
>>> >>>> > > To unsubscribe send an email to [email protected]
>>> >>>> >
>>> >>>> >
>>> >>>> > _______________________________________________
>>> >>>> > ceph-users mailing list -- [email protected]
>>> >>>> > To unsubscribe send an email to [email protected]
>>> >>>> >
>>> >>>>
>>> >>>>
>>> >>>> --
>>> >>>> Sa Pham Dang
>>> >>>> Skype: great_bn
>>> >>>> Phone/Telegram: 0986.849.582
>>> >>>> _______________________________________________
>>> >>>> ceph-users mailing list -- [email protected]
>>> >>>> To unsubscribe send an email to [email protected]
>>> >>>>
>>> >>>
>>> >>
>>> >> --
>>> >> Sa Pham Dang
>>> >> Skype: great_bn
>>> >> Phone/Telegram: 0986.849.582
>>> >>
>>> >>
>>> >>
>>> _______________________________________________
>>> ceph-users mailing list -- [email protected]
>>> To unsubscribe send an email to [email protected]
>>
>>
>> Hi,
>>
>> If it's a performance issue, particularly if it's manifesting as high CPU
>> load, you can usually pinpoint what's going on based on which symbol(s) are
>> hot according to `perf top -p <pid>`.
>>
>> If it's not CPU hot, `iostat` is worth a look to see if the kernel thinks
>> the block device is busy.
>>
>> Barring either of those it gets a bit trickier to tease out, but first
>> discern whether or not it's a resource issue and work backwards from there
>> would be my advice.
>>
>> Cheers,
>> Tyler
>>
>
>
> --
> Sa Pham Dang
> Skype: great_bn
> Phone/Telegram: 0986.849.582
>
>
>

-- 
Sa Pham Dang
Skype: great_bn
Phone/Telegram: 0986.849.582
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

[ceph-users] Re: Ceph PG stuck at degraded and very slow

Reply via email to