Hi,

in Quincy (17.2.X) mclock is the default scheduler, there have been issues with snaptrimming in mclock which have been resolved in the latest releases. You might be facing exactly that (stuck snaptrim). I'd recommend to either upgrade to latest Reef or Squid which should improve the mclock scheduler. Or you revert to the previous default "wpq" and restart all OSDs.

ceph config get osd osd_op_queue

If you are already running wpq we should take a closer look, if not, change it to wpq:

ceph config set osd osd_op_queue wpq
ceph orch restart osd

The latter will restart all OSDs (not simultaneously). Then you should see changes in the snaptrim queues. If you're patient and have time to dig in, you could check if the stuck snaptrim operations are due to the average object size in those affected PGs compared to the osd_snap_trim_cost.

The deep-scrubs should then also continue, I think they're blocked by the snaptrims.

Regards,
Eugen

Zitat von Gustavo Garcia Rondina <[email protected]>:

Hi Ceph community,

We have a Ceph cluster with 6 OSD nodes and 168 OSDs (28 per node each with a 18 TB data disk). We are running quincy 17.2.6. The cluster was not properly maintained for a while and it was in not the greatest shape. After some work, we got here:

[ceph: root@ceph-admin1 /]# ceph -s
  cluster:
    id:     26315dca-383a-11ee-9d49-xxxxxxxxxxxx
    health: HEALTH_WARN
            257 pgs not deep-scrubbed in time
            259 pgs not scrubbed in time

  services:
    mon: 5 daemons, quorum ceph-admin1,ceph-admin2,ceph-osd1,ceph-osd2,ceph-osd3 (age 6M)
    mgr: ceph-admin2.sipadf(active, since 1y), standbys: ceph-admin1.nwaovh
    mds: 2/2 daemons up, 2 standby
    osd: 168 osds: 168 up (since 6w), 168 in (since 6w)

  data:
    volumes: 2/2 healthy
    pools:   9 pools, 2273 pgs
    objects: 699.66M objects, 1.5 PiB
    usage:   2.2 PiB used, 529 TiB / 2.7 PiB avail
    pgs:     1986 active+clean
             125  active+clean+snaptrim_wait
             110  active+clean+scrubbing+deep
             47   active+clean+snaptrim
             5    active+clean+scrubbing


The snaptrim_wait and snaptrim numbers are stuck on these values for weeks, if not months. We also had almost all PGs not (deep-)scrubbed in time, but after tweaking a few parameters the numbers got down to what you see above. However, now they seem to not be lowering, or doing so very slowly (a couple every 2-3 days). Dumping all PGs and filtering for the SCRUB_SCHEDULING column, I can see that some PGs have been scrubbing for a long time:

[root@ceph-admin1 ~]# ceph pg dump | grep -e 'scrubbing for'  -e SCHED  | awk '{print $1,$(NF-2)}' | sort -n -k2 | tail
5.624 4788773s
5.5f2 4790148s
5.4b 4791924s
5.686 4795526s
5.538 4796221s
5.5c8 4796573s
5.722 4796937s
5.483 4797551s
5.81 4799856s
5.233 10052851s

While recent ones look reasonable, so new scrubs are starting:

[root@ceph-admin1 ~]# ceph pg dump | grep -e 'scrubbing for'  -e SCHED  | awk '{print $1,$(NF-2)}' | sort -n -k2 | head
PG_STAT SCRUB_SCHEDULING
5.14b 53s
5.382 71s
5.ff 187s
6.9 265s
5.1c3 354s
5.70a 364s
5.596 367s
5.1fc 897s
5.3c8 1420s

Looking closer to PG 5.233, it has a long snaptrimq length, but it doesn't seem to be blocked:

[root@ceph-admin1 ~]# ceph pg 5.233 query | grep -E 'snaptrimq_len|blocked_by|waiting_on'
            "snaptrimq_len": 2682,
            "blocked_by": [],

However it does seem to be waiting on someone:

[root@ceph-admin1 ~]# ceph pg 5.233 query | grep -A5 'waiting_on_whom'
        "waiting_on_whom": [
            "164(4)"
        ],
        "schedule": "scrubbing"

The primary OSD for PG 5.233 is osd.136 on node ceph-osd4, and its logs do not show anything remarkable. The HDD for osd.136 is in good shape, no SMART errors, no I/O errors anywhere on ceph-osd4.

Is it the scrubbing that's blocking the snaptrim, or the other way?

Also, looking at historic ops:

[root@ceph-admin1 ~]# ceph tell osd.136 dump_historic_ops_by_duration
{
    "size": 20,
    "duration": 600,
     ...
    "ops": [
        {
            "description": "rep_scrubmap(6.9 e14586 from shard 151)",
            "initiated_at": "2025-09-25T09:57:46.150913-0500",
            "age": 356.3390488,
            "duration": 0.12764220300000001,
            "type_data": {
                "flag_point": "started",
                "events": [
                    ...             
                    {
                        "event": "header_read",
                        "time": "2025-09-25T09:57:46.150912-0500",
                        "duration": 4294967295.9999995
                    },

This last duration above looks odd as 4294967295.9999995 is 2^32. Not sure what it means, but thought it was strange.

Any clues on any of this?

Thank you,
Gustavo
_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]


_______________________________________________
ceph-users mailing list -- [email protected]
To unsubscribe send an email to [email protected]

Reply via email to