Thanks a lot! Yes it turns out to be the same issue that you pointed to.
Switching to wpq solved the issue. We are running 18.2.0.
Leon
On Wed, Feb 7, 2024 at 12:48 PM Kai Stian Olstad
wrote:
> You don't say anything about the Ceph version you are running.
> I had an similar issue with 17.2.7, and is seams to be an issue with
> mclock,
> when I switch to wpq everything worked again.
>
> You can read more about it here
>
> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/IPHBE3DLW5ABCZHSNYOBUBSI3TLWVD22/#OE3QXLAJIY6NU7PNMGHP47UK2CBZJPUG
>
> -
> Kai Stian Olstad
>
>
> On Tue, Feb 06, 2024 at 06:35:26AM -, LeonGao wrote:
> >Hi community
> >
> >We have a new Ceph cluster deployment with 100 nodes. When we are
> draining an OSD host from the cluster, we see a small amount of PGs that
> cannot make any progress to the end. From the logs and metrics, it seems
> like the recovery progress is stuck (0 recovery ops for several days).
> Would like to get some ideas on this. Re-peering and OSD restart do resolve
> to mitigate the issue but we want to get to the root cause of it as
> draining and recovery happen frequently.
> >
> >I have put some debugging information below. Any help is appreciated,
> thanks!
> >
> >ceph -s
> >pgs: 4210926/7380034104 objects misplaced (0.057%)
> > 41198 active+clean
> > 71active+remapped+backfilling
> > 12active+recovering
> >
> >One of the stuck PG:
> >6.38f1 active+remapped+backfilling [313,643,727]
> 313 [313,643,717] 313
> >
> >PG query result:
> >
> >ceph pg 6.38f1 query
> >{
> >"snap_trimq": "[]",
> >"snap_trimq_len": 0,
> >"state": "active+remapped+backfilling",
> >"epoch": 246856,
> >"up": [
> >313,
> >643,
> >727
> >],
> >"acting": [
> >313,
> >643,
> >717
> >],
> >"backfill_targets": [
> >"727"
> >],
> >"acting_recovery_backfill": [
> >"313",
> >"643",
> >"717",
> >"727"
> >],
> >"info": {
> >"pgid": "6.38f1",
> >"last_update": "212333'38916",
> >"last_complete": "212333'38916",
> >"log_tail": "80608'37589",
> >"last_user_version": 38833,
> >"last_backfill": "MAX",
> >"purged_snaps": [],
> >"history": {
> >"epoch_created": 3726,
> >"epoch_pool_created": 3279,
> >"last_epoch_started": 243987,
> >"last_interval_started": 243986,
> >"last_epoch_clean": 220174,
> >"last_interval_clean": 220173,
> >"last_epoch_split": 3726,
> >"last_epoch_marked_full": 0,
> >"same_up_since": 238347,
> >"same_interval_since": 243986,
> >"same_primary_since": 3728,
> >"last_scrub": "212333'38916",
> >"last_scrub_stamp": "2024-01-29T13:43:10.654709+",
> >"last_deep_scrub": "212333'38916",
> >"last_deep_scrub_stamp": "2024-01-28T07:43:45.920198+",
> >"last_clean_scrub_stamp": "2024-01-29T13:43:10.654709+",
> >"prior_readable_until_ub": 0
> >},
> >"stats": {
> >"version": "212333'38916",
> >"reported_seq": 413425,
> >"reported_epoch": 246856,
> >"state": "active+remapped+backfilling",
> >"last_fresh": "2024-02-05T21:14:40.838785+",
> >"last_change": "2024-02-03T22:33:43.052272+",
> >"last_active": "2024-02-05T21:14:40.838785+",
> >"last_peered": "2024-02-05T21:14:40.838785+",
> >"last_clean": "2024-02-03T04:26:35.168232+",
> >"last_became_active": "2024-02-03T22:31:16.037823+",
> >"last_became_peered": "2024-02-03T22:31:16.037823+",
> >"last_unstale": "2024-02-05T21:14:40.838785+",
> >"last_undegraded": "2024-02-05T21:14:40.838785+",
> >"last_fullsized": "2024-02-05T21:14:40.838785+",
> >"mapping_epoch": 243986,
> >"log_start": "80608'37589",
> >"ondisk_log_start": "80608'37589",
> >"created": 3726,
> >"last_epoch_clean": 220174,
> >"parent": "0.0",
> >"parent_split_bits": 14,
> >"last_scrub": "212333'38916",
> >"last_scrub_stamp": "2024-01-29T13:43:10.654709+",
> >"last_deep_scrub": "212333'38916",
> >"last_deep_scrub_stamp": "2024-01-28T07:43:45.920198+",
> >"last_clean_scrub_stamp": "2024-01-29T13:43:10.654709+",
> >"objects_scrubbed": 17743,
> >"log_size": 1327,
> >"log_dups_size": 3000,
> >"ondisk_log_size": 1327,
> >"stats_invalid": false,
> >"dirty_stats_invalid": false,
> >"omap_stats_invalid": false,
> >