[ceph-users] Re: PG stuck at recovery

2024-02-23 Thread Leon Gao
For us we see this for both EC 3,2 and 3 way replication pools, but all on
HDD. Our SSD usage is very small though.

On Mon, Feb 19, 2024 at 10:18 PM Anthony D'Atri 
wrote:

>
>
> >> After wrangling with this myself, both with 17.2.7 and to an extent
> with 17.2.5, I'd like to follow up here and ask:
> >> Those who have experienced this, were the affected PGs
> >> * Part of an EC pool?
> >> * Part of an HDD pool?
> >> * Both?
> >
> > Both in my case, EC is 4+2 jerasure blaum_roth and the HDD is hybrid
> where DB is on SSD shared by 5 HDD.
> > And in your cases?
>
>
> EC 4,2, HDD-only.
>
>
___
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io


[ceph-users] Re: PG stuck at recovery

2024-02-12 Thread Leon Gao
Thanks a lot! Yes it turns out to be the same issue that you pointed to.
Switching to wpq solved the issue. We are running 18.2.0.

Leon

On Wed, Feb 7, 2024 at 12:48 PM Kai Stian Olstad 
wrote:

> You don't say anything about the Ceph version you are running.
> I had an similar issue with 17.2.7, and is seams to be an issue with
> mclock,
> when I switch to wpq everything worked again.
>
> You can read more about it here
>
> https://lists.ceph.io/hyperkitty/list/ceph-users@ceph.io/thread/IPHBE3DLW5ABCZHSNYOBUBSI3TLWVD22/#OE3QXLAJIY6NU7PNMGHP47UK2CBZJPUG
>
> -
> Kai Stian Olstad
>
>
> On Tue, Feb 06, 2024 at 06:35:26AM -, LeonGao  wrote:
> >Hi community
> >
> >We have a new Ceph cluster deployment with 100 nodes. When we are
> draining an OSD host from the cluster, we see a small amount of PGs that
> cannot make any progress to the end. From the logs and metrics, it seems
> like the recovery progress is stuck (0 recovery ops for several days).
> Would like to get some ideas on this. Re-peering and OSD restart do resolve
> to mitigate the issue but we want to get to the root cause of it as
> draining and recovery happen frequently.
> >
> >I have put some debugging information below. Any help is appreciated,
> thanks!
> >
> >ceph -s
> >pgs: 4210926/7380034104 objects misplaced (0.057%)
> > 41198 active+clean
> > 71active+remapped+backfilling
> > 12active+recovering
> >
> >One of the stuck PG:
> >6.38f1   active+remapped+backfilling [313,643,727]
>  313 [313,643,717] 313
> >
> >PG query result:
> >
> >ceph pg 6.38f1 query
> >{
> >"snap_trimq": "[]",
> >"snap_trimq_len": 0,
> >"state": "active+remapped+backfilling",
> >"epoch": 246856,
> >"up": [
> >313,
> >643,
> >727
> >],
> >"acting": [
> >313,
> >643,
> >717
> >],
> >"backfill_targets": [
> >"727"
> >],
> >"acting_recovery_backfill": [
> >"313",
> >"643",
> >"717",
> >"727"
> >],
> >"info": {
> >"pgid": "6.38f1",
> >"last_update": "212333'38916",
> >"last_complete": "212333'38916",
> >"log_tail": "80608'37589",
> >"last_user_version": 38833,
> >"last_backfill": "MAX",
> >"purged_snaps": [],
> >"history": {
> >"epoch_created": 3726,
> >"epoch_pool_created": 3279,
> >"last_epoch_started": 243987,
> >"last_interval_started": 243986,
> >"last_epoch_clean": 220174,
> >"last_interval_clean": 220173,
> >"last_epoch_split": 3726,
> >"last_epoch_marked_full": 0,
> >"same_up_since": 238347,
> >"same_interval_since": 243986,
> >"same_primary_since": 3728,
> >"last_scrub": "212333'38916",
> >"last_scrub_stamp": "2024-01-29T13:43:10.654709+",
> >"last_deep_scrub": "212333'38916",
> >"last_deep_scrub_stamp": "2024-01-28T07:43:45.920198+",
> >"last_clean_scrub_stamp": "2024-01-29T13:43:10.654709+",
> >"prior_readable_until_ub": 0
> >},
> >"stats": {
> >"version": "212333'38916",
> >"reported_seq": 413425,
> >"reported_epoch": 246856,
> >"state": "active+remapped+backfilling",
> >"last_fresh": "2024-02-05T21:14:40.838785+",
> >"last_change": "2024-02-03T22:33:43.052272+",
> >"last_active": "2024-02-05T21:14:40.838785+",
> >"last_peered": "2024-02-05T21:14:40.838785+",
> >"last_clean": "2024-02-03T04:26:35.168232+",
> >"last_became_active": "2024-02-03T22:31:16.037823+",
> >"last_became_peered": "2024-02-03T22:31:16.037823+",
> >"last_unstale": "2024-02-05T21:14:40.838785+",
> >"last_undegraded": "2024-02-05T21:14:40.838785+",
> >"last_fullsized": "2024-02-05T21:14:40.838785+",
> >"mapping_epoch": 243986,
> >"log_start": "80608'37589",
> >"ondisk_log_start": "80608'37589",
> >"created": 3726,
> >"last_epoch_clean": 220174,
> >"parent": "0.0",
> >"parent_split_bits": 14,
> >"last_scrub": "212333'38916",
> >"last_scrub_stamp": "2024-01-29T13:43:10.654709+",
> >"last_deep_scrub": "212333'38916",
> >"last_deep_scrub_stamp": "2024-01-28T07:43:45.920198+",
> >"last_clean_scrub_stamp": "2024-01-29T13:43:10.654709+",
> >"objects_scrubbed": 17743,
> >"log_size": 1327,
> >"log_dups_size": 3000,
> >"ondisk_log_size": 1327,
> >"stats_invalid": false,
> >"dirty_stats_invalid": false,
> >"omap_stats_invalid": false,
> >