[ceph-users] Ceph MDS randomly hangs when pg nums reduced

lokitingyi Fri, 23 Feb 2024 06:58:07 -0800

Hi,

I have a CephFS cluster
```
> ceph -s


  cluster:
    id:     e78987f2-ef1c-11ed-897d-cf8c255417f0
    health: HEALTH_WARN
            85 pgs not deep-scrubbed in time
            85 pgs not scrubbed in time

  services:
    mon: 5 daemons, quorum 
datastone05,datastone06,datastone07,datastone10,datastone09 (age 2w)
    mgr: datastone05.iitngk(active, since 2w), standbys: datastone06.wjppdy
    mds: 2/2 daemons up, 1 hot standby
    osd: 22 osds: 22 up (since 3d), 22 in (since 4w); 8 remapped pgs

  data:
    volumes: 1/1 healthy
    pools:   4 pools, 115 pgs
    objects: 49.08M objects, 16 TiB
    usage:   35 TiB used, 2.0 PiB / 2.1 PiB avail
    pgs:     3807933/98160678 objects misplaced (3.879%)
             107 active+clean
             8   active+remapped+backfilling

  io:
    client:   224 MiB/s rd, 79 MiB/s wr, 844 op/s rd, 33 op/s wr
    recovery: 8.8 MiB/s, 24 objects/s
```

The pool and pg status

```
> ceph osd pool autoscale-status

POOL                SIZE  TARGET SIZE  RATE  RAW CAPACITY   RATIO  TARGET RATIO 
 EFFECTIVE RATIO  BIAS  PG_NUM  NEW PG_NUM  AUTOSCALE  BULK
cephfs.myfs.meta  28802M                2.0         2119T  0.0000               
                   4.0      16              on         False
cephfs.myfs.data  16743G                2.0         2119T  0.0154               
                   1.0      32              on         False
rbd                  19                 2.0         2119T  0.0000               
                   1.0      32              on         False
.mgr               3840k                2.0         2119T  0.0000               
                   1.0       1              on         False
```

The pool detail

```
> ceph osd pool ls detail

pool 1 'cephfs.myfs.meta' replicated size 2 min_size 1 crush_rule 0 object_hash 
rjenkins pg_num 16 pgp_num 16 autoscale_mode on last_change 3639 lfor 
0/3639/3637 flags hashpspool stripe_width 0 pg_autoscale_bias 4 pg_num_min 16 
recovery_priority 5 application cephfs
pool 2 'cephfs.myfs.data' replicated size 2 min_size 1 crush_rule 0 object_hash 
rjenkins pg_num 66 pgp_num 58 pg_num_target 32 pgp_num_target 32 autoscale_mode 
on last_change 5670 lfor 0/5661/5659 flags hashpspool,selfmanaged_snaps 
stripe_width 0 application cephfs
pool 3 'rbd' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins 
pg_num 32 pgp_num 32 autoscale_mode on last_change 486 lfor 0/486/478 flags 
hashpspool,selfmanaged_snaps stripe_width 0 application rbd
pool 4 '.mgr' replicated size 2 min_size 1 crush_rule 0 object_hash rjenkins 
pg_num 1 pgp_num 1 autoscale_mode on last_change 39 flags hashpspool 
stripe_width 0 pg_num_max 32 pg_num_min 1 application mgr
```

When pg numbers reduce, the mds server would have a chance to hang.
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

[ceph-users] Ceph MDS randomly hangs when pg nums reduced

Reply via email to