Re: [ceph-users] hanging/stopped recovery/rebalance in Nautilus

2019-10-03 Thread Konstantin Shalygin

Hi,I often observed now that the recovery/rebalance in Nautilus starts quite 
fast but gets extremely slow (2-3 objects/s) even if there are like 20 OSDs 
involved. Right now I am moving (reweighted to 0) 16x8TB disks, it's running 
since 4 days and since 12h it's kind of stuck now at
   cluster:
     id: 2f525d60-aada-4da6-830f-7ba7b46c546b
     health: HEALTH_WARN
     Degraded data redundancy: 1070/899796274 objects degraded 
(0.000%), 1 pg degraded, 1 pg undersized
     1216 pgs not deep-scrubbed in time
     1216 pgs not scrubbed in time
   
   services:

     mon: 1 daemons, quorum km-fsn-1-dc4-m1-797678 (age 8w)
     mgr: km-fsn-1-dc4-m1-797678(active, since 6w)
     mds: xfd:1 {0=km-fsn-1-dc4-m1-797678=up:active}
     osd: 151 osds: 151 up (since 3d), 151 in (since 7d); 24 remapped pgs
     rgw: 1 daemon active (km-fsn-1-dc4-m1-797680)
  
   data:

     pools:   13 pools, 10433 pgs
     objects: 447.45M objects, 282 TiB
     usage:   602 TiB used, 675 TiB / 1.2 PiB avail
     pgs: 1070/899796274 objects degraded (0.000%)
  261226/899796274 objects misplaced (0.029%)
  10388 active+clean
  24    active+clean+remapped
  19    active+clean+scrubbing+deep
  1 active+undersized+degraded
  1 active+clean+scrubbing
  
   io:

     client:   10 MiB/s rd, 18 MiB/s wr, 141 op/s rd, 292 op/s wr


osd-max-backfill is at 16 for all OSDs.
Anyone got an idea why rebalance completely stopped?
Thanks


Try to lower sleep options:

osd_recovery_sleep_hdd -> if hdd without RocksDB on NVMe;
osd_recovery_sleep_hybrid -> for hybrid solutions, i.e. RocksDB on NVMe;
osd_recovery_sleep_ssd -> for non rotational devices;



k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] hanging/stopped recovery/rebalance in Nautilus

2019-10-01 Thread Philippe D'Anjou
Hi,I often observed now that the recovery/rebalance in Nautilus starts quite 
fast but gets extremely slow (2-3 objects/s) even if there are like 20 OSDs 
involved. Right now I am moving (reweighted to 0) 16x8TB disks, it's running 
since 4 days and since 12h it's kind of stuck now at
  cluster:
    id: 2f525d60-aada-4da6-830f-7ba7b46c546b
    health: HEALTH_WARN
    Degraded data redundancy: 1070/899796274 objects degraded (0.000%), 
1 pg degraded, 1 pg undersized
    1216 pgs not deep-scrubbed in time
    1216 pgs not scrubbed in time
  
  services:
    mon: 1 daemons, quorum km-fsn-1-dc4-m1-797678 (age 8w)
    mgr: km-fsn-1-dc4-m1-797678(active, since 6w)
    mds: xfd:1 {0=km-fsn-1-dc4-m1-797678=up:active}
    osd: 151 osds: 151 up (since 3d), 151 in (since 7d); 24 remapped pgs
    rgw: 1 daemon active (km-fsn-1-dc4-m1-797680)
 
  data:
    pools:   13 pools, 10433 pgs
    objects: 447.45M objects, 282 TiB
    usage:   602 TiB used, 675 TiB / 1.2 PiB avail
    pgs: 1070/899796274 objects degraded (0.000%)
 261226/899796274 objects misplaced (0.029%)
 10388 active+clean
 24    active+clean+remapped
 19    active+clean+scrubbing+deep
 1 active+undersized+degraded
 1 active+clean+scrubbing
 
  io:
    client:   10 MiB/s rd, 18 MiB/s wr, 141 op/s rd, 292 op/s wr


osd-max-backfill is at 16 for all OSDs.
Anyone got an idea why rebalance completely stopped?
Thanks
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com