I forgot to add that OSD daemons really seem to be idle, no disk activity, no CPU usage.. it just looks to me like some kind of deadlock, as they were waiting for each other..
and so I'm trying to get last 1.5% of misplaced / degraded PGs for almost a week.. On Fri, Jul 28, 2017 at 10:56:02AM +0200, Nikola Ciprich wrote: > Hi, > > I'm trying to find reason for strange recovery issues I'm seeing on > our cluster.. > > it's mostly idle, 4 node cluster with 26 OSDs evenly distributed > across nodes. jewel 10.2.9 > > the problem is that after some disk replaces and data moves, recovery > is progressing extremely slowly.. pgs seem to be stuck in > active+recovering+degraded > state: > > [root@v1d ~]# ceph -s > cluster a5efbc87-3900-4c42-a977-8c93f7aa8c33 > health HEALTH_WARN > 159 pgs backfill_wait > 4 pgs backfilling > 259 pgs degraded > 12 pgs recovering > 113 pgs recovery_wait > 215 pgs stuck degraded > 266 pgs stuck unclean > 140 pgs stuck undersized > 151 pgs undersized > recovery 37788/2327775 objects degraded (1.623%) > recovery 23854/2327775 objects misplaced (1.025%) > noout,noin flag(s) set > monmap e21: 3 mons at > {v1a=10.0.0.1:6789/0,v1b=10.0.0.2:6789/0,v1c=10.0.0.3:6789/0} > election epoch 6160, quorum 0,1,2 v1a,v1b,v1c > fsmap e817: 1/1/1 up {0=v1a=up:active}, 1 up:standby > osdmap e76002: 26 osds: 26 up, 26 in; 185 remapped pgs > flags noout,noin,sortbitwise,require_jewel_osds > pgmap v80995844: 3200 pgs, 4 pools, 2876 GB data, 757 kobjects > 9215 GB used, 35572 GB / 45365 GB avail > 37788/2327775 objects degraded (1.623%) > 23854/2327775 objects misplaced (1.025%) > 2912 active+clean > 130 active+undersized+degraded+remapped+wait_backfill > 97 active+recovery_wait+degraded > 29 active+remapped+wait_backfill > 12 active+recovery_wait+undersized+degraded+remapped > 6 active+recovering+degraded > 5 active+recovering+undersized+degraded+remapped > 4 active+undersized+degraded+remapped+backfilling > 4 active+recovery_wait+degraded+remapped > 1 active+recovering+degraded+remapped > client io 2026 B/s rd, 146 kB/s wr, 9 op/s rd, 21 op/s wr > > > when I restart affected OSDs, it bumps the recovery, but then another > PGs get stuck.. All OSDs were restarted multiple times, none are even close to > nearfull, I just cant find what I'm doing wrong.. > > possibly related OSD options: > > osd max backfills = 4 > osd recovery max active = 15 > debug osd = 0/0 > osd op threads = 4 > osd backfill scan min = 4 > osd backfill scan max = 16 > > Any hints would be greatly appreciated > > thanks > > nik > > > -- > ------------------------------------- > Ing. Nikola CIPRICH > LinuxBox.cz, s.r.o. > 28.rijna 168, 709 00 Ostrava > > tel.: +420 591 166 214 > fax: +420 596 621 273 > mobil: +420 777 093 799 > www.linuxbox.cz > > mobil servis: +420 737 238 656 > email servis: ser...@linuxbox.cz > ------------------------------------- > _______________________________________________ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > -- ------------------------------------- Ing. Nikola CIPRICH LinuxBox.cz, s.r.o. 28.rijna 168, 709 00 Ostrava tel.: +420 591 166 214 fax: +420 596 621 273 mobil: +420 777 093 799 www.linuxbox.cz mobil servis: +420 737 238 656 email servis: ser...@linuxbox.cz ------------------------------------- _______________________________________________ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com