Not sure which version you are on, but adding these to your /etc/ceph/ceph.conf file and restarting the OSD processes can go a long way to helping these really long blocks. It won't get rid of them completely, but should really help.
osd op queue = wpq osd op queue cut off = high ---------------- Robert LeBlanc PGP Fingerprint 79A2 9CA4 6CC4 45DD A904 C70E E654 3BB2 FA62 B9F1 On Mon, Sep 16, 2019 at 11:34 PM Thomas <74cmo...@gmail.com> wrote: > Hi, > > I have defined pool hdd which is exclusively used by virtual disks of > multiple KVMs / LXCs. > Yesterday I run these commands > osdmaptool om --upmap out.txt --upmap-pool hdd > source out.txt > and Ceph started rebalancing this pool. > > However since then no KVM / LXC is reacting anymore. > If I try to start a new KVM it hangs in boot process. > > This is the output of ceph health detail: > root@ld3955:/mnt/rbd# ceph health detail > HEALTH_ERR 28 nearfull osd(s); 1 pool(s) nearfull; Reduced data > availability: 1 pg inactive, 1 pg peering; Degraded data redundancy (low > space): 8 pgs backfill_toofull; 1 subtrees have overcommitted pool > target_size_bytes; 1 subtrees have overcommitted pool target_size_ratio; > 2 pools have too many placement groups; 672 slow requests are blocked > > 32 sec; 4752 stuck requests are blocked > 4096 sec > OSD_NEARFULL 28 nearfull osd(s) > osd.42 is near full > osd.44 is near full > osd.45 is near full > osd.77 is near full > osd.84 is near full > osd.94 is near full > osd.101 is near full > osd.103 is near full > osd.106 is near full > osd.109 is near full > osd.113 is near full > osd.118 is near full > osd.120 is near full > osd.136 is near full > osd.138 is near full > osd.142 is near full > osd.147 is near full > osd.156 is near full > osd.159 is near full > osd.161 is near full > osd.168 is near full > osd.192 is near full > osd.202 is near full > osd.206 is near full > osd.208 is near full > osd.226 is near full > osd.234 is near full > osd.247 is near full > POOL_NEARFULL 1 pool(s) nearfull > pool 'hdb_backup' is nearfull > PG_AVAILABILITY Reduced data availability: 1 pg inactive, 1 pg peering > pg 30.1b9 is stuck peering for 4722.750977, current state peering, > last acting [183,27,63] > PG_DEGRADED_FULL Degraded data redundancy (low space): 8 pgs > backfill_toofull > pg 11.465 is active+remapped+backfill_wait+backfill_toofull, acting > [308,351,58] > pg 11.5c4 is active+remapped+backfill_wait+backfill_toofull, acting > [318,336,54] > pg 11.afd is active+remapped+backfill_wait+backfill_toofull, acting > [347,220,315] > pg 11.b82 is active+remapped+backfill_toofull, acting [314,320,254] > pg 11.1803 is active+remapped+backfill_wait+backfill_toofull, acting > [88,363,302] > pg 11.1aac is active+remapped+backfill_wait+backfill_toofull, acting > [328,275,95] > pg 11.1c09 is active+remapped+backfill_wait+backfill_toofull, acting > [55,124,278] > pg 11.1e36 is active+remapped+backfill_wait+backfill_toofull, acting > [351,92,315] > POOL_TARGET_SIZE_BYTES_OVERCOMMITTED 1 subtrees have overcommitted pool > target_size_bytes > Pools ['hdb_backup'] overcommit available storage by 1.708x due to > target_size_bytes 0 on pools [] > POOL_TARGET_SIZE_RATIO_OVERCOMMITTED 1 subtrees have overcommitted pool > target_size_ratio > Pools ['hdb_backup'] overcommit available storage by 1.708x due to > target_size_ratio 0.000 on pools [] > POOL_TOO_MANY_PGS 2 pools have too many placement groups > Pool hdd has 512 placement groups, should have 128 > Pool pve_cephfs_metadata has 32 placement groups, should have 4 > REQUEST_SLOW 672 slow requests are blocked > 32 sec > 249 ops are blocked > 2097.15 sec > 284 ops are blocked > 1048.58 sec > 108 ops are blocked > 524.288 sec > 9 ops are blocked > 262.144 sec > 22 ops are blocked > 131.072 sec > osd.9 has blocked requests > 524.288 sec > osds 0,2,6,68 have blocked requests > 1048.58 sec > osd.3 has blocked requests > 2097.15 sec > REQUEST_STUCK 4752 stuck requests are blocked > 4096 sec > 1431 ops are blocked > 67108.9 sec > 513 ops are blocked > 33554.4 sec > 909 ops are blocked > 16777.2 sec > 1809 ops are blocked > 8388.61 sec > 90 ops are blocked > 4194.3 sec > osd.63 has stuck requests > 67108.9 sec > > > My interpretation is that Ceph > a) is busy with remapping PGs of pool hdb_backup > b) has identified several OSDs with either blocked or stuck requests. > > Any of these OSDs belongs to pool hdd, though. > osd.9 belongs to node A, osd.63 and osd.68 belongs to node C (there are > 4 nodes serving OSD in the cluster). > > I have tried to fix this issue, but it failed with > - ceph osd set noout > - restart of relevant OSD by systemctl restart ceph-osd@<id> > and finally server reboot. > > I also tried to migrate the virtual disks to another pool, but this > fails, too. > > There are no changes on server side, like network or disks or whatsoever. > > How can I resolve this issue? > > THX > Thomas > _______________________________________________ > ceph-users mailing list -- ceph-users@ceph.io > To unsubscribe send an email to ceph-users-le...@ceph.io >
_______________________________________________ ceph-users mailing list -- ceph-users@ceph.io To unsubscribe send an email to ceph-users-le...@ceph.io