Not sure which version you are on, but adding these to your
/etc/ceph/ceph.conf file and restarting the OSD processes can go a long way
to helping these really long blocks. It won't get rid of them completely,
but should really help.

osd op queue = wpq
osd op queue cut off = high

----------------
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Mon, Sep 16, 2019 at 11:34 PM Thomas <74cmo...@gmail.com> wrote:

> Hi,
>
> I have defined pool hdd which is exclusively used by virtual disks of
> multiple KVMs / LXCs.
> Yesterday I run these commands
> osdmaptool om --upmap out.txt --upmap-pool hdd
> source out.txt
> and Ceph started rebalancing this pool.
>
> However since then no KVM / LXC is reacting anymore.
> If I try to start a new KVM it hangs in boot process.
>
> This is the output of ceph health detail:
> root@ld3955:/mnt/rbd# ceph health detail
> HEALTH_ERR 28 nearfull osd(s); 1 pool(s) nearfull; Reduced data
> availability: 1 pg inactive, 1 pg peering; Degraded data redundancy (low
> space): 8 pgs backfill_toofull; 1 subtrees have overcommitted pool
> target_size_bytes; 1 subtrees have overcommitted pool target_size_ratio;
> 2 pools have too many placement groups; 672 slow requests are blocked >
> 32 sec; 4752 stuck requests are blocked > 4096 sec
> OSD_NEARFULL 28 nearfull osd(s)
>     osd.42 is near full
>     osd.44 is near full
>     osd.45 is near full
>     osd.77 is near full
>     osd.84 is near full
>     osd.94 is near full
>     osd.101 is near full
>     osd.103 is near full
>     osd.106 is near full
>     osd.109 is near full
>     osd.113 is near full
>     osd.118 is near full
>     osd.120 is near full
>     osd.136 is near full
>     osd.138 is near full
>     osd.142 is near full
>     osd.147 is near full
>     osd.156 is near full
>     osd.159 is near full
>     osd.161 is near full
>     osd.168 is near full
>     osd.192 is near full
>     osd.202 is near full
>     osd.206 is near full
>     osd.208 is near full
>     osd.226 is near full
>     osd.234 is near full
>     osd.247 is near full
> POOL_NEARFULL 1 pool(s) nearfull
>     pool 'hdb_backup' is nearfull
> PG_AVAILABILITY Reduced data availability: 1 pg inactive, 1 pg peering
>     pg 30.1b9 is stuck peering for 4722.750977, current state peering,
> last acting [183,27,63]
> PG_DEGRADED_FULL Degraded data redundancy (low space): 8 pgs
> backfill_toofull
>     pg 11.465 is active+remapped+backfill_wait+backfill_toofull, acting
> [308,351,58]
>     pg 11.5c4 is active+remapped+backfill_wait+backfill_toofull, acting
> [318,336,54]
>     pg 11.afd is active+remapped+backfill_wait+backfill_toofull, acting
> [347,220,315]
>     pg 11.b82 is active+remapped+backfill_toofull, acting [314,320,254]
>     pg 11.1803 is active+remapped+backfill_wait+backfill_toofull, acting
> [88,363,302]
>     pg 11.1aac is active+remapped+backfill_wait+backfill_toofull, acting
> [328,275,95]
>     pg 11.1c09 is active+remapped+backfill_wait+backfill_toofull, acting
> [55,124,278]
>     pg 11.1e36 is active+remapped+backfill_wait+backfill_toofull, acting
> [351,92,315]
> POOL_TARGET_SIZE_BYTES_OVERCOMMITTED 1 subtrees have overcommitted pool
> target_size_bytes
>     Pools ['hdb_backup'] overcommit available storage by 1.708x due to
> target_size_bytes    0  on pools []
> POOL_TARGET_SIZE_RATIO_OVERCOMMITTED 1 subtrees have overcommitted pool
> target_size_ratio
>     Pools ['hdb_backup'] overcommit available storage by 1.708x due to
> target_size_ratio 0.000 on pools []
> POOL_TOO_MANY_PGS 2 pools have too many placement groups
>     Pool hdd has 512 placement groups, should have 128
>     Pool pve_cephfs_metadata has 32 placement groups, should have 4
> REQUEST_SLOW 672 slow requests are blocked > 32 sec
>     249 ops are blocked > 2097.15 sec
>     284 ops are blocked > 1048.58 sec
>     108 ops are blocked > 524.288 sec
>     9 ops are blocked > 262.144 sec
>     22 ops are blocked > 131.072 sec
>     osd.9 has blocked requests > 524.288 sec
>     osds 0,2,6,68 have blocked requests > 1048.58 sec
>     osd.3 has blocked requests > 2097.15 sec
> REQUEST_STUCK 4752 stuck requests are blocked > 4096 sec
>     1431 ops are blocked > 67108.9 sec
>     513 ops are blocked > 33554.4 sec
>     909 ops are blocked > 16777.2 sec
>     1809 ops are blocked > 8388.61 sec
>     90 ops are blocked > 4194.3 sec
>     osd.63 has stuck requests > 67108.9 sec
>
>
> My interpretation is that Ceph
> a) is busy with remapping PGs of pool hdb_backup
> b) has identified several OSDs with either blocked or stuck requests.
>
> Any of these OSDs belongs to pool hdd, though.
> osd.9 belongs to node A, osd.63 and osd.68 belongs to node C (there are
> 4 nodes serving OSD in the cluster).
>
> I have tried to fix this issue, but it failed with
> - ceph osd set noout
> - restart of relevant OSD by systemctl restart ceph-osd@<id>
> and finally server reboot.
>
> I also tried to migrate the virtual disks to another pool, but this
> fails, too.
>
> There are no changes on server side, like network or disks or whatsoever.
>
> How can I resolve this issue?
>
> THX
> Thomas
> _______________________________________________
> ceph-users mailing list -- ceph-users@ceph.io
> To unsubscribe send an email to ceph-users-le...@ceph.io
>
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to