We did have a peering storm, we're past that portion of the backfill and still 
experiencing new instances of rbd volumes hanging. It is for sure not just the 
peering storm.

We've got 22.184% objects misplaced yet, with a bunch of pgs left to backfill 
(like 75k). Our rbd poll is using about 1.7PiB of storage, so we're looking at 
like 370TiB yet to backfill, rough estimate. This specific pool is using 
replicated encoding, with size=3.

RAW STORAGE:
    CLASS     SIZE       AVAIL      USED       RAW USED     %RAW USED
    hdd       21 PiB     11 PiB     10 PiB       10 PiB         48.73
    TOTAL     21 PiB     11 PiB     10 PiB       10 PiB         48.73

POOLS:
    POOL                                    ID     PGS       STORED      
OBJECTS     USED        %USED     MAX AVAIL
    pool1                                    4     32768     574 TiB     
147.16M     1.7 PiB     68.87       260 TiB

We did see a lot of rbd volumes that hung, often giving the buffer i/o errors 
previously sent - whether that was the peering storm or backfills is uncertain. 
As suggested, we've already been detaching/reattaching the rbd volumes, pushing 
the primary active osd for pgs to another, and sometimes rebooting the kernel 
on the vm to clear the io queue. A combination of those brings the rbd volume 
block device back for a while.

We're no longer in a peering storm and we're seeing the rbd volumes going into 
an unresponsive state again - including osds where they were unresponsive, we 
did things and got them responsive, and then they went unresponsive again. All 
pgs are in an active state, some active+remapped+backfilling, some 
active+undersized+remapped+backfilling, etc.

We also run the object gateway off the same cluster with the same backfill, the 
object gateway is not experiencing issues. Also the osds patricipating in the 
backfill are not saturated with i/o, or seeing abnormal load for our usual 
backfill operations.

But with the continuing backfill, we're seeing rbd volumes on active pgs going 
back into a blocked state. We can do about the same with detaching the volume / 
bouncing the pg to a new primary acting osd, but we'd rather have these stop 
going unresponsive in the first place. Any suggestions towards that direction 
are greatly appreciated.
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to