Some interesting updates on our end. 

This cluster (condor) is in a multisite RGW zonegroup with another cluster 
(albans). Albans is still on nautilus and was healthy back when we started this 
thread. As a last resort, we decided to destroy condor and recreate it, putting 
it back in the zonegroup with albans to restore all its data. This worked, but 
shortly after completing the process albans (still on nautilus) ran into the 
same issue we started out with on condor, that we raised this thread for.

So - we're now seeing this issue ("bluefs _allocate failed to expand slow 
device to fit...") on a nautilus rig, running ceph 14.2.9 
(581f22da52345dba46ee232b73b990f06029a2a0) nautilus (stable). Once again 2 of 
the 3 OSDs are flapping. This is with both allocators set to "bitmap" in 
config. I've attached logs from one of these hosts (albans_sc1) in case there's 
any comparison to be made; the logs from the failure ultimately look the same 
to me.

Would that suggest this isn't specific to octopus at all? Or perhaps it's a 
result of having one cluster at octopus and one at nautilus within the same RGW 
zonegroup?

Something else I did wonder about - we've had alarms about "large omap objects" 
on these two clusters for several weeks now, certainly before the OSDs started 
flapping. Currently albans is reporting 54 large omap objects.  Condor, which 
is back running healthily again on octopus since we redeployed it, has 21. 
Could this be the underlying issue?

One final thought: we're using CephFS too, which I think is perhaps less 
commonly used than other Ceph features. Could that be related, and explain why 
we're seeing this when other ceph users aren't?

Any suggestion on next steps or other things to try would be greatly 
appreciated  - we're out of ideas here! 

Dave



 
_______________________________________________
ceph-users mailing list -- ceph-users@ceph.io
To unsubscribe send an email to ceph-users-le...@ceph.io

Reply via email to