Hey All:

I've also quite frequently experienced this sort of issue with my Ceph 
RBD-backed QEMU/KVM
cluster (not OpenStack specifically). Should this workaround of allowing the 
'osd blacklist'
command in the caps help in that scenario as well, or is this an 
OpenStack-specific
functionality?

Thanks,
Joshua

On 2019-11-15 9:02 a.m., Florian Haas wrote:

On 15/11/2019 14:27, Simon Ironside wrote:
Hi Florian,

On 15/11/2019 12:32, Florian Haas wrote:

I received this off-list but then subsequently saw this message pop up
in the list archive, so I hope it's OK to reply on-list?
Of course, I just clicked the wrong reply button the first time.

So that cap was indeed missing, thanks for the hint! However, I am still
trying to understand how this is related to the issue we saw.
I had exactly the same happen to me as happened to you a week or so ago.
Compute node lost power and once restored the VMs would start booting
but fail early on when they tried to write.

My key was also missing that cap, adding it and resetting the affected
VMs was the only action I took to sort things out. I didn't need to go
around removing locks by hand as you did. As you say, waiting 30 seconds
didn't do any good so it doesn't appear to be a watcher thing.
Right, so suffice to say that that article is at least somewhere between
incomplete and misleading. :)

This was mentioned in the release notes for Luminous[1], I'd missed it
too as I redeployed Nautilus instead and skipped these steps:

<snip>

Verify that all RBD client users have sufficient caps to blacklist other
client users. RBD client users with only "allow r" monitor caps should
be updated as follows:

# ceph auth caps client.<ID> mon 'allow r, allow command "osd
blacklist"' osd '<existing OSD caps for user>'

<snip>
Yup, looks like we missed that bit of the release notes too (cluster has
been in production for several major releases now).

So it looks like we've got a fix for this. Thanks!

Also Wido, thanks for the reminder on profile rbd; we'll look into that too.

However, I'm still failing to wrap my read around the causality chain
here, and also around the interplay between watchers, locks, and
blacklists. If anyone could share some insight about this that I could
distill into a doc patch, I'd much appreciate that.

Cheers,
Florian
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to