Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks
On 02/12/2019 16:48, Florian Haas wrote: > Doc patch PR is here, for anyone who would feels inclined to review: > > https://github.com/ceph/ceph/pull/31893 Landed, here's the new documentation: https://docs.ceph.com/docs/master/rbd/rbd-exclusive-locks/ Thanks everyone for chiming in, and special thanks to Jason for the detailed review! Cheers, Florian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks
On 19/11/2019 22:42, Florian Haas wrote: > On 19/11/2019 22:34, Jason Dillaman wrote: >>> Oh totally, I wasn't arguing it was a bad idea for it to do what it >>> does! I just got confused by the fact that our mon logs showed what >>> looked like a (failed) attempt to blacklist an entire client IP address. >> >> There should have been an associated client nonce after the IP address >> to uniquely identify which client connection is blacklisted -- >> something like "1.2.3.4:0/5678". Let me know if that's not the case >> since that would definitely be wrong. > > English lacks a universally understood way to answer a negated question > in the affirmative, so this is tricky to get right, but I'll try: No, > that *is* the case, thus nothing is wrong. :) Doc patch PR is here, for anyone who would feels inclined to review: https://github.com/ceph/ceph/pull/31893 Cheers, Florian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks
On Tue, Nov 19, 2019 at 4:42 PM Florian Haas wrote: > > On 19/11/2019 22:34, Jason Dillaman wrote: > >> Oh totally, I wasn't arguing it was a bad idea for it to do what it > >> does! I just got confused by the fact that our mon logs showed what > >> looked like a (failed) attempt to blacklist an entire client IP address. > > > > There should have been an associated client nonce after the IP address > > to uniquely identify which client connection is blacklisted -- > > something like "1.2.3.4:0/5678". Let me know if that's not the case > > since that would definitely be wrong. > > English lacks a universally understood way to answer a negated question > in the affirmative, so this is tricky to get right, but I'll try: No, > that *is* the case, thus nothing is wrong. :) Haha -- thanks! > Cheers, > Florian > -- Jason ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks
On 19/11/2019 22:34, Jason Dillaman wrote: >> Oh totally, I wasn't arguing it was a bad idea for it to do what it >> does! I just got confused by the fact that our mon logs showed what >> looked like a (failed) attempt to blacklist an entire client IP address. > > There should have been an associated client nonce after the IP address > to uniquely identify which client connection is blacklisted -- > something like "1.2.3.4:0/5678". Let me know if that's not the case > since that would definitely be wrong. English lacks a universally understood way to answer a negated question in the affirmative, so this is tricky to get right, but I'll try: No, that *is* the case, thus nothing is wrong. :) Cheers, Florian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks
On Tue, Nov 19, 2019 at 4:31 PM Florian Haas wrote: > > On 19/11/2019 22:19, Jason Dillaman wrote: > > On Tue, Nov 19, 2019 at 4:09 PM Florian Haas wrote: > >> > >> On 19/11/2019 21:32, Jason Dillaman wrote: > What, exactly, is the "reasonably configured hypervisor" here, in other > words, what is it that grabs and releases this lock? It's evidently not > Nova that does this, but is it libvirt, or Qemu/KVM, and if so, what > magic in there makes this happen, and what "reasonable configuration" > influences this? > >>> > >>> librbd and krbd perform this logic when the exclusive-lock feature is > >>> enabled. > >> > >> Right. So the "reasonable configuration" applies to the features they > >> enable when they *create* an image, rather than what they do to the > >> image at runtime. Is that fair to say? > > > > The exclusive-lock ownership is enforced at image use (i.e. when the > > feature is a property of the image, not specifically just during the > > action of enabling the property) -- so this implies "what they do to > > the image at runtime" > > OK, gotcha. > > >>> In this case, librbd sees that the previous lock owner is > >>> dead / missing, but before it can steal the lock (since librbd did not > >>> cleanly close the image), it needs to ensure it cannot come back from > >>> the dead to issue future writes against the RBD image by blacklisting > >>> it from the cluster. > >> > >> Thanks. I'm probably sounding dense here, sorry for that, but yes, this > >> makes perfect sense to me when I want to fence a whole node off — > >> however, how exactly does this work with VM recovery in place? > > > > How would librbd / krbd know under what situation a VM was being > > "recovered"? Should librbd be expected to integrate w/ IPMI devices > > where the VM is being run or w/ Zabbix alert monitoring to know that > > this was a power failure so don't expect that the lock owner will come > > back up? The safe and generic thing for librbd / krbd to do in this > > situation is to just blacklist the old lock owner to ensure it cannot > > talk to the cluster. Obviously in the case of a physically failed > > node, that won't ever happen -- but I think we can all agree this is > > the sane recovery path that covers all bases. > > Oh totally, I wasn't arguing it was a bad idea for it to do what it > does! I just got confused by the fact that our mon logs showed what > looked like a (failed) attempt to blacklist an entire client IP address. There should have been an associated client nonce after the IP address to uniquely identify which client connection is blacklisted -- something like "1.2.3.4:0/5678". Let me know if that's not the case since that would definitely be wrong. > > Yup, with the correct permissions librbd / rbd will be able to > > blacklist the lock owner, break the old lock, and acquire the lock > > themselves for R/W operations -- and the operator would not need to > > intervene. > > Ack. Thanks! > > Cheers, > Florian > -- Jason ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks
On 19/11/2019 22:19, Jason Dillaman wrote: > On Tue, Nov 19, 2019 at 4:09 PM Florian Haas wrote: >> >> On 19/11/2019 21:32, Jason Dillaman wrote: What, exactly, is the "reasonably configured hypervisor" here, in other words, what is it that grabs and releases this lock? It's evidently not Nova that does this, but is it libvirt, or Qemu/KVM, and if so, what magic in there makes this happen, and what "reasonable configuration" influences this? >>> >>> librbd and krbd perform this logic when the exclusive-lock feature is >>> enabled. >> >> Right. So the "reasonable configuration" applies to the features they >> enable when they *create* an image, rather than what they do to the >> image at runtime. Is that fair to say? > > The exclusive-lock ownership is enforced at image use (i.e. when the > feature is a property of the image, not specifically just during the > action of enabling the property) -- so this implies "what they do to > the image at runtime" OK, gotcha. >>> In this case, librbd sees that the previous lock owner is >>> dead / missing, but before it can steal the lock (since librbd did not >>> cleanly close the image), it needs to ensure it cannot come back from >>> the dead to issue future writes against the RBD image by blacklisting >>> it from the cluster. >> >> Thanks. I'm probably sounding dense here, sorry for that, but yes, this >> makes perfect sense to me when I want to fence a whole node off — >> however, how exactly does this work with VM recovery in place? > > How would librbd / krbd know under what situation a VM was being > "recovered"? Should librbd be expected to integrate w/ IPMI devices > where the VM is being run or w/ Zabbix alert monitoring to know that > this was a power failure so don't expect that the lock owner will come > back up? The safe and generic thing for librbd / krbd to do in this > situation is to just blacklist the old lock owner to ensure it cannot > talk to the cluster. Obviously in the case of a physically failed > node, that won't ever happen -- but I think we can all agree this is > the sane recovery path that covers all bases. Oh totally, I wasn't arguing it was a bad idea for it to do what it does! I just got confused by the fact that our mon logs showed what looked like a (failed) attempt to blacklist an entire client IP address. > Yup, with the correct permissions librbd / rbd will be able to > blacklist the lock owner, break the old lock, and acquire the lock > themselves for R/W operations -- and the operator would not need to > intervene. Ack. Thanks! Cheers, Florian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks
On Tue, Nov 19, 2019 at 4:09 PM Florian Haas wrote: > > On 19/11/2019 21:32, Jason Dillaman wrote: > >> What, exactly, is the "reasonably configured hypervisor" here, in other > >> words, what is it that grabs and releases this lock? It's evidently not > >> Nova that does this, but is it libvirt, or Qemu/KVM, and if so, what > >> magic in there makes this happen, and what "reasonable configuration" > >> influences this? > > > > librbd and krbd perform this logic when the exclusive-lock feature is > > enabled. > > Right. So the "reasonable configuration" applies to the features they > enable when they *create* an image, rather than what they do to the > image at runtime. Is that fair to say? The exclusive-lock ownership is enforced at image use (i.e. when the feature is a property of the image, not specifically just during the action of enabling the property) -- so this implies "what they do to the image at runtime" > > In this case, librbd sees that the previous lock owner is > > dead / missing, but before it can steal the lock (since librbd did not > > cleanly close the image), it needs to ensure it cannot come back from > > the dead to issue future writes against the RBD image by blacklisting > > it from the cluster. > > Thanks. I'm probably sounding dense here, sorry for that, but yes, this > makes perfect sense to me when I want to fence a whole node off — > however, how exactly does this work with VM recovery in place? How would librbd / krbd know under what situation a VM was being "recovered"? Should librbd be expected to integrate w/ IPMI devices where the VM is being run or w/ Zabbix alert monitoring to know that this was a power failure so don't expect that the lock owner will come back up? The safe and generic thing for librbd / krbd to do in this situation is to just blacklist the old lock owner to ensure it cannot talk to the cluster. Obviously in the case of a physically failed node, that won't ever happen -- but I think we can all agree this is the sane recovery path that covers all bases. > From further upthread: > > > Semi-relatedly, as I understand it OSD blacklisting happens based either > > on an IP address, or on a socket address (IP:port). While this comes in > > handy in host evacuation, it doesn't in in-place recovery (see question > > 4 in my original message). > > > > - If the blacklist happens based on IP address alone (and that's what > > seems to be what the client attempts to be doing, based on our log > > messages), then it would break recovery-in-place after a hard reboot > > altogether. > > > > - Even if the client would blacklist based on an address:port pair, it > > would be just very unlikely that an RBD client used the same source port > > to connect after the node recovers in place, but not impossible. > > Clearly though, if people set their permissions correctly then this > blacklisting seems to work fine even for recovery-in-place, so no reason > for me to doubt that, I'd just really like to understand the mechanics. :) Yup, with the correct permissions librbd / rbd will be able to blacklist the lock owner, break the old lock, and acquire the lock themselves for R/W operations -- and the operator would not need to intervene. > Thanks again! > > Cheers, > Florian > -- Jason ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks
On 19/11/2019 21:32, Jason Dillaman wrote: >> What, exactly, is the "reasonably configured hypervisor" here, in other >> words, what is it that grabs and releases this lock? It's evidently not >> Nova that does this, but is it libvirt, or Qemu/KVM, and if so, what >> magic in there makes this happen, and what "reasonable configuration" >> influences this? > > librbd and krbd perform this logic when the exclusive-lock feature is > enabled. Right. So the "reasonable configuration" applies to the features they enable when they *create* an image, rather than what they do to the image at runtime. Is that fair to say? > In this case, librbd sees that the previous lock owner is > dead / missing, but before it can steal the lock (since librbd did not > cleanly close the image), it needs to ensure it cannot come back from > the dead to issue future writes against the RBD image by blacklisting > it from the cluster. Thanks. I'm probably sounding dense here, sorry for that, but yes, this makes perfect sense to me when I want to fence a whole node off — however, how exactly does this work with VM recovery in place? From further upthread: > Semi-relatedly, as I understand it OSD blacklisting happens based either > on an IP address, or on a socket address (IP:port). While this comes in > handy in host evacuation, it doesn't in in-place recovery (see question > 4 in my original message). > > - If the blacklist happens based on IP address alone (and that's what > seems to be what the client attempts to be doing, based on our log > messages), then it would break recovery-in-place after a hard reboot > altogether. > > - Even if the client would blacklist based on an address:port pair, it > would be just very unlikely that an RBD client used the same source port > to connect after the node recovers in place, but not impossible. Clearly though, if people set their permissions correctly then this blacklisting seems to work fine even for recovery-in-place, so no reason for me to doubt that, I'd just really like to understand the mechanics. :) Thanks again! Cheers, Florian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks
On Tue, Nov 19, 2019 at 2:49 PM Florian Haas wrote: > > On 19/11/2019 20:03, Jason Dillaman wrote: > > On Tue, Nov 19, 2019 at 1:51 PM shubjero wrote: > >> > >> Florian, > >> > >> Thanks for posting about this issue. This is something that we have > >> been experiencing (stale exclusive locks) with our OpenStack and Ceph > >> cloud more frequently as our datacentre has had some reliability > >> issues recently with power and cooling causing several unexpected > >> shutdowns. > >> > >> At this point we are on Ceph Mimic 13.2.6 and reading through this > >> thread and related links I just wanted to confirm if I have the > >> correct caps for cinder clients as listed below as we have upgraded > >> through many major Ceph versions over the years and I'm sure a lot of > >> our configs and settings still contain deprecated options. > >> > >> client.cinder > >> key: sanitized== > >> caps: [mgr] allow r > >> caps: [mon] profile rbd > >> caps: [osd] allow class-read object_prefix rbd_children, profile rbd > >> pool=volumes, profile rbd pool=vms, profile rbd pool=images > > > > Only use "profile rbd" for 'mon' and 'osd' caps -- it's documented > > here [1]. Once you use 'profile rbd', you don't need the extra "allow > > class-read object_prefix rbd_children" since it is included within the > > profile (along with other things like support for clone v2). Octopus > > will also include "profile rbd" for the 'mgr' cap to support the new > > functionality in the "rbd_support" manager module (like running "rbd > > perf image top" w/o the admin caps). > > > >> From what I read, the blacklist permission was something that was > >> supposed to be applied pre-Luminous upgrade but once you are on > >> Luminous or later, it's no longer needed assuming you have switched to > >> using the rbd profile. > > > > Correct. The "blacklist" permission was an intermediate state > > pre-upgrade since your older OSDs wouldn't have support for "profile > > rbd" yet but Luminous OSDs started to enforce caps on the 'blacklist > > add' op so that rogue users w/ read-only permissions couldn't just > > blacklist all clients. Once you are at Luminous or later, you can just > > use the profile. > > OK, great. This gives me something to start with for a doc patch. > Thanks! However, I'm still curious about this bit: > > >> On Fri, Nov 15, 2019 at 11:05 AM Paul Emmerich > >> wrote: > >>> * This is unrelated to openstack and will happen with *any* reasonably > >>> configured hypervisor that uses exclusive locking > > What, exactly, is the "reasonably configured hypervisor" here, in other > words, what is it that grabs and releases this lock? It's evidently not > Nova that does this, but is it libvirt, or Qemu/KVM, and if so, what > magic in there makes this happen, and what "reasonable configuration" > influences this? librbd and krbd perform this logic when the exclusive-lock feature is enabled. In this case, librbd sees that the previous lock owner is dead / missing, but before it can steal the lock (since librbd did not cleanly close the image), it needs to ensure it cannot come back from the dead to issue future writes against the RBD image by blacklisting it from the cluster. > Thanks again! > > Cheers, > Florian > -- Jason ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks
On 19/11/2019 20:03, Jason Dillaman wrote: > On Tue, Nov 19, 2019 at 1:51 PM shubjero wrote: >> >> Florian, >> >> Thanks for posting about this issue. This is something that we have >> been experiencing (stale exclusive locks) with our OpenStack and Ceph >> cloud more frequently as our datacentre has had some reliability >> issues recently with power and cooling causing several unexpected >> shutdowns. >> >> At this point we are on Ceph Mimic 13.2.6 and reading through this >> thread and related links I just wanted to confirm if I have the >> correct caps for cinder clients as listed below as we have upgraded >> through many major Ceph versions over the years and I'm sure a lot of >> our configs and settings still contain deprecated options. >> >> client.cinder >> key: sanitized== >> caps: [mgr] allow r >> caps: [mon] profile rbd >> caps: [osd] allow class-read object_prefix rbd_children, profile rbd >> pool=volumes, profile rbd pool=vms, profile rbd pool=images > > Only use "profile rbd" for 'mon' and 'osd' caps -- it's documented > here [1]. Once you use 'profile rbd', you don't need the extra "allow > class-read object_prefix rbd_children" since it is included within the > profile (along with other things like support for clone v2). Octopus > will also include "profile rbd" for the 'mgr' cap to support the new > functionality in the "rbd_support" manager module (like running "rbd > perf image top" w/o the admin caps). > >> From what I read, the blacklist permission was something that was >> supposed to be applied pre-Luminous upgrade but once you are on >> Luminous or later, it's no longer needed assuming you have switched to >> using the rbd profile. > > Correct. The "blacklist" permission was an intermediate state > pre-upgrade since your older OSDs wouldn't have support for "profile > rbd" yet but Luminous OSDs started to enforce caps on the 'blacklist > add' op so that rogue users w/ read-only permissions couldn't just > blacklist all clients. Once you are at Luminous or later, you can just > use the profile. OK, great. This gives me something to start with for a doc patch. Thanks! However, I'm still curious about this bit: >> On Fri, Nov 15, 2019 at 11:05 AM Paul Emmerich >> wrote: >>> * This is unrelated to openstack and will happen with *any* reasonably >>> configured hypervisor that uses exclusive locking What, exactly, is the "reasonably configured hypervisor" here, in other words, what is it that grabs and releases this lock? It's evidently not Nova that does this, but is it libvirt, or Qemu/KVM, and if so, what magic in there makes this happen, and what "reasonable configuration" influences this? Thanks again! Cheers, Florian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks
On Tue, Nov 19, 2019 at 1:51 PM shubjero wrote: > > Florian, > > Thanks for posting about this issue. This is something that we have > been experiencing (stale exclusive locks) with our OpenStack and Ceph > cloud more frequently as our datacentre has had some reliability > issues recently with power and cooling causing several unexpected > shutdowns. > > At this point we are on Ceph Mimic 13.2.6 and reading through this > thread and related links I just wanted to confirm if I have the > correct caps for cinder clients as listed below as we have upgraded > through many major Ceph versions over the years and I'm sure a lot of > our configs and settings still contain deprecated options. > > client.cinder > key: sanitized== > caps: [mgr] allow r > caps: [mon] profile rbd > caps: [osd] allow class-read object_prefix rbd_children, profile rbd > pool=volumes, profile rbd pool=vms, profile rbd pool=images Only use "profile rbd" for 'mon' and 'osd' caps -- it's documented here [1]. Once you use 'profile rbd', you don't need the extra "allow class-read object_prefix rbd_children" since it is included within the profile (along with other things like support for clone v2). Octopus will also include "profile rbd" for the 'mgr' cap to support the new functionality in the "rbd_support" manager module (like running "rbd perf image top" w/o the admin caps). > From what I read, the blacklist permission was something that was > supposed to be applied pre-Luminous upgrade but once you are on > Luminous or later, it's no longer needed assuming you have switched to > using the rbd profile. Correct. The "blacklist" permission was an intermediate state pre-upgrade since your older OSDs wouldn't have support for "profile rbd" yet but Luminous OSDs started to enforce caps on the 'blacklist add' op so that rogue users w/ read-only permissions couldn't just blacklist all clients. Once you are at Luminous or later, you can just use the profile. > On Fri, Nov 15, 2019 at 11:05 AM Paul Emmerich wrote: > > > > To clear up a few misconceptions here: > > > > * RBD keyrings should use the "profile rbd" permissions, everything > > else is *wrong* and should be fixed asap > > * Manually adding the blacklist permission might work but isn't > > future-proof, fix the keyring instead > > * The suggestion to mount them elsewhere to fix this only works > > because "elsewhere" probably has an admin keyring, this is a bad > > work-around, fix the keyring instead > > * This is unrelated to openstack and will happen with *any* reasonably > > configured hypervisor that uses exclusive locking > > > > This problem usually happens after upgrading to Luminous without > > reading the change log. The change log tells you to adjust the keyring > > permissions accordingly > > > > Paul > > > > -- > > Paul Emmerich > > > > Looking for help with your Ceph cluster? Contact us at https://croit.io > > > > croit GmbH > > Freseniusstr. 31h > > 81247 München > > www.croit.io > > Tel: +49 89 1896585 90 > > > > On Fri, Nov 15, 2019 at 4:56 PM Joshua M. Boniface > > wrote: > > > > > > Thanks Simon! I've implemented it, I guess I'll test it out next time my > > > homelab's power dies :-) > > > > > > On 2019-11-15 10:54 a.m., Simon Ironside wrote: > > > > > > On 15/11/2019 15:44, Joshua M. Boniface wrote: > > > > > > Hey All: > > > > > > I've also quite frequently experienced this sort of issue with my Ceph > > > RBD-backed QEMU/KVM > > > > > > cluster (not OpenStack specifically). Should this workaround of allowing > > > the 'osd blacklist' > > > > > > command in the caps help in that scenario as well, or is this an > > > OpenStack-specific > > > > > > functionality? > > > > > > Yes, my use case is RBD backed QEMU/KVM too, not Openstack. It's > > > required for all RBD clients. > > > > > > Simon > > > > > > ___ > > > > > > ceph-users mailing list > > > > > > ceph-users@lists.ceph.com > > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > > > ___ > > > ceph-users mailing list > > > ceph-users@lists.ceph.com > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [1] https://docs.ceph.com/docs/master/rbd/rbd-openstack/#setup-ceph-client-authentication -- Jason ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks
Florian, Thanks for posting about this issue. This is something that we have been experiencing (stale exclusive locks) with our OpenStack and Ceph cloud more frequently as our datacentre has had some reliability issues recently with power and cooling causing several unexpected shutdowns. At this point we are on Ceph Mimic 13.2.6 and reading through this thread and related links I just wanted to confirm if I have the correct caps for cinder clients as listed below as we have upgraded through many major Ceph versions over the years and I'm sure a lot of our configs and settings still contain deprecated options. client.cinder key: sanitized== caps: [mgr] allow r caps: [mon] profile rbd caps: [osd] allow class-read object_prefix rbd_children, profile rbd pool=volumes, profile rbd pool=vms, profile rbd pool=images From what I read, the blacklist permission was something that was supposed to be applied pre-Luminous upgrade but once you are on Luminous or later, it's no longer needed assuming you have switched to using the rbd profile. On Fri, Nov 15, 2019 at 11:05 AM Paul Emmerich wrote: > > To clear up a few misconceptions here: > > * RBD keyrings should use the "profile rbd" permissions, everything > else is *wrong* and should be fixed asap > * Manually adding the blacklist permission might work but isn't > future-proof, fix the keyring instead > * The suggestion to mount them elsewhere to fix this only works > because "elsewhere" probably has an admin keyring, this is a bad > work-around, fix the keyring instead > * This is unrelated to openstack and will happen with *any* reasonably > configured hypervisor that uses exclusive locking > > This problem usually happens after upgrading to Luminous without > reading the change log. The change log tells you to adjust the keyring > permissions accordingly > > Paul > > -- > Paul Emmerich > > Looking for help with your Ceph cluster? Contact us at https://croit.io > > croit GmbH > Freseniusstr. 31h > 81247 München > www.croit.io > Tel: +49 89 1896585 90 > > On Fri, Nov 15, 2019 at 4:56 PM Joshua M. Boniface wrote: > > > > Thanks Simon! I've implemented it, I guess I'll test it out next time my > > homelab's power dies :-) > > > > On 2019-11-15 10:54 a.m., Simon Ironside wrote: > > > > On 15/11/2019 15:44, Joshua M. Boniface wrote: > > > > Hey All: > > > > I've also quite frequently experienced this sort of issue with my Ceph > > RBD-backed QEMU/KVM > > > > cluster (not OpenStack specifically). Should this workaround of allowing > > the 'osd blacklist' > > > > command in the caps help in that scenario as well, or is this an > > OpenStack-specific > > > > functionality? > > > > Yes, my use case is RBD backed QEMU/KVM too, not Openstack. It's > > required for all RBD clients. > > > > Simon > > > > ___ > > > > ceph-users mailing list > > > > ceph-users@lists.ceph.com > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks
To clear up a few misconceptions here: * RBD keyrings should use the "profile rbd" permissions, everything else is *wrong* and should be fixed asap * Manually adding the blacklist permission might work but isn't future-proof, fix the keyring instead * The suggestion to mount them elsewhere to fix this only works because "elsewhere" probably has an admin keyring, this is a bad work-around, fix the keyring instead * This is unrelated to openstack and will happen with *any* reasonably configured hypervisor that uses exclusive locking This problem usually happens after upgrading to Luminous without reading the change log. The change log tells you to adjust the keyring permissions accordingly Paul -- Paul Emmerich Looking for help with your Ceph cluster? Contact us at https://croit.io croit GmbH Freseniusstr. 31h 81247 München www.croit.io Tel: +49 89 1896585 90 On Fri, Nov 15, 2019 at 4:56 PM Joshua M. Boniface wrote: > > Thanks Simon! I've implemented it, I guess I'll test it out next time my > homelab's power dies :-) > > On 2019-11-15 10:54 a.m., Simon Ironside wrote: > > On 15/11/2019 15:44, Joshua M. Boniface wrote: > > Hey All: > > I've also quite frequently experienced this sort of issue with my Ceph > RBD-backed QEMU/KVM > > cluster (not OpenStack specifically). Should this workaround of allowing the > 'osd blacklist' > > command in the caps help in that scenario as well, or is this an > OpenStack-specific > > functionality? > > Yes, my use case is RBD backed QEMU/KVM too, not Openstack. It's > required for all RBD clients. > > Simon > > ___ > > ceph-users mailing list > > ceph-users@lists.ceph.com > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks
Thanks Simon! I've implemented it, I guess I'll test it out next time my homelab's power dies :-) On 2019-11-15 10:54 a.m., Simon Ironside wrote: On 15/11/2019 15:44, Joshua M. Boniface wrote: Hey All: I've also quite frequently experienced this sort of issue with my Ceph RBD-backed QEMU/KVM cluster (not OpenStack specifically). Should this workaround of allowing the 'osd blacklist' command in the caps help in that scenario as well, or is this an OpenStack-specific functionality? Yes, my use case is RBD backed QEMU/KVM too, not Openstack. It's required for all RBD clients. Simon ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks
On 15/11/2019 15:44, Joshua M. Boniface wrote: Hey All: I've also quite frequently experienced this sort of issue with my Ceph RBD-backed QEMU/KVM cluster (not OpenStack specifically). Should this workaround of allowing the 'osd blacklist' command in the caps help in that scenario as well, or is this an OpenStack-specific functionality? Yes, my use case is RBD backed QEMU/KVM too, not Openstack. It's required for all RBD clients. Simon ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks
Hey All: I've also quite frequently experienced this sort of issue with my Ceph RBD-backed QEMU/KVM cluster (not OpenStack specifically). Should this workaround of allowing the 'osd blacklist' command in the caps help in that scenario as well, or is this an OpenStack-specific functionality? Thanks, Joshua On 2019-11-15 9:02 a.m., Florian Haas wrote: On 15/11/2019 14:27, Simon Ironside wrote: Hi Florian, On 15/11/2019 12:32, Florian Haas wrote: I received this off-list but then subsequently saw this message pop up in the list archive, so I hope it's OK to reply on-list? Of course, I just clicked the wrong reply button the first time. So that cap was indeed missing, thanks for the hint! However, I am still trying to understand how this is related to the issue we saw. I had exactly the same happen to me as happened to you a week or so ago. Compute node lost power and once restored the VMs would start booting but fail early on when they tried to write. My key was also missing that cap, adding it and resetting the affected VMs was the only action I took to sort things out. I didn't need to go around removing locks by hand as you did. As you say, waiting 30 seconds didn't do any good so it doesn't appear to be a watcher thing. Right, so suffice to say that that article is at least somewhere between incomplete and misleading. :) This was mentioned in the release notes for Luminous[1], I'd missed it too as I redeployed Nautilus instead and skipped these steps: Verify that all RBD client users have sufficient caps to blacklist other client users. RBD client users with only "allow r" monitor caps should be updated as follows: # ceph auth caps client. mon 'allow r, allow command "osd blacklist"' osd '' Yup, looks like we missed that bit of the release notes too (cluster has been in production for several major releases now). So it looks like we've got a fix for this. Thanks! Also Wido, thanks for the reminder on profile rbd; we'll look into that too. However, I'm still failing to wrap my read around the causality chain here, and also around the interplay between watchers, locks, and blacklists. If anyone could share some insight about this that I could distill into a doc patch, I'd much appreciate that. Cheers, Florian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks
Hi, For solve the issue, mount with: rbd map pool/disk_id , and mount the / volume in a linux machine "A ceph node will be ok", this will flush the journal and close and discard the pending changes in openstack nodes cache, then unmount and rbd unmap. Boot the instance from openstack again, and voila will work. For windows instances you must use ntfsfix in a linux computer with the same commands. Regards, Manuel -Mensaje original- De: ceph-users En nombre de Simon Ironside Enviado el: viernes, 15 de noviembre de 2019 14:28 Para: ceph-users Asunto: Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks Hi Florian, On 15/11/2019 12:32, Florian Haas wrote: > I received this off-list but then subsequently saw this message pop up > in the list archive, so I hope it's OK to reply on-list? Of course, I just clicked the wrong reply button the first time. > So that cap was indeed missing, thanks for the hint! However, I am > still trying to understand how this is related to the issue we saw. I had exactly the same happen to me as happened to you a week or so ago. Compute node lost power and once restored the VMs would start booting but fail early on when they tried to write. My key was also missing that cap, adding it and resetting the affected VMs was the only action I took to sort things out. I didn't need to go around removing locks by hand as you did. As you say, waiting 30 seconds didn't do any good so it doesn't appear to be a watcher thing. This was mentioned in the release notes for Luminous[1], I'd missed it too as I redeployed Nautilus instead and skipped these steps: Verify that all RBD client users have sufficient caps to blacklist other client users. RBD client users with only "allow r" monitor caps should be updated as follows: # ceph auth caps client. mon 'allow r, allow command "osd blacklist"' osd '' Simon [1] https://docs.ceph.com/docs/master/releases/luminous/#upgrade-from-jewel-or-k raken ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks
Hi Florian, On 15/11/2019 12:32, Florian Haas wrote: I received this off-list but then subsequently saw this message pop up in the list archive, so I hope it's OK to reply on-list? Of course, I just clicked the wrong reply button the first time. So that cap was indeed missing, thanks for the hint! However, I am still trying to understand how this is related to the issue we saw. I had exactly the same happen to me as happened to you a week or so ago. Compute node lost power and once restored the VMs would start booting but fail early on when they tried to write. My key was also missing that cap, adding it and resetting the affected VMs was the only action I took to sort things out. I didn't need to go around removing locks by hand as you did. As you say, waiting 30 seconds didn't do any good so it doesn't appear to be a watcher thing. This was mentioned in the release notes for Luminous[1], I'd missed it too as I redeployed Nautilus instead and skipped these steps: Verify that all RBD client users have sufficient caps to blacklist other client users. RBD client users with only "allow r" monitor caps should be updated as follows: # ceph auth caps client. mon 'allow r, allow command "osd blacklist"' osd '' Simon [1] https://docs.ceph.com/docs/master/releases/luminous/#upgrade-from-jewel-or-kraken ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks
On 15/11/2019 11:23, Simon Ironside wrote: > Hi Florian, > > Any chance the key your compute nodes are using for the RBD pool is > missing 'allow command "osd blacklist"' from its mon caps? > > Simon Hi Simon, I received this off-list but then subsequently saw this message pop up in the list archive, so I hope it's OK to reply on-list? So that cap was indeed missing, thanks for the hint! However, I am still trying to understand how this is related to the issue we saw. The only documentation-ish article that I found about osd blacklist caps is this: https://access.redhat.com/solutions/3391211 We can also confirm a bunch of "access denied" messages when trying to blacklist an OSD in the mon logs. So the content of that article definitely applies to our situation, I'm just not sure I follow how the absence of that capability caused this issue. The article talks about RBD watchers, not locks. To the best of my knowledge, a watcher operates like a lease on the image, which is periodically renewed. If not renewed in 30 seconds of client inactivity, the cluster considers the client dead. (Please correct me if I'm wrong.) For us, that didn't help. We had to actively remove locks with "rbd lock rm". Is the article using the wrong terms? Is there a link between watchers and locks that I'm unaware of? Semi-relatedly, as I understand it OSD blacklisting happens based either on an IP address, or on a socket address (IP:port). While this comes in handy in host evacuation, it doesn't in in-place recovery (see question 4 in my original message). - If the blacklist happens based on IP address alone (and that's what seems to be what the client attempts to be doing, based on our log messages), then it would break recovery-in-place after a hard reboot altogether. - Even if the client would blacklist based on an address:port pair, it would be just very unlikely that an RBD client used the same source port to connect after the node recovers in place, but not impossible. So I am wondering: is this incorrect documentation, or incorrect behavior, or am I simply making dead-wrong assumptions? Cheers, Florian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks
On 11/15/19 11:24 AM, Simon Ironside wrote: > Hi Florian, > > Any chance the key your compute nodes are using for the RBD pool is > missing 'allow command "osd blacklist"' from its mon caps? > Added to this I recommend to use the 'profile rbd' for the mon caps. As also stated in the OpenStack docs: https://docs.ceph.com/docs/master/rbd/rbd-openstack/#setup-ceph-client-authentication Wido > Simon > > On 15/11/2019 08:19, Florian Haas wrote: >> Hi everyone, >> >> I'm trying to wrap my head around an issue we recently saw, as it >> relates to RBD locks, Qemu/KVM, and libvirt. >> >> Our data center graced us with a sudden and complete dual-feed power >> failure that affected both a Ceph cluster (Luminous, 12.2.12), and >> OpenStack compute nodes that used RBDs in that Ceph cluster. (Yes, these >> things really happen, even in 2019.) >> >> Once nodes were powered back up, the Ceph cluster came up gracefully >> with no intervention required — all we saw was some Mon clock skew until >> NTP peers had fully synced. Yay! However, our Nova compute nodes, or >> rather the libvirt VMs that were running on them, were in not so great a >> shape. The VMs booted up fine initially, but then blew up as soon as >> they were trying to write to their RBD-backed virtio devices — which, of >> course, was very early in the boot sequence as they had dirty filesystem >> journals to apply. >> >> Being able to read from, but not write to, RBDs is usually an issue with >> exclusive locking, so we stopped one of the affected VMs, checked the >> RBD locks on its device, and found (with rbd lock ls) that the lock was >> still being held even after the VM was definitely down — both "openstack >> server show" and "virsh domstate" agreed on this. We manually cleared >> the lock (rbd lock rm), started the VM, and it booted up fine. >> >> Repeat for all VMs, and we were back in business. >> >> If I understand correctly, image locks — in contrast to image watchers — >> have no timeout, so locks must be always be explicitly released, or they >> linger forever. >> >> So that raises a few questions: >> >> (1) Is it correct to assume that the lingering lock was actually from >> *before* the power failure? >> >> (2) What, exactly, triggers the lock acquisition and release in this >> context? Is it nova-compute that does this, or libvirt, or Qemu/KVM? >> >> (3) Would the same issue be expected essentially in any hard failure of >> even a single compute node, and if so, does that mean that what >> https://docs.ceph.com/docs/master/rbd/rbd-openstack/ says about "nova >> evacuate" (and presumably, by extension also about "nova host-evacuate") >> is inaccurate? If so, what would be required to make that work? >> >> (4) If (3), is it correct to assume that the same considerations apply >> to the Nova resume_guests_state_on_host_boot feature, i.e. that >> automatic guest recovery wouldn't be expected to succeed even if a node >> experienced just a hard reboot, as opposed to a a catastrophic permanent >> failure? And again, what would be required to make that work? Is it >> really necessary to clean all RBD locks manually? >> >> Grateful for any insight that people could share here. I'd volunteer to >> add a brief writeup of locking functionality in this context to the docs. >> >> Thanks! >> >> Cheers, >> Florian >> ___ >> ceph-users mailing list >> ceph-users@lists.ceph.com >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > ___ > ceph-users mailing list > ceph-users@lists.ceph.com > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks
Hi Florian, Any chance the key your compute nodes are using for the RBD pool is missing 'allow command "osd blacklist"' from its mon caps? Simon On 15/11/2019 08:19, Florian Haas wrote: Hi everyone, I'm trying to wrap my head around an issue we recently saw, as it relates to RBD locks, Qemu/KVM, and libvirt. Our data center graced us with a sudden and complete dual-feed power failure that affected both a Ceph cluster (Luminous, 12.2.12), and OpenStack compute nodes that used RBDs in that Ceph cluster. (Yes, these things really happen, even in 2019.) Once nodes were powered back up, the Ceph cluster came up gracefully with no intervention required — all we saw was some Mon clock skew until NTP peers had fully synced. Yay! However, our Nova compute nodes, or rather the libvirt VMs that were running on them, were in not so great a shape. The VMs booted up fine initially, but then blew up as soon as they were trying to write to their RBD-backed virtio devices — which, of course, was very early in the boot sequence as they had dirty filesystem journals to apply. Being able to read from, but not write to, RBDs is usually an issue with exclusive locking, so we stopped one of the affected VMs, checked the RBD locks on its device, and found (with rbd lock ls) that the lock was still being held even after the VM was definitely down — both "openstack server show" and "virsh domstate" agreed on this. We manually cleared the lock (rbd lock rm), started the VM, and it booted up fine. Repeat for all VMs, and we were back in business. If I understand correctly, image locks — in contrast to image watchers — have no timeout, so locks must be always be explicitly released, or they linger forever. So that raises a few questions: (1) Is it correct to assume that the lingering lock was actually from *before* the power failure? (2) What, exactly, triggers the lock acquisition and release in this context? Is it nova-compute that does this, or libvirt, or Qemu/KVM? (3) Would the same issue be expected essentially in any hard failure of even a single compute node, and if so, does that mean that what https://docs.ceph.com/docs/master/rbd/rbd-openstack/ says about "nova evacuate" (and presumably, by extension also about "nova host-evacuate") is inaccurate? If so, what would be required to make that work? (4) If (3), is it correct to assume that the same considerations apply to the Nova resume_guests_state_on_host_boot feature, i.e. that automatic guest recovery wouldn't be expected to succeed even if a node experienced just a hard reboot, as opposed to a a catastrophic permanent failure? And again, what would be required to make that work? Is it really necessary to clean all RBD locks manually? Grateful for any insight that people could share here. I'd volunteer to add a brief writeup of locking functionality in this context to the docs. Thanks! Cheers, Florian ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com