Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks

2019-12-05 Thread Florian Haas
On 02/12/2019 16:48, Florian Haas wrote:
> Doc patch PR is here, for anyone who would feels inclined to review:
> 
> https://github.com/ceph/ceph/pull/31893

Landed, here's the new documentation:
https://docs.ceph.com/docs/master/rbd/rbd-exclusive-locks/

Thanks everyone for chiming in, and special thanks to Jason for the
detailed review!

Cheers,
Florian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks

2019-12-02 Thread Florian Haas
On 19/11/2019 22:42, Florian Haas wrote:
> On 19/11/2019 22:34, Jason Dillaman wrote:
>>> Oh totally, I wasn't arguing it was a bad idea for it to do what it
>>> does! I just got confused by the fact that our mon logs showed what
>>> looked like a (failed) attempt to blacklist an entire client IP address.
>>
>> There should have been an associated client nonce after the IP address
>> to uniquely identify which client connection is blacklisted --
>> something like "1.2.3.4:0/5678". Let me know if that's not the case
>> since that would definitely be wrong.
> 
> English lacks a universally understood way to answer a negated question
> in the affirmative, so this is tricky to get right, but I'll try: No,
> that *is* the case, thus nothing is wrong. :)

Doc patch PR is here, for anyone who would feels inclined to review:

https://github.com/ceph/ceph/pull/31893

Cheers,
Florian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks

2019-11-19 Thread Jason Dillaman
On Tue, Nov 19, 2019 at 4:42 PM Florian Haas  wrote:
>
> On 19/11/2019 22:34, Jason Dillaman wrote:
> >> Oh totally, I wasn't arguing it was a bad idea for it to do what it
> >> does! I just got confused by the fact that our mon logs showed what
> >> looked like a (failed) attempt to blacklist an entire client IP address.
> >
> > There should have been an associated client nonce after the IP address
> > to uniquely identify which client connection is blacklisted --
> > something like "1.2.3.4:0/5678". Let me know if that's not the case
> > since that would definitely be wrong.
>
> English lacks a universally understood way to answer a negated question
> in the affirmative, so this is tricky to get right, but I'll try: No,
> that *is* the case, thus nothing is wrong. :)

Haha -- thanks!

> Cheers,
> Florian
>


-- 
Jason

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks

2019-11-19 Thread Florian Haas
On 19/11/2019 22:34, Jason Dillaman wrote:
>> Oh totally, I wasn't arguing it was a bad idea for it to do what it
>> does! I just got confused by the fact that our mon logs showed what
>> looked like a (failed) attempt to blacklist an entire client IP address.
> 
> There should have been an associated client nonce after the IP address
> to uniquely identify which client connection is blacklisted --
> something like "1.2.3.4:0/5678". Let me know if that's not the case
> since that would definitely be wrong.

English lacks a universally understood way to answer a negated question
in the affirmative, so this is tricky to get right, but I'll try: No,
that *is* the case, thus nothing is wrong. :)

Cheers,
Florian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks

2019-11-19 Thread Jason Dillaman
On Tue, Nov 19, 2019 at 4:31 PM Florian Haas  wrote:
>
> On 19/11/2019 22:19, Jason Dillaman wrote:
> > On Tue, Nov 19, 2019 at 4:09 PM Florian Haas  wrote:
> >>
> >> On 19/11/2019 21:32, Jason Dillaman wrote:
>  What, exactly, is the "reasonably configured hypervisor" here, in other
>  words, what is it that grabs and releases this lock? It's evidently not
>  Nova that does this, but is it libvirt, or Qemu/KVM, and if so, what
>  magic in there makes this happen, and what "reasonable configuration"
>  influences this?
> >>>
> >>> librbd and krbd perform this logic when the exclusive-lock feature is
> >>> enabled.
> >>
> >> Right. So the "reasonable configuration" applies to the features they
> >> enable when they *create* an image, rather than what they do to the
> >> image at runtime. Is that fair to say?
> >
> > The exclusive-lock ownership is enforced at image use (i.e. when the
> > feature is a property of the image, not specifically just during the
> > action of enabling the property) -- so this implies "what they do to
> > the image at runtime"
>
> OK, gotcha.
>
> >>> In this case, librbd sees that the previous lock owner is
> >>> dead / missing, but before it can steal the lock (since librbd did not
> >>> cleanly close the image), it needs to ensure it cannot come back from
> >>> the dead to issue future writes against the RBD image by blacklisting
> >>> it from the cluster.
> >>
> >> Thanks. I'm probably sounding dense here, sorry for that, but yes, this
> >> makes perfect sense to me when I want to fence a whole node off —
> >> however, how exactly does this work with VM recovery in place?
> >
> > How would librbd / krbd know under what situation a VM was being
> > "recovered"? Should librbd be expected to integrate w/ IPMI devices
> > where the VM is being run or w/ Zabbix alert monitoring to know that
> > this was a power failure so don't expect that the lock owner will come
> > back up? The safe and generic thing for librbd / krbd to do in this
> > situation is to just blacklist the old lock owner to ensure it cannot
> > talk to the cluster. Obviously in the case of a physically failed
> > node, that won't ever happen -- but I think we can all agree this is
> > the sane recovery path that covers all bases.
>
> Oh totally, I wasn't arguing it was a bad idea for it to do what it
> does! I just got confused by the fact that our mon logs showed what
> looked like a (failed) attempt to blacklist an entire client IP address.

There should have been an associated client nonce after the IP address
to uniquely identify which client connection is blacklisted --
something like "1.2.3.4:0/5678". Let me know if that's not the case
since that would definitely be wrong.

> > Yup, with the correct permissions librbd / rbd will be able to
> > blacklist the lock owner, break the old lock, and acquire the lock
> > themselves for R/W operations -- and the operator would not need to
> > intervene.
>
> Ack. Thanks!
>
> Cheers,
> Florian
>


-- 
Jason

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks

2019-11-19 Thread Florian Haas
On 19/11/2019 22:19, Jason Dillaman wrote:
> On Tue, Nov 19, 2019 at 4:09 PM Florian Haas  wrote:
>>
>> On 19/11/2019 21:32, Jason Dillaman wrote:
 What, exactly, is the "reasonably configured hypervisor" here, in other
 words, what is it that grabs and releases this lock? It's evidently not
 Nova that does this, but is it libvirt, or Qemu/KVM, and if so, what
 magic in there makes this happen, and what "reasonable configuration"
 influences this?
>>>
>>> librbd and krbd perform this logic when the exclusive-lock feature is
>>> enabled.
>>
>> Right. So the "reasonable configuration" applies to the features they
>> enable when they *create* an image, rather than what they do to the
>> image at runtime. Is that fair to say?
> 
> The exclusive-lock ownership is enforced at image use (i.e. when the
> feature is a property of the image, not specifically just during the
> action of enabling the property) -- so this implies "what they do to
> the image at runtime"

OK, gotcha.

>>> In this case, librbd sees that the previous lock owner is
>>> dead / missing, but before it can steal the lock (since librbd did not
>>> cleanly close the image), it needs to ensure it cannot come back from
>>> the dead to issue future writes against the RBD image by blacklisting
>>> it from the cluster.
>>
>> Thanks. I'm probably sounding dense here, sorry for that, but yes, this
>> makes perfect sense to me when I want to fence a whole node off —
>> however, how exactly does this work with VM recovery in place?
> 
> How would librbd / krbd know under what situation a VM was being
> "recovered"? Should librbd be expected to integrate w/ IPMI devices
> where the VM is being run or w/ Zabbix alert monitoring to know that
> this was a power failure so don't expect that the lock owner will come
> back up? The safe and generic thing for librbd / krbd to do in this
> situation is to just blacklist the old lock owner to ensure it cannot
> talk to the cluster. Obviously in the case of a physically failed
> node, that won't ever happen -- but I think we can all agree this is
> the sane recovery path that covers all bases.

Oh totally, I wasn't arguing it was a bad idea for it to do what it
does! I just got confused by the fact that our mon logs showed what
looked like a (failed) attempt to blacklist an entire client IP address.

> Yup, with the correct permissions librbd / rbd will be able to
> blacklist the lock owner, break the old lock, and acquire the lock
> themselves for R/W operations -- and the operator would not need to
> intervene.

Ack. Thanks!

Cheers,
Florian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks

2019-11-19 Thread Jason Dillaman
On Tue, Nov 19, 2019 at 4:09 PM Florian Haas  wrote:
>
> On 19/11/2019 21:32, Jason Dillaman wrote:
> >> What, exactly, is the "reasonably configured hypervisor" here, in other
> >> words, what is it that grabs and releases this lock? It's evidently not
> >> Nova that does this, but is it libvirt, or Qemu/KVM, and if so, what
> >> magic in there makes this happen, and what "reasonable configuration"
> >> influences this?
> >
> > librbd and krbd perform this logic when the exclusive-lock feature is
> > enabled.
>
> Right. So the "reasonable configuration" applies to the features they
> enable when they *create* an image, rather than what they do to the
> image at runtime. Is that fair to say?

The exclusive-lock ownership is enforced at image use (i.e. when the
feature is a property of the image, not specifically just during the
action of enabling the property) -- so this implies "what they do to
the image at runtime"

> > In this case, librbd sees that the previous lock owner is
> > dead / missing, but before it can steal the lock (since librbd did not
> > cleanly close the image), it needs to ensure it cannot come back from
> > the dead to issue future writes against the RBD image by blacklisting
> > it from the cluster.
>
> Thanks. I'm probably sounding dense here, sorry for that, but yes, this
> makes perfect sense to me when I want to fence a whole node off —
> however, how exactly does this work with VM recovery in place?

How would librbd / krbd know under what situation a VM was being
"recovered"? Should librbd be expected to integrate w/ IPMI devices
where the VM is being run or w/ Zabbix alert monitoring to know that
this was a power failure so don't expect that the lock owner will come
back up? The safe and generic thing for librbd / krbd to do in this
situation is to just blacklist the old lock owner to ensure it cannot
talk to the cluster. Obviously in the case of a physically failed
node, that won't ever happen -- but I think we can all agree this is
the sane recovery path that covers all bases.

> From further upthread:
>
> > Semi-relatedly, as I understand it OSD blacklisting happens based either
> > on an IP address, or on a socket address (IP:port). While this comes in
> > handy in host evacuation, it doesn't in in-place recovery (see question
> > 4 in my original message).
> >
> > - If the blacklist happens based on IP address alone (and that's what
> > seems to be what the client attempts to be doing, based on our log
> > messages), then it would break recovery-in-place after a hard reboot
> > altogether.
> >
> > - Even if the client would blacklist based on an address:port pair, it
> > would be just very unlikely that an RBD client used the same source port
> > to connect after the node recovers in place, but not impossible.
>
> Clearly though, if people set their permissions correctly then this
> blacklisting seems to work fine even for recovery-in-place, so no reason
> for me to doubt that, I'd just really like to understand the mechanics. :)

Yup, with the correct permissions librbd / rbd will be able to
blacklist the lock owner, break the old lock, and acquire the lock
themselves for R/W operations -- and the operator would not need to
intervene.

> Thanks again!
>
> Cheers,
> Florian
>

-- 
Jason

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks

2019-11-19 Thread Florian Haas
On 19/11/2019 21:32, Jason Dillaman wrote:
>> What, exactly, is the "reasonably configured hypervisor" here, in other
>> words, what is it that grabs and releases this lock? It's evidently not
>> Nova that does this, but is it libvirt, or Qemu/KVM, and if so, what
>> magic in there makes this happen, and what "reasonable configuration"
>> influences this?
> 
> librbd and krbd perform this logic when the exclusive-lock feature is
> enabled.

Right. So the "reasonable configuration" applies to the features they
enable when they *create* an image, rather than what they do to the
image at runtime. Is that fair to say?

> In this case, librbd sees that the previous lock owner is
> dead / missing, but before it can steal the lock (since librbd did not
> cleanly close the image), it needs to ensure it cannot come back from
> the dead to issue future writes against the RBD image by blacklisting
> it from the cluster.

Thanks. I'm probably sounding dense here, sorry for that, but yes, this
makes perfect sense to me when I want to fence a whole node off —
however, how exactly does this work with VM recovery in place?

From further upthread:

> Semi-relatedly, as I understand it OSD blacklisting happens based either
> on an IP address, or on a socket address (IP:port). While this comes in
> handy in host evacuation, it doesn't in in-place recovery (see question
> 4 in my original message).
> 
> - If the blacklist happens based on IP address alone (and that's what
> seems to be what the client attempts to be doing, based on our log
> messages), then it would break recovery-in-place after a hard reboot
> altogether.
> 
> - Even if the client would blacklist based on an address:port pair, it
> would be just very unlikely that an RBD client used the same source port
> to connect after the node recovers in place, but not impossible.

Clearly though, if people set their permissions correctly then this
blacklisting seems to work fine even for recovery-in-place, so no reason
for me to doubt that, I'd just really like to understand the mechanics. :)

Thanks again!

Cheers,
Florian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks

2019-11-19 Thread Jason Dillaman
On Tue, Nov 19, 2019 at 2:49 PM Florian Haas  wrote:
>
> On 19/11/2019 20:03, Jason Dillaman wrote:
> > On Tue, Nov 19, 2019 at 1:51 PM shubjero  wrote:
> >>
> >> Florian,
> >>
> >> Thanks for posting about this issue. This is something that we have
> >> been experiencing (stale exclusive locks) with our OpenStack and Ceph
> >> cloud more frequently as our datacentre has had some reliability
> >> issues recently with power and cooling causing several unexpected
> >> shutdowns.
> >>
> >> At this point we are on Ceph Mimic 13.2.6 and reading through this
> >> thread and related links I just wanted to confirm if I have the
> >> correct caps for cinder clients as listed below as we have upgraded
> >> through many major Ceph versions over the years and I'm sure a lot of
> >> our configs and settings still contain deprecated options.
> >>
> >> client.cinder
> >> key: sanitized==
> >> caps: [mgr] allow r
> >> caps: [mon] profile rbd
> >> caps: [osd] allow class-read object_prefix rbd_children, profile rbd
> >> pool=volumes, profile rbd pool=vms, profile rbd pool=images
> >
> > Only use "profile rbd" for 'mon' and 'osd' caps -- it's documented
> > here [1]. Once you use 'profile rbd', you don't need the extra "allow
> > class-read object_prefix rbd_children" since it is included within the
> > profile (along with other things like support for clone v2). Octopus
> > will also include "profile rbd" for the 'mgr' cap to support the new
> > functionality in the "rbd_support" manager module (like running "rbd
> > perf image top" w/o the admin caps).
> >
> >> From what I read, the blacklist permission was something that was
> >> supposed to be applied pre-Luminous upgrade but once you are on
> >> Luminous or later, it's no longer needed assuming you have switched to
> >> using the rbd profile.
> >
> > Correct. The "blacklist" permission was an intermediate state
> > pre-upgrade since your older OSDs wouldn't have support for "profile
> > rbd" yet but Luminous OSDs started to enforce caps on the 'blacklist
> > add' op so that rogue users w/ read-only permissions couldn't just
> > blacklist all clients. Once you are at Luminous or later, you can just
> > use the profile.
>
> OK, great. This gives me something to start with for a doc patch.
> Thanks! However, I'm still curious about this bit:
>
> >> On Fri, Nov 15, 2019 at 11:05 AM Paul Emmerich  
> >> wrote:
> >>> * This is unrelated to openstack and will happen with *any* reasonably
> >>> configured hypervisor that uses exclusive locking
>
> What, exactly, is the "reasonably configured hypervisor" here, in other
> words, what is it that grabs and releases this lock? It's evidently not
> Nova that does this, but is it libvirt, or Qemu/KVM, and if so, what
> magic in there makes this happen, and what "reasonable configuration"
> influences this?

librbd and krbd perform this logic when the exclusive-lock feature is
enabled. In this case, librbd sees that the previous lock owner is
dead / missing, but before it can steal the lock (since librbd did not
cleanly close the image), it needs to ensure it cannot come back from
the dead to issue future writes against the RBD image by blacklisting
it from the cluster.

> Thanks again!
>
> Cheers,
> Florian
>


-- 
Jason

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks

2019-11-19 Thread Florian Haas
On 19/11/2019 20:03, Jason Dillaman wrote:
> On Tue, Nov 19, 2019 at 1:51 PM shubjero  wrote:
>>
>> Florian,
>>
>> Thanks for posting about this issue. This is something that we have
>> been experiencing (stale exclusive locks) with our OpenStack and Ceph
>> cloud more frequently as our datacentre has had some reliability
>> issues recently with power and cooling causing several unexpected
>> shutdowns.
>>
>> At this point we are on Ceph Mimic 13.2.6 and reading through this
>> thread and related links I just wanted to confirm if I have the
>> correct caps for cinder clients as listed below as we have upgraded
>> through many major Ceph versions over the years and I'm sure a lot of
>> our configs and settings still contain deprecated options.
>>
>> client.cinder
>> key: sanitized==
>> caps: [mgr] allow r
>> caps: [mon] profile rbd
>> caps: [osd] allow class-read object_prefix rbd_children, profile rbd
>> pool=volumes, profile rbd pool=vms, profile rbd pool=images
> 
> Only use "profile rbd" for 'mon' and 'osd' caps -- it's documented
> here [1]. Once you use 'profile rbd', you don't need the extra "allow
> class-read object_prefix rbd_children" since it is included within the
> profile (along with other things like support for clone v2). Octopus
> will also include "profile rbd" for the 'mgr' cap to support the new
> functionality in the "rbd_support" manager module (like running "rbd
> perf image top" w/o the admin caps).
> 
>> From what I read, the blacklist permission was something that was
>> supposed to be applied pre-Luminous upgrade but once you are on
>> Luminous or later, it's no longer needed assuming you have switched to
>> using the rbd profile.
> 
> Correct. The "blacklist" permission was an intermediate state
> pre-upgrade since your older OSDs wouldn't have support for "profile
> rbd" yet but Luminous OSDs started to enforce caps on the 'blacklist
> add' op so that rogue users w/ read-only permissions couldn't just
> blacklist all clients. Once you are at Luminous or later, you can just
> use the profile.

OK, great. This gives me something to start with for a doc patch.
Thanks! However, I'm still curious about this bit:

>> On Fri, Nov 15, 2019 at 11:05 AM Paul Emmerich  
>> wrote:
>>> * This is unrelated to openstack and will happen with *any* reasonably
>>> configured hypervisor that uses exclusive locking

What, exactly, is the "reasonably configured hypervisor" here, in other
words, what is it that grabs and releases this lock? It's evidently not
Nova that does this, but is it libvirt, or Qemu/KVM, and if so, what
magic in there makes this happen, and what "reasonable configuration"
influences this?

Thanks again!

Cheers,
Florian

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks

2019-11-19 Thread Jason Dillaman
On Tue, Nov 19, 2019 at 1:51 PM shubjero  wrote:
>
> Florian,
>
> Thanks for posting about this issue. This is something that we have
> been experiencing (stale exclusive locks) with our OpenStack and Ceph
> cloud more frequently as our datacentre has had some reliability
> issues recently with power and cooling causing several unexpected
> shutdowns.
>
> At this point we are on Ceph Mimic 13.2.6 and reading through this
> thread and related links I just wanted to confirm if I have the
> correct caps for cinder clients as listed below as we have upgraded
> through many major Ceph versions over the years and I'm sure a lot of
> our configs and settings still contain deprecated options.
>
> client.cinder
> key: sanitized==
> caps: [mgr] allow r
> caps: [mon] profile rbd
> caps: [osd] allow class-read object_prefix rbd_children, profile rbd
> pool=volumes, profile rbd pool=vms, profile rbd pool=images

Only use "profile rbd" for 'mon' and 'osd' caps -- it's documented
here [1]. Once you use 'profile rbd', you don't need the extra "allow
class-read object_prefix rbd_children" since it is included within the
profile (along with other things like support for clone v2). Octopus
will also include "profile rbd" for the 'mgr' cap to support the new
functionality in the "rbd_support" manager module (like running "rbd
perf image top" w/o the admin caps).

> From what I read, the blacklist permission was something that was
> supposed to be applied pre-Luminous upgrade but once you are on
> Luminous or later, it's no longer needed assuming you have switched to
> using the rbd profile.

Correct. The "blacklist" permission was an intermediate state
pre-upgrade since your older OSDs wouldn't have support for "profile
rbd" yet but Luminous OSDs started to enforce caps on the 'blacklist
add' op so that rogue users w/ read-only permissions couldn't just
blacklist all clients. Once you are at Luminous or later, you can just
use the profile.

> On Fri, Nov 15, 2019 at 11:05 AM Paul Emmerich  wrote:
> >
> > To clear up a few misconceptions here:
> >
> > * RBD keyrings should use the "profile rbd" permissions, everything
> > else is *wrong* and should be fixed asap
> > * Manually adding the blacklist permission might work but isn't
> > future-proof, fix the keyring instead
> > * The suggestion to mount them elsewhere to fix this only works
> > because "elsewhere" probably has an admin keyring, this is a bad
> > work-around, fix the keyring instead
> > * This is unrelated to openstack and will happen with *any* reasonably
> > configured hypervisor that uses exclusive locking
> >
> > This problem usually happens after upgrading to Luminous without
> > reading the change log. The change log tells you to adjust the keyring
> > permissions accordingly
> >
> > Paul
> >
> > --
> > Paul Emmerich
> >
> > Looking for help with your Ceph cluster? Contact us at https://croit.io
> >
> > croit GmbH
> > Freseniusstr. 31h
> > 81247 München
> > www.croit.io
> > Tel: +49 89 1896585 90
> >
> > On Fri, Nov 15, 2019 at 4:56 PM Joshua M. Boniface  
> > wrote:
> > >
> > > Thanks Simon! I've implemented it, I guess I'll test it out next time my 
> > > homelab's power dies :-)
> > >
> > > On 2019-11-15 10:54 a.m., Simon Ironside wrote:
> > >
> > > On 15/11/2019 15:44, Joshua M. Boniface wrote:
> > >
> > > Hey All:
> > >
> > > I've also quite frequently experienced this sort of issue with my Ceph 
> > > RBD-backed QEMU/KVM
> > >
> > > cluster (not OpenStack specifically). Should this workaround of allowing 
> > > the 'osd blacklist'
> > >
> > > command in the caps help in that scenario as well, or is this an 
> > > OpenStack-specific
> > >
> > > functionality?
> > >
> > > Yes, my use case is RBD backed QEMU/KVM too, not Openstack. It's
> > > required for all RBD clients.
> > >
> > > Simon
> > >
> > > ___
> > >
> > > ceph-users mailing list
> > >
> > > ceph-users@lists.ceph.com
> > >
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[1] 
https://docs.ceph.com/docs/master/rbd/rbd-openstack/#setup-ceph-client-authentication

-- 
Jason

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks

2019-11-19 Thread shubjero
Florian,

Thanks for posting about this issue. This is something that we have
been experiencing (stale exclusive locks) with our OpenStack and Ceph
cloud more frequently as our datacentre has had some reliability
issues recently with power and cooling causing several unexpected
shutdowns.

At this point we are on Ceph Mimic 13.2.6 and reading through this
thread and related links I just wanted to confirm if I have the
correct caps for cinder clients as listed below as we have upgraded
through many major Ceph versions over the years and I'm sure a lot of
our configs and settings still contain deprecated options.

client.cinder
key: sanitized==
caps: [mgr] allow r
caps: [mon] profile rbd
caps: [osd] allow class-read object_prefix rbd_children, profile rbd
pool=volumes, profile rbd pool=vms, profile rbd pool=images

From what I read, the blacklist permission was something that was
supposed to be applied pre-Luminous upgrade but once you are on
Luminous or later, it's no longer needed assuming you have switched to
using the rbd profile.

On Fri, Nov 15, 2019 at 11:05 AM Paul Emmerich  wrote:
>
> To clear up a few misconceptions here:
>
> * RBD keyrings should use the "profile rbd" permissions, everything
> else is *wrong* and should be fixed asap
> * Manually adding the blacklist permission might work but isn't
> future-proof, fix the keyring instead
> * The suggestion to mount them elsewhere to fix this only works
> because "elsewhere" probably has an admin keyring, this is a bad
> work-around, fix the keyring instead
> * This is unrelated to openstack and will happen with *any* reasonably
> configured hypervisor that uses exclusive locking
>
> This problem usually happens after upgrading to Luminous without
> reading the change log. The change log tells you to adjust the keyring
> permissions accordingly
>
> Paul
>
> --
> Paul Emmerich
>
> Looking for help with your Ceph cluster? Contact us at https://croit.io
>
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io
> Tel: +49 89 1896585 90
>
> On Fri, Nov 15, 2019 at 4:56 PM Joshua M. Boniface  wrote:
> >
> > Thanks Simon! I've implemented it, I guess I'll test it out next time my 
> > homelab's power dies :-)
> >
> > On 2019-11-15 10:54 a.m., Simon Ironside wrote:
> >
> > On 15/11/2019 15:44, Joshua M. Boniface wrote:
> >
> > Hey All:
> >
> > I've also quite frequently experienced this sort of issue with my Ceph 
> > RBD-backed QEMU/KVM
> >
> > cluster (not OpenStack specifically). Should this workaround of allowing 
> > the 'osd blacklist'
> >
> > command in the caps help in that scenario as well, or is this an 
> > OpenStack-specific
> >
> > functionality?
> >
> > Yes, my use case is RBD backed QEMU/KVM too, not Openstack. It's
> > required for all RBD clients.
> >
> > Simon
> >
> > ___
> >
> > ceph-users mailing list
> >
> > ceph-users@lists.ceph.com
> >
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks

2019-11-15 Thread Paul Emmerich
To clear up a few misconceptions here:

* RBD keyrings should use the "profile rbd" permissions, everything
else is *wrong* and should be fixed asap
* Manually adding the blacklist permission might work but isn't
future-proof, fix the keyring instead
* The suggestion to mount them elsewhere to fix this only works
because "elsewhere" probably has an admin keyring, this is a bad
work-around, fix the keyring instead
* This is unrelated to openstack and will happen with *any* reasonably
configured hypervisor that uses exclusive locking

This problem usually happens after upgrading to Luminous without
reading the change log. The change log tells you to adjust the keyring
permissions accordingly

Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90

On Fri, Nov 15, 2019 at 4:56 PM Joshua M. Boniface  wrote:
>
> Thanks Simon! I've implemented it, I guess I'll test it out next time my 
> homelab's power dies :-)
>
> On 2019-11-15 10:54 a.m., Simon Ironside wrote:
>
> On 15/11/2019 15:44, Joshua M. Boniface wrote:
>
> Hey All:
>
> I've also quite frequently experienced this sort of issue with my Ceph 
> RBD-backed QEMU/KVM
>
> cluster (not OpenStack specifically). Should this workaround of allowing the 
> 'osd blacklist'
>
> command in the caps help in that scenario as well, or is this an 
> OpenStack-specific
>
> functionality?
>
> Yes, my use case is RBD backed QEMU/KVM too, not Openstack. It's
> required for all RBD clients.
>
> Simon
>
> ___
>
> ceph-users mailing list
>
> ceph-users@lists.ceph.com
>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks

2019-11-15 Thread Joshua M. Boniface

Thanks Simon! I've implemented it, I guess I'll test it out next time my 
homelab's power dies :-)

On 2019-11-15 10:54 a.m., Simon Ironside wrote:


On 15/11/2019 15:44, Joshua M. Boniface wrote:

Hey All:
I've also quite frequently experienced this sort of issue with my Ceph 
RBD-backed QEMU/KVM
cluster (not OpenStack specifically). Should this workaround of allowing the 
'osd blacklist'
command in the caps help in that scenario as well, or is this an 
OpenStack-specific
functionality?

Yes, my use case is RBD backed QEMU/KVM too, not Openstack. It's
required for all RBD clients.
Simon
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks

2019-11-15 Thread Simon Ironside

On 15/11/2019 15:44, Joshua M. Boniface wrote:

Hey All:

I've also quite frequently experienced this sort of issue with my Ceph 
RBD-backed QEMU/KVM
cluster (not OpenStack specifically). Should this workaround of allowing the 
'osd blacklist'
command in the caps help in that scenario as well, or is this an 
OpenStack-specific
functionality?


Yes, my use case is RBD backed QEMU/KVM too, not Openstack. It's 
required for all RBD clients.


Simon

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks

2019-11-15 Thread Joshua M. Boniface

Hey All:

I've also quite frequently experienced this sort of issue with my Ceph 
RBD-backed QEMU/KVM
cluster (not OpenStack specifically). Should this workaround of allowing the 
'osd blacklist'
command in the caps help in that scenario as well, or is this an 
OpenStack-specific
functionality?

Thanks,
Joshua

On 2019-11-15 9:02 a.m., Florian Haas wrote:


On 15/11/2019 14:27, Simon Ironside wrote:

Hi Florian,

On 15/11/2019 12:32, Florian Haas wrote:


I received this off-list but then subsequently saw this message pop up
in the list archive, so I hope it's OK to reply on-list?

Of course, I just clicked the wrong reply button the first time.


So that cap was indeed missing, thanks for the hint! However, I am still
trying to understand how this is related to the issue we saw.

I had exactly the same happen to me as happened to you a week or so ago.
Compute node lost power and once restored the VMs would start booting
but fail early on when they tried to write.

My key was also missing that cap, adding it and resetting the affected
VMs was the only action I took to sort things out. I didn't need to go
around removing locks by hand as you did. As you say, waiting 30 seconds
didn't do any good so it doesn't appear to be a watcher thing.

Right, so suffice to say that that article is at least somewhere between
incomplete and misleading. :)


This was mentioned in the release notes for Luminous[1], I'd missed it
too as I redeployed Nautilus instead and skipped these steps:



Verify that all RBD client users have sufficient caps to blacklist other
client users. RBD client users with only "allow r" monitor caps should
be updated as follows:

# ceph auth caps client. mon 'allow r, allow command "osd
blacklist"' osd ''



Yup, looks like we missed that bit of the release notes too (cluster has
been in production for several major releases now).

So it looks like we've got a fix for this. Thanks!

Also Wido, thanks for the reminder on profile rbd; we'll look into that too.

However, I'm still failing to wrap my read around the causality chain
here, and also around the interplay between watchers, locks, and
blacklists. If anyone could share some insight about this that I could
distill into a doc patch, I'd much appreciate that.

Cheers,
Florian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks

2019-11-15 Thread EDH - Manuel Rios Fernandez
Hi,

For solve the issue, mount with:

rbd map pool/disk_id , and mount the / volume in a linux machine "A ceph
node will be ok", this will flush the journal and close and discard the
pending changes in openstack nodes cache, then unmount and rbd unmap. Boot
the instance from openstack again, and voila will work.

For windows instances you must use ntfsfix in a linux computer with the same
commands.

Regards,
Manuel




-Mensaje original-
De: ceph-users  En nombre de Simon
Ironside
Enviado el: viernes, 15 de noviembre de 2019 14:28
Para: ceph-users 
Asunto: Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM,
and Ceph RBD locks

Hi Florian,

On 15/11/2019 12:32, Florian Haas wrote:

> I received this off-list but then subsequently saw this message pop up 
> in the list archive, so I hope it's OK to reply on-list?

Of course, I just clicked the wrong reply button the first time.

> So that cap was indeed missing, thanks for the hint! However, I am 
> still trying to understand how this is related to the issue we saw.

I had exactly the same happen to me as happened to you a week or so ago. 
Compute node lost power and once restored the VMs would start booting but
fail early on when they tried to write.

My key was also missing that cap, adding it and resetting the affected VMs
was the only action I took to sort things out. I didn't need to go around
removing locks by hand as you did. As you say, waiting 30 seconds didn't do
any good so it doesn't appear to be a watcher thing.

This was mentioned in the release notes for Luminous[1], I'd missed it too
as I redeployed Nautilus instead and skipped these steps:



Verify that all RBD client users have sufficient caps to blacklist other
client users. RBD client users with only "allow r" monitor caps should be
updated as follows:

# ceph auth caps client. mon 'allow r, allow command "osd blacklist"'
osd ''



Simon

[1]
https://docs.ceph.com/docs/master/releases/luminous/#upgrade-from-jewel-or-k
raken
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks

2019-11-15 Thread Simon Ironside

Hi Florian,

On 15/11/2019 12:32, Florian Haas wrote:


I received this off-list but then subsequently saw this message pop up
in the list archive, so I hope it's OK to reply on-list?


Of course, I just clicked the wrong reply button the first time.


So that cap was indeed missing, thanks for the hint! However, I am still
trying to understand how this is related to the issue we saw.


I had exactly the same happen to me as happened to you a week or so ago. 
Compute node lost power and once restored the VMs would start booting 
but fail early on when they tried to write.


My key was also missing that cap, adding it and resetting the affected 
VMs was the only action I took to sort things out. I didn't need to go 
around removing locks by hand as you did. As you say, waiting 30 seconds 
didn't do any good so it doesn't appear to be a watcher thing.


This was mentioned in the release notes for Luminous[1], I'd missed it 
too as I redeployed Nautilus instead and skipped these steps:




Verify that all RBD client users have sufficient caps to blacklist other 
client users. RBD client users with only "allow r" monitor caps should 
be updated as follows:


# ceph auth caps client. mon 'allow r, allow command "osd 
blacklist"' osd ''




Simon

[1] 
https://docs.ceph.com/docs/master/releases/luminous/#upgrade-from-jewel-or-kraken

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks

2019-11-15 Thread Florian Haas
On 15/11/2019 11:23, Simon Ironside wrote:
> Hi Florian,
> 
> Any chance the key your compute nodes are using for the RBD pool is
> missing 'allow command "osd blacklist"' from its mon caps?
> 
> Simon

Hi Simon,

I received this off-list but then subsequently saw this message pop up
in the list archive, so I hope it's OK to reply on-list?

So that cap was indeed missing, thanks for the hint! However, I am still
trying to understand how this is related to the issue we saw.

The only documentation-ish article that I found about osd blacklist caps
is this:

https://access.redhat.com/solutions/3391211

We can also confirm a bunch of "access denied" messages when trying to
blacklist an OSD in the mon logs. So the content of that article
definitely applies to our situation, I'm just not sure I follow how the
absence of that capability caused this issue.

The article talks about RBD watchers, not locks. To the best of my
knowledge, a watcher operates like a lease on the image, which is
periodically renewed. If not renewed in 30 seconds of client inactivity,
the cluster considers the client dead. (Please correct me if I'm wrong.)
For us, that didn't help. We had to actively remove locks with "rbd lock
rm". Is the article using the wrong terms? Is there a link between
watchers and locks that I'm unaware of?

Semi-relatedly, as I understand it OSD blacklisting happens based either
on an IP address, or on a socket address (IP:port). While this comes in
handy in host evacuation, it doesn't in in-place recovery (see question
4 in my original message).

- If the blacklist happens based on IP address alone (and that's what
seems to be what the client attempts to be doing, based on our log
messages), then it would break recovery-in-place after a hard reboot
altogether.

- Even if the client would blacklist based on an address:port pair, it
would be just very unlikely that an RBD client used the same source port
to connect after the node recovers in place, but not impossible.

So I am wondering: is this incorrect documentation, or incorrect
behavior, or am I simply making dead-wrong assumptions?

Cheers,
Florian

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks

2019-11-15 Thread Wido den Hollander


On 11/15/19 11:24 AM, Simon Ironside wrote:
> Hi Florian,
> 
> Any chance the key your compute nodes are using for the RBD pool is
> missing 'allow command "osd blacklist"' from its mon caps?
> 

Added to this I recommend to use the 'profile rbd' for the mon caps.

As also stated in the OpenStack docs:
https://docs.ceph.com/docs/master/rbd/rbd-openstack/#setup-ceph-client-authentication

Wido

> Simon
> 
> On 15/11/2019 08:19, Florian Haas wrote:
>> Hi everyone,
>>
>> I'm trying to wrap my head around an issue we recently saw, as it
>> relates to RBD locks, Qemu/KVM, and libvirt.
>>
>> Our data center graced us with a sudden and complete dual-feed power
>> failure that affected both a Ceph cluster (Luminous, 12.2.12), and
>> OpenStack compute nodes that used RBDs in that Ceph cluster. (Yes, these
>> things really happen, even in 2019.)
>>
>> Once nodes were powered back up, the Ceph cluster came up gracefully
>> with no intervention required — all we saw was some Mon clock skew until
>> NTP peers had fully synced. Yay! However, our Nova compute nodes, or
>> rather the libvirt VMs that were running on them, were in not so great a
>> shape. The VMs booted up fine initially, but then blew up as soon as
>> they were trying to write to their RBD-backed virtio devices — which, of
>> course, was very early in the boot sequence as they had dirty filesystem
>> journals to apply.
>>
>> Being able to read from, but not write to, RBDs is usually an issue with
>> exclusive locking, so we stopped one of the affected VMs, checked the
>> RBD locks on its device, and found (with rbd lock ls) that the lock was
>> still being held even after the VM was definitely down — both "openstack
>> server show" and "virsh domstate" agreed on this. We manually cleared
>> the lock (rbd lock rm), started the VM, and it booted up fine.
>>
>> Repeat for all VMs, and we were back in business.
>>
>> If I understand correctly, image locks — in contrast to image watchers —
>> have no timeout, so locks must be always be explicitly released, or they
>> linger forever.
>>
>> So that raises a few questions:
>>
>> (1) Is it correct to assume that the lingering lock was actually from
>> *before* the power failure?
>>
>> (2) What, exactly, triggers the lock acquisition and release in this
>> context? Is it nova-compute that does this, or libvirt, or Qemu/KVM?
>>
>> (3) Would the same issue be expected essentially in any hard failure of
>> even a single compute node, and if so, does that mean that what
>> https://docs.ceph.com/docs/master/rbd/rbd-openstack/ says about "nova
>> evacuate" (and presumably, by extension also about "nova host-evacuate")
>> is inaccurate? If so, what would be required to make that work?
>>
>> (4) If (3), is it correct to assume that the same considerations apply
>> to the Nova resume_guests_state_on_host_boot feature, i.e. that
>> automatic guest recovery wouldn't be expected to succeed even if a node
>> experienced just a hard reboot, as opposed to a a catastrophic permanent
>> failure? And again, what would be required to make that work?  Is it
>> really necessary to clean all RBD locks manually?
>>
>> Grateful for any insight that people could share here. I'd volunteer to
>> add a brief writeup of locking functionality in this context to the docs.
>>
>> Thanks!
>>
>> Cheers,
>> Florian
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Global power failure, OpenStack Nova/libvirt/KVM, and Ceph RBD locks

2019-11-15 Thread Simon Ironside

Hi Florian,

Any chance the key your compute nodes are using for the RBD pool is 
missing 'allow command "osd blacklist"' from its mon caps?


Simon

On 15/11/2019 08:19, Florian Haas wrote:

Hi everyone,

I'm trying to wrap my head around an issue we recently saw, as it
relates to RBD locks, Qemu/KVM, and libvirt.

Our data center graced us with a sudden and complete dual-feed power
failure that affected both a Ceph cluster (Luminous, 12.2.12), and
OpenStack compute nodes that used RBDs in that Ceph cluster. (Yes, these
things really happen, even in 2019.)

Once nodes were powered back up, the Ceph cluster came up gracefully
with no intervention required — all we saw was some Mon clock skew until
NTP peers had fully synced. Yay! However, our Nova compute nodes, or
rather the libvirt VMs that were running on them, were in not so great a
shape. The VMs booted up fine initially, but then blew up as soon as
they were trying to write to their RBD-backed virtio devices — which, of
course, was very early in the boot sequence as they had dirty filesystem
journals to apply.

Being able to read from, but not write to, RBDs is usually an issue with
exclusive locking, so we stopped one of the affected VMs, checked the
RBD locks on its device, and found (with rbd lock ls) that the lock was
still being held even after the VM was definitely down — both "openstack
server show" and "virsh domstate" agreed on this. We manually cleared
the lock (rbd lock rm), started the VM, and it booted up fine.

Repeat for all VMs, and we were back in business.

If I understand correctly, image locks — in contrast to image watchers —
have no timeout, so locks must be always be explicitly released, or they
linger forever.

So that raises a few questions:

(1) Is it correct to assume that the lingering lock was actually from
*before* the power failure?

(2) What, exactly, triggers the lock acquisition and release in this
context? Is it nova-compute that does this, or libvirt, or Qemu/KVM?

(3) Would the same issue be expected essentially in any hard failure of
even a single compute node, and if so, does that mean that what
https://docs.ceph.com/docs/master/rbd/rbd-openstack/ says about "nova
evacuate" (and presumably, by extension also about "nova host-evacuate")
is inaccurate? If so, what would be required to make that work?

(4) If (3), is it correct to assume that the same considerations apply
to the Nova resume_guests_state_on_host_boot feature, i.e. that
automatic guest recovery wouldn't be expected to succeed even if a node
experienced just a hard reboot, as opposed to a a catastrophic permanent
failure? And again, what would be required to make that work?  Is it
really necessary to clean all RBD locks manually?

Grateful for any insight that people could share here. I'd volunteer to
add a brief writeup of locking functionality in this context to the docs.

Thanks!

Cheers,
Florian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com