Re: [openstack-dev] [nova][cinder] Update (swap) of multiattach volume should not be allowed

2018-06-06 Thread Matt Riedemann

Here is the nova patch for those following along:

https://review.openstack.org/#/c/572790/

On 6/6/2018 9:07 AM, Jay Pipes wrote:

On 06/06/2018 10:02 AM, Matt Riedemann wrote:

On 6/6/2018 8:24 AM, Jay Pipes wrote:

On 06/06/2018 09:10 AM, Artom Lifshitz wrote:

I think regardless of how we ended up with this situation, we're still
in a position where we have a public-facing API that could lead to
data-corruption when used in a specific way. That should never be the
case. I would think re-using the already possible 400 response code to
update-volume when used with a multi-attach volume to indicate that it
can't be done, without a new microversion, would be the cleaned way of
getting out of this pickle.


That's fine, yes.

I just think it's worth noting that it's a pickle that we put 
ourselves in due to an ill-conceived feature and Compute API call. 
And that we should, you know, try to stop doing that. :)


-jay


If we're going to change something, I think it should probably happen 
on the cinder side when the retype or live migration of the volume is 
initiated and would do the attachment counting there.


So if you're swapping from multiattach volume A to multiattach volume 
B and either has >1 read/write attachment, then fail with a 400 in the 
cinder API.


We can check those things in the compute API when cinder calls the 
swap volume API in nova, but:


1. It's racy - cinder is the source of truth on the current state of 
the attachments.


2. The failure mode is going to be questionable - by the time cinder 
calls nova to swap the volumes on the compute host, the cinder REST 
API has long since 202'ed the response to the user and the best nova 
can do is return a 400 and then cinder has to handle that gracefully 
and rollback. It would be much cleaner if the volume API just fails fast.


+10

-jay

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev



--

Thanks,

Matt

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][cinder] Update (swap) of multiattach volume should not be allowed

2018-06-06 Thread Jay Pipes

On 06/06/2018 10:02 AM, Matt Riedemann wrote:

On 6/6/2018 8:24 AM, Jay Pipes wrote:

On 06/06/2018 09:10 AM, Artom Lifshitz wrote:

I think regardless of how we ended up with this situation, we're still
in a position where we have a public-facing API that could lead to
data-corruption when used in a specific way. That should never be the
case. I would think re-using the already possible 400 response code to
update-volume when used with a multi-attach volume to indicate that it
can't be done, without a new microversion, would be the cleaned way of
getting out of this pickle.


That's fine, yes.

I just think it's worth noting that it's a pickle that we put 
ourselves in due to an ill-conceived feature and Compute API call. And 
that we should, you know, try to stop doing that. :)


-jay


If we're going to change something, I think it should probably happen on 
the cinder side when the retype or live migration of the volume is 
initiated and would do the attachment counting there.


So if you're swapping from multiattach volume A to multiattach volume B 
and either has >1 read/write attachment, then fail with a 400 in the 
cinder API.


We can check those things in the compute API when cinder calls the swap 
volume API in nova, but:


1. It's racy - cinder is the source of truth on the current state of the 
attachments.


2. The failure mode is going to be questionable - by the time cinder 
calls nova to swap the volumes on the compute host, the cinder REST API 
has long since 202'ed the response to the user and the best nova can do 
is return a 400 and then cinder has to handle that gracefully and 
rollback. It would be much cleaner if the volume API just fails fast.


+10

-jay

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][cinder] Update (swap) of multiattach volume should not be allowed

2018-06-06 Thread Matthew Booth
On 6 June 2018 at 13:55, Jay Pipes  wrote:
> On 06/06/2018 07:46 AM, Matthew Booth wrote:
>>
>> TL;DR I think we need to entirely disable swap volume for multiattach
>> volumes, and this will be an api breaking change with no immediate
>> workaround.
>>
>> I was looking through tempest and came across
>>
>> api.compute.admin.test_volume_swap.TestMultiAttachVolumeSwap.test_volume_swap_with_multiattach.
>> This test does:
>>
>> Create 2 multiattach volumes
>> Create 2 servers
>> Attach volume 1 to both servers
>> ** Swap volume 1 for volume 2  on server 1 **
>> Check all is attached as expected
>>
>> The problem with this is that swap volume is a copy operation.
>
>
> Is it, though? The original blueprint and implementation seem to suggest
> that the swap_volume operation was nothing more than changing the mountpoint
> for a volume to point to a different location (in a safe
> manner that didn't lose any reads or writes).
>
> https://blueprints.launchpad.net/nova/+spec/volume-swap
>
> Nothing about the description of swap_volume() in the virt driver interface
> mentions swap_volume() being a "copy operation":
>
> https://github.com/openstack/nova/blob/76ec078d3781fb55c96d7aaca4fb73a74ce94d96/nova/virt/driver.py#L476
>
>> We don't just replace one volume with another, we copy the contents
>> from one to the other and then do the swap. We do this with a qemu
>> drive mirror operation, which is able to do this copy safely without
>> needing to make the source read-only because it can also track writes
>> to the source and ensure the target is updated again. Here's a link
>> to the libvirt logs showing a drive mirror operation during the swap
>> volume of an execution of the above test:
>
> After checking the source code, the libvirt virt driver is the only virt
> driver that implements swap_volume(), so it looks to me like a public HTTP
> API method was added that was specific to libvirt's implementation of drive
> mirroring. Yay, more implementation leaking out through the API.
>
>>
>> http://logs.openstack.org/58/567258/5/check/nova-multiattach/d23fad8/logs/libvirt/libvirtd.txt.gz#_2018-06-04_10_57_05_201
>>
>> The problem is that when the volume is attached to more than one VM,
>> the hypervisor doing the drive mirror *doesn't* know about writes on
>> the other attached VMs, so it can't do that copy safely, and the
>> result is data corruption.
>
>
> Would it be possible to swap the volume by doing what Vish originally
> described in the blueprint: pause the VM, swap the volume mountpoints
> (potentially after migrating the underlying volume), start the VM?
>
>>
>  Note that swap volume isn't visible to the
>>
>> guest os, so this can't be addressed by the user. This is a data
>> corrupter, and we shouldn't allow it. However, it is in released code
>> and users might be doing it already, so disabling it would be a
>> user-visible api change with no immediate workaround.
>
>
> I'd love to know who is actually using the swap_volume() functionality,
> actually. I'd especially like to know who is using swap_volume() with
> multiattach.
>
>> However, I think we're attempting to do the wrong thing here anyway,
>> and the above tempest test is explicit testing behaviour that we don't
>> want. The use case for swap volume is that a user needs to move volume
>> data for attached volumes, e.g. to new faster/supported/maintained
>> hardware.
>
>
> Is that the use case?
>
> As was typical, there's no mention of a use case on the original blueprint.
> It just says "This feature allows a user or administrator to transparently
> swap out a cinder volume that connected to an instance." Which is hardly a
> use case since it uses the feature name in a description of the feature
> itself. :(
>
> The commit message (there was only a single commit for this functionality
> [1]) mentions overwriting data on the new volume:
>
>   Adds support for transparently swapping an attached volume with
>   another volume. Note that this overwrites all data on the new volume
>   with data from the old volume.
>
> Yes, that is the commit message in its entirety. Of course, the commit had
> no documentation at all in it, so there's no ability to understand what the
> original use case really was here.
>
> https://review.openstack.org/#/c/28995/
>
> If the use case was really "that a user needs to move volume data for
> attached volumes", why not just pause the VM, detach the volume, do a
> openstack volume migrate to the new destination, reattach the volume and
> start the VM? That would mean no libvirt/QEMU-specific implementation
> behaviour leaking out of the public HTTP API and allow the volume service
> (Cinder) to do its job properly.

I can't comment on how it was originally documented, but I'm confident
in the use case. Certainly I know this is how our customers use it.
It's the Nova-side implementation of a cinder retype operation. There
are a bunch of potential reasons to want to do this, but a specific
one that I recall from a 

Re: [openstack-dev] [nova][cinder] Update (swap) of multiattach volume should not be allowed

2018-06-06 Thread Matt Riedemann

On 6/6/2018 8:24 AM, Jay Pipes wrote:

On 06/06/2018 09:10 AM, Artom Lifshitz wrote:

I think regardless of how we ended up with this situation, we're still
in a position where we have a public-facing API that could lead to
data-corruption when used in a specific way. That should never be the
case. I would think re-using the already possible 400 response code to
update-volume when used with a multi-attach volume to indicate that it
can't be done, without a new microversion, would be the cleaned way of
getting out of this pickle.


That's fine, yes.

I just think it's worth noting that it's a pickle that we put ourselves 
in due to an ill-conceived feature and Compute API call. And that we 
should, you know, try to stop doing that. :)


-jay


If we're going to change something, I think it should probably happen on 
the cinder side when the retype or live migration of the volume is 
initiated and would do the attachment counting there.


So if you're swapping from multiattach volume A to multiattach volume B 
and either has >1 read/write attachment, then fail with a 400 in the 
cinder API.


We can check those things in the compute API when cinder calls the swap 
volume API in nova, but:


1. It's racy - cinder is the source of truth on the current state of the 
attachments.


2. The failure mode is going to be questionable - by the time cinder 
calls nova to swap the volumes on the compute host, the cinder REST API 
has long since 202'ed the response to the user and the best nova can do 
is return a 400 and then cinder has to handle that gracefully and 
rollback. It would be much cleaner if the volume API just fails fast.


--

Thanks,

Matt

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][cinder] Update (swap) of multiattach volume should not be allowed

2018-06-06 Thread Jay Pipes

On 06/06/2018 09:10 AM, Artom Lifshitz wrote:

I think regardless of how we ended up with this situation, we're still
in a position where we have a public-facing API that could lead to
data-corruption when used in a specific way. That should never be the
case. I would think re-using the already possible 400 response code to
update-volume when used with a multi-attach volume to indicate that it
can't be done, without a new microversion, would be the cleaned way of
getting out of this pickle.


That's fine, yes.

I just think it's worth noting that it's a pickle that we put ourselves 
in due to an ill-conceived feature and Compute API call. And that we 
should, you know, try to stop doing that. :)


-jay

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][cinder] Update (swap) of multiattach volume should not be allowed

2018-06-06 Thread Matt Riedemann

On 6/6/2018 7:55 AM, Jay Pipes wrote:
I'd love to know who is actually using the swap_volume() functionality, 
actually. I'd especially like to know who is using swap_volume() with 
multiattach.


The swap volume API in nova only exists as a callback routine during 
volume live migration or retype operations. It's admin-only by default 
on the nova side, and shouldn't be called directly (similar to 
guest-assisted volume snapshots for NFS and GlusterFS volumes - totally 
just a callback from Cinder). So during volume retype, cinder will call 
swap volume in nova and then nova will call another admin-only API in 
Cinder to tell Cinder, yup we did it or we failed, rollback.


The cinder API reference on retype mentions the restrictions about 
multiattach volumes:


https://developer.openstack.org/api-ref/block-storage/v3/#retype-a-volume

"Retyping an in-use volume from a multiattach-capable type to a 
non-multiattach-capable type, or vice-versa, is not supported. It is 
generally not recommended to retype an in-use multiattach volume if that 
volume has more than one active read/write attachment."


There is no API reference for volume live migration, but it should 
generally be the same idea.


The Tempest test for swap volume with multiattach volumes was written 
before we realized we needed to put restrictions in place *on the cinder 
side* to limit the behavior. The Tempest test just hits the compute API 
to verify the plumbing in nova works properly, it doesn't initiate the 
flow via an actual retype (or volume live migration).


--

Thanks,

Matt

__
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev


Re: [openstack-dev] [nova][cinder] Update (swap) of multiattach volume should not be allowed

2018-06-06 Thread Artom Lifshitz
I think regardless of how we ended up with this situation, we're still
in a position where we have a public-facing API that could lead to
data-corruption when used in a specific way. That should never be the
case. I would think re-using the already possible 400 response code to
update-volume when used with a multi-attach volume to indicate that it
can't be done, without a new microversion, would be the cleaned way of
getting out of this pickle.

On Wed, Jun 6, 2018 at 2:55 PM, Jay Pipes  wrote:
> On 06/06/2018 07:46 AM, Matthew Booth wrote:
>>
>> TL;DR I think we need to entirely disable swap volume for multiattach
>> volumes, and this will be an api breaking change with no immediate
>> workaround.
>>
>> I was looking through tempest and came across
>>
>> api.compute.admin.test_volume_swap.TestMultiAttachVolumeSwap.test_volume_swap_with_multiattach.
>> This test does:
>>
>> Create 2 multiattach volumes
>> Create 2 servers
>> Attach volume 1 to both servers
>> ** Swap volume 1 for volume 2  on server 1 **
>> Check all is attached as expected
>>
>> The problem with this is that swap volume is a copy operation.
>
>
> Is it, though? The original blueprint and implementation seem to suggest
> that the swap_volume operation was nothing more than changing the mountpoint
> for a volume to point to a different location (in a safe
> manner that didn't lose any reads or writes).
>
> https://blueprints.launchpad.net/nova/+spec/volume-swap
>
> Nothing about the description of swap_volume() in the virt driver interface
> mentions swap_volume() being a "copy operation":
>
> https://github.com/openstack/nova/blob/76ec078d3781fb55c96d7aaca4fb73a74ce94d96/nova/virt/driver.py#L476
>
>> We don't just replace one volume with another, we copy the contents
>> from one to the other and then do the swap. We do this with a qemu
>> drive mirror operation, which is able to do this copy safely without
>> needing to make the source read-only because it can also track writes
>> to the source and ensure the target is updated again. Here's a link
>> to the libvirt logs showing a drive mirror operation during the swap
>> volume of an execution of the above test:
>
> After checking the source code, the libvirt virt driver is the only virt
> driver that implements swap_volume(), so it looks to me like a public HTTP
> API method was added that was specific to libvirt's implementation of drive
> mirroring. Yay, more implementation leaking out through the API.
>
>>
>> http://logs.openstack.org/58/567258/5/check/nova-multiattach/d23fad8/logs/libvirt/libvirtd.txt.gz#_2018-06-04_10_57_05_201
>>
>> The problem is that when the volume is attached to more than one VM,
>> the hypervisor doing the drive mirror *doesn't* know about writes on
>> the other attached VMs, so it can't do that copy safely, and the
>> result is data corruption.
>
>
> Would it be possible to swap the volume by doing what Vish originally
> described in the blueprint: pause the VM, swap the volume mountpoints
> (potentially after migrating the underlying volume), start the VM?
>
>>
>  Note that swap volume isn't visible to the
>>
>> guest os, so this can't be addressed by the user. This is a data
>> corrupter, and we shouldn't allow it. However, it is in released code
>> and users might be doing it already, so disabling it would be a
>> user-visible api change with no immediate workaround.
>
>
> I'd love to know who is actually using the swap_volume() functionality,
> actually. I'd especially like to know who is using swap_volume() with
> multiattach.
>
>> However, I think we're attempting to do the wrong thing here anyway,
>> and the above tempest test is explicit testing behaviour that we don't
>> want. The use case for swap volume is that a user needs to move volume
>> data for attached volumes, e.g. to new faster/supported/maintained
>> hardware.
>
>
> Is that the use case?
>
> As was typical, there's no mention of a use case on the original blueprint.
> It just says "This feature allows a user or administrator to transparently
> swap out a cinder volume that connected to an instance." Which is hardly a
> use case since it uses the feature name in a description of the feature
> itself. :(
>
> The commit message (there was only a single commit for this functionality
> [1]) mentions overwriting data on the new volume:
>
>   Adds support for transparently swapping an attached volume with
>   another volume. Note that this overwrites all data on the new volume
>   with data from the old volume.
>
> Yes, that is the commit message in its entirety. Of course, the commit had
> no documentation at all in it, so there's no ability to understand what the
> original use case really was here.
>
> https://review.openstack.org/#/c/28995/
>
> If the use case was really "that a user needs to move volume data for
> attached volumes", why not just pause the VM, detach the volume, do a
> openstack volume migrate to the new destination, reattach the volume and
> start the VM? That would mean no 

Re: [openstack-dev] [nova][cinder] Update (swap) of multiattach volume should not be allowed

2018-06-06 Thread Jay Pipes

On 06/06/2018 07:46 AM, Matthew Booth wrote:

TL;DR I think we need to entirely disable swap volume for multiattach
volumes, and this will be an api breaking change with no immediate
workaround.

I was looking through tempest and came across
api.compute.admin.test_volume_swap.TestMultiAttachVolumeSwap.test_volume_swap_with_multiattach.
This test does:

Create 2 multiattach volumes
Create 2 servers
Attach volume 1 to both servers
** Swap volume 1 for volume 2  on server 1 **
Check all is attached as expected

The problem with this is that swap volume is a copy operation.


Is it, though? The original blueprint and implementation seem to suggest 
that the swap_volume operation was nothing more than changing the 
mountpoint for a volume to point to a different location (in a safe

manner that didn't lose any reads or writes).

https://blueprints.launchpad.net/nova/+spec/volume-swap

Nothing about the description of swap_volume() in the virt driver 
interface mentions swap_volume() being a "copy operation":


https://github.com/openstack/nova/blob/76ec078d3781fb55c96d7aaca4fb73a74ce94d96/nova/virt/driver.py#L476


We don't just replace one volume with another, we copy the contents
from one to the other and then do the swap. We do this with a qemu
drive mirror operation, which is able to do this copy safely without
needing to make the source read-only because it can also track writes
to the source and ensure the target is updated again. Here's a link
to the libvirt logs showing a drive mirror operation during the swap
volume of an execution of the above test:
After checking the source code, the libvirt virt driver is the only virt 
driver that implements swap_volume(), so it looks to me like a public 
HTTP API method was added that was specific to libvirt's implementation 
of drive mirroring. Yay, more implementation leaking out through the API.



http://logs.openstack.org/58/567258/5/check/nova-multiattach/d23fad8/logs/libvirt/libvirtd.txt.gz#_2018-06-04_10_57_05_201

The problem is that when the volume is attached to more than one VM,
the hypervisor doing the drive mirror *doesn't* know about writes on
the other attached VMs, so it can't do that copy safely, and the
result is data corruption.


Would it be possible to swap the volume by doing what Vish originally 
described in the blueprint: pause the VM, swap the volume mountpoints 
(potentially after migrating the underlying volume), start the VM?


>
 Note that swap volume isn't visible to the

guest os, so this can't be addressed by the user. This is a data
corrupter, and we shouldn't allow it. However, it is in released code
and users might be doing it already, so disabling it would be a
user-visible api change with no immediate workaround.


I'd love to know who is actually using the swap_volume() functionality, 
actually. I'd especially like to know who is using swap_volume() with 
multiattach.



However, I think we're attempting to do the wrong thing here anyway,
and the above tempest test is explicit testing behaviour that we don't
want. The use case for swap volume is that a user needs to move volume
data for attached volumes, e.g. to new faster/supported/maintained
hardware.


Is that the use case?

As was typical, there's no mention of a use case on the original 
blueprint. It just says "This feature allows a user or administrator to 
transparently swap out a cinder volume that connected to an instance." 
Which is hardly a use case since it uses the feature name in a 
description of the feature itself. :(


The commit message (there was only a single commit for this 
functionality [1]) mentions overwriting data on the new volume:


  Adds support for transparently swapping an attached volume with
  another volume. Note that this overwrites all data on the new volume
  with data from the old volume.

Yes, that is the commit message in its entirety. Of course, the commit 
had no documentation at all in it, so there's no ability to understand 
what the original use case really was here.


https://review.openstack.org/#/c/28995/

If the use case was really "that a user needs to move volume data for 
attached volumes", why not just pause the VM, detach the volume, do a 
openstack volume migrate to the new destination, reattach the volume and 
start the VM? That would mean no libvirt/QEMU-specific implementation 
behaviour leaking out of the public HTTP API and allow the volume 
service (Cinder) to do its job properly.



With single attach that's exactly what they get: the end
user should never notice. With multi-attach they don't get that. We're
basically forking the shared volume at a point in time, with the
instance which did the swap writing to the new location while all
others continue writing to the old location. Except that even the fork
is broken, because they'll get a corrupt, inconsistent copy rather
than point in time. I can't think of a use case for this behaviour,
and it certainly doesn't meet the original design intent.

What they