On 06/06/2018 07:46 AM, Matthew Booth wrote:
TL;DR I think we need to entirely disable swap volume for multiattach
volumes, and this will be an api breaking change with no immediate
workaround.

I was looking through tempest and came across
api.compute.admin.test_volume_swap.TestMultiAttachVolumeSwap.test_volume_swap_with_multiattach.
This test does:

Create 2 multiattach volumes
Create 2 servers
Attach volume 1 to both servers
** Swap volume 1 for volume 2  on server 1 **
Check all is attached as expected

The problem with this is that swap volume is a copy operation.

Is it, though? The original blueprint and implementation seem to suggest that the swap_volume operation was nothing more than changing the mountpoint for a volume to point to a different location (in a safe
manner that didn't lose any reads or writes).

https://blueprints.launchpad.net/nova/+spec/volume-swap

Nothing about the description of swap_volume() in the virt driver interface mentions swap_volume() being a "copy operation":

https://github.com/openstack/nova/blob/76ec078d3781fb55c96d7aaca4fb73a74ce94d96/nova/virt/driver.py#L476

We don't just replace one volume with another, we copy the contents
from one to the other and then do the swap. We do this with a qemu
drive mirror operation, which is able to do this copy safely without
needing to make the source read-only because it can also track writes
to the source and ensure the target is updated again. Here's a link
to the libvirt logs showing a drive mirror operation during the swap
volume of an execution of the above test:
After checking the source code, the libvirt virt driver is the only virt driver that implements swap_volume(), so it looks to me like a public HTTP API method was added that was specific to libvirt's implementation of drive mirroring. Yay, more implementation leaking out through the API.

http://logs.openstack.org/58/567258/5/check/nova-multiattach/d23fad8/logs/libvirt/libvirtd.txt.gz#_2018-06-04_10_57_05_201

The problem is that when the volume is attached to more than one VM,
the hypervisor doing the drive mirror *doesn't* know about writes on
the other attached VMs, so it can't do that copy safely, and the
result is data corruption.

Would it be possible to swap the volume by doing what Vish originally described in the blueprint: pause the VM, swap the volume mountpoints (potentially after migrating the underlying volume), start the VM?

>
 Note that swap volume isn't visible to the
guest os, so this can't be addressed by the user. This is a data
corrupter, and we shouldn't allow it. However, it is in released code
and users might be doing it already, so disabling it would be a
user-visible api change with no immediate workaround.

I'd love to know who is actually using the swap_volume() functionality, actually. I'd especially like to know who is using swap_volume() with multiattach.

However, I think we're attempting to do the wrong thing here anyway,
and the above tempest test is explicit testing behaviour that we don't
want. The use case for swap volume is that a user needs to move volume
data for attached volumes, e.g. to new faster/supported/maintained
hardware.

Is that the use case?

As was typical, there's no mention of a use case on the original blueprint. It just says "This feature allows a user or administrator to transparently swap out a cinder volume that connected to an instance." Which is hardly a use case since it uses the feature name in a description of the feature itself. :(

The commit message (there was only a single commit for this functionality [1]) mentions overwriting data on the new volume:

  Adds support for transparently swapping an attached volume with
  another volume. Note that this overwrites all data on the new volume
  with data from the old volume.

Yes, that is the commit message in its entirety. Of course, the commit had no documentation at all in it, so there's no ability to understand what the original use case really was here.

https://review.openstack.org/#/c/28995/

If the use case was really "that a user needs to move volume data for attached volumes", why not just pause the VM, detach the volume, do a openstack volume migrate to the new destination, reattach the volume and start the VM? That would mean no libvirt/QEMU-specific implementation behaviour leaking out of the public HTTP API and allow the volume service (Cinder) to do its job properly.

With single attach that's exactly what they get: the end
user should never notice. With multi-attach they don't get that. We're
basically forking the shared volume at a point in time, with the
instance which did the swap writing to the new location while all
others continue writing to the old location. Except that even the fork
is broken, because they'll get a corrupt, inconsistent copy rather
than point in time. I can't think of a use case for this behaviour,
and it certainly doesn't meet the original design intent.

What they really want is for the multi-attached volume to be copied
from location a to location b and for all attachments to be updated.
Unfortunately I don't think we're going to be in a position to do that
any time soon, but I also think users will be unhappy if they're no
longer able to move data at all because it's multi-attach. We can
compromise, though, if we allow a multiattach volume to be moved as
long as it only has a single attachment. This means the operator can't
move this data without disruption to users, but at least it's not
fundamentally immovable.

This would require some cooperation with cinder to achieve, as we need
to be able to temporarily prevent cinder from allowing new
attachments. A natural way to achieve this would be to allow a
multi-attach volume with only a single attachment to be redesignated
not multiattach, but there might be others. The flow would then be:

Detach volume from server 2
Set multiattach=False on volume
Migrate volume on server 1
Set multiattach=True on volume
Attach volume to server 2

Combined with a patch to nova to disallow swap_volume on any
multiattach volume, this would then be possible if inconvenient.

Regardless of any other changes, though, I think it's urgent that we
disable the ability to swap_volume a multiattach volume because we
don't want users to start using this relatively new, but broken,
feature.

Or we could deprecate the swap_volume Compute API operation and use Cinder for all of this.

But sure, we could also add more cruft to the Compute API and add more conditional "it works but only when X" docs to the API reference.

Just my two cents,
-jay

__________________________________________________________________________
OpenStack Development Mailing List (not for usage questions)
Unsubscribe: openstack-dev-requ...@lists.openstack.org?subject:unsubscribe
http://lists.openstack.org/cgi-bin/mailman/listinfo/openstack-dev

Reply via email to