Re: [Qemu-devel] [PATCH 0/4] add failover feature for assigned network devices

Laine Stump Tue, 11 Jun 2019 08:56:35 -0700

On 5/17/19 8:58 AM, Jens Freimann wrote:

This is another attempt at implementing the host side of the
net_failover concept
(https://www.kernel.org/doc/html/latest/networking/net_failover.html)


Changes since last RFC:
- work around circular dependency of commandline options. Just add
   failover=on to the virtio-net standby options and reference it from
   primary (vfio-pci) device with standby=<id>
- add patch 3/4 to allow migration of vfio-pci device when it is part of a
   failover pair, still disallow for all other devices
- add patch 4/4 to allow unplug of device during migrationm, make an
   exception for failover primary devices. I'd like feedback on how to
   solve this more elegant. I added a boolean to DeviceState, have it
   default to false for all devices except for primary devices.
- not tested yet with surprise removal
- I don't expect this to go in as it is, still needs more testing but
   I'd like to get feedback on above mentioned changes.

The general idea is that we have a pair of devices, a vfio-pci and a
emulated device. Before migration the vfio device is unplugged and data
flows to the emulated device, on the target side another vfio-pci device
is plugged in to take over the data-path. In the guest the net_failover
module will pair net devices with the same MAC address.

* In the first patch the infrastructure for hiding the device is added
   for the qbus and qdev APIs.

* In the second patch the virtio-net uses the API to defer adding the vfio
   device until the VIRTIO_NET_F_STANDBY feature is acked.

Previous discussion:
   RFC v1 https://patchwork.ozlabs.org/cover/989098/
   RFC v2 https://www.mail-archive.com/qemu-devel@nongnu.org/msg606906.html

To summarize concerns/feedback from previous discussion:
1.- guest OS can reject or worse _delay_ unplug by any amount of time.
   Migration might get stuck for unpredictable time with unclear reason.
   This approach combines two tricky things, hot/unplug and migration.
   -> We can surprise-remove the PCI device and in QEMU we can do all
      necessary rollbacks transparent to management software. Will it be
      easy, probably not.
2. PCI devices are a precious ressource. The primary device should never
   be added to QEMU if it won't be used by guest instead of hiding it in
   QEMU.
   -> We only hotplug the device when the standby feature bit was
      negotiated. We save the device cmdline options until we need it for
      qdev_device_add()
      Hiding a device can be a useful concept to model. For example a
      pci device in a powered-off slot could be marked as hidden until the slot 
is
      powered on (mst).
3. Management layer software should handle this. Open Stack already has
   components/code to handle unplug/replug VFIO devices and metadata to
   provide to the guest for detecting which devices should be paired.
   -> An approach that includes all software from firmware to
      higher-level management software wasn't tried in the last years. This is
      an attempt to keep it simple and contained in QEMU as much as possible.
4. Hotplugging a device and then making it part of a failover setup is
    not possible
   -> addressed by extending qdev hotplug functions to check for hidden
      attribute, so e.g. device_add can be used to plug a device.


I have tested this with a mlx5 NIC and was able to migrate the VM with
above mentioned workarounds for open problems.

Command line example:

qemu-system-x86_64 -enable-kvm -m 3072 -smp 3 \
         -machine q35,kernel-irqchip=split -cpu host   \
         -k fr   \
         -serial stdio   \
         -net none \
         -qmp unix:/tmp/qmp.socket,server,nowait \
         -monitor telnet:127.0.0.1:5555,server,nowait \
         -device pcie-root-port,id=root0,multifunction=on,chassis=0,addr=0xa \
         -device pcie-root-port,id=root1,bus=pcie.0,chassis=1 \
         -device pcie-root-port,id=root2,bus=pcie.0,chassis=2 \
         -netdev 
tap,script=/root/bin/bridge.sh,downscript=no,id=hostnet1,vhost=on \
         -device 
virtio-net-pci,netdev=hostnet1,id=net1,mac=52:54:00:6f:55:cc,bus=root2,failover=on
 \
         /root/rhel-guest-image-8.0-1781.x86_64.qcow2

Then the primary device can be hotplugged via
  (qemu) device_add vfio-pci,host=5e:00.2,id=hostdev0,bus=root1,standby=net1

I guess this is the commandline on the migration destination, and as faras I understand from this example, on the destination we (meaninglibvirt or higher level management application) must *not* include theassigned device on the qemu commandline, but must instead hotplug thedevice later after the guest CPUs have been restarted on the destination.

So if I'm understanding correctly, the idea is that on the migrationsource, the device may have been hotplugged, or may have been includedwhen qemu was originally started. Then qemu automatically handles theunplug of the device on the source, but it seems qemu does nothing onthe destination, leaving that up to libvirt or a higher layer to implement.

Then in order for this to work, libvirt (or OpenStack or oVirt orwhoever) needs to understand that the device in the libvirt config (itwill still be in the libvirt config, since from libvirt's POV it hasn'tbeen unplugged):


1) shouldn't be included in the qemu commandline on the destination,

2) will almost surely need to be replaced with a different device on thedestination (since it's almost certain that the destination won't havean available device at the same PCI address)

3) will probably need to be unbinded from the VF net driver (does thisneed to happen before migration is finished? If we want to lower theprobability of a failure after we're already committed to the migration,then I think we must, but libvirt isn't set up for that in any way).

4) will need to be hotplugged after the migration has finished *and*after the guest CPUs have been restarted on the destination.

While it will be possible to assure that there is a destination device,and to replace the old device with new in the config (and maybe, eitherwith some major reworking of device assignment code, or offloading theresponsibility to the management application(s), possible to re-bind thedevice to the vfio-pci driver), prior to marking the migration as"successful" (thus committing to running it on the destination), wecan't say as much for actually assigning the device. So if theassignment fails, then what happens?

So a few issues I see that will need to be solved by [someone](apparently either libvirt or management):

a) there isn't anything in libvirt's XML grammar that allows us tosignify a device that is "present in the config but shouldn't beincluded in the commandline"

b) someone will need to replace the device from the source with anequivalent device on the destination in the libvirt XML. There are othercases of management modifying the XML during migration (I think), butthis does point out that putting the "auto-unplug code into qemu isn'tturning this into a trivial

c) there is nothing in libvirt's migration logic that can cause a deviceto be re-binded to vfio-pci prior to completion of a migration. Unlessthis is added to libvirt (or the re-bind operation is passed off to themanagement application), we will need to live with the possibility thathotplugging the device will fail due to failed re-bind *after* we'vecommitted to the migration.

d) once the guest CPUs are restarted on the destination, [someone](libvirt or management) needs to hotplug the new device on thedestination. (I'm guessing that a hotplug can only be done while theguest CPUs are running; correct me if this is wrong!)

This sounds like a lot of complexity for something that was supposed tobe handled completely/transparently by qemu :-P.

Re: [Qemu-devel] [PATCH 0/4] add failover feature for assigned network devices

Reply via email to