On Wed, Nov 04, 2020 at 11:32:34AM +0800, Jason Wang wrote: > > On 2020/11/3 下午8:15, Stefan Hajnoczi wrote: > > On Tue, Nov 03, 2020 at 04:46:53PM +0800, Jason Wang wrote: > > > On 2020/11/2 下午7:11, Stefan Hajnoczi wrote: > > > > There is discussion about VFIO migration in the "Re: Out-of-Process > > > > Device Emulation session at KVM Forum 2020" thread. The current status > > > > is that Kirti proposed a VFIO device region type for saving and loading > > > > device state. There is currently no guidance on migrating between > > > > different device versions or device implementations from different > > > > vendors. This is known to be non-trivial and raised discussion about > > > > whether it should really be handled by VFIO or centralized in QEMU. > > > > > > > > Below is a document that describes how to ensure migration compatibility > > > > in VFIO. It does not require changes to the VFIO migration interface. It > > > > can be used for both VFIO/mdev kernel devices and vfio-user devices. > > > > > > > > The idea is that the device state blob is opaque to the VMM but the same > > > > level of migration compatibility that exists today is still available. > > > > > > So if we can't mandate this or there's no way to validate this. Vendor is > > > still free to implement their own protocol which could lead a lot of > > > maintaining burden. > > Yes, the device state representation is their responsibility. We can't > > do that for them since they define the hardware interface and internal > > state. > > > > As Michael and Paolo have mentioned in the other thread, we can provide > > guidelines and standardize common aspects. > > > > > > Migration can fail if loading the device state is not possible. It > > > > should fail > > > > early with a clear error message. It must not appear to complete but > > > > leave the > > > > device inoperable due to a migration problem. > > > > > > For VFIO-user, how management know that a VM can be migrated from src to > > > dst? For kernel, we have sysfs. > > vfio-user devices will normally be instantiated in one of two ways: > > > > 1. Launching a device backend and passing command-line parameters: > > > > $ my-nic --socket-path /tmp/my-nic-vfio-user.sock \ > > --model https://vendor-a.com/my-nic \ > > --rss on > > > > Here "model" is the device model URL. The program could support > > multiple device models. > > > > The "rss" device configuration parameter enables Receive Side Scaling > > (RSS) as an example of a configuration parameter. > > > > 2. Creating a device using an RPC interface: > > > > (qemu) device-add my-nic,rss=on > > > > If the device instantiation succeeds then it is safe to live migrate. > > The device is exposing the desired hardware interface and expecting the > > right device state representation. > > > Does this mean there will still be a "my-nic" stub in qemu? (I thought it > should be a generic one like device-add "vfio-user-pci")
No, sorry for the confusing example. I was thinking of qemu-storage-daemon or multi-process QEMU where devices could be configured over a QMP/HMP monitor. The device happens to be implemented in the QEMU codebase but the VMM doesn't need a stub device. A D-Bus or gRPC example would have been clearer because it's not associated with a VMM. > > > > > > The rest of this document describes how these requirements can be met. > > > > > > > > Device Models > > > > ------------- > > > > Devices have a *hardware interface* consisting of hardware registers, > > > > interrupts, and so on. > > > > > > > > The hardware interface together with the device state representation is > > > > called > > > > a *device model*. Device models can be assigned URIs such as > > > > https://qemu.org/devices/e1000e to uniquely identify them. > > > > > > It looks worse than "pci://vendor_id.device_id.subvendor_id.subdevice_id". > > > "e1000e" means a lot of different 8275X implementations that have subtle > > > but > > > easy to be ignored differences. > > If you wish to reflect those differences in the device model URI then > > you can use: > > > > > > https://qemu.org/devices/pci/<vendor-id>/<device-id>/<subvendor-id>/<subdevice-id> > > > > Another option is to use device configuration parameters to express > > differences. > > > > The important thing is that this device model URI has one owner. No one > > else will use qemu.org. There can be many different e1000e device model > > URIs, if necessary (with slightly different hardware interfaces and/or > > device state representations). This avoids collisions. > > > > > And is it possible to have a list of URIs here? > > A device implementation (mdev driver, vfio-user device backend, etc) may > > support multiple device model URIs. > > > > A device instance has an immutable device model URI and list of > > configuration parameters. In other words, once the device is created its > > ABI is fixed for the lifetime of the device. A new device instance can > > be configured by powering off the machine, hotplug, etc. > > > > > > Multiple implementations of a device model may exist. They are they are > > > > interchangeable if they follow the same hardware interface and device > > > > state representation. > > > > > > > > Multiple implementations of the same hardware interface may exist with > > > > different device state representations, in which case the device models > > > > are not > > > > interchangeable and must be assigned different URIs. > > > > > > > > Migration is only possible when the same device model is supported by > > > > the > > > > *source* and the *destination* devices. > > > > > > > > Device Configuration > > > > -------------------- > > > > Device models may have parameters that affect the hardware interface or > > > > device > > > > state representation. For example, a network card may have a > > > > configurable > > > > address filtering table size parameter called ``rx-filter-size``. A > > > > device state saved with ``rx-filter-size=32`` cannot be safely loaded > > > > into a device with ``rx-filter-size=0``, because changing the size from > > > > 32 to 0 may disrupt device operation. > > > > > > Do we allow the migration from "rx-filter-size=16" to "rx-filter-size=32" > > > (I > > > guess not?) And should we extend the concept to "device capability" > > > instead > > > of just state representation. E.g src has CAP_X=on,CAP_Y=off but dst has > > > CAP_X=on,CAP_Y=on, so we disallow the migration from src to dst. > > A device instance's configuration parameters are immutable. > > rx-filter-size=16 cannot be migrated to rx-filter-size=32. > > > But then it looks to me we can't migrate back, or do you mean it is required > to have the ability to change the max rx-filter-size. We can migrate a device with rx-filter-size=16 from old -> new if the new device implementation supports rx-filter-size=16. We can migrate back to the old device implementation because it supports rx-filter-size=16. If you want to change the configuration parameters then new device must be instantiated during poweroff or hotplug. This is how rx-filter-size=16 can be changed to rx-filter-size=32, but it must be done explicitly (configuration parameters don't change across migration). > > Yes, configuration parameters can describe capabilities. I think of > > capabilities as something that affects the guest-visible hardware > > interface (e.g. the RSS feature bit is enabled) that is mentioned in the > > text, but it would be clearer to mention them explicitly. > > > > > > A list of configuration parameters is called the *device configuration*. > > > > Migration is expected to succeed when the same device model and > > > > configuration > > > > that was used for saving the device state is used again to load it. > > > > > > > > Note that not all parameters used to instantiate a device need to be > > > > part of > > > > the device configuration. For example, assigning a network card to a > > > > specific > > > > physical port is not part of the device configuration since it is not > > > > part of > > > > the device's hardware interface or the device state representation. > > > > > > Yes, but the task needs to be done by management somehow. So do you > > > expect a > > > vendor specific provisioning API here? > > There seems to be no consensus on this yet. It's the question of how to > > manage the lifecycle of VFIO, mdev, vhost-user, and vfio-user devices. > > There are attempts to standardize in some of these areas. > > > > For mdev drivers we can standardize the sysfs interface so management > > tools can query source devices and instantiate destination devices > > without device-specific code. > > > Even for mdev, it should be have some class defined for sysfs which could be > a standard way to configure NVME or virtio device. Discussion on the mdev sysfs interface has started in the sub-thread with Alex Williamson. > > The problem with subsection semantics is that they break rollback. Once > > the old device state has been loaded by the new device implementation, > > saving the device state produces the new device state representation. > > The old device implementation can no longer load it :(. > > > Only when subsection is needed. Good point. Most rollback migrations still work, only the ones that introduce new subsections fail. > > Manual > > intervention is necessary to tell the new device implementation to save > > in the old representation. > > > If we don't support subsection, could we end up with a deadlock like we do > migration since want upgrade the kernel, but if we don't upgrade the kernel, > we can't do live migration. Can you explain in more detail? I think the approach described in this document works, except it requires manual intervention to change device configuration parameters whereas subsections are automatically applied by the new QEMU.
signature.asc
Description: PGP signature