On Sat, Mar 30, 2019 at 10:14:07PM +0800, Alex Williamson wrote: > On Fri, 29 Mar 2019 19:10:50 -0400 > Zhao Yan <yan.y.z...@intel.com> wrote: > > > On Fri, Mar 29, 2019 at 10:26:39PM +0800, Alex Williamson wrote: > > > On Thu, 28 Mar 2019 22:47:04 -0400 > > > Zhao Yan <yan.y.z...@intel.com> wrote: > > > > > > > On Fri, Mar 29, 2019 at 12:04:31AM +0800, Alex Williamson wrote: > > > > > On Thu, 28 Mar 2019 10:21:38 +0100 > > > > > Erik Skultety <eskul...@redhat.com> wrote: > > > > > > > > > > > On Thu, Mar 28, 2019 at 04:36:03AM -0400, Zhao Yan wrote: > > > > > > > hi Alex and Dave, > > > > > > > Thanks for your replies. > > > > > > > Please see my comments inline. > > > > > > > > > > > > > > On Thu, Mar 28, 2019 at 06:10:20AM +0800, Alex Williamson wrote: > > > > > > > > > > > > > > > On Wed, 27 Mar 2019 20:18:54 +0000 > > > > > > > > "Dr. David Alan Gilbert" <dgilb...@redhat.com> wrote: > > > > > > > > > > > > > > > > > * Zhao Yan (yan.y.z...@intel.com) wrote: > > > > > > > > > > On Wed, Feb 20, 2019 at 07:42:42PM +0800, Cornelia Huck > > > > > > > > > > wrote: > > > > > > > > > > > > > > > b) How do we detect if we're migrating from/to > > > > > > > > > > > > > > > the wrong device or > > > > > > > > > > > > > > > version of device? Or say to a device with older > > > > > > > > > > > > > > > firmware or perhaps > > > > > > > > > > > > > > > a device that has less device memory ? > > > > > > > > > > > > > > Actually it's still an open for VFIO migration. > > > > > > > > > > > > > > Need to think about > > > > > > > > > > > > > > whether it's better to check that in libvirt or > > > > > > > > > > > > > > qemu (like a device magic > > > > > > > > > > > > > > along with verion ?). > > > > > > > > > > > > > > > > > > > > > > > > We must keep the hardware generation is the same with > > > > > > > > > > > > one POD of public cloud > > > > > > > > > > > > providers. But we still think about the live migration > > > > > > > > > > > > between from the the lower > > > > > > > > > > > > generation of hardware migrated to the higher > > > > > > > > > > > > generation. > > > > > > > > > > > > > > > > > > > > > > Agreed, lower->higher is the one direction that might > > > > > > > > > > > make sense to > > > > > > > > > > > support. > > > > > > > > > > > > > > > > > > > > > > But regardless of that, I think we need to make sure that > > > > > > > > > > > incompatible > > > > > > > > > > > devices/versions fail directly instead of failing in a > > > > > > > > > > > subtle, hard to > > > > > > > > > > > debug way. Might be useful to do some initial sanity > > > > > > > > > > > checks in libvirt > > > > > > > > > > > as well. > > > > > > > > > > > > > > > > > > > > > > How easy is it to obtain that information in a form that > > > > > > > > > > > can be > > > > > > > > > > > consumed by higher layers? Can we find out the device > > > > > > > > > > > type at least? > > > > > > > > > > > What about some kind of revision? > > > > > > > > > > hi Alex and Cornelia > > > > > > > > > > for device compatibility, do you think it's a good idea to > > > > > > > > > > use "version" > > > > > > > > > > and "device version" fields? > > > > > > > > > > > > > > > > > > > > version field: identify live migration interface's version. > > > > > > > > > > it can have a > > > > > > > > > > sort of backward compatibility, like target machine's > > > > > > > > > > version >= source > > > > > > > > > > machine's version. something like that. > > > > > > > > > > > > > > > > Don't we essentially already have this via the device specific > > > > > > > > region? > > > > > > > > The struct vfio_info_cap_header includes id and version fields, > > > > > > > > so we > > > > > > > > can declare a migration id and increment the version for any > > > > > > > > incompatible changes to the protocol. > > > > > > > yes, good idea! > > > > > > > so, what about declaring below new cap? > > > > > > > #define VFIO_REGION_INFO_CAP_MIGRATION 4 > > > > > > > struct vfio_region_info_cap_migration { > > > > > > > struct vfio_info_cap_header header; > > > > > > > __u32 device_version_len; > > > > > > > __u8 device_version[]; > > > > > > > }; > > > > > > > > > > I'm not sure why we need a new region for everything, it seems this > > > > > could fit within the protocol of a single region. This could simply > > > > > be > > > > > a new action to retrieve the version where the protocol would report > > > > > the number of bytes available, just like the migration stream itself. > > > > > > > > > so, to get version of VFIO live migration device state interface (simply > > > > call it migration interface?), > > > > a new cap looks like this: > > > > #define VFIO_REGION_INFO_CAP_MIGRATION 4 > > > > it contains struct vfio_info_cap_header only. > > > > when get region info of the migration region, we query this cap and get > > > > migration interface's version. right? > > > > > > > > or just directly use VFIO_REGION_INFO_CAP_TYPE is fine? > > > > > > Again, why a new region. I'm imagining we have one region and this is > > > just asking for a slightly different thing from it. But TBH, I'm not > > > sure we need it at all vs the sysfs interface. > > > > > > > > > > > > > device_version field consists two parts: > > > > > > > > > > 1. vendor id : it takes 32 bits. e.g. 0x8086. > > > > > > > > > > > > > > > > Who allocates IDs? If we're going to use PCI vendor IDs, then > > > > > > > > I'd > > > > > > > > suggest we use a bit to flag it as such so we can reserve that > > > > > > > > portion > > > > > > > > of the 32bit address space. See for example: > > > > > > > > > > > > > > > > #define VFIO_REGION_TYPE_PCI_VENDOR_TYPE (1 << 31) > > > > > > > > #define VFIO_REGION_TYPE_PCI_VENDOR_MASK (0xffff) > > > > > > > > > > > > > > > > For vendor specific regions. > > > > > > > Yes, use PCI vendor ID. > > > > > > > you are right, we need to use highest bit > > > > > > > (VFIO_REGION_TYPE_PCI_VENDOR_TYPE) > > > > > > > to identify it's a PCI ID. > > > > > > > Thanks for pointing it out. > > > > > > > But, I have a question. what is VFIO_REGION_TYPE_PCI_VENDOR_MASK > > > > > > > used for? > > > > > > > why it's 0xffff? I searched QEMU and kernel code and did not find > > > > > > > anywhere > > > > > > > uses it. > > > > > > > > > > PCI vendor IDs are 16bits, it's just indicating that when the > > > > > PCI_VENDOR_TYPE bit is set the valid data is the lower 16bits. > > > > > > > > thanks:) > > > > > > > > > > > > > > 2. vendor proprietary string: it can be any string that a > > > > > > > > > > vendor driver > > > > > > > > > > thinks can identify a source device. e.g. pciid + mdev type. > > > > > > > > > > "vendor id" is to avoid overlap of "vendor proprietary > > > > > > > > > > string". > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > struct vfio_device_state_ctl { > > > > > > > > > > __u32 version; /* ro */ > > > > > > > > > > __u8 device_version[MAX_DEVICE_VERSION_LEN]; > > > > > > > > > > /* ro */ > > > > > > > > > > struct { > > > > > > > > > > __u32 action; /* GET_BUFFER, SET_BUFFER, > > > > > > > > > > IS_COMPATIBLE*/ > > > > > > > > > > ... > > > > > > > > > > }data; > > > > > > > > > > ... > > > > > > > > > > }; > > > > > > > > > > > > > > > > We have a buffer area where we can read and write data from the > > > > > > > > vendor > > > > > > > > driver, why would we have this IS_COMPATIBLE protocol to write > > > > > > > > the > > > > > > > > device version string but use a static fixed length version > > > > > > > > string in > > > > > > > > the control header to read it? IOW, let's use GET_VERSION, > > > > > > > > CHECK_VERSION actions that make use of the buffer area and > > > > > > > > allow vendor > > > > > > > > specific version information length. > > > > > > > you are right, such static fixed length version string is bad :) > > > > > > > To get device version, do you think which approach below is > > > > > > > better? > > > > > > > 1. use GET_VERSION action, and read from region buffer > > > > > > > 2. get it when querying cap VFIO_REGION_INFO_CAP_MIGRATION > > > > > > > > > > > > > > seems approach 1 is better, and cap > > > > > > > VFIO_REGION_INFO_CAP_MIGRATION is only > > > > > > > for checking migration interface's version? > > > > > > > > > > I think 1 provides the most flexibility to the vendor driver. > > > > > > > > Got it. > > > > For VFIO live migration, compared to reuse device state region (which > > > > takes > > > > GET_BUFFER/SET_BUFFER actions), > > > > could we create a new region for GET_VERSION & CHECK_VERSION ? > > > > > > Why? > > > > > > > > > > > > > Then, an action IS_COMPATIBLE is added to check device > > > > > > > > > > compatibility. > > > > > > > > > > > > > > > > > > > > The flow to figure out whether a source device is > > > > > > > > > > migratable to target device > > > > > > > > > > is like that: > > > > > > > > > > 1. in source side's .save_setup, save source device's > > > > > > > > > > device_version string > > > > > > > > > > 2. in target side's .load_state, load source device's > > > > > > > > > > device version string > > > > > > > > > > and write it to data region, and call IS_COMPATIBLE action > > > > > > > > > > to ask vendor driver > > > > > > > > > > to check whether the source device is compatible to it. > > > > > > > > > > > > > > > > > > > > The advantage of adding an IS_COMPATIBLE action is that, > > > > > > > > > > vendor driver can > > > > > > > > > > maintain a compatibility table and decide whether source > > > > > > > > > > device is compatible > > > > > > > > > > to target device according to its proprietary table. > > > > > > > > > > In device_version string, vendor driver only has to > > > > > > > > > > describe the source > > > > > > > > > > device as elaborately as possible and resorts to vendor > > > > > > > > > > driver in target side > > > > > > > > > > to figure out whether they are compatible. > > > > > > > > > > > > > > > > I agree, it's too complicated and restrictive to try to create > > > > > > > > an > > > > > > > > interface for the user to determine compatibility, let the > > > > > > > > driver > > > > > > > > declare it compatible or not. > > > > > > > :) > > > > > > > > > > > > > > > > It would also be good if the 'IS_COMPATIBLE' was somehow > > > > > > > > > callable > > > > > > > > > externally - so we could be able to answer a question like > > > > > > > > > 'can we > > > > > > > > > migrate this VM to this host' - from the management layer > > > > > > > > > before it > > > > > > > > > actually starts the migration. > > > > > > > > > > > > > > so qemu needs to expose two qmp/sysfs interfaces: GET_VERSION and > > > > > > > CHECK_VERSION. > > > > > > > GET_VERSION returns a vm's device's version string. > > > > > > > CHECK_VERSION's input is device version string and return > > > > > > > compatible/non-compatible. > > > > > > > Do you think it's good? > > > > > > > > > > That's the idea, but note that QEMU can only provide the QMP > > > > > interface, > > > > > the sysfs interface would of course be provided as more of a direct > > > > > path from the vendor driver or mdev kernel layer. > > > > > > > > > > > > > I think we'd need to mirror this capability in sysfs to support > > > > > > > > that, > > > > > > > > or create a qmp interface through QEMU that the device owner > > > > > > > > could make > > > > > > > > the request on behalf of the management layer. Getting access > > > > > > > > to the > > > > > > > > vfio device requires an iommu context that's already in use by > > > > > > > > the > > > > > > > > device owner, we have no intention of supporting a model that > > > > > > > > allows > > > > > > > > independent tasks access to a device. Thanks, > > > > > > > > Alex > > > > > > > > > > > > > > > do you think two sysfs nodes under a device node is ok? > > > > > > > e.g. > > > > > > > /sys/devices/pci0000\:00/0000\:00\:02.0/882cc4da-dede-11e7-9180-078a62063ab1/get_version > > > > > > > /sys/devices/pci0000\:00/0000\:00\:02.0/882cc4da-dede-11e7-9180-078a62063ab1/check_version > > > > > > > > > > > > > > > > > I'd think it might live more in the mdev_support_types area, wouldn't > > > > > we ideally like to know if a device is compatible even before it's > > > > > created? For example maybe: > > > > > > > > > > /sys/class/mdev_bus/0000:00:02.0/mdev_supported_types/i915-GVTg_V5_4/version > > > > > > > > > > Where reading the sysfs attribute returns the version string and > > > > > writing a string into the attribute return an errno for > > > > > incompatibility. > > > > yes, knowing if a device is compatible before it's created is good. > > > > but do you think check whether a device is compatible after it's > > > > created is > > > > also required? For live migration, user usually only queries this > > > > information > > > > when it's really required, i.e. when a device has been created. > > > > maybe we can add this version get/check at both places? > > > > > > Why does an instantiated device suddenly not follow the version and > > > compatibility rules of an uninstantiated device? IOW, if the version > > > and compatibility check are on the mdev type, why can't we trace back > > > from the device to the mdev type and make use of that same interface? > > > Seems the only question is whether we require an interface through the > > > vfio API directly or if sysfs is sufficient. > > ok. got it. > > > > > > > > Why do you need both sysfs and QMP at the same time? I can see it > > > > > > gives us some > > > > > > flexibility, but is there something more to that? > > > > > > > > > > > > Normally, I'd prefer a QMP interface from libvirt's perspective > > > > > > (with an > > > > > > appropriate capability that libvirt can check for QEMU support) > > > > > > because I imagine large nodes having a > > > > > > bunch of GPUs with different revisions which might not be backwards > > > > > > compatible. > > > > > > Libvirt might query the version string on source and check it on > > > > > > dest via the > > > > > > QMP in a way that QEMU, talking to the driver, would return either > > > > > > a list or a > > > > > > single physical device to which we can migrate, because neither > > > > > > QEMU nor > > > > > > libvirt know that, only the driver does, so that's an important > > > > > > information > > > > > > rather than looping through all the devices and trying to find one > > > > > > that is > > > > > > compatible. However, you might have a hard time making all the > > > > > > necessary > > > > > > changes in QMP introspectable, a new command would be fine, but if > > > > > > you also > > > > > > wanted to extend say vfio-pci options, IIRC that would not appear > > > > > > in the QAPI > > > > > > schema and libvirt would not be able to detect support for it. > > > > > > > > > > > > On the other hand, the presence of a QMP interface IMO doesn't help > > > > > > mgmt apps > > > > > > much, as it still carries the burden of being able to check this > > > > > > only at the > > > > > > time of migration, which e.g. OpenStack would like to know long > > > > > > before that. > > > > > > > > > > > > So, having sysfs attributes would work for both libvirt (even > > > > > > though libvirt > > > > > > would benefit from a QMP much more) and OpenStack. OpenStack would > > > > > > IMO then > > > > > > have to figure out how to create the mappings between compatible > > > > > > devices across > > > > > > several nodes which are non-uniform. > > > > > > > > > > Yep, vfio encompasses more than just QEMU, so a sysfs interface has > > > > > more > > > > > utility than a QMP interface. For instance we couldn't predetermine > > > > > if > > > > > an mdev type on a host is compatible if we need to first create the > > > > > device and launch a QEMU instance on it to get access to QMP. So > > > > > maybe > > > > > the question is whether we should bother with any sort of VFIO API to > > > > > do this comparison, perhaps only a sysfs interface is sufficient for a > > > > > complete solution. The downside of not having a version API in the > > > > > user interface might be that QEMU on its own can only try a migration > > > > > and see if it fails, it wouldn't have the ability to test expected > > > > > compatibility without access to sysfs. And maybe that's fine. > > > > > Thanks, > > > > > > > > > So QEMU vfio uses sysfs to check device compatiblity in migration's > > > > save_setup > > > > phase? > > > > > > The migration stream between source and target device are the ultimate > > > test of compatibility, the vendor driver should never rely on userspace > > > validating compatibility of the migration. At the point it could do so, > > > the > > > migration has already begun, so we're only testing how quickly we can > > > fail the migration. The management layer setting up the migration can > > > test via sysfs for compatibility and the migration stream itself needs > > > to be self validating, so what value is added for QEMU to perform a > > > version compatibility test? Thanks, > > oh, do you mean vendor driver should embed source device's version in > > migration > > stream, which is opaque to qemu? > > otherwise, I can't think of a quick way for vendor driver to determine > > whether > > source device is an incompatible device. > > Yes, the vendor driver cannot rely on the user to make sure the > incoming migration stream is compatible, the vendor driver must take > responsibility for this. Therefore, regardless of what other > interfaces we have for the user to test the compatibility between > devices, the vendor driver must make no assumptions about the validity > or integrity of the data stream. Plan for and protect against a > malicious or incompetent user. Thanks, > ok. got it. Thank you :)
Yan