Re: device compatibility interface for live migration with assigned devices

2020-07-15 Thread Alex Xu
Yan Zhao  于2020年7月15日周三 下午4:32写道:

> On Tue, Jul 14, 2020 at 02:59:48PM -0600, Alex Williamson wrote:
> > On Tue, 14 Jul 2020 18:19:46 +0100
> > "Dr. David Alan Gilbert"  wrote:
> >
> > > * Alex Williamson (alex.william...@redhat.com) wrote:
> > > > On Tue, 14 Jul 2020 11:21:29 +0100
> > > > Daniel P. Berrangé  wrote:
> > > >
> > > > > On Tue, Jul 14, 2020 at 07:29:57AM +0800, Yan Zhao wrote:
> > > > > > hi folks,
> > > > > > we are defining a device migration compatibility interface that
> helps upper
> > > > > > layer stack like openstack/ovirt/libvirt to check if two devices
> are
> > > > > > live migration compatible.
> > > > > > The "devices" here could be MDEVs, physical devices, or hybrid
> of the two.
> > > > > > e.g. we could use it to check whether
> > > > > > - a src MDEV can migrate to a target MDEV,
> > > > > > - a src VF in SRIOV can migrate to a target VF in SRIOV,
> > > > > > - a src MDEV can migration to a target VF in SRIOV.
> > > > > >   (e.g. SIOV/SRIOV backward compatibility case)
> > > > > >
> > > > > > The upper layer stack could use this interface as the last step
> to check
> > > > > > if one device is able to migrate to another device before
> triggering a real
> > > > > > live migration procedure.
> > > > > > we are not sure if this interface is of value or help to you.
> please don't
> > > > > > hesitate to drop your valuable comments.
> > > > > >
> > > > > >
> > > > > > (1) interface definition
> > > > > > The interface is defined in below way:
> > > > > >
> > > > > >  __userspace
> > > > > >   /\  \
> > > > > >  / \write
> > > > > > / read  \
> > > > > >/__   ___\|/_
> > > > > >   | migration_version | | migration_version |-->check
> migration
> > > > > >   - -   compatibility
> > > > > >  device Adevice B
> > > > > >
> > > > > >
> > > > > > a device attribute named migration_version is defined under each
> device's
> > > > > > sysfs node. e.g.
> (/sys/bus/pci/devices/\:00\:02.0/$mdev_UUID/migration_version).
> > > > > > userspace tools read the migration_version as a string from the
> source device,
> > > > > > and write it to the migration_version sysfs attribute in the
> target device.
> > > > > >
> > > > > > The userspace should treat ANY of below conditions as two
> devices not compatible:
> > > > > > - any one of the two devices does not have a migration_version
> attribute
> > > > > > - error when reading from migration_version attribute of one
> device
> > > > > > - error when writing migration_version string of one device to
> > > > > >   migration_version attribute of the other device
> > > > > >
> > > > > > The string read from migration_version attribute is defined by
> device vendor
> > > > > > driver and is completely opaque to the userspace.
> > > > > > for a Intel vGPU, string format can be defined like
> > > > > > "parent device PCI ID" + "version of gvt driver" + "mdev type" +
> "aggregator count".
> > > > > >
> > > > > > for an NVMe VF connecting to a remote storage. it could be
> > > > > > "PCI ID" + "driver version" + "configured remote storage URL"
> > > > > >
> > > > > > for a QAT VF, it may be
> > > > > > "PCI ID" + "driver version" + "supported encryption set".
> > > > > >
> > > > > > (to avoid namespace confliction from each vendor, we may prefix
> a driver name to
> > > > > > each migration_version string. e.g.
> i915-v1-8086-591d-i915-GVTg_V5_8-1)
> > > >
> > > > It's very strange to define it as opaque and then proceed to describe
> > > > the contents of that opaque string.  The point is that its contents
> > > > are defined by the vendor driver to describe the device, driver
> version,
> > > > and possibly metadata about the configuration of the device.  One
> > > > instance of a device might generate a different string from another.
> > > > The string that a device produces is not necessarily the only string
> > > > the vendor driver will accept, for example the driver might support
> > > > backwards compatible migrations.
> > >
> > > (As I've said in the previous discussion, off one of the patch series)
> > >
> > > My view is it makes sense to have a half-way house on the opaqueness of
> > > this string; I'd expect to have an ID and version that are human
> > > readable, maybe a device ID/name that's human interpretable and then a
> > > bunch of other cruft that maybe device/vendor/version specific.
> > >
> > > I'm thinking that we want to be able to report problems and include the
> > > string and the user to be able to easily identify the device that was
> > > complaining and notice a difference in versions, and perhaps also use
> > > it in compatibility patterns to find compatible hosts; but that does
> > > get tricky when it's a 'ask the device if it's compatible'.
> >
> > In the reply I just sent to Dan, I gave this example of what a
> > 

Re: device compatibility interface for live migration with assigned devices

2020-07-15 Thread Alex Xu
Alex Williamson  于2020年7月15日周三 上午5:00写道:

> On Tue, 14 Jul 2020 18:19:46 +0100
> "Dr. David Alan Gilbert"  wrote:
>
> > * Alex Williamson (alex.william...@redhat.com) wrote:
> > > On Tue, 14 Jul 2020 11:21:29 +0100
> > > Daniel P. Berrangé  wrote:
> > >
> > > > On Tue, Jul 14, 2020 at 07:29:57AM +0800, Yan Zhao wrote:
> > > > > hi folks,
> > > > > we are defining a device migration compatibility interface that
> helps upper
> > > > > layer stack like openstack/ovirt/libvirt to check if two devices
> are
> > > > > live migration compatible.
> > > > > The "devices" here could be MDEVs, physical devices, or hybrid of
> the two.
> > > > > e.g. we could use it to check whether
> > > > > - a src MDEV can migrate to a target MDEV,
> > > > > - a src VF in SRIOV can migrate to a target VF in SRIOV,
> > > > > - a src MDEV can migration to a target VF in SRIOV.
> > > > >   (e.g. SIOV/SRIOV backward compatibility case)
> > > > >
> > > > > The upper layer stack could use this interface as the last step to
> check
> > > > > if one device is able to migrate to another device before
> triggering a real
> > > > > live migration procedure.
> > > > > we are not sure if this interface is of value or help to you.
> please don't
> > > > > hesitate to drop your valuable comments.
> > > > >
> > > > >
> > > > > (1) interface definition
> > > > > The interface is defined in below way:
> > > > >
> > > > >  __userspace
> > > > >   /\  \
> > > > >  / \write
> > > > > / read  \
> > > > >/__   ___\|/_
> > > > >   | migration_version | | migration_version |-->check migration
> > > > >   - -   compatibility
> > > > >  device Adevice B
> > > > >
> > > > >
> > > > > a device attribute named migration_version is defined under each
> device's
> > > > > sysfs node. e.g.
> (/sys/bus/pci/devices/\:00\:02.0/$mdev_UUID/migration_version).
> > > > > userspace tools read the migration_version as a string from the
> source device,
> > > > > and write it to the migration_version sysfs attribute in the
> target device.
> > > > >
> > > > > The userspace should treat ANY of below conditions as two devices
> not compatible:
> > > > > - any one of the two devices does not have a migration_version
> attribute
> > > > > - error when reading from migration_version attribute of one device
> > > > > - error when writing migration_version string of one device to
> > > > >   migration_version attribute of the other device
> > > > >
> > > > > The string read from migration_version attribute is defined by
> device vendor
> > > > > driver and is completely opaque to the userspace.
> > > > > for a Intel vGPU, string format can be defined like
> > > > > "parent device PCI ID" + "version of gvt driver" + "mdev type" +
> "aggregator count".
> > > > >
> > > > > for an NVMe VF connecting to a remote storage. it could be
> > > > > "PCI ID" + "driver version" + "configured remote storage URL"
> > > > >
> > > > > for a QAT VF, it may be
> > > > > "PCI ID" + "driver version" + "supported encryption set".
> > > > >
> > > > > (to avoid namespace confliction from each vendor, we may prefix a
> driver name to
> > > > > each migration_version string. e.g.
> i915-v1-8086-591d-i915-GVTg_V5_8-1)
> > >
> > > It's very strange to define it as opaque and then proceed to describe
> > > the contents of that opaque string.  The point is that its contents
> > > are defined by the vendor driver to describe the device, driver
> version,
> > > and possibly metadata about the configuration of the device.  One
> > > instance of a device might generate a different string from another.
> > > The string that a device produces is not necessarily the only string
> > > the vendor driver will accept, for example the driver might support
> > > backwards compatible migrations.
> >
> > (As I've said in the previous discussion, off one of the patch series)
> >
> > My view is it makes sense to have a half-way house on the opaqueness of
> > this string; I'd expect to have an ID and version that are human
> > readable, maybe a device ID/name that's human interpretable and then a
> > bunch of other cruft that maybe device/vendor/version specific.
> >
> > I'm thinking that we want to be able to report problems and include the
> > string and the user to be able to easily identify the device that was
> > complaining and notice a difference in versions, and perhaps also use
> > it in compatibility patterns to find compatible hosts; but that does
> > get tricky when it's a 'ask the device if it's compatible'.
>
> In the reply I just sent to Dan, I gave this example of what a
> "compatibility string" might look like represented as json:
>
> {
>   "device_api": "vfio-pci",
>   "vendor": "vendor-driver-name",
>   "version": {
> "major": 0,
> "minor": 1
>   },
>

The OpenStack Placement service doesn't support 

Re: device compatibility interface for live migration with assigned devices

2020-07-15 Thread Alex Xu
Alex Williamson  于2020年7月15日周三 上午12:16写道:

> On Tue, 14 Jul 2020 11:21:29 +0100
> Daniel P. Berrangé  wrote:
>
> > On Tue, Jul 14, 2020 at 07:29:57AM +0800, Yan Zhao wrote:
> > > hi folks,
> > > we are defining a device migration compatibility interface that helps
> upper
> > > layer stack like openstack/ovirt/libvirt to check if two devices are
> > > live migration compatible.
> > > The "devices" here could be MDEVs, physical devices, or hybrid of the
> two.
> > > e.g. we could use it to check whether
> > > - a src MDEV can migrate to a target MDEV,
> > > - a src VF in SRIOV can migrate to a target VF in SRIOV,
> > > - a src MDEV can migration to a target VF in SRIOV.
> > >   (e.g. SIOV/SRIOV backward compatibility case)
> > >
> > > The upper layer stack could use this interface as the last step to
> check
> > > if one device is able to migrate to another device before triggering a
> real
> > > live migration procedure.
> > > we are not sure if this interface is of value or help to you. please
> don't
> > > hesitate to drop your valuable comments.
> > >
> > >
> > > (1) interface definition
> > > The interface is defined in below way:
> > >
> > >  __userspace
> > >   /\  \
> > >  / \write
> > > / read  \
> > >/__   ___\|/_
> > >   | migration_version | | migration_version |-->check migration
> > >   - -   compatibility
> > >  device Adevice B
> > >
> > >
> > > a device attribute named migration_version is defined under each
> device's
> > > sysfs node. e.g.
> (/sys/bus/pci/devices/\:00\:02.0/$mdev_UUID/migration_version).
> > > userspace tools read the migration_version as a string from the source
> device,
> > > and write it to the migration_version sysfs attribute in the target
> device.
> > >
> > > The userspace should treat ANY of below conditions as two devices not
> compatible:
> > > - any one of the two devices does not have a migration_version
> attribute
> > > - error when reading from migration_version attribute of one device
> > > - error when writing migration_version string of one device to
> > >   migration_version attribute of the other device
> > >
> > > The string read from migration_version attribute is defined by device
> vendor
> > > driver and is completely opaque to the userspace.
> > > for a Intel vGPU, string format can be defined like
> > > "parent device PCI ID" + "version of gvt driver" + "mdev type" +
> "aggregator count".

> >
> > > for an NVMe VF connecting to a remote storage. it could be
> > > "PCI ID" + "driver version" + "configured remote storage URL"
>

If the "configured remote storage URL" is something configuration setting
before the usage, then it isn't something we need for migration compatible
check. Openstack only needs to know the target device's driver and hardware
compatible for migration, then the scheduler will choose a host which such
device, and then Openstack will pre-configure the target host and target
device before the migration, then openstack will configure the correct
remote storage URL to the device. If we want, we can do a sanity check
after the live migration with the os.


> > >
> > > for a QAT VF, it may be
> > > "PCI ID" + "driver version" + "supported encryption set".
> > >
> > > (to avoid namespace confliction from each vendor, we may prefix a
> driver name to
> > > each migration_version string. e.g. i915-v1-8086-591d-i915-GVTg_V5_8-1)
>
> It's very strange to define it as opaque and then proceed to describe
> the contents of that opaque string.  The point is that its contents
> are defined by the vendor driver to describe the device, driver version,
> and possibly metadata about the configuration of the device.  One
> instance of a device might generate a different string from another.
> The string that a device produces is not necessarily the only string
> the vendor driver will accept, for example the driver might support
> backwards compatible migrations.
>
> > > (2) backgrounds
> > >
> > > The reason we hope the migration_version string is opaque to the
> userspace
> > > is that it is hard to generalize standard comparing fields and
> comparing
> > > methods for different devices from different vendors.
> > > Though userspace now could still do a simple string compare to check if
> > > two devices are compatible, and result should also be right, it's still
> > > too limited as it excludes the possible candidate whose
> migration_version
> > > string fails to be equal.
> > > e.g. an MDEV with mdev_type_1, aggregator count 3 is probably
> compatible
> > > with another MDEV with mdev_type_3, aggregator count 1, even their
> > > migration_version strings are not equal.
> > > (assumed mdev_type_3 is of 3 times equal resources of mdev_type_1).
> > >
> > > besides that, driver version + configured resources are all elements
> demanding
> > >