On 2020/7/16 下午4:32, Yan Zhao wrote:
On Thu, Jul 16, 2020 at 12:16:26PM +0800, Jason Wang wrote:
On 2020/7/14 上午7:29, Yan Zhao wrote:
hi folks,
we are defining a device migration compatibility interface that helps upper
layer stack like openstack/ovirt/libvirt to check if two devices are
live migration compatible.
The "devices" here could be MDEVs, physical devices, or hybrid of the two.
e.g. we could use it to check whether
- a src MDEV can migrate to a target MDEV,
- a src VF in SRIOV can migrate to a target VF in SRIOV,
- a src MDEV can migration to a target VF in SRIOV.
    (e.g. SIOV/SRIOV backward compatibility case)

The upper layer stack could use this interface as the last step to check
if one device is able to migrate to another device before triggering a real
live migration procedure.
we are not sure if this interface is of value or help to you. please don't
hesitate to drop your valuable comments.

(1) interface definition
The interface is defined in below way:

               __    userspace
                /\              \
               /                 \write
              / read              \
     ________/__________       ___\|/_____________
    | migration_version |     | migration_version |-->check migration
    ---------------------     ---------------------   compatibility
       device A                    device B

a device attribute named migration_version is defined under each device's
sysfs node. e.g. 

Are you aware of the devlink based device management interface that is
proposed upstream? I think it has many advantages over sysfs, do you
consider to switch to that?
not familiar with the devlink. will do some research of it.

userspace tools read the migration_version as a string from the source device,
and write it to the migration_version sysfs attribute in the target device.

The userspace should treat ANY of below conditions as two devices not 
- any one of the two devices does not have a migration_version attribute
- error when reading from migration_version attribute of one device
- error when writing migration_version string of one device to
    migration_version attribute of the other device

The string read from migration_version attribute is defined by device vendor
driver and is completely opaque to the userspace.

My understanding is that something opaque to userspace is not the philosophy
but the VFIO live migration in itself is essentially a big opaque stream to 

I think it's better not limit to the kernel interface for a specific use case. This is basically the device introspection.

of Linux. Instead of having a generic API but opaque value, why not do in a
vendor specific way like:

1) exposing the device capability in a vendor specific way via sysfs/devlink
or other API
2) management read capability in both src and dst and determine whether we
can do the migration

This is the way we plan to do with vDPA.

yes, in another reply, Alex proposed to use an interface in json format.
I guess we can define something like

{ "self" :
     { "pciid" : "8086591d",
       "driver" : "i915",
       "gvt-version" : "v1",
       "mdev_type"   : "i915-GVTg_V5_2",
       "aggregator"  : "1",
       "pv-mode"     : "none",
   "compatible" :
     { "pciid" : "8086591d",
       "driver" : "i915",
       "gvt-version" : "v1",
       "mdev_type"   : "i915-GVTg_V5_2",
       "aggregator"  : "1"
       "pv-mode"     : "none",
     { "pciid" : "8086591d",
       "driver" : "i915",
       "gvt-version" : "v1",
       "mdev_type"   : "i915-GVTg_V5_4",
       "aggregator"  : "2"
       "pv-mode"     : "none",
     { "pciid" : "8086591d",
       "driver" : "i915",
       "gvt-version" : "v2",
       "mdev_type"   : "i915-GVTg_V5_4",
       "aggregator"  : "2"
       "pv-mode"     : "none, ppgtt, context",

This is probably another call for devlink base interface.

But as those fields are mostly vendor specific, the userspace can
only do simple string comparing, I guess the list would be very long as
it needs to enumerate all possible targets.
also, in some fileds like "gvt-version", is there a simple way to express
things like v2+?

That's total vendor specific I think. If "v2+" means it only support a version 2+, we can introduce fields like min_version and max_version. But again, the point is to let such interfaces vendor specific instead of trying to have a generic format.

If the userspace can read this interface both in src and target and
check whether both src and target are in corresponding compatible list, I
think it will work for us.

But still, kernel should not rely on userspace's choice, the opaque
compatibility string is still required in kernel. No matter whether
it would be exposed to userspace as an compatibility checking interface,
vendor driver would keep this part of code and embed the string into the
migration stream.

Why? Can we simply do:

1) Src support feature A, B, C  (version 1.0)
2) Dst support feature A, B, C, D (version 2.0)
3) only enable feature A, B, C in destination in a version specific way (set version to 1.0)
4) migrate metadata A, B, C

  so exposing it as an interface to be used by libvirt to
do a safety check before a real live migration is only about enabling
the kernel part of check to happen ahead.

If we've already exposed the capability, there's no need for an extra check like compatibility string.



for a Intel vGPU, string format can be defined like
"parent device PCI ID" + "version of gvt driver" + "mdev type" + "aggregator 

for an NVMe VF connecting to a remote storage. it could be
"PCI ID" + "driver version" + "configured remote storage URL"

for a QAT VF, it may be
"PCI ID" + "driver version" + "supported encryption set".

(to avoid namespace confliction from each vendor, we may prefix a driver name to
each migration_version string. e.g. i915-v1-8086-591d-i915-GVTg_V5_8-1)

(2) backgrounds

The reason we hope the migration_version string is opaque to the userspace
is that it is hard to generalize standard comparing fields and comparing
methods for different devices from different vendors.
Though userspace now could still do a simple string compare to check if
two devices are compatible, and result should also be right, it's still
too limited as it excludes the possible candidate whose migration_version
string fails to be equal.
e.g. an MDEV with mdev_type_1, aggregator count 3 is probably compatible
with another MDEV with mdev_type_3, aggregator count 1, even their
migration_version strings are not equal.
(assumed mdev_type_3 is of 3 times equal resources of mdev_type_1).

besides that, driver version + configured resources are all elements demanding
to take into account.

So, we hope leaving the freedom to vendor driver and let it make the final 
in a simple reading from source side and writing for test in the target side 

we then think the device compatibility issues for live migration with assigned
devices can be divided into two steps:
a. management tools filter out possible migration target devices.
     Tags could be created according to info from product specification.
     we think openstack/ovirt may have vendor proprietary components to create
     those customized tags for each product from each vendor.
     for Intel vGPU, with a vGPU(a MDEV device) in source side, the tags to
     search target vGPU are like:
     a tag for compatible parent PCI IDs,
     a tag for a range of gvt driver versions,
     a tag for a range of mdev type + aggregator count

     for NVMe VF, the tags to search target VF may be like:
     a tag for compatible PCI IDs,
     a tag for a range of driver versions,
     a tag for URL of configured remote storage.

b. with the output from step a, openstack/ovirt/libvirt could use our proposed
     device migration compatibility interface to make sure the two devices are
     indeed live migration compatible before launching the real live migration
     process to start stream copying, src device stopping and target device
     It is supposed that this step would not bring any performance penalty as
     -in kernel it's just a simple string decoding and comparing
     -in openstack/ovirt, it could be done by extending current function
      check_can_live_migrate_destination, along side claiming target 



