Re: device compatibility interface for live migration with assigned devices

2020-09-10 Thread Yan Zhao
On Thu, Sep 10, 2020 at 12:02:44PM -0600, Alex Williamson wrote:
> On Thu, 10 Sep 2020 13:50:11 +0100
> Sean Mooney  wrote:
> 
> > On Thu, 2020-09-10 at 14:38 +0200, Cornelia Huck wrote:
> > > On Wed, 9 Sep 2020 10:13:09 +0800
> > > Yan Zhao  wrote:
> > >   
> > > > > > still, I'd like to put it more explicitly to make ensure it's not 
> > > > > > missed:
> > > > > > the reason we want to specify compatible_type as a trait and check
> > > > > > whether target compatible_type is the superset of source
> > > > > > compatible_type is for the consideration of backward compatibility.
> > > > > > e.g.
> > > > > > an old generation device may have a mdev type xxx-v4-yyy, while a 
> > > > > > newer
> > > > > > generation  device may be of mdev type xxx-v5-yyy.
> > > > > > with the compatible_type traits, the old generation device is still
> > > > > > able to be regarded as compatible to newer generation device even 
> > > > > > their
> > > > > > mdev types are not equal.
> > > > > 
> > > > > If you want to support migration from v4 to v5, can't the (presumably
> > > > > newer) driver that supports v5 simply register the v4 type as well, so
> > > > > that the mdev can be created as v4? (Just like QEMU versioned machine
> > > > > types work.)
> > > > 
> > > > yes, it should work in some conditions.
> > > > but it may not be that good in some cases when v5 and v4 in the name 
> > > > string
> > > > of mdev type identify hardware generation (e.g. v4 for gen8, and v5 for
> > > > gen9)
> > > > 
> > > > e.g.
> > > > (1). when src mdev type is v4 and target mdev type is v5 as
> > > > software does not support it initially, and v4 and v5 identify hardware
> > > > differences.  
> > > 
> > > My first hunch here is: Don't introduce types that may be compatible
> > > later. Either make them compatible, or make them distinct by design,
> > > and possibly add a different, compatible type later.
> > >   
> > > > then after software upgrade, v5 is now compatible to v4, should the
> > > > software now downgrade mdev type from v5 to v4?
> > > > not sure if moving hardware generation info into a separate attribute
> > > > from mdev type name is better. e.g. remove v4, v5 in mdev type, while 
> > > > use
> > > > compatible_pci_ids to identify compatibility.  
> > > 
> > > If the generations are compatible, don't mention it in the mdev type.
> > > If they aren't, use distinct types, so that management software doesn't
> > > have to guess. At least that would be my naive approach here.  
> > yep that is what i would prefer to see too.
> > >   
> > > > 
> > > > (2) name string of mdev type is composed by "driver_name + type_name".
> > > > in some devices, e.g. qat, different generations of devices are binding 
> > > > to
> > > > drivers of different names, e.g. "qat-v4", "qat-v5".
> > > > then though type_name is equal, mdev type is not equal. e.g.
> > > > "qat-v4-type1", "qat-v5-type1".  
> > > 
> > > I guess that shows a shortcoming of that "driver_name + type_name"
> > > approach? Or maybe I'm just confused.  
> > yes i really dont like haveing the version in the mdev-type name 
> > i would stongly perfger just qat-type-1 wehere qat is just there as a way 
> > of namespacing.
> > although symmetric-cryto, asymmetric-cryto and compression woudl be a 
> > better name then type-1, type-2, type-3 if
> > that is what they would end up mapping too. e.g. qat-compression or qat-aes 
> > is a much better name then type-1
> > higher layers of software are unlikely to parse the mdev names but as a 
> > human looking at them its much eaiser to
> > understand if the names are meaningful. the qat prefix i think is important 
> > however to make sure that your mdev-types
> > dont colide with other vendeors mdev types. so i woudl encurage all vendors 
> > to prefix there mdev types with etiher the
> > device name or the vendor.
> 
> +1 to all this, the mdev type is meant to indicate a software
> compatible interface, if different hardware versions can be software
> compatible, then don't make the job of finding a compatible device
> harder.  The full type is a combination of the vendor driver name plus
> the vendor provided type name specifically in order to provide a type
> namespace per vendor driver.  That's done at the mdev core level.
> Thanks,

hi Alex,
got it. so do you suggest that vendors use consistent driver name over
generations of devices?
for qat, they create different modules for each generation. This
practice is not good if they want to support migration between devices
of different generations, right?

and can I understand that we don't want support of migration between
different mdev types even in future ?

Thanks
Yan



Re: device compatibility interface for live migration with assigned devices

2020-09-08 Thread Yan Zhao
hi All,
Per our previous discussion, there are two main concerns to the previous
proposal:
(1) it's currently hard for openstack to match mdev types.
(2) complicated.

so, we further propose below changes:
(1) requiring two compatible mdevs to have the same mdev type for now.
(though kernel still exposes compatible_type attributes for future use)  
(2) requiring 1:1 match for other attributes under sysfs type node for now
(those attributes are specified via compatible_ but
with only 1 value in it.)
(3) do not match attributes under device instance node.
rather, they are regarded as part of resource claiming process.
so src and dest values are ensured to be 1:1.
A dynamic_resources attribute under sysfs  node is added to
list the attributes under device instance that mgt tools need to
ensure 1:1 from src and dest.
the "aggregator" attribute under device instance node is such one that
needs to be listed.
Those listed attributes can actually be treated as device state set by
vendor driver during live migration. but we still want to ask for them to
be set by mgt tools before live migration starts, in oder to reduce the
chance of live migration failure.

do you like those changes?

after the changes, the sysfs interface would look like blow:

  |- [parent physical device]
  |--- Vendor-specific-attributes [optional]
  |--- [mdev_supported_types]
  | |--- []
  | |   |--- create
  | |   |--- name
  | |   |--- available_instances
  | |   |--- device_api
  | |   |--- software_version
  | |   |--- compatible_type
  | |   |--- compatible_
  | |   |--- compatible_
  | |   |--- dynamic_resources
  | |   |--- description
  | |   |--- [devices]

- device_api : exact match between src and dest is required.
   its value can be one of 
   "vfio-pci", "vfio-platform", "vfio-amba", "vfio-ccw", "vfio-ap"
- software_version: version of vendor driver.
in major.minor.bugfix scheme. 
dest major should be equal to src major,
dest minor should be no less than src minor.
once migration stream related code changed, vendor
drivers need to bump the version.
- compatible_type: not used by mgt tools currently.
   vendor drivers can provide this attribute, but need to
   know that mgt apps would ignore it.
   when in future mgt tools support this attribute, it
   would allow migration across different mdev types,
   so that devices of older generation may be able to
   migrate to newer generations.

- compatible_: for device api specific attributes,
  e.g. compatible_subchannel_type,
  dest values should be superset of arc values.
  vendor drivers can specify only one value in this attribute,
  in order to do exact match between src and dest.
  It's ok for mgt tools to only read one value in the
  attribute so that src:dest values are 1:1.

- compatible_: for mdev type specific attributes,
  e.g. compatible_pci_ids, compatible_chpid_type
  dest values should be superset of arc values.
  vendor drivers can specify only one value in the attribute
  in order to do exact match between src and dest.
  It's ok for mgt tools to only read one value in the
  attribute so that src:dest values are 1:1.

- dynamic_resources: though defined statically under ,
  this attribute lists attributes under device instance that
  need to be set as part of claiming dest resources.
  e.g. $cat dynamic_resources: aggregator, fps,...
  then after dest device is created, values of its device
  attributes need to be set to that of src device attributes.
  Failure in syncing src device values to dest device
  values is treated the same as failing to claiming
  dest resources.
  attributes under device instance that are not listed
  in this attribute would not be part of resource checking in
  mgt tools.



Thanks
Yan



Re: device compatibility interface for live migration with assigned devices

2020-09-08 Thread Yan Zhao
> > still, I'd like to put it more explicitly to make ensure it's not missed:
> > the reason we want to specify compatible_type as a trait and check
> > whether target compatible_type is the superset of source
> > compatible_type is for the consideration of backward compatibility.
> > e.g.
> > an old generation device may have a mdev type xxx-v4-yyy, while a newer
> > generation  device may be of mdev type xxx-v5-yyy.
> > with the compatible_type traits, the old generation device is still
> > able to be regarded as compatible to newer generation device even their
> > mdev types are not equal.
> 
> If you want to support migration from v4 to v5, can't the (presumably
> newer) driver that supports v5 simply register the v4 type as well, so
> that the mdev can be created as v4? (Just like QEMU versioned machine
> types work.)
yes, it should work in some conditions.
but it may not be that good in some cases when v5 and v4 in the name string
of mdev type identify hardware generation (e.g. v4 for gen8, and v5 for
gen9)

e.g.
(1). when src mdev type is v4 and target mdev type is v5 as
software does not support it initially, and v4 and v5 identify hardware
differences.
then after software upgrade, v5 is now compatible to v4, should the
software now downgrade mdev type from v5 to v4?
not sure if moving hardware generation info into a separate attribute
from mdev type name is better. e.g. remove v4, v5 in mdev type, while use
compatible_pci_ids to identify compatibility.

(2) name string of mdev type is composed by "driver_name + type_name".
in some devices, e.g. qat, different generations of devices are binding to
drivers of different names, e.g. "qat-v4", "qat-v5".
then though type_name is equal, mdev type is not equal. e.g.
"qat-v4-type1", "qat-v5-type1".

Thanks
Yan



Re: device compatibility interface for live migration with assigned devices

2020-08-30 Thread Yan Zhao
On Fri, Aug 28, 2020 at 03:04:12PM +0100, Sean Mooney wrote:
> On Fri, 2020-08-28 at 15:47 +0200, Cornelia Huck wrote:
> > On Wed, 26 Aug 2020 14:41:17 +0800
> > Yan Zhao  wrote:
> > 
> > > previously, we want to regard the two mdevs created with dsa-1dwq x 30 and
> > > dsa-2dwq x 15 as compatible, because the two mdevs consist equal 
> > > resources.
> > > 
> > > But, as it's a burden to upper layer, we agree that if this condition
> > > happens, we still treat the two as incompatible.
> > > 
> > > To fix it, either the driver should expose dsa-1dwq only, or the target
> > > dsa-2dwq needs to be destroyed and reallocated via dsa-1dwq x 30.
> > 
> > AFAIU, these are mdev types, aren't they? So, basically, any management
> > software needs to take care to use the matching mdev type on the target
> > system for device creation?
> 
> or just do the simple thing of use the same mdev type on the source and dest.
> matching mdevtypes is not nessiarly trivial. we could do that but we woudl 
> have
> to do that in python rather then sql so it would be slower to do at least 
> today.
> 
> we dont currently have the ablity to say the resouce provider must have 1 of 
> these
> set of traits. just that we must have a specific trait. this is a feature we 
> have
> disucssed a couple of times and delayed untill we really really need it but 
> its not out
> of the question that we could add it for this usecase. i suspect however we 
> would do exact
> match first and explore this later after the inital mdev migration works.

Yes, I think it's good.

still, I'd like to put it more explicitly to make ensure it's not missed:
the reason we want to specify compatible_type as a trait and check
whether target compatible_type is the superset of source
compatible_type is for the consideration of backward compatibility.
e.g.
an old generation device may have a mdev type xxx-v4-yyy, while a newer
generation  device may be of mdev type xxx-v5-yyy.
with the compatible_type traits, the old generation device is still
able to be regarded as compatible to newer generation device even their
mdev types are not equal.

Thanks
Yan
> by the way i was looking at some vdpa reslated matiail today and noticed vdpa 
> devices are nolonger
> usign mdevs and and now use a vhost chardev so i guess we will need a 
> completely seperate mechanioum
> for vdpa vs mdev migration as a result. that is rather unfortunet but i guess 
> that is life.
> > 
> 



Re: device compatibility interface for live migration with assigned devices

2020-08-30 Thread Yan Zhao
On Fri, Aug 28, 2020 at 03:47:41PM +0200, Cornelia Huck wrote:
> On Wed, 26 Aug 2020 14:41:17 +0800
> Yan Zhao  wrote:
> 
> > previously, we want to regard the two mdevs created with dsa-1dwq x 30 and
> > dsa-2dwq x 15 as compatible, because the two mdevs consist equal resources.
> > 
> > But, as it's a burden to upper layer, we agree that if this condition
> > happens, we still treat the two as incompatible.
> > 
> > To fix it, either the driver should expose dsa-1dwq only, or the target
> > dsa-2dwq needs to be destroyed and reallocated via dsa-1dwq x 30.
> 
> AFAIU, these are mdev types, aren't they? So, basically, any management
> software needs to take care to use the matching mdev type on the target
> system for device creation?
dsa-1dwq is the mdev type.
there's no dsa-2dwq yet. and I think no dsa-2dwq should be provided in
future according to our discussion.

GVT currently does not support aggregator also.
how to add the the aggregator attribute is currently uder discussion,
and up to now it is recommended to be a vendor specific attributes.

https://lists.freedesktop.org/archives/intel-gvt-dev/2020-July/006854.html.

Thanks
Yan



Re: device compatibility interface for live migration with assigned devices

2020-08-26 Thread Yan Zhao
On Thu, Aug 20, 2020 at 02:24:26PM +0100, Sean Mooney wrote:
> On Thu, 2020-08-20 at 14:27 +0800, Yan Zhao wrote:
> > On Thu, Aug 20, 2020 at 06:16:28AM +0100, Sean Mooney wrote:
> > > On Thu, 2020-08-20 at 12:01 +0800, Yan Zhao wrote:
> > > > On Thu, Aug 20, 2020 at 02:29:07AM +0100, Sean Mooney wrote:
> > > > > On Thu, 2020-08-20 at 08:39 +0800, Yan Zhao wrote:
> > > > > > On Tue, Aug 18, 2020 at 11:36:52AM +0200, Cornelia Huck wrote:
> > > > > > > On Tue, 18 Aug 2020 10:16:28 +0100
> > > > > > > Daniel P. Berrangé  wrote:
> > > > > > > 
> > > > > > > > On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote:
> > > > > > > > >On 2020/8/18 下午4:55, Daniel P. Berrangé wrote:
> > > > > > > > > 
> > > > > > > > >  On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote:
> > > > > > > > > 
> > > > > > > > >  On 2020/8/14 下午1:16, Yan Zhao wrote:
> > > > > > > > > 
> > > > > > > > >  On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote:
> > > > > > > > > 
> > > > > > > > >  On 2020/8/10 下午3:46, Yan Zhao wrote:  
> > > > > > > > >  we actually can also retrieve the same information through 
> > > > > > > > > sysfs, .e.g
> > > > > > > > > 
> > > > > > > > >  |- [path to device]
> > > > > > > > > |--- migration
> > > > > > > > > | |--- self
> > > > > > > > > | |   |---device_api
> > > > > > > > > ||   |---mdev_type
> > > > > > > > > ||   |---software_version
> > > > > > > > > ||   |---device_id
> > > > > > > > > ||   |---aggregator
> > > > > > > > > | |--- compatible
> > > > > > > > > | |   |---device_api
> > > > > > > > > ||   |---mdev_type
> > > > > > > > > ||   |---software_version
> > > > > > > > > ||   |---device_id
> > > > > > > > > ||   |---aggregator
> > > > > > > > > 
> > > > > > > > > 
> > > > > > > > >  Yes but:
> > > > > > > > > 
> > > > > > > > >  - You need one file per attribute (one syscall for one 
> > > > > > > > > attribute)
> > > > > > > > >  - Attribute is coupled with kobject
> > > > > > > 
> > > > > > > Is that really that bad? You have the device with an embedded 
> > > > > > > kobject
> > > > > > > anyway, and you can just put things into an attribute group?
> > > > > > > 
> > > > > > > [Also, I think that self/compatible split in the example makes 
> > > > > > > things
> > > > > > > needlessly complex. Shouldn't semantic versioning and matching 
> > > > > > > already
> > > > > > > cover nearly everything? I would expect very few cases that are 
> > > > > > > more
> > > > > > > complex than that. Maybe the aggregation stuff, but I don't think 
> > > > > > > we
> > > > > > > need that self/compatible split for that, either.]
> > > > > > 
> > > > > > Hi Cornelia,
> > > > > > 
> > > > > > The reason I want to declare compatible list of attributes is that
> > > > > > sometimes it's not a simple 1:1 matching of source attributes and 
> > > > > > target attributes
> > > > > > as I demonstrated below,
> > > > > > source mdev of (mdev_type i915-GVTg_V5_2 + aggregator 1) is 
> > > > > > compatible to
> > > > > > target mdev of (mdev_type i915-GVTg_V5_4 + aggregator 2),
> > > > > >(mdev_type i915-GVTg_V5_8 + aggregator 4)
> > > > > 
> > > > > the way you are doing the nameing is till really confusing by the way
> > > > > if this has not already been merged in the kernel can you chagne the 
> > > > > mdev
&g

Re: device compatibility interface for live migration with assigned devices

2020-08-26 Thread Yan Zhao
On Tue, Aug 25, 2020 at 04:39:25PM +0200, Cornelia Huck wrote:
<...>
> > do you think the bin_attribute I proposed yesterday good?
> > Then we can have a single compatible with a variable in the mdev_type and
> > aggregator.
> > 
> >mdev_type=i915-GVTg_V5_{val1:int:2,4,8}
> >aggregator={val1}/2
> 
> I'm not really a fan of binary attributes other than in cases where we
> have some kind of binary format to begin with.
> 
> IIUC, we basically have:
> - different partitioning (expressed in the mdev_type)
> - different number of partitions (expressed via the aggregator)
> - devices being compatible if the partitioning:aggregator ratio is the
>   same
> 
> (The multiple mdev_type variants seem to come from avoiding extra
> creation parameters, IIRC?)
> 
> Would it be enough to export
> base_type=i915-GVTg_V5
> aggregation_ratio=
> 
> to express the various combinations that are compatible without the
> need for multiple sets of attributes?

yes. I agree we need to decouple the mdev type name and aggregator for
compatibility detection purpose.

please allow me to put some words to describe the history and
motivation of introducing aggregator.

initially, we have fixed mdev_type
i915-GVTg_V5_1,
i915-GVTg_V5_2,
i915-GVTg_V5_4,
i915-GVTg_V5_8,
the digital after i915-GVTg_V5 representing the max number of instances
allowed to be created for this type. They also identify how many
resources are to be allocated for each type.

They are so far so good for current intel vgpus, i.e., cutting the
physical GPU into several virtual pieces and sharing them among several
VMs in pure mediation way.
fixed types are provided in advance as we thought it can meet needs from
most users and users can know the hardware capability they acquired
from the type name. the bigger in number, the smaller piece of physical
hardware.

Then, when it comes to scalable IOV in near future, one physical hardware
is able to be cut into a large number of units in hardware layer
The single unit to be assigned into guest can be very small while one to
several units are grouped into an mdev.

The fixed type scheme is then cumbersome. 
Therefore, a new attribute aggregator is introduced to specify the number
of resources to be assigned based on the base resource specified in type
name. e.g.
if type name is dsa-1dwq, and aggregator is 30, then the assignable
resources to guest is 30 wqs in a single created mdev.
if type name is dsa-2dwq, and aggregator is 15, then the assignable
resources to guest is also 30wqs in a single created mdev.
(in this example, the rule to define type name is different to the case
in GVT. here 1 wq means wq number is 1. yes, they are current reality.
:) )


previously, we want to regard the two mdevs created with dsa-1dwq x 30 and
dsa-2dwq x 15 as compatible, because the two mdevs consist equal resources.

But, as it's a burden to upper layer, we agree that if this condition
happens, we still treat the two as incompatible.

To fix it, either the driver should expose dsa-1dwq only, or the target
dsa-2dwq needs to be destroyed and reallocated via dsa-1dwq x 30.

Does it make sense?

Thanks
Yan







Re: device compatibility interface for live migration with assigned devices

2020-08-20 Thread Yan Zhao
On Thu, Aug 20, 2020 at 06:16:28AM +0100, Sean Mooney wrote:
> On Thu, 2020-08-20 at 12:01 +0800, Yan Zhao wrote:
> > On Thu, Aug 20, 2020 at 02:29:07AM +0100, Sean Mooney wrote:
> > > On Thu, 2020-08-20 at 08:39 +0800, Yan Zhao wrote:
> > > > On Tue, Aug 18, 2020 at 11:36:52AM +0200, Cornelia Huck wrote:
> > > > > On Tue, 18 Aug 2020 10:16:28 +0100
> > > > > Daniel P. Berrangé  wrote:
> > > > > 
> > > > > > On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote:
> > > > > > >On 2020/8/18 下午4:55, Daniel P. Berrangé wrote:
> > > > > > > 
> > > > > > >  On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote:
> > > > > > > 
> > > > > > >  On 2020/8/14 下午1:16, Yan Zhao wrote:
> > > > > > > 
> > > > > > >  On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote:
> > > > > > > 
> > > > > > >  On 2020/8/10 下午3:46, Yan Zhao wrote:  
> > > > > > >  we actually can also retrieve the same information through 
> > > > > > > sysfs, .e.g
> > > > > > > 
> > > > > > >  |- [path to device]
> > > > > > > |--- migration
> > > > > > > | |--- self
> > > > > > > | |   |---device_api
> > > > > > > ||   |---mdev_type
> > > > > > > ||   |---software_version
> > > > > > > ||   |---device_id
> > > > > > > ||   |---aggregator
> > > > > > > | |--- compatible
> > > > > > > | |   |---device_api
> > > > > > > ||   |---mdev_type
> > > > > > > ||   |---software_version
> > > > > > > ||   |---device_id
> > > > > > > ||   |---aggregator
> > > > > > > 
> > > > > > > 
> > > > > > >  Yes but:
> > > > > > > 
> > > > > > >  - You need one file per attribute (one syscall for one attribute)
> > > > > > >  - Attribute is coupled with kobject
> > > > > 
> > > > > Is that really that bad? You have the device with an embedded kobject
> > > > > anyway, and you can just put things into an attribute group?
> > > > > 
> > > > > [Also, I think that self/compatible split in the example makes things
> > > > > needlessly complex. Shouldn't semantic versioning and matching already
> > > > > cover nearly everything? I would expect very few cases that are more
> > > > > complex than that. Maybe the aggregation stuff, but I don't think we
> > > > > need that self/compatible split for that, either.]
> > > > 
> > > > Hi Cornelia,
> > > > 
> > > > The reason I want to declare compatible list of attributes is that
> > > > sometimes it's not a simple 1:1 matching of source attributes and 
> > > > target attributes
> > > > as I demonstrated below,
> > > > source mdev of (mdev_type i915-GVTg_V5_2 + aggregator 1) is compatible 
> > > > to
> > > > target mdev of (mdev_type i915-GVTg_V5_4 + aggregator 2),
> > > >(mdev_type i915-GVTg_V5_8 + aggregator 4)
> > > 
> > > the way you are doing the nameing is till really confusing by the way
> > > if this has not already been merged in the kernel can you chagne the mdev
> > > so that mdev_type i915-GVTg_V5_2 is 2 of mdev_type i915-GVTg_V5_1 instead 
> > > of half the device
> > > 
> > > currently you need to deived the aggratod by the number at the end of the 
> > > mdev type to figure out
> > > how much of the phsicial device is being used with is a very unfridly api 
> > > convention
> > > 
> > > the way aggrator are being proposed in general is not really someting i 
> > > like but i thin this at least
> > > is something that should be able to correct.
> > > 
> > > with the complexity in the mdev type name + aggrator i suspect that this 
> > > will never be support
> > > in openstack nova directly requireing integration via cyborg unless we 
> > > can pre partion the
> > > device in to mdevs staicaly and just ignore this.
> > > 
> > > this is way to vendor sepecif to integrate in

Re: device compatibility interface for live migration with assigned devices

2020-08-19 Thread Yan Zhao
On Thu, Aug 20, 2020 at 02:29:07AM +0100, Sean Mooney wrote:
> On Thu, 2020-08-20 at 08:39 +0800, Yan Zhao wrote:
> > On Tue, Aug 18, 2020 at 11:36:52AM +0200, Cornelia Huck wrote:
> > > On Tue, 18 Aug 2020 10:16:28 +0100
> > > Daniel P. Berrangé  wrote:
> > > 
> > > > On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote:
> > > > >On 2020/8/18 下午4:55, Daniel P. Berrangé wrote:
> > > > > 
> > > > >  On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote:
> > > > > 
> > > > >  On 2020/8/14 下午1:16, Yan Zhao wrote:
> > > > > 
> > > > >  On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote:
> > > > > 
> > > > >  On 2020/8/10 下午3:46, Yan Zhao wrote:  
> > > > >  we actually can also retrieve the same information through sysfs, 
> > > > > .e.g
> > > > > 
> > > > >  |- [path to device]
> > > > > |--- migration
> > > > > | |--- self
> > > > > | |   |---device_api
> > > > > ||   |---mdev_type
> > > > > ||   |---software_version
> > > > > ||   |---device_id
> > > > > ||   |---aggregator
> > > > > | |--- compatible
> > > > > | |   |---device_api
> > > > > ||   |---mdev_type
> > > > > ||   |---software_version
> > > > > ||   |---device_id
> > > > > ||   |---aggregator
> > > > > 
> > > > > 
> > > > >  Yes but:
> > > > > 
> > > > >  - You need one file per attribute (one syscall for one attribute)
> > > > >  - Attribute is coupled with kobject
> > > 
> > > Is that really that bad? You have the device with an embedded kobject
> > > anyway, and you can just put things into an attribute group?
> > > 
> > > [Also, I think that self/compatible split in the example makes things
> > > needlessly complex. Shouldn't semantic versioning and matching already
> > > cover nearly everything? I would expect very few cases that are more
> > > complex than that. Maybe the aggregation stuff, but I don't think we
> > > need that self/compatible split for that, either.]
> > 
> > Hi Cornelia,
> > 
> > The reason I want to declare compatible list of attributes is that
> > sometimes it's not a simple 1:1 matching of source attributes and target 
> > attributes
> > as I demonstrated below,
> > source mdev of (mdev_type i915-GVTg_V5_2 + aggregator 1) is compatible to
> > target mdev of (mdev_type i915-GVTg_V5_4 + aggregator 2),
> >(mdev_type i915-GVTg_V5_8 + aggregator 4)
> the way you are doing the nameing is till really confusing by the way
> if this has not already been merged in the kernel can you chagne the mdev
> so that mdev_type i915-GVTg_V5_2 is 2 of mdev_type i915-GVTg_V5_1 instead of 
> half the device
> 
> currently you need to deived the aggratod by the number at the end of the 
> mdev type to figure out
> how much of the phsicial device is being used with is a very unfridly api 
> convention
> 
> the way aggrator are being proposed in general is not really someting i like 
> but i thin this at least
> is something that should be able to correct.
> 
> with the complexity in the mdev type name + aggrator i suspect that this will 
> never be support
> in openstack nova directly requireing integration via cyborg unless we can 
> pre partion the
> device in to mdevs staicaly and just ignore this.
> 
> this is way to vendor sepecif to integrate into something like openstack in 
> nova unless we can guarentee
> taht how aggreator work will be portable across vendors genericly.
> 
> > 
> > and aggragator may be just one of such examples that 1:1 matching does not
> > fit.
> for openstack nova i dont see us support anything beyond the 1:1 case where 
> the mdev type does not change.
>
hi Sean,
I understand it's hard for openstack. but 1:N is always meaningful.
e.g.
if source device 1 has cap A, it is compatible to
device 2: cap A,
device 3: cap A+B,
device 4: cap A+B+C

to allow openstack to detect it correctly, in compatible list of
device 2, we would say compatible cap is A;
device 3, compatible cap is A or A+B;
device 4, compatible cap is A or A+B, or A+B+C;

then if openstack finds device A's self cap A is contained in compatible
cap of device 2/3/4, it can migrate device 1 to device 2,3,4.

conversely,  device 1's compatible cap is only 

Re: device compatibility interface for live migration with assigned devices

2020-08-19 Thread Yan Zhao
On Wed, Aug 19, 2020 at 09:22:34PM -0600, Alex Williamson wrote:
> On Thu, 20 Aug 2020 08:39:22 +0800
> Yan Zhao  wrote:
> 
> > On Tue, Aug 18, 2020 at 11:36:52AM +0200, Cornelia Huck wrote:
> > > On Tue, 18 Aug 2020 10:16:28 +0100
> > > Daniel P. Berrangé  wrote:
> > >   
> > > > On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote:  
> > > > >On 2020/8/18 下午4:55, Daniel P. Berrangé wrote:
> > > > > 
> > > > >  On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote:
> > > > > 
> > > > >  On 2020/8/14 下午1:16, Yan Zhao wrote:
> > > > > 
> > > > >  On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote:
> > > > > 
> > > > >  On 2020/8/10 下午3:46, Yan Zhao wrote:
> > > >   
> > > > >  we actually can also retrieve the same information through sysfs, 
> > > > > .e.g
> > > > > 
> > > > >  |- [path to device]
> > > > > |--- migration
> > > > > | |--- self
> > > > > | |   |---device_api
> > > > > ||   |---mdev_type
> > > > > ||   |---software_version
> > > > > ||   |---device_id
> > > > > ||   |---aggregator
> > > > > | |--- compatible
> > > > > | |   |---device_api
> > > > > ||   |---mdev_type
> > > > > ||   |---software_version
> > > > > ||   |---device_id
> > > > > ||   |---aggregator
> > > > > 
> > > > > 
> > > > >  Yes but:
> > > > > 
> > > > >  - You need one file per attribute (one syscall for one attribute)
> > > > >  - Attribute is coupled with kobject  
> > > 
> > > Is that really that bad? You have the device with an embedded kobject
> > > anyway, and you can just put things into an attribute group?
> > > 
> > > [Also, I think that self/compatible split in the example makes things
> > > needlessly complex. Shouldn't semantic versioning and matching already
> > > cover nearly everything? I would expect very few cases that are more
> > > complex than that. Maybe the aggregation stuff, but I don't think we
> > > need that self/compatible split for that, either.]  
> > Hi Cornelia,
> > 
> > The reason I want to declare compatible list of attributes is that
> > sometimes it's not a simple 1:1 matching of source attributes and target 
> > attributes
> > as I demonstrated below,
> > source mdev of (mdev_type i915-GVTg_V5_2 + aggregator 1) is compatible to
> > target mdev of (mdev_type i915-GVTg_V5_4 + aggregator 2),
> >(mdev_type i915-GVTg_V5_8 + aggregator 4)
> > 
> > and aggragator may be just one of such examples that 1:1 matching does not
> > fit.
> 
> If you're suggesting that we need a new 'compatible' set for every
> aggregation, haven't we lost the purpose of aggregation?  For example,
> rather than having N mdev types to represent all the possible
> aggregation values, we have a single mdev type with N compatible
> migration entries, one for each possible aggregation value.  BTW, how do
> we have multiple compatible directories?  compatible0001,
> compatible0002? Thanks,
> 
do you think the bin_attribute I proposed yesterday good?
Then we can have a single compatible with a variable in the mdev_type and
aggregator.

   mdev_type=i915-GVTg_V5_{val1:int:2,4,8}
   aggregator={val1}/2

Thanks
Yan



Re: device compatibility interface for live migration with assigned devices

2020-08-19 Thread Yan Zhao
On Wed, Aug 19, 2020 at 09:13:45PM -0600, Alex Williamson wrote:
> On Thu, 20 Aug 2020 08:18:10 +0800
> Yan Zhao  wrote:
> 
> > On Wed, Aug 19, 2020 at 11:50:21AM -0600, Alex Williamson wrote:
> > <...>
> > > > > > > What I care about is that we have a *standard* userspace API for
> > > > > > > performing device compatibility checking / state migration, for 
> > > > > > > use by
> > > > > > > QEMU/libvirt/ OpenStack, such that we can write code without 
> > > > > > > countless
> > > > > > > vendor specific code paths.
> > > > > > >
> > > > > > > If there is vendor specific stuff on the side, that's fine as we 
> > > > > > > can
> > > > > > > ignore that, but the core functionality for device compat / 
> > > > > > > migration
> > > > > > > needs to be standardized.
> > > > > > 
> > > > > > To summarize:
> > > > > > - choose one of sysfs or devlink
> > > > > > - have a common interface, with a standardized way to add
> > > > > >   vendor-specific attributes
> > > > > > ?
> > > > > 
> > > > > Please refer to my previous email which has more example and details. 
> > > > >
> > > > hi Parav,
> > > > the example is based on a new vdpa tool running over netlink, not based
> > > > on devlink, right?
> > > > For vfio migration compatibility, we have to deal with both mdev and 
> > > > physical
> > > > pci devices, I don't think it's a good idea to write a new tool for it, 
> > > > given
> > > > we are able to retrieve the same info from sysfs and there's already an
> > > > mdevctl from Alex (https://github.com/mdevctl/mdevctl).
> > > > 
> > > > hi All,
> > > > could we decide that sysfs is the interface that every VFIO vendor 
> > > > driver
> > > > needs to provide in order to support vfio live migration, otherwise the
> > > > userspace management tool would not list the device into the compatible
> > > > list?
> > > > 
> > > > if that's true, let's move to the standardizing of the sysfs interface.
> > > > (1) content
> > > > common part: (must)
> > > >- software_version: (in major.minor.bugfix scheme)
> > > >- device_api: vfio-pci or vfio-ccw ...
> > > >- type: mdev type for mdev device or
> > > >a signature for physical device which is a counterpart for
> > > >mdev type.
> > > > 
> > > > device api specific part: (must)
> > > >   - pci id: pci id of mdev parent device or pci id of physical pci
> > > > device (device_api is vfio-pci)  
> > > 
> > > As noted previously, the parent PCI ID should not matter for an mdev
> > > device, if a vendor has a dependency on matching the parent device PCI
> > > ID, that's a vendor specific restriction.  An mdev device can also
> > > expose a vfio-pci device API without the parent device being PCI.  For
> > > a physical PCI device, shouldn't the PCI ID be encompassed in the
> > > signature?  Thanks,
> > >   
> > you are right. I need to put the PCI ID as a vendor specific field.
> > I didn't do that because I wanted all fields in vendor specific to be
> > configurable by management tools, so they can configure the target device
> > according to the value of a vendor specific field even they don't know
> > the meaning of the field.
> > But maybe they can just ignore the field when they can't find a matching
> > writable field to configure the target.
> 
> 
> If fields can be ignored, what's the point of reporting them?  Seems
> it's no longer a requirement.  Thanks,
> 
sorry about the confusion. I mean this condition:
about to migrate, openstack searches if there are existing matching
MDEVs,
if yes, i.e. all common/vendor specific fields match, then just create
a VM with the matching target MDEV. (in this condition, the PCI ID field
is not ignored);
if not, openstack tries to create one MDEV according to mdev_type, and
configures MDEV according to the vendor specific attributes.
as PCI ID is not a configurable field, it just ignore the field.

Thanks
Yan

 
 



Re: device compatibility interface for live migration with assigned devices

2020-08-19 Thread Yan Zhao
On Tue, Aug 18, 2020 at 11:36:52AM +0200, Cornelia Huck wrote:
> On Tue, 18 Aug 2020 10:16:28 +0100
> Daniel P. Berrangé  wrote:
> 
> > On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote:
> > >On 2020/8/18 下午4:55, Daniel P. Berrangé wrote:
> > > 
> > >  On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote:
> > > 
> > >  On 2020/8/14 下午1:16, Yan Zhao wrote:
> > > 
> > >  On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote:
> > > 
> > >  On 2020/8/10 下午3:46, Yan Zhao wrote:  
> > 
> > >  we actually can also retrieve the same information through sysfs, .e.g
> > > 
> > >  |- [path to device]
> > > |--- migration
> > > | |--- self
> > > | |   |---device_api
> > > ||   |---mdev_type
> > > ||   |---software_version
> > > ||   |---device_id
> > > ||   |---aggregator
> > > | |--- compatible
> > > | |   |---device_api
> > > ||   |---mdev_type
> > > ||   |---software_version
> > > ||   |---device_id
> > > ||   |---aggregator
> > > 
> > > 
> > >  Yes but:
> > > 
> > >  - You need one file per attribute (one syscall for one attribute)
> > >  - Attribute is coupled with kobject
> 
> Is that really that bad? You have the device with an embedded kobject
> anyway, and you can just put things into an attribute group?
> 
> [Also, I think that self/compatible split in the example makes things
> needlessly complex. Shouldn't semantic versioning and matching already
> cover nearly everything? I would expect very few cases that are more
> complex than that. Maybe the aggregation stuff, but I don't think we
> need that self/compatible split for that, either.]
Hi Cornelia,

The reason I want to declare compatible list of attributes is that
sometimes it's not a simple 1:1 matching of source attributes and target 
attributes
as I demonstrated below,
source mdev of (mdev_type i915-GVTg_V5_2 + aggregator 1) is compatible to
target mdev of (mdev_type i915-GVTg_V5_4 + aggregator 2),
   (mdev_type i915-GVTg_V5_8 + aggregator 4)

and aggragator may be just one of such examples that 1:1 matching does not
fit.

So, we explicitly list out self/compatible attributes, and management
tools only need to check if self attributes is contained compatible
attributes.

or do you mean only compatible list is enough, and the management tools
need to find out self list by themselves?
But I think provide a self list is easier for management tools.

Thanks
Yan



Re: device compatibility interface for live migration with assigned devices

2020-08-19 Thread Yan Zhao
On Wed, Aug 19, 2020 at 11:50:21AM -0600, Alex Williamson wrote:
<...>
> > > > > What I care about is that we have a *standard* userspace API for
> > > > > performing device compatibility checking / state migration, for use by
> > > > > QEMU/libvirt/ OpenStack, such that we can write code without countless
> > > > > vendor specific code paths.
> > > > >
> > > > > If there is vendor specific stuff on the side, that's fine as we can
> > > > > ignore that, but the core functionality for device compat / migration
> > > > > needs to be standardized.  
> > > > 
> > > > To summarize:
> > > > - choose one of sysfs or devlink
> > > > - have a common interface, with a standardized way to add
> > > >   vendor-specific attributes
> > > > ?  
> > > 
> > > Please refer to my previous email which has more example and details.  
> > hi Parav,
> > the example is based on a new vdpa tool running over netlink, not based
> > on devlink, right?
> > For vfio migration compatibility, we have to deal with both mdev and 
> > physical
> > pci devices, I don't think it's a good idea to write a new tool for it, 
> > given
> > we are able to retrieve the same info from sysfs and there's already an
> > mdevctl from Alex (https://github.com/mdevctl/mdevctl).
> > 
> > hi All,
> > could we decide that sysfs is the interface that every VFIO vendor driver
> > needs to provide in order to support vfio live migration, otherwise the
> > userspace management tool would not list the device into the compatible
> > list?
> > 
> > if that's true, let's move to the standardizing of the sysfs interface.
> > (1) content
> > common part: (must)
> >- software_version: (in major.minor.bugfix scheme)
> >- device_api: vfio-pci or vfio-ccw ...
> >- type: mdev type for mdev device or
> >a signature for physical device which is a counterpart for
> >mdev type.
> > 
> > device api specific part: (must)
> >   - pci id: pci id of mdev parent device or pci id of physical pci
> > device (device_api is vfio-pci)
> 
> As noted previously, the parent PCI ID should not matter for an mdev
> device, if a vendor has a dependency on matching the parent device PCI
> ID, that's a vendor specific restriction.  An mdev device can also
> expose a vfio-pci device API without the parent device being PCI.  For
> a physical PCI device, shouldn't the PCI ID be encompassed in the
> signature?  Thanks,
> 
you are right. I need to put the PCI ID as a vendor specific field.
I didn't do that because I wanted all fields in vendor specific to be
configurable by management tools, so they can configure the target device
according to the value of a vendor specific field even they don't know
the meaning of the field.
But maybe they can just ignore the field when they can't find a matching
writable field to configure the target.

Thanks
Yan


> >   - subchannel_type (device_api is vfio-ccw) 
> >  
> > vendor driver specific part: (optional)
> >   - aggregator
> >   - chpid_type
> >   - remote_url
> > 
> > NOTE: vendors are free to add attributes in this part with a
> > restriction that this attribute is able to be configured with the same
> > name in sysfs too. e.g.
> > for aggregator, there must be a sysfs attribute in device node
> > /sys/devices/pci:00/:00:02.0/882cc4da-dede-11e7-9180-078a62063ab1/intel_vgpu/aggregator,
> > so that the userspace tool is able to configure the target device
> > according to source device's aggregator attribute.
> > 
> > 
> > (2) where and structure
> > proposal 1:
> > |- [path to device]
> >   |--- migration
> >   | |--- self
> >   | ||-software_version
> >   | ||-device_api
> >   | ||-type
> >   | ||-[pci_id or subchannel_type]
> >   | ||-
> >   | |--- compatible
> >   | ||-software_version
> >   | ||-device_api
> >   | ||-type
> >   | ||-[pci_id or subchannel_type]
> >   | ||-
> > multiple compatible is allowed.
> > attributes should be ASCII text files, preferably with only one value
> > per file.
> > 
> > 
> > proposal 2: use bin_attribute.
> > |- [path to device]
> >   |--- migration
> >   | |--- self
> >   | |--- compatible
> > 
> > so we can continue use multiline format. e.g.
> > cat compatible
> >   software_version=0.1.0
> >   device_api=vfio_pci
> >   type=i915-GVTg_V5_{val1:int:1,2,4,8}
> >   pci_id=80865963
> >   aggregator={val1}/2
> > 
> > Thanks
> > Yan
> > 
> 



Re: [ovirt-devel] Re: device compatibility interface for live migration with assigned devices

2020-08-19 Thread Yan Zhao
On Wed, Aug 19, 2020 at 03:39:50PM +0800, Jason Wang wrote:
> 
> On 2020/8/19 下午2:59, Yan Zhao wrote:
> > On Wed, Aug 19, 2020 at 02:57:34PM +0800, Jason Wang wrote:
> > > On 2020/8/19 上午11:30, Yan Zhao wrote:
> > > > hi All,
> > > > could we decide that sysfs is the interface that every VFIO vendor 
> > > > driver
> > > > needs to provide in order to support vfio live migration, otherwise the
> > > > userspace management tool would not list the device into the compatible
> > > > list?
> > > > 
> > > > if that's true, let's move to the standardizing of the sysfs interface.
> > > > (1) content
> > > > common part: (must)
> > > >  - software_version: (in major.minor.bugfix scheme)
> > > 
> > > This can not work for devices whose features can be negotiated/advertised
> > > independently. (E.g virtio devices)
> > > 
> > sorry, I don't understand here, why virtio devices need to use vfio 
> > interface?
> 
> 
> I don't see any reason that virtio devices can't be used by VFIO. Do you?
> 
> Actually, virtio devices have been used by VFIO for many years:
> 
> - passthrough a hardware virtio devices to userspace(VM) drivers
> - using virtio PMD inside guest
>
So, what's different for it vs passing through a physical hardware via VFIO?
even though the features are negotiated dynamically, could you explain
why it would cause software_version not work?


> 
> > I think this thread is discussing about vfio related devices.
> > 
> > > >  - device_api: vfio-pci or vfio-ccw ...
> > > >  - type: mdev type for mdev device or
> > > >  a signature for physical device which is a counterpart for
> > > >mdev type.
> > > > 
> > > > device api specific part: (must)
> > > > - pci id: pci id of mdev parent device or pci id of physical pci
> > > >   device (device_api is vfio-pci)API here.
> > > 
> > > So this assumes a PCI device which is probably not true.
> > > 
> > for device_api of vfio-pci, why it's not true?
> > 
> > for vfio-ccw, it's subchannel_type.
> 
> 
> Ok but having two different attributes for the same file is not good idea.
> How mgmt know there will be a 3rd type?
that's why some attributes need to be common. e.g.
device_api: it's common because mgmt need to know it's a pci device or a
ccw device. and the api type is already defined vfio.h.
(The field is agreed by and actually suggested by Alex in previous 
mail)
type: mdev_type for mdev. if mgmt does not understand it, it would not
  be able to create one compatible mdev device.
software_version: mgmt can compare the major and minor if it understands
  this fields.
> 
> 
> > 
> > > > - subchannel_type (device_api is vfio-ccw)
> > > > vendor driver specific part: (optional)
> > > > - aggregator
> > > > - chpid_type
> > > > - remote_url
> > > 
> > > For "remote_url", just wonder if it's better to integrate or reuse the
> > > existing NVME management interface instead of duplicating it here. 
> > > Otherwise
> > > it could be a burden for mgmt to learn. E.g vendor A may use "remote_url"
> > > but vendor B may use a different attribute.
> > > 
> > it's vendor driver specific.
> > vendor specific attributes are inevitable, and that's why we are
> > discussing here of a way to standardizing of it.
> 
> 
> Well, then you will end up with a very long list to discuss. E.g for
> networking devices, you will have "mac", "v(x)lan" and a lot of other.
> 
> Note that "remote_url" is not vendor specific but NVME (class/subsystem)
> specific.
> 
yes, it's just NVMe specific. I added it as an example to show what is
vendor specific.
if one attribute is vendor specific across all vendors, then it's not vendor 
specific,
it's already common attribute, right?

> The point is that if vendor/class specific part is unavoidable, why not
> making all of the attributes vendor specific?
>
some parts need to be common, as I listed above.

> 
> > our goal is that mgmt can use it without understanding the meaning of vendor
> > specific attributes.
> 
> 
> I'm not sure this is the correct design of uAPI. Is there something similar
> in the existing uAPIs?
> 
> And it might be hard to work for virtio devices.
> 
> 
> > 
> > > > NOTE: vendors are free to add attributes in this part with a
> > > > restriction that th

Re: [ovirt-devel] Re: device compatibility interface for live migration with assigned devices

2020-08-19 Thread Yan Zhao
On Wed, Aug 19, 2020 at 02:57:34PM +0800, Jason Wang wrote:
> 
> On 2020/8/19 上午11:30, Yan Zhao wrote:
> > hi All,
> > could we decide that sysfs is the interface that every VFIO vendor driver
> > needs to provide in order to support vfio live migration, otherwise the
> > userspace management tool would not list the device into the compatible
> > list?
> > 
> > if that's true, let's move to the standardizing of the sysfs interface.
> > (1) content
> > common part: (must)
> > - software_version: (in major.minor.bugfix scheme)
> 
> 
> This can not work for devices whose features can be negotiated/advertised
> independently. (E.g virtio devices)
>
sorry, I don't understand here, why virtio devices need to use vfio interface?
I think this thread is discussing about vfio related devices.

> 
> > - device_api: vfio-pci or vfio-ccw ...
> > - type: mdev type for mdev device or
> > a signature for physical device which is a counterpart for
> >mdev type.
> > 
> > device api specific part: (must)
> >- pci id: pci id of mdev parent device or pci id of physical pci
> >  device (device_api is vfio-pci)API here.
> 
> 
> So this assumes a PCI device which is probably not true.
> 
for device_api of vfio-pci, why it's not true?

for vfio-ccw, it's subchannel_type.

> 
> >- subchannel_type (device_api is vfio-ccw)
> > vendor driver specific part: (optional)
> >- aggregator
> >- chpid_type
> >- remote_url
> 
> 
> For "remote_url", just wonder if it's better to integrate or reuse the
> existing NVME management interface instead of duplicating it here. Otherwise
> it could be a burden for mgmt to learn. E.g vendor A may use "remote_url"
> but vendor B may use a different attribute.
> 
it's vendor driver specific.
vendor specific attributes are inevitable, and that's why we are
discussing here of a way to standardizing of it.
our goal is that mgmt can use it without understanding the meaning of vendor
specific attributes.

> 
> > 
> > NOTE: vendors are free to add attributes in this part with a
> > restriction that this attribute is able to be configured with the same
> > name in sysfs too. e.g.
> 
> 
> Sysfs works well for common attributes belongs to a class, but I'm not sure
> it can work well for device/vendor specific attributes. Does this mean mgmt
> need to iterate all the attributes in both src and dst?
>
no. just attributes under migration directory.

> 
> > for aggregator, there must be a sysfs attribute in device node
> > /sys/devices/pci:00/:00:02.0/882cc4da-dede-11e7-9180-078a62063ab1/intel_vgpu/aggregator,
> > so that the userspace tool is able to configure the target device
> > according to source device's aggregator attribute.
> > 
> > 
> > (2) where and structure
> > proposal 1:
> > |- [path to device]
> >|--- migration
> >| |--- self
> >| ||-software_version
> >| ||-device_api
> >| ||-type
> >| ||-[pci_id or subchannel_type]
> >| ||-
> >| |--- compatible
> >| ||-software_version
> >| ||-device_api
> >| ||-type
> >| ||-[pci_id or subchannel_type]
> >| ||-
> > multiple compatible is allowed.
> > attributes should be ASCII text files, preferably with only one value
> > per file.
> > 
> > 
> > proposal 2: use bin_attribute.
> > |- [path to device]
> >|--- migration
> >| |--- self
> >| |--- compatible
> > 
> > so we can continue use multiline format. e.g.
> > cat compatible
> >software_version=0.1.0
> >device_api=vfio_pci
> >type=i915-GVTg_V5_{val1:int:1,2,4,8}
> >pci_id=80865963
> >aggregator={val1}/2
> 
> 
> So basically two questions:
> 
> - how hard to standardize sysfs API for dealing with compatibility check (to
> make it work for most types of devices)
sorry, I just know we are in the process of standardizing of it :)

> - how hard for the mgmt to learn with a vendor specific attributes (vs
> existing management API)
what is existing management API?

Thanks



Re: device compatibility interface for live migration with assigned devices

2020-08-18 Thread Yan Zhao
On Tue, Aug 18, 2020 at 09:39:24AM +, Parav Pandit wrote:
> Hi Cornelia,
> 
> > From: Cornelia Huck 
> > Sent: Tuesday, August 18, 2020 3:07 PM
> > To: Daniel P. Berrangé 
> > Cc: Jason Wang ; Yan Zhao
> > ; k...@vger.kernel.org; libvir-list@redhat.com;
> > qemu-de...@nongnu.org; Kirti Wankhede ;
> > eau...@redhat.com; xin-ran.w...@intel.com; cor...@lwn.net; openstack-
> > disc...@lists.openstack.org; shaohe.f...@intel.com; kevin.t...@intel.com;
> > Parav Pandit ; jian-feng.d...@intel.com;
> > dgilb...@redhat.com; zhen...@linux.intel.com; hejie...@intel.com;
> > bao.yum...@zte.com.cn; Alex Williamson ;
> > eskul...@redhat.com; smoo...@redhat.com; intel-gvt-
> > d...@lists.freedesktop.org; Jiri Pirko ;
> > dinec...@redhat.com; de...@ovirt.org
> > Subject: Re: device compatibility interface for live migration with assigned
> > devices
> > 
> > On Tue, 18 Aug 2020 10:16:28 +0100
> > Daniel P. Berrangé  wrote:
> > 
> > > On Tue, Aug 18, 2020 at 05:01:51PM +0800, Jason Wang wrote:
> > > >On 2020/8/18 下午4:55, Daniel P. Berrangé wrote:
> > > >
> > > >  On Tue, Aug 18, 2020 at 11:24:30AM +0800, Jason Wang wrote:
> > > >
> > > >  On 2020/8/14 下午1:16, Yan Zhao wrote:
> > > >
> > > >  On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote:
> > > >
> > > >  On 2020/8/10 下午3:46, Yan Zhao wrote:
> > >
> > > >  we actually can also retrieve the same information through sysfs,
> > > > .e.g
> > > >
> > > >  |- [path to device]
> > > > |--- migration
> > > > | |--- self
> > > > | |   |---device_api
> > > > ||   |---mdev_type
> > > > ||   |---software_version
> > > > ||   |---device_id
> > > > ||   |---aggregator
> > > > | |--- compatible
> > > > | |   |---device_api
> > > > ||   |---mdev_type
> > > > ||   |---software_version
> > > > ||   |---device_id
> > > > ||   |---aggregator
> > > >
> > > >
> > > >  Yes but:
> > > >
> > > >  - You need one file per attribute (one syscall for one attribute)
> > > >  - Attribute is coupled with kobject
> > 
> > Is that really that bad? You have the device with an embedded kobject
> > anyway, and you can just put things into an attribute group?
> > 
> > [Also, I think that self/compatible split in the example makes things
> > needlessly complex. Shouldn't semantic versioning and matching already
> > cover nearly everything? I would expect very few cases that are more
> > complex than that. Maybe the aggregation stuff, but I don't think we need
> > that self/compatible split for that, either.]
> > 
> > > >
> > > >  All of above seems unnecessary.
> > > >
> > > >  Another point, as we discussed in another thread, it's really hard
> > > > to make  sure the above API work for all types of devices and
> > > > frameworks. So having a  vendor specific API looks much better.
> > > >
> > > >  From the POV of userspace mgmt apps doing device compat checking /
> > > > migration,  we certainly do NOT want to use different vendor
> > > > specific APIs. We want to  have an API that can be used / controlled in 
> > > > a
> > standard manner across vendors.
> > > >
> > > >Yes, but it could be hard. E.g vDPA will chose to use devlink 
> > > > (there's a
> > > >long debate on sysfs vs devlink). So if we go with sysfs, at least 
> > > > two
> > > >APIs needs to be supported ...
> > >
> > > NB, I was not questioning devlink vs sysfs directly. If devlink is
> > > related to netlink, I can't say I'm enthusiastic as IMKE sysfs is
> > > easier to deal with. I don't know enough about devlink to have much of an
> > opinion though.
> > > The key point was that I don't want the userspace APIs we need to deal
> > > with to be vendor specific.
> > 
> > From what I've seen of devlink, it seems quite nice; but I understand why
> > sysfs might be easier to deal with (especially as there's likely already a 
> > lot of
> > code using it.)
> > 
> > I understand that some users would like devlink because it is already widely
> > used for network drivers (and some others), but I don't think the majority 
> >

Re: device compatibility interface for live migration with assigned devices

2020-08-16 Thread Yan Zhao
On Fri, Aug 14, 2020 at 01:30:00PM +0100, Sean Mooney wrote:
> On Fri, 2020-08-14 at 13:16 +0800, Yan Zhao wrote:
> > On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote:
> > > 
> > > On 2020/8/10 下午3:46, Yan Zhao wrote:
> > > > > driver is it handled by?
> > > > 
> > > > It looks that the devlink is for network device specific, and in
> > > > devlink.h, it says
> > > > include/uapi/linux/devlink.h - Network physical device Netlink
> > > > interface,
> > > 
> > > 
> > > Actually not, I think there used to have some discussion last year and the
> > > conclusion is to remove this comment.
> > > 
> > > It supports IB and probably vDPA in the future.
> > > 
> > 
> > hmm... sorry, I didn't find the referred discussion. only below discussion
> > regarding to why to add devlink.
> > 
> > https://www.mail-archive.com/netdev@vger.kernel.org/msg95801.html
> > >This doesn't seem to be too much related to networking? Why can't 
> > something
> > >like this be in sysfs?
> > 
> > It is related to networking quite bit. There has been couple of
> > iteration of this, including sysfs and configfs implementations. There
> > has been a consensus reached that this should be done by netlink. I
> > believe netlink is really the best for this purpose. Sysfs is not a good
> > idea
> > 
> > https://www.mail-archive.com/netdev@vger.kernel.org/msg96102.html
> > >there is already a way to change eth/ib via
> > >echo 'eth' > /sys/bus/pci/drivers/mlx4_core/:02:00.0/mlx4_port1
> > >
> > >sounds like this is another way to achieve the same?
> > 
> > It is. However the current way is driver-specific, not correct.
> > For mlx5, we need the same, it cannot be done in this way. Do devlink is
> > the correct way to go.
> im not sure i agree with that.
> standardising a filesystem based api that is used across all vendors is also 
> a valid
> option.  that said if devlink is the right choice form a kerenl perspective 
> by all
> means use it but i have not heard a convincing argument for why it actually 
> better.
> with tthat said we have been uing tools like ethtool to manage aspect of nics 
> for decades
> so its not that strange an idea to use a tool and binary protocoal rather 
> then a text
> based interface for this but there are advantages to both approches.
> >
Yes, I agree with you.

> > https://lwn.net/Articles/674867/
> > There a is need for some userspace API that would allow to expose things
> > that are not directly related to any device class like net_device of
> > ib_device, but rather chip-wide/switch-ASIC-wide stuff.
> > 
> > Use cases:
> > 1) get/set of port type (Ethernet/InfiniBand)
> > 2) monitoring of hardware messages to and from chip
> > 3) setting up port splitters - split port into multiple ones and squash 
> > again,
> >enables usage of splitter cable
> > 4) setting up shared buffers - shared among multiple ports within one 
> > chip
> > 
> > 
> > 
> > we actually can also retrieve the same information through sysfs, .e.g
> > 
> > > - [path to device]
> > 
> >   |--- migration
> >   | |--- self
> >   | |   |---device_api
> >   | |   |---mdev_type
> >   | |   |---software_version
> >   | |   |---device_id
> >   | |   |---aggregator
> >   | |--- compatible
> >   | |   |---device_api
> >   | |   |---mdev_type
> >   | |   |---software_version
> >   | |   |---device_id
> >   | |   |---aggregator
> > 
> > 
> > 
> > > 
> > > >   I feel like it's not very appropriate for a GPU driver to use
> > > > this interface. Is that right?
> > > 
> > > 
> > > I think not though most of the users are switch or ethernet devices. It
> > > doesn't prevent you from inventing new abstractions.
> > 
> > so need to patch devlink core and the userspace devlink tool?
> > e.g. devlink migration
> and devlink python libs if openstack was to use it directly.
> we do have caes where we just frok a process and execaute a comannd in a shell
> with or without elevated privladge but we really dont like doing that due to 
> the performacne impacat and security implciations so where we can use python 
> bindign
> over c apis we do. pyroute2 is the only python lib i know off of the top of 
> my head
> that support devlink so we would 

Re: device compatibility interface for live migration with assigned devices

2020-08-13 Thread Yan Zhao
On Thu, Aug 13, 2020 at 12:24:50PM +0800, Jason Wang wrote:
> 
> On 2020/8/10 下午3:46, Yan Zhao wrote:
> > > driver is it handled by?
> > It looks that the devlink is for network device specific, and in
> > devlink.h, it says
> > include/uapi/linux/devlink.h - Network physical device Netlink
> > interface,
> 
> 
> Actually not, I think there used to have some discussion last year and the
> conclusion is to remove this comment.
> 
> It supports IB and probably vDPA in the future.
>
hmm... sorry, I didn't find the referred discussion. only below discussion
regarding to why to add devlink.

https://www.mail-archive.com/netdev@vger.kernel.org/msg95801.html
>This doesn't seem to be too much related to networking? Why can't 
something
>like this be in sysfs?

It is related to networking quite bit. There has been couple of
iteration of this, including sysfs and configfs implementations. There
has been a consensus reached that this should be done by netlink. I
believe netlink is really the best for this purpose. Sysfs is not a good
idea

https://www.mail-archive.com/netdev@vger.kernel.org/msg96102.html
>there is already a way to change eth/ib via
>echo 'eth' > /sys/bus/pci/drivers/mlx4_core/:02:00.0/mlx4_port1
>
>sounds like this is another way to achieve the same?

It is. However the current way is driver-specific, not correct.
For mlx5, we need the same, it cannot be done in this way. Do devlink is
the correct way to go.

https://lwn.net/Articles/674867/
There a is need for some userspace API that would allow to expose things
that are not directly related to any device class like net_device of
ib_device, but rather chip-wide/switch-ASIC-wide stuff.

Use cases:
1) get/set of port type (Ethernet/InfiniBand)
2) monitoring of hardware messages to and from chip
3) setting up port splitters - split port into multiple ones and squash 
again,
   enables usage of splitter cable
4) setting up shared buffers - shared among multiple ports within one 
chip



we actually can also retrieve the same information through sysfs, .e.g

|- [path to device]
  |--- migration
  | |--- self
  | |   |---device_api
  | |   |---mdev_type
  | |   |---software_version
  | |   |---device_id
  | |   |---aggregator
  | |--- compatible
  | |   |---device_api
  | |   |---mdev_type
  | |   |---software_version
  | |   |---device_id
  | |   |---aggregator



> 
> >   I feel like it's not very appropriate for a GPU driver to use
> > this interface. Is that right?
> 
> 
> I think not though most of the users are switch or ethernet devices. It
> doesn't prevent you from inventing new abstractions.
so need to patch devlink core and the userspace devlink tool?
e.g. devlink migration

> Note that devlink is based on netlink, netlink has been widely used by
> various subsystems other than networking.

the advantage of netlink I see is that it can monitor device status and
notify upper layer that migration database needs to get updated.
But not sure whether openstack would like to use this capability.
As Sean said, it's heavy for openstack. it's heavy for vendor driver
as well :)

And devlink monitor now listens the notification and dumps the state
changes. If we want to use it, need to let it forward the notification
and dumped info to openstack, right?

Thanks
Yan



Re: device compatibility interface for live migration with assigned devices

2020-08-10 Thread Yan Zhao
On Wed, Aug 05, 2020 at 12:53:19PM +0200, Jiri Pirko wrote:
> Wed, Aug 05, 2020 at 11:33:38AM CEST, yan.y.z...@intel.com wrote:
> >On Wed, Aug 05, 2020 at 04:02:48PM +0800, Jason Wang wrote:
> >> 
> >> On 2020/8/5 下午3:56, Jiri Pirko wrote:
> >> > Wed, Aug 05, 2020 at 04:41:54AM CEST, jasow...@redhat.com wrote:
> >> > > On 2020/8/5 上午10:16, Yan Zhao wrote:
> >> > > > On Wed, Aug 05, 2020 at 10:22:15AM +0800, Jason Wang wrote:
> >> > > > > On 2020/8/5 上午12:35, Cornelia Huck wrote:
> >> > > > > > [sorry about not chiming in earlier]
> >> > > > > > 
> >> > > > > > On Wed, 29 Jul 2020 16:05:03 +0800
> >> > > > > > Yan Zhao  wrote:
> >> > > > > > 
> >> > > > > > > On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson 
> >> > > > > > > wrote:
> >> > > > > > (...)
> >> > > > > > 
> >> > > > > > > > Based on the feedback we've received, the previously 
> >> > > > > > > > proposed interface
> >> > > > > > > > is not viable.  I think there's agreement that the user 
> >> > > > > > > > needs to be
> >> > > > > > > > able to parse and interpret the version information.  Using 
> >> > > > > > > > json seems
> >> > > > > > > > viable, but I don't know if it's the best option.  Is there 
> >> > > > > > > > any
> >> > > > > > > > precedent of markup strings returned via sysfs we could 
> >> > > > > > > > follow?
> >> > > > > > I don't think encoding complex information in a sysfs file is a 
> >> > > > > > viable
> >> > > > > > approach. Quoting Documentation/filesystems/sysfs.rst:
> >> > > > > > 
> >> > > > > > "Attributes should be ASCII text files, preferably with only one 
> >> > > > > > value
> >> > > > > > per file. It is noted that it may not be efficient to contain 
> >> > > > > > only one
> >> > > > > > value per file, so it is socially acceptable to express an array 
> >> > > > > > of
> >> > > > > > values of the same type.
> >> > > > > > Mixing types, expressing multiple lines of data, and doing fancy
> >> > > > > > formatting of data is heavily frowned upon."
> >> > > > > > 
> >> > > > > > Even though this is an older file, I think these restrictions 
> >> > > > > > still
> >> > > > > > apply.
> >> > > > > +1, that's another reason why devlink(netlink) is better.
> >> > > > > 
> >> > > > hi Jason,
> >> > > > do you have any materials or sample code about devlink, so we can 
> >> > > > have a good
> >> > > > study of it?
> >> > > > I found some kernel docs about it but my preliminary study didn't 
> >> > > > show me the
> >> > > > advantage of devlink.
> >> > > 
> >> > > CC Jiri and Parav for a better answer for this.
> >> > > 
> >> > > My understanding is that the following advantages are obvious (as I 
> >> > > replied
> >> > > in another thread):
> >> > > 
> >> > > - existing users (NIC, crypto, SCSI, ib), mature and stable
> >> > > - much better error reporting (ext_ack other than string or errno)
> >> > > - namespace aware
> >> > > - do not couple with kobject
> >> > Jason, what is your use case?
> >> 
> >> 
> >> I think the use case is to report device compatibility for live migration.
> >> Yan proposed a simple sysfs based migration version first, but it looks not
> >> sufficient and something based on JSON is discussed.
> >> 
> >> Yan, can you help to summarize the discussion so far for Jiri as a
> >> reference?
> >> 
> >yes.
> >we are currently defining an device live migration compatibility
> >interface in order to let user space like openstack and libvirt knows
> >which two devices are live migration compatible.
> >currently the

Re: device compatibility interface for live migration with assigned devices

2020-08-05 Thread Yan Zhao
On Wed, Aug 05, 2020 at 04:02:48PM +0800, Jason Wang wrote:
> 
> On 2020/8/5 下午3:56, Jiri Pirko wrote:
> > Wed, Aug 05, 2020 at 04:41:54AM CEST, jasow...@redhat.com wrote:
> > > On 2020/8/5 上午10:16, Yan Zhao wrote:
> > > > On Wed, Aug 05, 2020 at 10:22:15AM +0800, Jason Wang wrote:
> > > > > On 2020/8/5 上午12:35, Cornelia Huck wrote:
> > > > > > [sorry about not chiming in earlier]
> > > > > > 
> > > > > > On Wed, 29 Jul 2020 16:05:03 +0800
> > > > > > Yan Zhao  wrote:
> > > > > > 
> > > > > > > On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote:
> > > > > > (...)
> > > > > > 
> > > > > > > > Based on the feedback we've received, the previously proposed 
> > > > > > > > interface
> > > > > > > > is not viable.  I think there's agreement that the user needs 
> > > > > > > > to be
> > > > > > > > able to parse and interpret the version information.  Using 
> > > > > > > > json seems
> > > > > > > > viable, but I don't know if it's the best option.  Is there any
> > > > > > > > precedent of markup strings returned via sysfs we could follow?
> > > > > > I don't think encoding complex information in a sysfs file is a 
> > > > > > viable
> > > > > > approach. Quoting Documentation/filesystems/sysfs.rst:
> > > > > > 
> > > > > > "Attributes should be ASCII text files, preferably with only one 
> > > > > > value
> > > > > > per file. It is noted that it may not be efficient to contain only 
> > > > > > one
> > > > > > value per file, so it is socially acceptable to express an array of
> > > > > > values of the same type.
> > > > > > Mixing types, expressing multiple lines of data, and doing fancy
> > > > > > formatting of data is heavily frowned upon."
> > > > > > 
> > > > > > Even though this is an older file, I think these restrictions still
> > > > > > apply.
> > > > > +1, that's another reason why devlink(netlink) is better.
> > > > > 
> > > > hi Jason,
> > > > do you have any materials or sample code about devlink, so we can have 
> > > > a good
> > > > study of it?
> > > > I found some kernel docs about it but my preliminary study didn't show 
> > > > me the
> > > > advantage of devlink.
> > > 
> > > CC Jiri and Parav for a better answer for this.
> > > 
> > > My understanding is that the following advantages are obvious (as I 
> > > replied
> > > in another thread):
> > > 
> > > - existing users (NIC, crypto, SCSI, ib), mature and stable
> > > - much better error reporting (ext_ack other than string or errno)
> > > - namespace aware
> > > - do not couple with kobject
> > Jason, what is your use case?
> 
> 
> I think the use case is to report device compatibility for live migration.
> Yan proposed a simple sysfs based migration version first, but it looks not
> sufficient and something based on JSON is discussed.
> 
> Yan, can you help to summarize the discussion so far for Jiri as a
> reference?
> 
yes.
we are currently defining an device live migration compatibility
interface in order to let user space like openstack and libvirt knows
which two devices are live migration compatible.
currently the devices include mdev (a kernel emulated virtual device)
and physical devices (e.g.  a VF of a PCI SRIOV device).

the attributes we want user space to compare including
common attribues:
device_api: vfio-pci, vfio-ccw...
mdev_type: mdev type of mdev or similar signature for physical device
   It specifies a device's hardware capability. e.g.
   i915-GVTg_V5_4 means it's of 1/4 of a gen9 Intel graphics
   device.
software_version: device driver's version.
   in .[.bugfix] scheme, where there is no
   compatibility across major versions, minor versions have
   forward compatibility (ex. 1-> 2 is ok, 2 -> 1 is not) and
   bugfix version number indicates some degree of internal
   improvement that is not visible to the user in terms of
   features or compatibility,

vendor specific attributes: each vendor may define different attributes
   device id : device id of a physical devices or

Re: device compatibility interface for live migration with assigned devices

2020-08-04 Thread Yan Zhao
On Wed, Aug 05, 2020 at 10:22:15AM +0800, Jason Wang wrote:
> 
> On 2020/8/5 上午12:35, Cornelia Huck wrote:
> > [sorry about not chiming in earlier]
> > 
> > On Wed, 29 Jul 2020 16:05:03 +0800
> > Yan Zhao  wrote:
> > 
> > > On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote:
> > (...)
> > 
> > > > Based on the feedback we've received, the previously proposed interface
> > > > is not viable.  I think there's agreement that the user needs to be
> > > > able to parse and interpret the version information.  Using json seems
> > > > viable, but I don't know if it's the best option.  Is there any
> > > > precedent of markup strings returned via sysfs we could follow?
> > I don't think encoding complex information in a sysfs file is a viable
> > approach. Quoting Documentation/filesystems/sysfs.rst:
> > 
> > "Attributes should be ASCII text files, preferably with only one value
> > per file. It is noted that it may not be efficient to contain only one
> > value per file, so it is socially acceptable to express an array of
> > values of the same type.
> > Mixing types, expressing multiple lines of data, and doing fancy
> > formatting of data is heavily frowned upon."
> > 
> > Even though this is an older file, I think these restrictions still
> > apply.
> 
> 
> +1, that's another reason why devlink(netlink) is better.
>
hi Jason,
do you have any materials or sample code about devlink, so we can have a good
study of it?
I found some kernel docs about it but my preliminary study didn't show me the
advantage of devlink.

Thanks
Yan



Re: device compatibility interface for live migration with assigned devices

2020-08-04 Thread Yan Zhao
> > yes, include a device_api field is better.
> > for mdev, "device_type=vfio-mdev", is it right?
> 
> No, vfio-mdev is not a device API, it's the driver that attaches to the
> mdev bus device to expose it through vfio.  The device_api exposes the
> actual interface of the vfio device, it's also vfio-pci for typical
> mdev devices found on x86, but may be vfio-ccw, vfio-ap, etc...  See
> VFIO_DEVICE_API_PCI_STRING and friends.
> 
ok. got it.

> > > > >   device_id=8086591d  
> > > 
> > > Is device_id interpreted relative to device_type?  How does this
> > > relate to mdev_type?  If we have an mdev_type, doesn't that fully
> > > defined the software API?
> > >   
> > it's parent pci id for mdev actually.
>
> If we need to specify the parent PCI ID then something is fundamentally
> wrong with the mdev_type.  The mdev_type should define a unique,
> software compatible interface, regardless of the parent device IDs.  If
> a i915-GVTg_V5_2 means different things based on the parent device IDs,
> then then different mdev_types should be reported for those parent
> devices.
>
hmm, then do we allow vendor specific fields?
or is it a must that a vendor specific field should have corresponding
vendor attribute?

another thing is that the definition of mdev_type in GVT only corresponds
to vGPU computing ability currently,
e.g. i915-GVTg_V5_2, is 1/2 of a gen9 IGD, i915-GVTg_V4_2 is 1/2 of a
gen8 IGD.
It is too coarse-grained to live migration compatibility.

Do you think we need to update GVT's definition of mdev_type?
And is there any guide in mdev_type definition?

> > > > >   mdev_type=i915-GVTg_V5_2  
> > > 
> > > And how are non-mdev devices represented?
> > >   
> > non-mdev can opt to not include this field, or as you said below, a
> > vendor signature. 
> > 
> > > > >   aggregator=1
> > > > >   pv_mode="none+ppgtt+context"  
> > > 
> > > These are meaningless vendor specific matches afaict.
> > >   
> > yes, pv_mode and aggregator are vendor specific fields.
> > but they are important to decide whether two devices are compatible.
> > pv_mode means whether a vGPU supports guest paravirtualized api.
> > "none+ppgtt+context" means guest can not use pv, or use ppgtt mode pv or
> > use context mode pv.
> > 
> > > > >   interface_version=3  
> > > 
> > > Not much granularity here, I prefer Sean's previous
> > > .[.bugfix] scheme.
> > >   
> > yes, .[.bugfix] scheme may be better, but I'm not sure if
> > it works for a complicated scenario.
> > e.g for pv_mode,
> > (1) initially,  pv_mode is not supported, so it's pv_mode=none, it's 0.0.0,
> > (2) then, pv_mode=ppgtt is supported, pv_mode="none+ppgtt", it's 0.1.0,
> > indicating pv_mode=none can migrate to pv_mode="none+ppgtt", but not vice 
> > versa.
> > (3) later, pv_mode=context is also supported,
> > pv_mode="none+ppgtt+context", so it's 0.2.0.
> > 
> > But if later, pv_mode=ppgtt is removed. pv_mode="none+context", how to
> > name its version? "none+ppgtt" (0.1.0) is not compatible to
> > "none+context", but "none+ppgtt+context" (0.2.0) is compatible to
> > "none+context".
> 
> If pv_mode=ppgtt is removed, then the compatible versions would be
> 0.0.0 or 1.0.0, ie. the major version would be incremented due to
> feature removal.
>  
> > Maintain such scheme is painful to vendor driver.
> 
> Migration compatibility is painful, there's no way around that.  I
> think the version scheme is an attempt to push some of that low level
> burden on the vendor driver, otherwise the management tools need to
> work on an ever growing matrix of vendor specific features which is
> going to become unwieldy and is largely meaningless outside of the
> vendor driver.  Instead, the vendor driver can make strategic decisions
> about where to continue to maintain a support burden and make explicit
> decisions to maintain or break compatibility.  The version scheme is a
> simplification and abstraction of vendor driver features in order to
> create a small, logical compatibility matrix.  Compromises necessarily
> need to be made for that to occur.
>
ok. got it.

> > > > > COMPATIBLE:
> > > > >   device_type=pci
> > > > >   device_id=8086591d
> > > > >   mdev_type=i915-GVTg_V5_{val1:int:1,2,4,8}
> > > > this mixed notation will be hard to parse so i would avoid that.  
> > > 
> > > Some background, Intel has been proposing aggregation as a solution to
> > > how we scale mdev devices when hardware exposes large numbers of
> > > assignable objects that can be composed in essentially arbitrary ways.
> > > So for instance, if we have a workqueue (wq), we might have an mdev
> > > type for 1wq, 2wq, 3wq,... Nwq.  It's not really practical to expose a
> > > discrete mdev type for each of those, so they want to define a base
> > > type which is composable to other types via this aggregation.  This is
> > > what this substitution and tagging is attempting to accomplish.  So
> > > imagine this set of values for cases where it's not practical to unroll
> > 

Re: device compatibility interface for live migration with assigned devices

2020-07-29 Thread Yan Zhao
On Wed, Jul 29, 2020 at 01:12:55PM -0600, Alex Williamson wrote:
> On Wed, 29 Jul 2020 12:28:46 +0100
> Sean Mooney  wrote:
> 
> > On Wed, 2020-07-29 at 16:05 +0800, Yan Zhao wrote:
> > > On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote:  
> > > > On Mon, 27 Jul 2020 15:24:40 +0800
> > > > Yan Zhao  wrote:
> > > >   
> > > > > > > As you indicate, the vendor driver is responsible for checking 
> > > > > > > version
> > > > > > > information embedded within the migration stream.  Therefore a
> > > > > > > migration should fail early if the devices are incompatible.  Is 
> > > > > > > it
> > > > > > 
> > > > > > but as I know, currently in VFIO migration protocol, we have no way 
> > > > > > to
> > > > > > get vendor specific compatibility checking string in migration 
> > > > > > setup stage
> > > > > > (i.e. .save_setup stage) before the device is set to _SAVING state.
> > > > > > In this way, for devices who does not save device data in precopy 
> > > > > > stage,
> > > > > > the migration compatibility checking is as late as in stop-and-copy
> > > > > > stage, which is too late.
> > > > > > do you think we need to add the getting/checking of vendor specific
> > > > > > compatibility string early in save_setup stage?
> > > > > >
> > > > > 
> > > > > hi Alex,
> > > > > after an offline discussion with Kevin, I realized that it may not be 
> > > > > a
> > > > > problem if migration compatibility check in vendor driver occurs late 
> > > > > in
> > > > > stop-and-copy phase for some devices, because if we report device
> > > > > compatibility attributes clearly in an interface, the chances for
> > > > > libvirt/openstack to make a wrong decision is little.  
> > > > 
> > > > I think it would be wise for a vendor driver to implement a pre-copy
> > > > phase, even if only to send version information and verify it at the
> > > > target.  Deciding you have no device state to send during pre-copy does
> > > > not mean your vendor driver needs to opt-out of the pre-copy phase
> > > > entirely.  Please also note that pre-copy is at the user's discretion,
> > > > we've defined that we can enter stop-and-copy at any point, including
> > > > without a pre-copy phase, so I would recommend that vendor drivers
> > > > validate compatibility at the start of both the pre-copy and the
> > > > stop-and-copy phases.
> > > >   
> > > 
> > > ok. got it!
> > >   
> > > > > so, do you think we are now arriving at an agreement that we'll give 
> > > > > up
> > > > > the read-and-test scheme and start to defining one interface (perhaps 
> > > > > in
> > > > > json format), from which libvirt/openstack is able to parse and find 
> > > > > out
> > > > > compatibility list of a source mdev/physical device?  
> > > > 
> > > > Based on the feedback we've received, the previously proposed interface
> > > > is not viable.  I think there's agreement that the user needs to be
> > > > able to parse and interpret the version information.  Using json seems
> > > > viable, but I don't know if it's the best option.  Is there any
> > > > precedent of markup strings returned via sysfs we could follow?  
> > > 
> > > I found some examples of using formatted string under /sys, mostly under
> > > tracing. maybe we can do a similar implementation.
> > > 
> > > #cat /sys/kernel/debug/tracing/events/kvm/kvm_mmio/format
> > > 
> > > name: kvm_mmio
> > > ID: 32
> > > format:
> > > field:unsigned short common_type;   offset:0;   size:2; 
> > > signed:0;
> > > field:unsigned char common_flags;   offset:2;   size:1; 
> > > signed:0;
> > > field:unsigned char common_preempt_count;   offset:3;   
> > > size:1; signed:0;
> > > field:int common_pid;   offset:4;   size:4; signed:1;
> > > 
> > > field:u32 type; offset:8;   size:4; signed:0;
> > > field:u32 len;  offset:12;  size:4; signed:0;
> > > field:u64 gpa;  off

Re: device compatibility interface for live migration with assigned devices

2020-07-29 Thread Yan Zhao
On Wed, Jul 29, 2020 at 12:28:46PM +0100, Sean Mooney wrote:
> On Wed, 2020-07-29 at 16:05 +0800, Yan Zhao wrote:
> > On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote:
> > > On Mon, 27 Jul 2020 15:24:40 +0800
> > > Yan Zhao  wrote:
> > > 
> > > > > > As you indicate, the vendor driver is responsible for checking 
> > > > > > version
> > > > > > information embedded within the migration stream.  Therefore a
> > > > > > migration should fail early if the devices are incompatible.  Is it 
> > > > > >  
> > > > > 
> > > > > but as I know, currently in VFIO migration protocol, we have no way to
> > > > > get vendor specific compatibility checking string in migration setup 
> > > > > stage
> > > > > (i.e. .save_setup stage) before the device is set to _SAVING state.
> > > > > In this way, for devices who does not save device data in precopy 
> > > > > stage,
> > > > > the migration compatibility checking is as late as in stop-and-copy
> > > > > stage, which is too late.
> > > > > do you think we need to add the getting/checking of vendor specific
> > > > > compatibility string early in save_setup stage?
> > > > >  
> > > > 
> > > > hi Alex,
> > > > after an offline discussion with Kevin, I realized that it may not be a
> > > > problem if migration compatibility check in vendor driver occurs late in
> > > > stop-and-copy phase for some devices, because if we report device
> > > > compatibility attributes clearly in an interface, the chances for
> > > > libvirt/openstack to make a wrong decision is little.
> > > 
> > > I think it would be wise for a vendor driver to implement a pre-copy
> > > phase, even if only to send version information and verify it at the
> > > target.  Deciding you have no device state to send during pre-copy does
> > > not mean your vendor driver needs to opt-out of the pre-copy phase
> > > entirely.  Please also note that pre-copy is at the user's discretion,
> > > we've defined that we can enter stop-and-copy at any point, including
> > > without a pre-copy phase, so I would recommend that vendor drivers
> > > validate compatibility at the start of both the pre-copy and the
> > > stop-and-copy phases.
> > > 
> > 
> > ok. got it!
> > 
> > > > so, do you think we are now arriving at an agreement that we'll give up
> > > > the read-and-test scheme and start to defining one interface (perhaps in
> > > > json format), from which libvirt/openstack is able to parse and find out
> > > > compatibility list of a source mdev/physical device?
> > > 
> > > Based on the feedback we've received, the previously proposed interface
> > > is not viable.  I think there's agreement that the user needs to be
> > > able to parse and interpret the version information.  Using json seems
> > > viable, but I don't know if it's the best option.  Is there any
> > > precedent of markup strings returned via sysfs we could follow?
> > 
> > I found some examples of using formatted string under /sys, mostly under
> > tracing. maybe we can do a similar implementation.
> > 
> > #cat /sys/kernel/debug/tracing/events/kvm/kvm_mmio/format
> > 
> > name: kvm_mmio
> > ID: 32
> > format:
> > field:unsigned short common_type;   offset:0;   size:2; 
> > signed:0;
> > field:unsigned char common_flags;   offset:2;   size:1; 
> > signed:0;
> > field:unsigned char common_preempt_count;   offset:3;   
> > size:1; signed:0;
> > field:int common_pid;   offset:4;   size:4; signed:1;
> > 
> > field:u32 type; offset:8;   size:4; signed:0;
> > field:u32 len;  offset:12;  size:4; signed:0;
> > field:u64 gpa;  offset:16;  size:8; signed:0;
> > field:u64 val;  offset:24;  size:8; signed:0;
> > 
> > print fmt: "mmio %s len %u gpa 0x%llx val 0x%llx", 
> > __print_symbolic(REC->type, { 0, "unsatisfied-read" }, { 1, "read"
> > }, { 2, "write" }), REC->len, REC->gpa, REC->val
> > 
> this is not json fromat and its not supper frendly to parse.
yes, it's just an example. It's exported to be used by userspace perf &
trace_cmd.

> > 
> > #cat /sys/devices/pci:00/:00:02.0/uevent
> > DRIVER=v

Re: device compatibility interface for live migration with assigned devices

2020-07-29 Thread Yan Zhao
On Mon, Jul 27, 2020 at 04:23:21PM -0600, Alex Williamson wrote:
> On Mon, 27 Jul 2020 15:24:40 +0800
> Yan Zhao  wrote:
> 
> > > > As you indicate, the vendor driver is responsible for checking version
> > > > information embedded within the migration stream.  Therefore a
> > > > migration should fail early if the devices are incompatible.  Is it  
> > > but as I know, currently in VFIO migration protocol, we have no way to
> > > get vendor specific compatibility checking string in migration setup stage
> > > (i.e. .save_setup stage) before the device is set to _SAVING state.
> > > In this way, for devices who does not save device data in precopy stage,
> > > the migration compatibility checking is as late as in stop-and-copy
> > > stage, which is too late.
> > > do you think we need to add the getting/checking of vendor specific
> > > compatibility string early in save_setup stage?
> > >  
> > hi Alex,
> > after an offline discussion with Kevin, I realized that it may not be a
> > problem if migration compatibility check in vendor driver occurs late in
> > stop-and-copy phase for some devices, because if we report device
> > compatibility attributes clearly in an interface, the chances for
> > libvirt/openstack to make a wrong decision is little.
> 
> I think it would be wise for a vendor driver to implement a pre-copy
> phase, even if only to send version information and verify it at the
> target.  Deciding you have no device state to send during pre-copy does
> not mean your vendor driver needs to opt-out of the pre-copy phase
> entirely.  Please also note that pre-copy is at the user's discretion,
> we've defined that we can enter stop-and-copy at any point, including
> without a pre-copy phase, so I would recommend that vendor drivers
> validate compatibility at the start of both the pre-copy and the
> stop-and-copy phases.
>
ok. got it!

> > so, do you think we are now arriving at an agreement that we'll give up
> > the read-and-test scheme and start to defining one interface (perhaps in
> > json format), from which libvirt/openstack is able to parse and find out
> > compatibility list of a source mdev/physical device?
> 
> Based on the feedback we've received, the previously proposed interface
> is not viable.  I think there's agreement that the user needs to be
> able to parse and interpret the version information.  Using json seems
> viable, but I don't know if it's the best option.  Is there any
> precedent of markup strings returned via sysfs we could follow?
I found some examples of using formatted string under /sys, mostly under
tracing. maybe we can do a similar implementation.

#cat /sys/kernel/debug/tracing/events/kvm/kvm_mmio/format

name: kvm_mmio
ID: 32
format:
field:unsigned short common_type;   offset:0;   size:2; 
signed:0;
field:unsigned char common_flags;   offset:2;   size:1; 
signed:0;
field:unsigned char common_preempt_count;   offset:3;   size:1; 
signed:0;
field:int common_pid;   offset:4;   size:4; signed:1;

field:u32 type; offset:8;   size:4; signed:0;
field:u32 len;  offset:12;  size:4; signed:0;
field:u64 gpa;  offset:16;  size:8; signed:0;
field:u64 val;  offset:24;  size:8; signed:0;

print fmt: "mmio %s len %u gpa 0x%llx val 0x%llx", __print_symbolic(REC->type, 
{ 0, "unsatisfied-read" }, { 1, "read" }, { 2, "write" }), REC->len, REC->gpa, 
REC->val


#cat /sys/devices/pci:00/:00:02.0/uevent
DRIVER=vfio-pci
PCI_CLASS=3
PCI_ID=8086:591D
PCI_SUBSYS_ID=8086:2212
PCI_SLOT_NAME=:00:02.0
MODALIAS=pci:v8086d591Dsv8086sd2212bc03sc00i00

> 
> Your idea of having both a "self" object and an array of "compatible"
> objects is perhaps something we can build on, but we must not assume
> PCI devices at the root level of the object.  Providing both the
> mdev-type and the driver is a bit redundant, since the former includes
> the latter.  We can't have vendor specific versioning schemes though,
> ie. gvt-version. We need to agree on a common scheme and decide which
> fields the version is relative to, ex. just the mdev type?
what about making all comparing fields vendor specific?
userspace like openstack only needs to parse and compare if target
device is within source compatible list without understanding the meaning
of each field.

> I had also proposed fields that provide information to create a
> compatible type, for example to create a type_x2 device from a type_x1
> mdev type, they need to know to apply an aggregation attribute.  If we
> need to explicitly list every aggregation value and the res

Re: device compatibility interface for live migration with assigned devices

2020-07-27 Thread Yan Zhao
> > As you indicate, the vendor driver is responsible for checking version
> > information embedded within the migration stream.  Therefore a
> > migration should fail early if the devices are incompatible.  Is it
> but as I know, currently in VFIO migration protocol, we have no way to
> get vendor specific compatibility checking string in migration setup stage
> (i.e. .save_setup stage) before the device is set to _SAVING state.
> In this way, for devices who does not save device data in precopy stage,
> the migration compatibility checking is as late as in stop-and-copy
> stage, which is too late.
> do you think we need to add the getting/checking of vendor specific
> compatibility string early in save_setup stage?
>
hi Alex,
after an offline discussion with Kevin, I realized that it may not be a
problem if migration compatibility check in vendor driver occurs late in
stop-and-copy phase for some devices, because if we report device
compatibility attributes clearly in an interface, the chances for
libvirt/openstack to make a wrong decision is little.
so, do you think we are now arriving at an agreement that we'll give up
the read-and-test scheme and start to defining one interface (perhaps in
json format), from which libvirt/openstack is able to parse and find out
compatibility list of a source mdev/physical device?

Thanks
Yan



Re: device compatibility interface for live migration with assigned devices

2020-07-20 Thread Yan Zhao
On Fri, Jul 17, 2020 at 10:12:58AM -0600, Alex Williamson wrote:
<...>
> > yes, in another reply, Alex proposed to use an interface in json format.
> > I guess we can define something like
> > 
> > { "self" :
> >   [
> > { "pciid" : "8086591d",
> >   "driver" : "i915",
> >   "gvt-version" : "v1",
> >   "mdev_type"   : "i915-GVTg_V5_2",
> >   "aggregator"  : "1",
> >   "pv-mode" : "none",
> > }
> >   ],
> >   "compatible" :
> >   [
> > { "pciid" : "8086591d",
> >   "driver" : "i915",
> >   "gvt-version" : "v1",
> >   "mdev_type"   : "i915-GVTg_V5_2",
> >   "aggregator"  : "1"
> >   "pv-mode" : "none",
> > },
> > { "pciid" : "8086591d",
> >   "driver" : "i915",
> >   "gvt-version" : "v1",
> >   "mdev_type"   : "i915-GVTg_V5_4",
> >   "aggregator"  : "2"
> >   "pv-mode" : "none",
> > },
> > { "pciid" : "8086591d",
> >   "driver" : "i915",
> >   "gvt-version" : "v2",
> >   "mdev_type"   : "i915-GVTg_V5_4",
> >   "aggregator"  : "2"
> >   "pv-mode" : "none, ppgtt, context",
> > }
> > ...
> >   ]
> > }
> > 
> > But as those fields are mostly vendor specific, the userspace can
> > only do simple string comparing, I guess the list would be very long as
> > it needs to enumerate all possible targets.
> 
> 
> This ignores so much of what I tried to achieve in my example :(
> 
sorry, I just was eager to show and confirm the way to list all compatible
combination of mdev_type and mdev attributes.

> 
> > also, in some fileds like "gvt-version", is there a simple way to express
> > things like v2+?
> 
> 
> That's not a reasonable thing to express anyway, how can you be certain
> that v3 won't break compatibility with v2?  Sean proposed a versioning
> scheme that accounts for this, using an x.y.z version expressing the
> major, minor, and bugfix versions, where there is no compatibility
> across major versions, minor versions have forward compatibility (ex. 1
> -> 2 is ok, 2 -> 1 is not) and bugfix version number indicates some
> degree of internal improvement that is not visible to the user in terms
> of features or compatibility, but provides a basis for preferring
> equally compatible candidates.
>
right. if self version is v1, it can't know its compatible version is
v2. it can only be done in reverse. i.e.
when self version is v2, it can list its compatible version is v1 and
v2.
and maybe later when self version is v3, there's no v1 in its compatible
list.

In this way, do you think we still need the complex x.y.z versioning scheme?

>  
> > If the userspace can read this interface both in src and target and
> > check whether both src and target are in corresponding compatible list, I
> > think it will work for us.
> > 
> > But still, kernel should not rely on userspace's choice, the opaque
> > compatibility string is still required in kernel. No matter whether
> > it would be exposed to userspace as an compatibility checking interface,
> > vendor driver would keep this part of code and embed the string into the
> > migration stream. so exposing it as an interface to be used by libvirt to
> > do a safety check before a real live migration is only about enabling
> > the kernel part of check to happen ahead.
> 
> As you indicate, the vendor driver is responsible for checking version
> information embedded within the migration stream.  Therefore a
> migration should fail early if the devices are incompatible.  Is it
but as I know, currently in VFIO migration protocol, we have no way to
get vendor specific compatibility checking string in migration setup stage
(i.e. .save_setup stage) before the device is set to _SAVING state.
In this way, for devices who does not save device data in precopy stage,
the migration compatibility checking is as late as in stop-and-copy
stage, which is too late.
do you think we need to add the getting/checking of vendor specific
compatibility string early in save_setup stage?

> really libvirt's place to second guess what it has been directed to do?
if libvirt uses the scheme of reading compatibility string at source and
writing for checking at the target, it can not be called "a second guess".
It's not a guess, but a confirmation.

> Why would we even proceed to design a user parse-able version interface
> if we still have a dependency on an opaque interface?  Thanks,
one reason is that libvirt can't trust the parsing result from
openstack.
Another reason is that libvirt can use this opaque interface easier than
another parsing by itself, in the fact that it would not introduce more
burden to kernel who would write this part of code anyway, no matter
libvirt uses it or not.
 
Thanks
Yan



Re: device compatibility interface for live migration with assigned devices

2020-07-16 Thread Yan Zhao
On Thu, Jul 16, 2020 at 12:16:26PM +0800, Jason Wang wrote:
> 
> On 2020/7/14 上午7:29, Yan Zhao wrote:
> > hi folks,
> > we are defining a device migration compatibility interface that helps upper
> > layer stack like openstack/ovirt/libvirt to check if two devices are
> > live migration compatible.
> > The "devices" here could be MDEVs, physical devices, or hybrid of the two.
> > e.g. we could use it to check whether
> > - a src MDEV can migrate to a target MDEV,
> > - a src VF in SRIOV can migrate to a target VF in SRIOV,
> > - a src MDEV can migration to a target VF in SRIOV.
> >(e.g. SIOV/SRIOV backward compatibility case)
> > 
> > The upper layer stack could use this interface as the last step to check
> > if one device is able to migrate to another device before triggering a real
> > live migration procedure.
> > we are not sure if this interface is of value or help to you. please don't
> > hesitate to drop your valuable comments.
> > 
> > 
> > (1) interface definition
> > The interface is defined in below way:
> > 
> >   __userspace
> >/\  \
> >   / \write
> >  / read  \
> > /__   ___\|/_
> >| migration_version | | migration_version |-->check migration
> >- -   compatibility
> >   device Adevice B
> > 
> > 
> > a device attribute named migration_version is defined under each device's
> > sysfs node. e.g. 
> > (/sys/bus/pci/devices/\:00\:02.0/$mdev_UUID/migration_version).
> 
> 
> Are you aware of the devlink based device management interface that is
> proposed upstream? I think it has many advantages over sysfs, do you
> consider to switch to that?
not familiar with the devlink. will do some research of it.
> 
> 
> > userspace tools read the migration_version as a string from the source 
> > device,
> > and write it to the migration_version sysfs attribute in the target device.
> > 
> > The userspace should treat ANY of below conditions as two devices not 
> > compatible:
> > - any one of the two devices does not have a migration_version attribute
> > - error when reading from migration_version attribute of one device
> > - error when writing migration_version string of one device to
> >migration_version attribute of the other device
> > 
> > The string read from migration_version attribute is defined by device vendor
> > driver and is completely opaque to the userspace.
> 
> 
> My understanding is that something opaque to userspace is not the philosophy

but the VFIO live migration in itself is essentially a big opaque stream to 
userspace.

> of Linux. Instead of having a generic API but opaque value, why not do in a
> vendor specific way like:
> 
> 1) exposing the device capability in a vendor specific way via sysfs/devlink
> or other API
> 2) management read capability in both src and dst and determine whether we
> can do the migration
> 
> This is the way we plan to do with vDPA.
>
yes, in another reply, Alex proposed to use an interface in json format.
I guess we can define something like

{ "self" :
  [
{ "pciid" : "8086591d",
  "driver" : "i915",
  "gvt-version" : "v1",
  "mdev_type"   : "i915-GVTg_V5_2",
  "aggregator"  : "1",
  "pv-mode" : "none",
}
  ],
  "compatible" :
  [
{ "pciid" : "8086591d",
  "driver" : "i915",
  "gvt-version" : "v1",
  "mdev_type"   : "i915-GVTg_V5_2",
  "aggregator"  : "1"
  "pv-mode" : "none",
},
{ "pciid" : "8086591d",
  "driver" : "i915",
  "gvt-version" : "v1",
  "mdev_type"   : "i915-GVTg_V5_4",
  "aggregator"  : "2"
  "pv-mode" : "none",
},
{ "pciid" : "8086591d",
  "driver" : "i915",
  "gvt-version" : "v2",
  "mdev_type"   : "i915-GVTg_V5_4",
  "aggregator"  : "2"
  "pv-mode" : "none, ppgtt, context",
}
...
  ]
}

But as those fields are mostly vendor specific, the userspace can
only do simple string comparing, I guess the list would be very long as
it needs to 

Re: device compatibility interface for live migration with assigned devices

2020-07-15 Thread Yan Zhao
On Tue, Jul 14, 2020 at 02:59:48PM -0600, Alex Williamson wrote:
> On Tue, 14 Jul 2020 18:19:46 +0100
> "Dr. David Alan Gilbert"  wrote:
> 
> > * Alex Williamson (alex.william...@redhat.com) wrote:
> > > On Tue, 14 Jul 2020 11:21:29 +0100
> > > Daniel P. Berrangé  wrote:
> > >   
> > > > On Tue, Jul 14, 2020 at 07:29:57AM +0800, Yan Zhao wrote:  
> > > > > hi folks,
> > > > > we are defining a device migration compatibility interface that helps 
> > > > > upper
> > > > > layer stack like openstack/ovirt/libvirt to check if two devices are
> > > > > live migration compatible.
> > > > > The "devices" here could be MDEVs, physical devices, or hybrid of the 
> > > > > two.
> > > > > e.g. we could use it to check whether
> > > > > - a src MDEV can migrate to a target MDEV,
> > > > > - a src VF in SRIOV can migrate to a target VF in SRIOV,
> > > > > - a src MDEV can migration to a target VF in SRIOV.
> > > > >   (e.g. SIOV/SRIOV backward compatibility case)
> > > > > 
> > > > > The upper layer stack could use this interface as the last step to 
> > > > > check
> > > > > if one device is able to migrate to another device before triggering 
> > > > > a real
> > > > > live migration procedure.
> > > > > we are not sure if this interface is of value or help to you. please 
> > > > > don't
> > > > > hesitate to drop your valuable comments.
> > > > > 
> > > > > 
> > > > > (1) interface definition
> > > > > The interface is defined in below way:
> > > > > 
> > > > >  __userspace
> > > > >   /\  \
> > > > >  / \write
> > > > > / read  \
> > > > >/__   ___\|/_
> > > > >   | migration_version | | migration_version |-->check migration
> > > > >   - -   compatibility
> > > > >  device Adevice B
> > > > > 
> > > > > 
> > > > > a device attribute named migration_version is defined under each 
> > > > > device's
> > > > > sysfs node. e.g. 
> > > > > (/sys/bus/pci/devices/\:00\:02.0/$mdev_UUID/migration_version).
> > > > > userspace tools read the migration_version as a string from the 
> > > > > source device,
> > > > > and write it to the migration_version sysfs attribute in the target 
> > > > > device.
> > > > > 
> > > > > The userspace should treat ANY of below conditions as two devices not 
> > > > > compatible:
> > > > > - any one of the two devices does not have a migration_version 
> > > > > attribute
> > > > > - error when reading from migration_version attribute of one device
> > > > > - error when writing migration_version string of one device to
> > > > >   migration_version attribute of the other device
> > > > > 
> > > > > The string read from migration_version attribute is defined by device 
> > > > > vendor
> > > > > driver and is completely opaque to the userspace.
> > > > > for a Intel vGPU, string format can be defined like
> > > > > "parent device PCI ID" + "version of gvt driver" + "mdev type" + 
> > > > > "aggregator count".
> > > > > 
> > > > > for an NVMe VF connecting to a remote storage. it could be
> > > > > "PCI ID" + "driver version" + "configured remote storage URL"
> > > > > 
> > > > > for a QAT VF, it may be
> > > > > "PCI ID" + "driver version" + "supported encryption set".
> > > > > 
> > > > > (to avoid namespace confliction from each vendor, we may prefix a 
> > > > > driver name to
> > > > > each migration_version string. e.g. 
> > > > > i915-v1-8086-591d-i915-GVTg_V5_8-1)  
> > > 
> > > It's very strange to define it as opaque and then proceed to describe
> > > the contents of that opaque string.  The point is that its contents
> > > are defined by

device compatibility interface for live migration with assigned devices

2020-07-13 Thread Yan Zhao
hi folks,
we are defining a device migration compatibility interface that helps upper
layer stack like openstack/ovirt/libvirt to check if two devices are
live migration compatible.
The "devices" here could be MDEVs, physical devices, or hybrid of the two.
e.g. we could use it to check whether
- a src MDEV can migrate to a target MDEV,
- a src VF in SRIOV can migrate to a target VF in SRIOV,
- a src MDEV can migration to a target VF in SRIOV.
  (e.g. SIOV/SRIOV backward compatibility case)

The upper layer stack could use this interface as the last step to check
if one device is able to migrate to another device before triggering a real
live migration procedure.
we are not sure if this interface is of value or help to you. please don't
hesitate to drop your valuable comments.


(1) interface definition
The interface is defined in below way:

 __userspace
  /\  \
 / \write
/ read  \
   /__   ___\|/_
  | migration_version | | migration_version |-->check migration
  - -   compatibility
 device Adevice B


a device attribute named migration_version is defined under each device's
sysfs node. e.g. 
(/sys/bus/pci/devices/\:00\:02.0/$mdev_UUID/migration_version).
userspace tools read the migration_version as a string from the source device,
and write it to the migration_version sysfs attribute in the target device.

The userspace should treat ANY of below conditions as two devices not 
compatible:
- any one of the two devices does not have a migration_version attribute
- error when reading from migration_version attribute of one device
- error when writing migration_version string of one device to
  migration_version attribute of the other device

The string read from migration_version attribute is defined by device vendor
driver and is completely opaque to the userspace.
for a Intel vGPU, string format can be defined like
"parent device PCI ID" + "version of gvt driver" + "mdev type" + "aggregator 
count".

for an NVMe VF connecting to a remote storage. it could be
"PCI ID" + "driver version" + "configured remote storage URL"

for a QAT VF, it may be
"PCI ID" + "driver version" + "supported encryption set".

(to avoid namespace confliction from each vendor, we may prefix a driver name to
each migration_version string. e.g. i915-v1-8086-591d-i915-GVTg_V5_8-1)


(2) backgrounds

The reason we hope the migration_version string is opaque to the userspace
is that it is hard to generalize standard comparing fields and comparing
methods for different devices from different vendors.
Though userspace now could still do a simple string compare to check if
two devices are compatible, and result should also be right, it's still
too limited as it excludes the possible candidate whose migration_version
string fails to be equal.
e.g. an MDEV with mdev_type_1, aggregator count 3 is probably compatible
with another MDEV with mdev_type_3, aggregator count 1, even their
migration_version strings are not equal.
(assumed mdev_type_3 is of 3 times equal resources of mdev_type_1).

besides that, driver version + configured resources are all elements demanding
to take into account.

So, we hope leaving the freedom to vendor driver and let it make the final 
decision
in a simple reading from source side and writing for test in the target side 
way.


we then think the device compatibility issues for live migration with assigned
devices can be divided into two steps:
a. management tools filter out possible migration target devices.
   Tags could be created according to info from product specification.
   we think openstack/ovirt may have vendor proprietary components to create
   those customized tags for each product from each vendor.
   e.g.
   for Intel vGPU, with a vGPU(a MDEV device) in source side, the tags to
   search target vGPU are like:
   a tag for compatible parent PCI IDs,
   a tag for a range of gvt driver versions,
   a tag for a range of mdev type + aggregator count

   for NVMe VF, the tags to search target VF may be like:
   a tag for compatible PCI IDs,
   a tag for a range of driver versions,
   a tag for URL of configured remote storage.

b. with the output from step a, openstack/ovirt/libvirt could use our proposed
   device migration compatibility interface to make sure the two devices are
   indeed live migration compatible before launching the real live migration
   process to start stream copying, src device stopping and target device
   resuming.
   It is supposed that this step would not bring any performance penalty as
   -in kernel it's just a simple string decoding and comparing
   -in openstack/ovirt, it could be done by extending current function
check_can_live_migrate_destination, along side claiming target resources.[1]


[1] 

Re: [PATCH v5 0/4] introduction of migration_version attribute for VFIO live migration

2020-06-21 Thread Yan Zhao
On Fri, Jun 19, 2020 at 04:40:46PM -0600, Alex Williamson wrote:
> On Tue, 9 Jun 2020 20:37:31 -0400
> Yan Zhao  wrote:
> 
> > On Fri, Jun 05, 2020 at 03:39:50PM +0100, Dr. David Alan Gilbert wrote:
> > > > > > I tried to simplify the problem a bit, but we keep going backwards. 
> > > > > >  If
> > > > > > the requirement is that potentially any source device can migrate 
> > > > > > to any
> > > > > > target device and we cannot provide any means other than writing an
> > > > > > opaque source string into a version attribute on the target and
> > > > > > evaluating the result to determine compatibility, then we're 
> > > > > > requiring
> > > > > > userspace to do an exhaustive search to find a potential match.  
> > > > > > That
> > > > > > sucks. 
> > > > >  
> > hi Alex and Dave,
> > do you think it's good for us to put aside physical devices and mdev 
> > aggregation
> > for the moment, and use Alex's original idea that
> > 
> > +  Userspace should regard two mdev devices compatible when ALL of below
> > +  conditions are met:
> > +  (0) The mdev devices are of the same type
> > +  (1) success when reading from migration_version attribute of one mdev 
> > device.
> > +  (2) success when writing migration_version string of one mdev device to
> > +  migration_version attribute of the other mdev device.
> 
> I think Pandora's box is already opened, if we can't articulate how
> this solution would evolve to support features that we know are coming,
> why should we proceed with this approach?  We've already seen interest
> in breaking rule (0) in this thread, so we can't focus the solution on
> mdev devices.
> 
> Maybe the best we can do is to compare one instance of a device to
> another instance of a device, without any capability to predict
> compatibility prior to creating devices, in the case on mdev.  The
> string would need to include not only the device and vendor driver
> compatibility, but also anything that has modified the state of the
> device, such as creation time or post-creation time configuration.  The
> user is left on their own for creating a compatible device, or
> filtering devices to determine which might be, or which might generate,
> compatible devices.  It's not much of a solution, I wonder if anyone
> would even use it.
> 
> > and what about adding another sysfs attribute for vendors to put
> > recommended migration compatible device type. e.g.
> > #cat 
> > /sys/bus/pci/devices/:00:02.0/mdev_supported_types/i915-GVTg_V5_8/migration_compatible_devices
> > parent id: 8086 591d
> > mdev_type: i915-GVTg_V5_8
> > 
> > vendors are free to define the format and conent of this 
> > migration_compatible_devices
> > and it's even not to be a full list.
> > 
> > before libvirt or user to do live migration, they have to read and test
> > migration_version attributes of src/target devices to check migration 
> > compatibility.
> 
> AFAICT, free-form, vendor defined attributes are useless to libvirt.
> Vendors could already put this information in the description attribute
> and have it ignored by userspace tools due to the lack of defined
> format.  It's also not clear what value this provides when it's
> necessarily incomplete, a driver written today cannot know what future
> drivers might be compatible with its migration data.  Thanks,
>
hi Alex
maybe the problem can be divided into two pieces:
(1) how to create/locate two migration compatible devices. For normal
users, the most common and safest way to do it is to find a exact duplication
of the source device. so for mdev, it's probably to create a target mdev
of the same parent pci id, mdev type and creation parameters as the
source mdev; and for physical devices, it's to locate a target device of the
same pci id as the source device, plus some extra constraints (e.g. the
target NVMe device is configured to the same remote device as the source
NVMe device; or the target QAT device is supporting equal encryption
algorithm set as the source QAT device...).
I think a possible solution for this piece is to let vendor drivers provide a
creating/locating script to find such exact duplication of source device.
Then before libvirt is about to do live migration, it can use this script to
create a target vm of exactly duplicated configuration of the source vm.

(2) how to identify two devices are migration compatible after they are
created and even they are not exactly identical (e.g. their parent
devices are of minor difference in hardware SKUs). This identification is
necessary even 

Re: [PATCH v5 0/4] introduction of migration_version attribute for VFIO live migration

2020-06-09 Thread Yan Zhao
On Fri, Jun 05, 2020 at 03:39:50PM +0100, Dr. David Alan Gilbert wrote:
> > > > I tried to simplify the problem a bit, but we keep going backwards.  If
> > > > the requirement is that potentially any source device can migrate to any
> > > > target device and we cannot provide any means other than writing an
> > > > opaque source string into a version attribute on the target and
> > > > evaluating the result to determine compatibility, then we're requiring
> > > > userspace to do an exhaustive search to find a potential match.  That
> > > > sucks.   
> > >
hi Alex and Dave,
do you think it's good for us to put aside physical devices and mdev aggregation
for the moment, and use Alex's original idea that

+  Userspace should regard two mdev devices compatible when ALL of below
+  conditions are met:
+  (0) The mdev devices are of the same type
+  (1) success when reading from migration_version attribute of one mdev device.
+  (2) success when writing migration_version string of one mdev device to
+  migration_version attribute of the other mdev device.

and what about adding another sysfs attribute for vendors to put
recommended migration compatible device type. e.g.
#cat 
/sys/bus/pci/devices/:00:02.0/mdev_supported_types/i915-GVTg_V5_8/migration_compatible_devices
parent id: 8086 591d
mdev_type: i915-GVTg_V5_8

vendors are free to define the format and conent of this 
migration_compatible_devices
and it's even not to be a full list.

before libvirt or user to do live migration, they have to read and test
migration_version attributes of src/target devices to check migration 
compatibility.

Thanks
Yan


> > > Why is the mechanism a 'write and test' why isn't it a 'write and ask'?
> > > i.e. the destination tells the driver what type it's received from the
> > > source, and the driver replies with a set of compatible configurations
> > > (in some preferred order).
> > 
> > A 'write and ask' interface would imply some sort of session in order
> > to not be racy with concurrent users.  More likely this would imply an
> > ioctl interface, which I don't think we have in sysfs.  Where do we
> > host this ioctl?
> 
> Or one fd?
>   f=open()
>   write(f, "The ID I want")
>   do {
>  read(f, ...)  -> The IDs we're offering that are compatible
>   } while (!eof)
> 
> > > It's also not clear to me why the name has to be that opaque;
> > > I agree it's only got to be understood by the driver but that doesn't
> > > seem to be a reason for the driver to make it purposely obfuscated.
> > > I wouldn't expect a user to be able to parse it necessarily; but would
> > > expect something that would be useful for an error message.
> > 
> > If the name is not opaque, then we're going to rat hole on the format
> > and the fields and evolving that format for every feature a vendor
> > decides they want the user to be able to parse out of the version
> > string.  Then we require a full specification of the string in order
> > that it be parsed according to a standard such that we don't break
> > users inferring features in subtly different ways.
> > 
> > This is a lot like the problems with mdev description attributes,
> > libvirt complains they can't use description because there's no
> > standard formatting, but even with two vendors describing the same class
> > of device we don't have an agreed set of things to expose in the
> > description attribute.  Thanks,
> 
> I'm not suggesting anything in anyway machine parsable; just something
> human readable that you can present in a menu/choice/configuration/error
> message.  The text would be down to the vendor, and I'd suggest it start
> with the vendor name just as a disambiguator and to make it obvious when
> we get it grossly wrong.
> 
> Dave
> 
> > Alex
> --
> Dr. David Alan Gilbert / dgilb...@redhat.com / Manchester, UK
> 
> ___
> intel-gvt-dev mailing list
> intel-gvt-...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev



Re: [PATCH v5 0/4] introduction of migration_version attribute for VFIO live migration

2020-06-02 Thread Yan Zhao
On Tue, Jun 02, 2020 at 09:55:28PM -0600, Alex Williamson wrote:
> On Tue, 2 Jun 2020 23:19:48 -0400
> Yan Zhao  wrote:
> 
> > On Tue, Jun 02, 2020 at 04:55:27PM -0600, Alex Williamson wrote:
> > > On Wed, 29 Apr 2020 20:39:50 -0400
> > > Yan Zhao  wrote:
> > >   
> > > > On Wed, Apr 29, 2020 at 05:48:44PM +0800, Dr. David Alan Gilbert wrote:
> > > >   
> > > > > > > > > > > > > > > > > An mdev type is meant to define a software 
> > > > > > > > > > > > > > > > > compatible interface, so in
> > > > > > > > > > > > > > > > > the case of mdev->mdev migration, doesn't 
> > > > > > > > > > > > > > > > > migrating to a different type
> > > > > > > > > > > > > > > > > fail the most basic of compatibility tests 
> > > > > > > > > > > > > > > > > that we expect userspace to
> > > > > > > > > > > > > > > > > perform?  IOW, if two mdev types are 
> > > > > > > > > > > > > > > > > migration compatible, it seems a
> > > > > > > > > > > > > > > > > prerequisite to that is that they provide the 
> > > > > > > > > > > > > > > > > same software interface,
> > > > > > > > > > > > > > > > > which means they should be the same mdev type.
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > > In the hybrid cases of mdev->phys or 
> > > > > > > > > > > > > > > > > phys->mdev, how does a
> > > > > > > > > > > > > > > > management
> > > > > > > > > > > > > > > > > tool begin to even guess what might be 
> > > > > > > > > > > > > > > > > compatible?  Are we expecting
> > > > > > > > > > > > > > > > > libvirt to probe ever device with this 
> > > > > > > > > > > > > > > > > attribute in the system?  Is
> > > > > > > > > > > > > > > > > there going to be a new class hierarchy 
> > > > > > > > > > > > > > > > > created to enumerate all
> > > > > > > > > > > > > > > > > possible migrate-able devices?
> > > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > yes, management tool needs to guess and test 
> > > > > > > > > > > > > > > > migration compatible
> > > > > > > > > > > > > > > > between two devices. But I think it's not the 
> > > > > > > > > > > > > > > > problem only for
> > > > > > > > > > > > > > > > mdev->phys or phys->mdev. even for mdev->mdev, 
> > > > > > > > > > > > > > > > management tool needs
> > > > > > > > > > > > > > > > to
> > > > > > > > > > > > > > > > first assume that the two mdevs have the same 
> > > > > > > > > > > > > > > > type of parent devices
> > > > > > > > > > > > > > > > (e.g.their pciids are equal). otherwise, it's 
> > > > > > > > > > > > > > > > still enumerating
> > > > > > > > > > > > > > > > possibilities.
> > > > > > > > > > > > > > > > 
> > > > > > > > > > > > > > > > on the other hand, for two mdevs,
> > > > > > > > > > > > > > > > mdev1 from pdev1, its mdev_type is 1/2 of pdev1;
> > > > > > > > > > > > > > > > mdev2 from pdev2, its mdev_type is 1/4 of pdev2;
> > > > > > > > > > > > > > > > if pdev2 is exactly 2

Re: [PATCH v5 0/4] introduction of migration_version attribute for VFIO live migration

2020-06-02 Thread Yan Zhao
On Tue, Jun 02, 2020 at 04:55:27PM -0600, Alex Williamson wrote:
> On Wed, 29 Apr 2020 20:39:50 -0400
> Yan Zhao  wrote:
> 
> > On Wed, Apr 29, 2020 at 05:48:44PM +0800, Dr. David Alan Gilbert wrote:
> > 
> > > > > > > > > > > > > > > An mdev type is meant to define a software 
> > > > > > > > > > > > > > > compatible interface, so in
> > > > > > > > > > > > > > > the case of mdev->mdev migration, doesn't 
> > > > > > > > > > > > > > > migrating to a different type
> > > > > > > > > > > > > > > fail the most basic of compatibility tests that 
> > > > > > > > > > > > > > > we expect userspace to
> > > > > > > > > > > > > > > perform?  IOW, if two mdev types are migration 
> > > > > > > > > > > > > > > compatible, it seems a
> > > > > > > > > > > > > > > prerequisite to that is that they provide the 
> > > > > > > > > > > > > > > same software interface,
> > > > > > > > > > > > > > > which means they should be the same mdev type.
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > In the hybrid cases of mdev->phys or phys->mdev, 
> > > > > > > > > > > > > > > how does a  
> > > > > > > > > > > > > > management  
> > > > > > > > > > > > > > > tool begin to even guess what might be 
> > > > > > > > > > > > > > > compatible?  Are we expecting
> > > > > > > > > > > > > > > libvirt to probe ever device with this attribute 
> > > > > > > > > > > > > > > in the system?  Is
> > > > > > > > > > > > > > > there going to be a new class hierarchy created 
> > > > > > > > > > > > > > > to enumerate all
> > > > > > > > > > > > > > > possible migrate-able devices?
> > > > > > > > > > > > > > >  
> > > > > > > > > > > > > > yes, management tool needs to guess and test 
> > > > > > > > > > > > > > migration compatible
> > > > > > > > > > > > > > between two devices. But I think it's not the 
> > > > > > > > > > > > > > problem only for
> > > > > > > > > > > > > > mdev->phys or phys->mdev. even for mdev->mdev, 
> > > > > > > > > > > > > > management tool needs
> > > > > > > > > > > > > > to
> > > > > > > > > > > > > > first assume that the two mdevs have the same type 
> > > > > > > > > > > > > > of parent devices
> > > > > > > > > > > > > > (e.g.their pciids are equal). otherwise, it's still 
> > > > > > > > > > > > > > enumerating
> > > > > > > > > > > > > > possibilities.
> > > > > > > > > > > > > > 
> > > > > > > > > > > > > > on the other hand, for two mdevs,
> > > > > > > > > > > > > > mdev1 from pdev1, its mdev_type is 1/2 of pdev1;
> > > > > > > > > > > > > > mdev2 from pdev2, its mdev_type is 1/4 of pdev2;
> > > > > > > > > > > > > > if pdev2 is exactly 2 times of pdev1, why not allow 
> > > > > > > > > > > > > > migration between
> > > > > > > > > > > > > > mdev1 <-> mdev2.  
> > > > > > > > > > > > > 
> > > > > > > > > > > > > How could the manage tool figure out that 1/2 of 
> > > > > > > > > > > > > pdev1 is equivalent 
> > > > > > > > > > > > > to 1/4 of pdev2? If we really want to allow su

Re: [PATCH v5 0/4] introduction of migration_version attribute for VFIO live migration

2020-04-29 Thread Yan Zhao
On Wed, Apr 29, 2020 at 10:13:01PM +0800, Eric Blake wrote:
> [meta-comment]
> 
> On 4/29/20 4:35 AM, Yan Zhao wrote:
> > On Wed, Apr 29, 2020 at 04:22:01PM +0800, Dr. David Alan Gilbert wrote:
> [...]
> >>>>>>>>>>>>>>>>> This patchset introduces a migration_version attribute 
> >>>>>>>>>>>>>>>>> under sysfs
> >>>>>>>>>>> of VFIO
> >>>>>>>>>>>>>>>>> Mediated devices.
> 
> Hmm, several pages with up to 16 levels of quoting, with editors making 
> the lines ragged, all before I get to the real meat of the email. 
> Remember, it's okay to trim content,...
> 
> >> So why don't we split the difference; lets say that it should start with
> >> the hex PCI Vendor ID.
> >>
> > The problem is for mdev devices, if the parent devices are not PCI devices,
> > they don't have PCI vendor IDs.
> 
> ...to just what you are replying to.
>
sorry for that. next time I'll try to make a better balance between
keeping conversation background and leaving the real meat of the email.

Thanks for reminding.
Yan




Re: [PATCH v5 0/4] introduction of migration_version attribute for VFIO live migration

2020-04-29 Thread Yan Zhao
On Wed, Apr 29, 2020 at 05:48:44PM +0800, Dr. David Alan Gilbert wrote:

> > > > > > > > > > > > > An mdev type is meant to define a software compatible 
> > > > > > > > > > > > > interface, so in
> > > > > > > > > > > > > the case of mdev->mdev migration, doesn't migrating 
> > > > > > > > > > > > > to a different type
> > > > > > > > > > > > > fail the most basic of compatibility tests that we 
> > > > > > > > > > > > > expect userspace to
> > > > > > > > > > > > > perform?  IOW, if two mdev types are migration 
> > > > > > > > > > > > > compatible, it seems a
> > > > > > > > > > > > > prerequisite to that is that they provide the same 
> > > > > > > > > > > > > software interface,
> > > > > > > > > > > > > which means they should be the same mdev type.
> > > > > > > > > > > > >
> > > > > > > > > > > > > In the hybrid cases of mdev->phys or phys->mdev, how 
> > > > > > > > > > > > > does a
> > > > > > > > > > > > management
> > > > > > > > > > > > > tool begin to even guess what might be compatible?  
> > > > > > > > > > > > > Are we expecting
> > > > > > > > > > > > > libvirt to probe ever device with this attribute in 
> > > > > > > > > > > > > the system?  Is
> > > > > > > > > > > > > there going to be a new class hierarchy created to 
> > > > > > > > > > > > > enumerate all
> > > > > > > > > > > > > possible migrate-able devices?
> > > > > > > > > > > > >
> > > > > > > > > > > > yes, management tool needs to guess and test migration 
> > > > > > > > > > > > compatible
> > > > > > > > > > > > between two devices. But I think it's not the problem 
> > > > > > > > > > > > only for
> > > > > > > > > > > > mdev->phys or phys->mdev. even for mdev->mdev, 
> > > > > > > > > > > > management tool needs
> > > > > > > > > > > > to
> > > > > > > > > > > > first assume that the two mdevs have the same type of 
> > > > > > > > > > > > parent devices
> > > > > > > > > > > > (e.g.their pciids are equal). otherwise, it's still 
> > > > > > > > > > > > enumerating
> > > > > > > > > > > > possibilities.
> > > > > > > > > > > > 
> > > > > > > > > > > > on the other hand, for two mdevs,
> > > > > > > > > > > > mdev1 from pdev1, its mdev_type is 1/2 of pdev1;
> > > > > > > > > > > > mdev2 from pdev2, its mdev_type is 1/4 of pdev2;
> > > > > > > > > > > > if pdev2 is exactly 2 times of pdev1, why not allow 
> > > > > > > > > > > > migration between
> > > > > > > > > > > > mdev1 <-> mdev2.
> > > > > > > > > > > 
> > > > > > > > > > > How could the manage tool figure out that 1/2 of pdev1 is 
> > > > > > > > > > > equivalent 
> > > > > > > > > > > to 1/4 of pdev2? If we really want to allow such thing 
> > > > > > > > > > > happen, the best
> > > > > > > > > > > choice is to report the same mdev type on both pdev1 and 
> > > > > > > > > > > pdev2.
> > > > > > > > > > I think that's exactly the value of this migration_version 
> > > > > > > > > > interface.
> > > > > > > > > > the management tool can take advantage of this interface to 
> > > > > > > > > > know if two
> > > > > > > > > > devices are migration compatible, no matter they are mdevs, 
> > > > > > > > > > non-mdevs,
> > > > > > > > > > or mix.
> > > > > > > > > > 
> > > > > > > > > > as I know, (please correct me if not right), current 
> > > > > > > > > > libvirt still
> > > > > > > > > > requires manually generating mdev devices, and it just 
> > > > > > > > > > duplicates src vm
> > > > > > > > > > configuration to the target vm.
> > > > > > > > > > for libvirt, currently it's always phys->phys and 
> > > > > > > > > > mdev->mdev (and of the
> > > > > > > > > > same mdev type).
> > > > > > > > > > But it does not justify that hybrid cases should not be 
> > > > > > > > > > allowed. otherwise,
> > > > > > > > > > why do we need to introduce this migration_version 
> > > > > > > > > > interface and leave
> > > > > > > > > > the judgement of migration compatibility to vendor driver? 
> > > > > > > > > > why not simply
> > > > > > > > > > set the criteria to something like "pciids of parent 
> > > > > > > > > > devices are equal,
> > > > > > > > > > and mdev types are equal" ?
> > > > > > > > > > 
> > > > > > > > > > 
> > > > > > > > > > > btw mdev<->phys just brings trouble to upper stack as 
> > > > > > > > > > > Alex pointed out. 
> > > > > > > > > > could you help me understand why it will bring trouble to 
> > > > > > > > > > upper stack?
> > > > > > > > > > 
> > > > > > > > > > I think it just needs to read src migration_version under 
> > > > > > > > > > src dev node,
> > > > > > > > > > and test it in target migration version under target dev 
> > > > > > > > > > node. 
> > > > > > > > > > 
> > > > > > > > > > after all, through this interface we just help the upper 
> > > > > > > > > > layer
> > > > > > > > > > knowing available options through reading and testing, and 
> > > > > > > > > > they decide
> > > > > > > > > > to use it or not.
> > > > > > > > > > 
> > > > > > > > > > > Can we simplify the requirement by allowing only 
> > > > > > > > > > > 

Re: [PATCH v5 0/4] introduction of migration_version attribute for VFIO live migration

2020-04-29 Thread Yan Zhao
On Wed, Apr 29, 2020 at 04:22:01PM +0800, Dr. David Alan Gilbert wrote:
> * Yan Zhao (yan.y.z...@intel.com) wrote:
> > On Tue, Apr 28, 2020 at 10:14:37PM +0800, Dr. David Alan Gilbert wrote:
> > > * Yan Zhao (yan.y.z...@intel.com) wrote:
> > > > On Mon, Apr 27, 2020 at 11:37:43PM +0800, Dr. David Alan Gilbert wrote:
> > > > > * Yan Zhao (yan.y.z...@intel.com) wrote:
> > > > > > On Sat, Apr 25, 2020 at 03:10:49AM +0800, Dr. David Alan Gilbert 
> > > > > > wrote:
> > > > > > > * Yan Zhao (yan.y.z...@intel.com) wrote:
> > > > > > > > On Tue, Apr 21, 2020 at 08:08:49PM +0800, Tian, Kevin wrote:
> > > > > > > > > > From: Yan Zhao
> > > > > > > > > > Sent: Tuesday, April 21, 2020 10:37 AM
> > > > > > > > > > 
> > > > > > > > > > On Tue, Apr 21, 2020 at 06:56:00AM +0800, Alex Williamson 
> > > > > > > > > > wrote:
> > > > > > > > > > > On Sun, 19 Apr 2020 21:24:57 -0400
> > > > > > > > > > > Yan Zhao  wrote:
> > > > > > > > > > >
> > > > > > > > > > > > On Fri, Apr 17, 2020 at 07:24:57PM +0800, Cornelia Huck 
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > On Fri, 17 Apr 2020 05:52:02 -0400
> > > > > > > > > > > > > Yan Zhao  wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > On Fri, Apr 17, 2020 at 04:44:50PM +0800, Cornelia 
> > > > > > > > > > > > > > Huck wrote:
> > > > > > > > > > > > > > > On Mon, 13 Apr 2020 01:52:01 -0400
> > > > > > > > > > > > > > > Yan Zhao  wrote:
> > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > This patchset introduces a migration_version 
> > > > > > > > > > > > > > > > attribute under sysfs
> > > > > > > > > > of VFIO
> > > > > > > > > > > > > > > > Mediated devices.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > This migration_version attribute is used to 
> > > > > > > > > > > > > > > > check migration
> > > > > > > > > > compatibility
> > > > > > > > > > > > > > > > between two mdev devices.
> > > > > > > > > > > > > > > >
> > > > > > > > > > > > > > > > Currently, it has two locations:
> > > > > > > > > > > > > > > > (1) under mdev_type node,
> > > > > > > > > > > > > > > > which can be used even before device 
> > > > > > > > > > > > > > > > creation, but only for
> > > > > > > > > > mdev
> > > > > > > > > > > > > > > > devices of the same mdev type.
> > > > > > > > > > > > > > > > (2) under mdev device node,
> > > > > > > > > > > > > > > > which can only be used after the mdev 
> > > > > > > > > > > > > > > > devices are created, but
> > > > > > > > > > the src
> > > > > > > > > > > > > > > > and target mdev devices are not necessarily 
> > > > > > > > > > > > > > > > be of the same
> > > > > > > > > > mdev type
> > > > > > > > > > > > > > > > (The second location is newly added in v5, in 
> > > > > > > > > > > > > > > > order to keep
> > > > > > > > > > consistent
> > > > > > > > > > > > > > > > with the migration_version node for migratable 
> > > > > > > > > > > > > > > > pass-though
> > > > &

Re: [PATCH v5 0/4] introduction of migration_version attribute for VFIO live migration

2020-04-29 Thread Yan Zhao
On Tue, Apr 28, 2020 at 10:14:37PM +0800, Dr. David Alan Gilbert wrote:
> * Yan Zhao (yan.y.z...@intel.com) wrote:
> > On Mon, Apr 27, 2020 at 11:37:43PM +0800, Dr. David Alan Gilbert wrote:
> > > * Yan Zhao (yan.y.z...@intel.com) wrote:
> > > > On Sat, Apr 25, 2020 at 03:10:49AM +0800, Dr. David Alan Gilbert wrote:
> > > > > * Yan Zhao (yan.y.z...@intel.com) wrote:
> > > > > > On Tue, Apr 21, 2020 at 08:08:49PM +0800, Tian, Kevin wrote:
> > > > > > > > From: Yan Zhao
> > > > > > > > Sent: Tuesday, April 21, 2020 10:37 AM
> > > > > > > > 
> > > > > > > > On Tue, Apr 21, 2020 at 06:56:00AM +0800, Alex Williamson wrote:
> > > > > > > > > On Sun, 19 Apr 2020 21:24:57 -0400
> > > > > > > > > Yan Zhao  wrote:
> > > > > > > > >
> > > > > > > > > > On Fri, Apr 17, 2020 at 07:24:57PM +0800, Cornelia Huck 
> > > > > > > > > > wrote:
> > > > > > > > > > > On Fri, 17 Apr 2020 05:52:02 -0400
> > > > > > > > > > > Yan Zhao  wrote:
> > > > > > > > > > >
> > > > > > > > > > > > On Fri, Apr 17, 2020 at 04:44:50PM +0800, Cornelia Huck 
> > > > > > > > > > > > wrote:
> > > > > > > > > > > > > On Mon, 13 Apr 2020 01:52:01 -0400
> > > > > > > > > > > > > Yan Zhao  wrote:
> > > > > > > > > > > > >
> > > > > > > > > > > > > > This patchset introduces a migration_version 
> > > > > > > > > > > > > > attribute under sysfs
> > > > > > > > of VFIO
> > > > > > > > > > > > > > Mediated devices.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > This migration_version attribute is used to check 
> > > > > > > > > > > > > > migration
> > > > > > > > compatibility
> > > > > > > > > > > > > > between two mdev devices.
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > Currently, it has two locations:
> > > > > > > > > > > > > > (1) under mdev_type node,
> > > > > > > > > > > > > > which can be used even before device creation, 
> > > > > > > > > > > > > > but only for
> > > > > > > > mdev
> > > > > > > > > > > > > > devices of the same mdev type.
> > > > > > > > > > > > > > (2) under mdev device node,
> > > > > > > > > > > > > > which can only be used after the mdev devices 
> > > > > > > > > > > > > > are created, but
> > > > > > > > the src
> > > > > > > > > > > > > > and target mdev devices are not necessarily be 
> > > > > > > > > > > > > > of the same
> > > > > > > > mdev type
> > > > > > > > > > > > > > (The second location is newly added in v5, in order 
> > > > > > > > > > > > > > to keep
> > > > > > > > consistent
> > > > > > > > > > > > > > with the migration_version node for migratable 
> > > > > > > > > > > > > > pass-though
> > > > > > > > devices)
> > > > > > > > > > > > >
> > > > > > > > > > > > > What is the relationship between those two attributes?
> > > > > > > > > > > > >
> > > > > > > > > > > > (1) is for mdev devices specifically, and (2) is 
> > > > > > > > > > > > provided to keep the
> > > > > > > > same
> > > > > > > > > > > > sysfs interface as with non-mdev cases. so (2) is for 
> > > > > > > > > > > > both mdev
> > > > > > > > dev

Re: [PATCH v5 0/4] introduction of migration_version attribute for VFIO live migration

2020-04-27 Thread Yan Zhao
On Mon, Apr 27, 2020 at 11:37:43PM +0800, Dr. David Alan Gilbert wrote:
> * Yan Zhao (yan.y.z...@intel.com) wrote:
> > On Sat, Apr 25, 2020 at 03:10:49AM +0800, Dr. David Alan Gilbert wrote:
> > > * Yan Zhao (yan.y.z...@intel.com) wrote:
> > > > On Tue, Apr 21, 2020 at 08:08:49PM +0800, Tian, Kevin wrote:
> > > > > > From: Yan Zhao
> > > > > > Sent: Tuesday, April 21, 2020 10:37 AM
> > > > > > 
> > > > > > On Tue, Apr 21, 2020 at 06:56:00AM +0800, Alex Williamson wrote:
> > > > > > > On Sun, 19 Apr 2020 21:24:57 -0400
> > > > > > > Yan Zhao  wrote:
> > > > > > >
> > > > > > > > On Fri, Apr 17, 2020 at 07:24:57PM +0800, Cornelia Huck wrote:
> > > > > > > > > On Fri, 17 Apr 2020 05:52:02 -0400
> > > > > > > > > Yan Zhao  wrote:
> > > > > > > > >
> > > > > > > > > > On Fri, Apr 17, 2020 at 04:44:50PM +0800, Cornelia Huck 
> > > > > > > > > > wrote:
> > > > > > > > > > > On Mon, 13 Apr 2020 01:52:01 -0400
> > > > > > > > > > > Yan Zhao  wrote:
> > > > > > > > > > >
> > > > > > > > > > > > This patchset introduces a migration_version attribute 
> > > > > > > > > > > > under sysfs
> > > > > > of VFIO
> > > > > > > > > > > > Mediated devices.
> > > > > > > > > > > >
> > > > > > > > > > > > This migration_version attribute is used to check 
> > > > > > > > > > > > migration
> > > > > > compatibility
> > > > > > > > > > > > between two mdev devices.
> > > > > > > > > > > >
> > > > > > > > > > > > Currently, it has two locations:
> > > > > > > > > > > > (1) under mdev_type node,
> > > > > > > > > > > > which can be used even before device creation, but 
> > > > > > > > > > > > only for
> > > > > > mdev
> > > > > > > > > > > > devices of the same mdev type.
> > > > > > > > > > > > (2) under mdev device node,
> > > > > > > > > > > > which can only be used after the mdev devices are 
> > > > > > > > > > > > created, but
> > > > > > the src
> > > > > > > > > > > > and target mdev devices are not necessarily be of 
> > > > > > > > > > > > the same
> > > > > > mdev type
> > > > > > > > > > > > (The second location is newly added in v5, in order to 
> > > > > > > > > > > > keep
> > > > > > consistent
> > > > > > > > > > > > with the migration_version node for migratable 
> > > > > > > > > > > > pass-though
> > > > > > devices)
> > > > > > > > > > >
> > > > > > > > > > > What is the relationship between those two attributes?
> > > > > > > > > > >
> > > > > > > > > > (1) is for mdev devices specifically, and (2) is provided 
> > > > > > > > > > to keep the
> > > > > > same
> > > > > > > > > > sysfs interface as with non-mdev cases. so (2) is for both 
> > > > > > > > > > mdev
> > > > > > devices and
> > > > > > > > > > non-mdev devices.
> > > > > > > > > >
> > > > > > > > > > in future, if we enable vfio-pci vendor ops, (i.e. a 
> > > > > > > > > > non-mdev device
> > > > > > > > > > is binding to vfio-pci, but is able to register migration 
> > > > > > > > > > region and do
> > > > > > > > > > migration transactions from a vendor provided affiliate 
> > > > > > > > > > driver),
> > > > > > > > > > the vendor driver would export (2) directly, under device 
> > > &

Re: [PATCH v5 0/4] introduction of migration_version attribute for VFIO live migration

2020-04-25 Thread Yan Zhao
On Sat, Apr 25, 2020 at 03:10:49AM +0800, Dr. David Alan Gilbert wrote:
> * Yan Zhao (yan.y.z...@intel.com) wrote:
> > On Tue, Apr 21, 2020 at 08:08:49PM +0800, Tian, Kevin wrote:
> > > > From: Yan Zhao
> > > > Sent: Tuesday, April 21, 2020 10:37 AM
> > > > 
> > > > On Tue, Apr 21, 2020 at 06:56:00AM +0800, Alex Williamson wrote:
> > > > > On Sun, 19 Apr 2020 21:24:57 -0400
> > > > > Yan Zhao  wrote:
> > > > >
> > > > > > On Fri, Apr 17, 2020 at 07:24:57PM +0800, Cornelia Huck wrote:
> > > > > > > On Fri, 17 Apr 2020 05:52:02 -0400
> > > > > > > Yan Zhao  wrote:
> > > > > > >
> > > > > > > > On Fri, Apr 17, 2020 at 04:44:50PM +0800, Cornelia Huck wrote:
> > > > > > > > > On Mon, 13 Apr 2020 01:52:01 -0400
> > > > > > > > > Yan Zhao  wrote:
> > > > > > > > >
> > > > > > > > > > This patchset introduces a migration_version attribute 
> > > > > > > > > > under sysfs
> > > > of VFIO
> > > > > > > > > > Mediated devices.
> > > > > > > > > >
> > > > > > > > > > This migration_version attribute is used to check migration
> > > > compatibility
> > > > > > > > > > between two mdev devices.
> > > > > > > > > >
> > > > > > > > > > Currently, it has two locations:
> > > > > > > > > > (1) under mdev_type node,
> > > > > > > > > > which can be used even before device creation, but only 
> > > > > > > > > > for
> > > > mdev
> > > > > > > > > > devices of the same mdev type.
> > > > > > > > > > (2) under mdev device node,
> > > > > > > > > > which can only be used after the mdev devices are 
> > > > > > > > > > created, but
> > > > the src
> > > > > > > > > > and target mdev devices are not necessarily be of the 
> > > > > > > > > > same
> > > > mdev type
> > > > > > > > > > (The second location is newly added in v5, in order to keep
> > > > consistent
> > > > > > > > > > with the migration_version node for migratable pass-though
> > > > devices)
> > > > > > > > >
> > > > > > > > > What is the relationship between those two attributes?
> > > > > > > > >
> > > > > > > > (1) is for mdev devices specifically, and (2) is provided to 
> > > > > > > > keep the
> > > > same
> > > > > > > > sysfs interface as with non-mdev cases. so (2) is for both mdev
> > > > devices and
> > > > > > > > non-mdev devices.
> > > > > > > >
> > > > > > > > in future, if we enable vfio-pci vendor ops, (i.e. a non-mdev 
> > > > > > > > device
> > > > > > > > is binding to vfio-pci, but is able to register migration 
> > > > > > > > region and do
> > > > > > > > migration transactions from a vendor provided affiliate driver),
> > > > > > > > the vendor driver would export (2) directly, under device node.
> > > > > > > > It is not able to provide (1) as there're no mdev devices 
> > > > > > > > involved.
> > > > > > >
> > > > > > > Ok, creating an alternate attribute for non-mdev devices makes 
> > > > > > > sense.
> > > > > > > However, wouldn't that rather be a case (3)? The change here only
> > > > > > > refers to mdev devices.
> > > > > > >
> > > > > > as you pointed below, (3) and (2) serve the same purpose.
> > > > > > and I think a possible usage is to migrate between a non-mdev 
> > > > > > device and
> > > > > > an mdev device. so I think it's better for them both to use (2) 
> > > > > > rather
> > > > > > than creating (3).
> > > > >
> > > > > An mdev type is meant to define a software compatible interface, so in
> > > >

Re: [PATCH v5 0/4] introduction of migration_version attribute for VFIO live migration

2020-04-22 Thread Yan Zhao
On Tue, Apr 21, 2020 at 08:08:49PM +0800, Tian, Kevin wrote:
> > From: Yan Zhao
> > Sent: Tuesday, April 21, 2020 10:37 AM
> > 
> > On Tue, Apr 21, 2020 at 06:56:00AM +0800, Alex Williamson wrote:
> > > On Sun, 19 Apr 2020 21:24:57 -0400
> > > Yan Zhao  wrote:
> > >
> > > > On Fri, Apr 17, 2020 at 07:24:57PM +0800, Cornelia Huck wrote:
> > > > > On Fri, 17 Apr 2020 05:52:02 -0400
> > > > > Yan Zhao  wrote:
> > > > >
> > > > > > On Fri, Apr 17, 2020 at 04:44:50PM +0800, Cornelia Huck wrote:
> > > > > > > On Mon, 13 Apr 2020 01:52:01 -0400
> > > > > > > Yan Zhao  wrote:
> > > > > > >
> > > > > > > > This patchset introduces a migration_version attribute under 
> > > > > > > > sysfs
> > of VFIO
> > > > > > > > Mediated devices.
> > > > > > > >
> > > > > > > > This migration_version attribute is used to check migration
> > compatibility
> > > > > > > > between two mdev devices.
> > > > > > > >
> > > > > > > > Currently, it has two locations:
> > > > > > > > (1) under mdev_type node,
> > > > > > > > which can be used even before device creation, but only for
> > mdev
> > > > > > > > devices of the same mdev type.
> > > > > > > > (2) under mdev device node,
> > > > > > > > which can only be used after the mdev devices are created, 
> > > > > > > > but
> > the src
> > > > > > > > and target mdev devices are not necessarily be of the same
> > mdev type
> > > > > > > > (The second location is newly added in v5, in order to keep
> > consistent
> > > > > > > > with the migration_version node for migratable pass-though
> > devices)
> > > > > > >
> > > > > > > What is the relationship between those two attributes?
> > > > > > >
> > > > > > (1) is for mdev devices specifically, and (2) is provided to keep 
> > > > > > the
> > same
> > > > > > sysfs interface as with non-mdev cases. so (2) is for both mdev
> > devices and
> > > > > > non-mdev devices.
> > > > > >
> > > > > > in future, if we enable vfio-pci vendor ops, (i.e. a non-mdev device
> > > > > > is binding to vfio-pci, but is able to register migration region 
> > > > > > and do
> > > > > > migration transactions from a vendor provided affiliate driver),
> > > > > > the vendor driver would export (2) directly, under device node.
> > > > > > It is not able to provide (1) as there're no mdev devices involved.
> > > > >
> > > > > Ok, creating an alternate attribute for non-mdev devices makes sense.
> > > > > However, wouldn't that rather be a case (3)? The change here only
> > > > > refers to mdev devices.
> > > > >
> > > > as you pointed below, (3) and (2) serve the same purpose.
> > > > and I think a possible usage is to migrate between a non-mdev device and
> > > > an mdev device. so I think it's better for them both to use (2) rather
> > > > than creating (3).
> > >
> > > An mdev type is meant to define a software compatible interface, so in
> > > the case of mdev->mdev migration, doesn't migrating to a different type
> > > fail the most basic of compatibility tests that we expect userspace to
> > > perform?  IOW, if two mdev types are migration compatible, it seems a
> > > prerequisite to that is that they provide the same software interface,
> > > which means they should be the same mdev type.
> > >
> > > In the hybrid cases of mdev->phys or phys->mdev, how does a
> > management
> > > tool begin to even guess what might be compatible?  Are we expecting
> > > libvirt to probe ever device with this attribute in the system?  Is
> > > there going to be a new class hierarchy created to enumerate all
> > > possible migrate-able devices?
> > >
> > yes, management tool needs to guess and test migration compatible
> > between two devices. But I think it's not the problem only for
> > mdev->phys or phys->mdev. even for mdev->mdev, management tool needs
> > 

Re: [PATCH v5 0/4] introduction of migration_version attribute for VFIO live migration

2020-04-20 Thread Yan Zhao
On Tue, Apr 21, 2020 at 06:56:00AM +0800, Alex Williamson wrote:
> On Sun, 19 Apr 2020 21:24:57 -0400
> Yan Zhao  wrote:
> 
> > On Fri, Apr 17, 2020 at 07:24:57PM +0800, Cornelia Huck wrote:
> > > On Fri, 17 Apr 2020 05:52:02 -0400
> > > Yan Zhao  wrote:
> > >   
> > > > On Fri, Apr 17, 2020 at 04:44:50PM +0800, Cornelia Huck wrote:  
> > > > > On Mon, 13 Apr 2020 01:52:01 -0400
> > > > > Yan Zhao  wrote:
> > > > > 
> > > > > > This patchset introduces a migration_version attribute under sysfs 
> > > > > > of VFIO
> > > > > > Mediated devices.
> > > > > > 
> > > > > > This migration_version attribute is used to check migration 
> > > > > > compatibility
> > > > > > between two mdev devices.
> > > > > > 
> > > > > > Currently, it has two locations:
> > > > > > (1) under mdev_type node,
> > > > > > which can be used even before device creation, but only for mdev
> > > > > > devices of the same mdev type.
> > > > > > (2) under mdev device node,
> > > > > > which can only be used after the mdev devices are created, but 
> > > > > > the src
> > > > > > and target mdev devices are not necessarily be of the same mdev 
> > > > > > type
> > > > > > (The second location is newly added in v5, in order to keep 
> > > > > > consistent
> > > > > > with the migration_version node for migratable pass-though devices) 
> > > > > >
> > > > > 
> > > > > What is the relationship between those two attributes?
> > > > > 
> > > > (1) is for mdev devices specifically, and (2) is provided to keep the 
> > > > same
> > > > sysfs interface as with non-mdev cases. so (2) is for both mdev devices 
> > > > and
> > > > non-mdev devices.
> > > > 
> > > > in future, if we enable vfio-pci vendor ops, (i.e. a non-mdev device
> > > > is binding to vfio-pci, but is able to register migration region and do
> > > > migration transactions from a vendor provided affiliate driver),
> > > > the vendor driver would export (2) directly, under device node.
> > > > It is not able to provide (1) as there're no mdev devices involved.  
> > > 
> > > Ok, creating an alternate attribute for non-mdev devices makes sense.
> > > However, wouldn't that rather be a case (3)? The change here only
> > > refers to mdev devices.
> > >  
> > as you pointed below, (3) and (2) serve the same purpose. 
> > and I think a possible usage is to migrate between a non-mdev device and
> > an mdev device. so I think it's better for them both to use (2) rather
> > than creating (3).
> 
> An mdev type is meant to define a software compatible interface, so in
> the case of mdev->mdev migration, doesn't migrating to a different type
> fail the most basic of compatibility tests that we expect userspace to
> perform?  IOW, if two mdev types are migration compatible, it seems a
> prerequisite to that is that they provide the same software interface,
> which means they should be the same mdev type.
> 
> In the hybrid cases of mdev->phys or phys->mdev, how does a management
> tool begin to even guess what might be compatible?  Are we expecting
> libvirt to probe ever device with this attribute in the system?  Is
> there going to be a new class hierarchy created to enumerate all
> possible migrate-able devices?
>
yes, management tool needs to guess and test migration compatible
between two devices. But I think it's not the problem only for
mdev->phys or phys->mdev. even for mdev->mdev, management tool needs to
first assume that the two mdevs have the same type of parent devices
(e.g.their pciids are equal). otherwise, it's still enumerating
possibilities.

on the other hand, for two mdevs,
mdev1 from pdev1, its mdev_type is 1/2 of pdev1;
mdev2 from pdev2, its mdev_type is 1/4 of pdev2;
if pdev2 is exactly 2 times of pdev1, why not allow migration between
mdev1 <-> mdev2.


> I agree that there was a gap in the previous proposal for non-mdev
> devices, but I think this bring a lot of questions that we need to
> puzzle through and libvirt will need to re-evaluate how they might
> decide to pick a migration target device.  For example, I'm sure
> libvirt would reject any policy decisions regarding picking a physical
> device versus an mdev device.  Had we previous

Re: [PATCH v5 0/4] introduction of migration_version attribute for VFIO live migration

2020-04-19 Thread Yan Zhao
On Fri, Apr 17, 2020 at 07:24:57PM +0800, Cornelia Huck wrote:
> On Fri, 17 Apr 2020 05:52:02 -0400
> Yan Zhao  wrote:
> 
> > On Fri, Apr 17, 2020 at 04:44:50PM +0800, Cornelia Huck wrote:
> > > On Mon, 13 Apr 2020 01:52:01 -0400
> > > Yan Zhao  wrote:
> > >   
> > > > This patchset introduces a migration_version attribute under sysfs of 
> > > > VFIO
> > > > Mediated devices.
> > > > 
> > > > This migration_version attribute is used to check migration 
> > > > compatibility
> > > > between two mdev devices.
> > > > 
> > > > Currently, it has two locations:
> > > > (1) under mdev_type node,
> > > > which can be used even before device creation, but only for mdev
> > > > devices of the same mdev type.
> > > > (2) under mdev device node,
> > > > which can only be used after the mdev devices are created, but the 
> > > > src
> > > > and target mdev devices are not necessarily be of the same mdev type
> > > > (The second location is newly added in v5, in order to keep consistent
> > > > with the migration_version node for migratable pass-though devices)  
> > > 
> > > What is the relationship between those two attributes?
> > >   
> > (1) is for mdev devices specifically, and (2) is provided to keep the same
> > sysfs interface as with non-mdev cases. so (2) is for both mdev devices and
> > non-mdev devices.
> > 
> > in future, if we enable vfio-pci vendor ops, (i.e. a non-mdev device
> > is binding to vfio-pci, but is able to register migration region and do
> > migration transactions from a vendor provided affiliate driver),
> > the vendor driver would export (2) directly, under device node.
> > It is not able to provide (1) as there're no mdev devices involved.
> 
> Ok, creating an alternate attribute for non-mdev devices makes sense.
> However, wouldn't that rather be a case (3)? The change here only
> refers to mdev devices.
>
as you pointed below, (3) and (2) serve the same purpose. 
and I think a possible usage is to migrate between a non-mdev device and
an mdev device. so I think it's better for them both to use (2) rather
than creating (3).
> > 
> > > Is existence (and compatibility) of (1) a pre-req for possible
> > > existence (and compatibility) of (2)?
> > >  
> > no. (2) does not reply on (1).
> 
> Hm. Non-existence of (1) seems to imply "this type does not support
> migration". If an mdev created for such a type suddenly does support
> migration, it feels a bit odd.
> 
yes. but I think if the condition happens, it should be reported a bug
to vendor driver.
should I add a line in the doc like "vendor driver should ensure that the
migration compatibility from migration_version under mdev_type should be
consistent with that from migration_version under device node" ?

> (It obviously cannot be a prereq for what I called (3) above.)
> 
> > 
> > > Does userspace need to check (1) or can it completely rely on (2), if
> > > it so chooses?
> > >  
> > I think it can completely reply on (2) if compatibility check before
> > mdev creation is not required.
> > 
> > > If devices with a different mdev type are indeed compatible, it seems
> > > userspace can only find out after the devices have actually been
> > > created, as (1) does not apply?  
> > yes, I think so. 
> 
> How useful would it be for userspace to even look at (1) in that case?
> It only knows if things have a chance of working if it actually goes
> ahead and creates devices.
>
hmm, is it useful for userspace to test the migration_version under mdev
type before it knows what mdev device to generate ?
like when the userspace wants to migrate an mdev device in src vm,
but it has not created target vm and the target mdev device.

> > 
> > > One of my worries is that the existence of an attribute with the same
> > > name in two similar locations might lead to confusion. But maybe it
> > > isn't a problem.
> > >  
> > Yes, I have the same feeling. but as (2) is for sysfs interface
> > consistency, to make it transparent to userspace tools like libvirt,
> > I guess the same name is necessary?
> 
> What do we actually need here, I wonder? (1) and (2) seem to serve
> slightly different purposes, while (2) and what I called (3) have the
> same purpose. Is it important to userspace that (1) and (2) have the
> same name?
so change (1) to migration_type_version and (2) to
migration_instance_version?
But as they are under different locations, could that location imply
enough information?


Thanks
Yan





Re: [PATCH v5 0/4] introduction of migration_version attribute for VFIO live migration

2020-04-17 Thread Yan Zhao
On Fri, Apr 17, 2020 at 04:44:50PM +0800, Cornelia Huck wrote:
> On Mon, 13 Apr 2020 01:52:01 -0400
> Yan Zhao  wrote:
> 
> > This patchset introduces a migration_version attribute under sysfs of VFIO
> > Mediated devices.
> > 
> > This migration_version attribute is used to check migration compatibility
> > between two mdev devices.
> > 
> > Currently, it has two locations:
> > (1) under mdev_type node,
> > which can be used even before device creation, but only for mdev
> > devices of the same mdev type.
> > (2) under mdev device node,
> > which can only be used after the mdev devices are created, but the src
> > and target mdev devices are not necessarily be of the same mdev type
> > (The second location is newly added in v5, in order to keep consistent
> > with the migration_version node for migratable pass-though devices)
> 
> What is the relationship between those two attributes?
> 
(1) is for mdev devices specifically, and (2) is provided to keep the same
sysfs interface as with non-mdev cases. so (2) is for both mdev devices and
non-mdev devices.

in future, if we enable vfio-pci vendor ops, (i.e. a non-mdev device
is binding to vfio-pci, but is able to register migration region and do
migration transactions from a vendor provided affiliate driver),
the vendor driver would export (2) directly, under device node.
It is not able to provide (1) as there're no mdev devices involved.

> Is existence (and compatibility) of (1) a pre-req for possible
> existence (and compatibility) of (2)?
>
no. (2) does not reply on (1).

> Does userspace need to check (1) or can it completely rely on (2), if
> it so chooses?
>
I think it can completely reply on (2) if compatibility check before
mdev creation is not required.

> If devices with a different mdev type are indeed compatible, it seems
> userspace can only find out after the devices have actually been
> created, as (1) does not apply?
yes, I think so. 

> One of my worries is that the existence of an attribute with the same
> name in two similar locations might lead to confusion. But maybe it
> isn't a problem.
>
Yes, I have the same feeling. but as (2) is for sysfs interface
consistency, to make it transparent to userspace tools like libvirt,
I guess the same name is necessary?

Thanks
Yan
> > 
> > Patch 1 defines migration_version attribute for the first location in
> > Documentation/vfio-mediated-device.txt
> > 
> > Patch 2 uses GVT as an example for patch 1 to show how to expose
> > migration_version attribute and check migration compatibility in vendor
> > driver.
> > 
> > Patch 3 defines migration_version attribute for the second location in
> > Documentation/vfio-mediated-device.txt
> > 
> > Patch 4 uses GVT as an example for patch 3 to show how to expose
> > migration_version attribute and check migration compatibility in vendor
> > driver.
> > 
> > (The previous "Reviewed-by" and "Acked-by" for patch 1 and patch 2 are
> > kept in v5, as there are only small changes to commit messages of the two
> > patches.)
> > 
> > v5:
> > added patch 2 and 4 for mdev device part of migration_version attribute.
> > 
> > v4:
> > 1. fixed indentation/spell errors, reworded several error messages
> > 2. added a missing memory free for error handling in patch 2
> > 
> > v3:
> > 1. renamed version to migration_version
> > 2. let errno to be freely defined by vendor driver
> > 3. let checking mdev_type be prerequisite of migration compatibility check
> > 4. reworded most part of patch 1
> > 5. print detailed error log in patch 2 and generate migration_version
> > string at init time
> > 
> > v2:
> > 1. renamed patched 1
> > 2. made definition of device version string completely private to vendor
> > driver
> > 3. reverted changes to sample mdev drivers
> > 4. described intent and usage of version attribute more clearly.
> > 
> > 
> > Yan Zhao (4):
> >   vfio/mdev: add migration_version attribute for mdev (under mdev_type
> > node)
> >   drm/i915/gvt: export migration_version to mdev sysfs (under mdev_type
> > node)
> >   vfio/mdev: add migration_version attribute for mdev (under mdev device
> > node)
> >   drm/i915/gvt: export migration_version to mdev sysfs (under mdev
> > device node)
> > 
> >  .../driver-api/vfio-mediated-device.rst   | 183 ++
> >  drivers/gpu/drm/i915/gvt/Makefile |   2 +-
> >  drivers/gpu/drm/i915/gvt/gvt.c|  39 
> >  drivers/gpu/drm/i915/gvt/gvt.h|   7 +
> >  drivers/gpu/drm/i915/gvt/kvmgt.c  |  55 ++
> >  drivers/gpu/drm/i915/gvt/migration_version.c  | 170 
> >  drivers/gpu/drm/i915/gvt/vgpu.c   |  13 +-
> >  7 files changed, 466 insertions(+), 3 deletions(-)
> >  create mode 100644 drivers/gpu/drm/i915/gvt/migration_version.c
> > 
> 




Re: [PATCH v5 3/4] vfio/mdev: add migration_version attribute for mdev (under mdev device node)

2020-04-15 Thread Yan Zhao
On Wed, Apr 15, 2020 at 03:42:58PM +0800, Erik Skultety wrote:
> On Mon, Apr 13, 2020 at 01:55:04AM -0400, Yan Zhao wrote:
> > migration_version attribute is used to check migration compatibility
> > between two mdev devices of the same mdev type.
> > The key is that it's rw and its data is opaque to userspace.
> >
> > Userspace reads migration_version of mdev device at source side and
> > writes the value to migration_version attribute of mdev device at target
> > side. It judges migration compatibility according to whether the read
> > and write operations succeed or fail.
> >
> > Currently, it is able to read/write migration_version attribute under two
> > places:
> >
> > (1) under mdev_type node
> > userspace is able to know whether two mdev devices are compatible before
> > a mdev device is created.
> >
> > userspace also needs to check whether the two mdev devices are of the same
> > mdev type before checking the migration_version attribute. It also needs
> > to check device creation parameters if aggregation is supported in future.
> >
> > (2) under mdev device node
> > userspace is able to know whether two mdev devices are compatible after
> > they are all created. But it does not need to check mdev type and device
> > creation parameter for aggregation as device vendor driver would have
> > incorporated those information into the migration_version attribute.
> >
> >  __userspace
> >   /\  \
> >  / \write
> > / read  \
> >/__   ___\|/_
> >   | migration_version | | migration_version |-->check migration
> >   - -   compatibility
> > mdev device A   mdev device B
> >
> > This patch is for mdev documentation about the second place (under
> > mdev device node)
> >
> > Cc: Alex Williamson 
> > Cc: Erik Skultety 
> > Cc: "Dr. David Alan Gilbert" 
> > Cc: Cornelia Huck 
> > Cc: "Tian, Kevin" 
> > Cc: Zhenyu Wang 
> > Cc: "Wang, Zhi A" 
> > Cc: Neo Jia 
> > Cc: Kirti Wankhede 
> > Cc: Daniel P. Berrangé 
> > Cc: Christophe de Dinechin 
> >
> > Signed-off-by: Yan Zhao 
> > ---
> >  .../driver-api/vfio-mediated-device.rst   | 70 +++
> >  1 file changed, 70 insertions(+)
> >
> > diff --git a/Documentation/driver-api/vfio-mediated-device.rst 
> > b/Documentation/driver-api/vfio-mediated-device.rst
> > index 2d1f3c0f3c8f..efbadfd51b7e 100644
> > --- a/Documentation/driver-api/vfio-mediated-device.rst
> > +++ b/Documentation/driver-api/vfio-mediated-device.rst
> > @@ -383,6 +383,7 @@ Directories and Files Under the sysfs for Each mdev 
> > Device
> >   |--- remove
> >   |--- mdev_type {link to its type}
> >   |--- vendor-specific-attributes [optional]
> > + |--- migration_verion [optional]
> >
> >  * remove (write only)
> >
> > @@ -394,6 +395,75 @@ Example::
> >
> > # echo 1 > /sys/bus/mdev/devices/$mdev_UUID/remove
> >
> > +* migration_version (rw, optional)
> 
> Hmm, ^this is not consistent with how patch 1/5 reports this information, but
> looking at the existing docs we're not doing very well in terms of consistency
> there either.
> 
> I suggest we go with "(read-write)" in both patch 1/5 and here and then start
> the paragraph with "This is an optional attribute."
>
ok. got it.

> > +  It is used to check migration compatibility between two mdev devices.
> > +  Absence of this attribute means the mdev device does not support 
> > migration.
> > +
> > +  This attribute provides a way to check migration compatibility between 
> > two
> > +  mdev devices from userspace after device created. The intended usage is
> 
> after the target device has been created.
> 
> side note: maybe add something like "(see the migration_version attribute of
> the device node if the target device already exists)" in the same section in
> patch 1/5.

ok. good idea.
> 
> > +  for userspace to read the migration_version attribute from one mdev 
> > device and
> > +  then writing that value to the migration_version attribute of the other 
> > mdev
> > +  device. The second mdev device indicates compatibility via the return 
> > code of
> > +  the write operation. This makes compatibility between mdev devices 
> > completely
> > +  vendor-defined a

Re: [PATCH v5 1/4] vfio/mdev: add migration_version attribute for mdev (under mdev_type node)

2020-04-15 Thread Yan Zhao
On Wed, Apr 15, 2020 at 03:28:51PM +0800, Erik Skultety wrote:
> On Mon, Apr 13, 2020 at 01:54:03AM -0400, Yan Zhao wrote:
> > migration_version attribute is used to check migration compatibility
> > between two mdev devices of the same mdev type.
> > The key is that it's rw and its data is opaque to userspace.
> >
> > Userspace reads migration_version of mdev device at source side and
> > writes the value to migration_version attribute of mdev device at target
> > side. It judges migration compatibility according to whether the read
> > and write operations succeed or fail.
> >
> > Currently, it is able to read/write migration_version attribute under two
> > places:
> >
> > (1) under mdev_type node
> > userspace is able to know whether two mdev devices are compatible before
> > a mdev device is created.
> >
> > userspace also needs to check whether the two mdev devices are of the same
> > mdev type before checking the migration_version attribute. It also needs
> > to check device creation parameters if aggregation is supported in future.
> >
> > (2) under mdev device node
> > userspace is able to know whether two mdev devices are compatible after
> > they are all created. But it does not need to check mdev type and device
> > creation parameter for aggregation as device vendor driver would have
> > incorporated those information into the migration_version attribute.
> >
> >  __userspace
> >   /\  \
> >  / \write
> > / read  \
> >/__   ___\|/_
> >   | migration_version | | migration_version |-->check migration
> >   - -   compatibility
> > mdev device A   mdev device B
> >
> > This patch is for mdev documentation about the first place (under
> > mdev_type node)
> >
> > Cc: Alex Williamson 
> > Cc: Erik Skultety 
> > Cc: "Dr. David Alan Gilbert" 
> > Cc: Cornelia Huck 
> > Cc: "Tian, Kevin" 
> > Cc: Zhenyu Wang 
> > Cc: "Wang, Zhi A" 
> > Cc: Neo Jia 
> > Cc: Kirti Wankhede 
> > Cc: Daniel P. Berrangé 
> > Cc: Christophe de Dinechin 
> >
> > Reviewed-by: Cornelia Huck 
> > Signed-off-by: Yan Zhao 
> >
> > ---
> > v5:
> > updated commit message a little to indicate this patch is for
> > migration_version attribute under mdev_type node
> >
> > v4:
> > fixed a typo. (Cornelia Huck)
> >
> > v3:
> > 1. renamed version to migration_version
> > (Christophe de Dinechin, Cornelia Huck, Alex Williamson)
> > 2. let errno to be freely defined by vendor driver
> > (Alex Williamson, Erik Skultety, Cornelia Huck, Dr. David Alan Gilbert)
> > 3. let checking mdev_type be prerequisite of migration compatibility
> > check. (Alex Williamson)
> > 4. reworded example usage section.
> > (most of this section came from Alex Williamson)
> > 5. reworded attribute intention section (Cornelia Huck)
> >
> > v2:
> > 1. added detailed intent and usage
> > 2. made definition of version string completely private to vendor driver
> >(Alex Williamson)
> > 3. abandoned changes to sample mdev drivers (Alex Williamson)
> > 4. mandatory --> optional (Cornelia Huck)
> > 5. added description for errno (Cornelia Huck)
> > ---
> >  .../driver-api/vfio-mediated-device.rst   | 113 ++
> >  1 file changed, 113 insertions(+)
> >
> > diff --git a/Documentation/driver-api/vfio-mediated-device.rst 
> > b/Documentation/driver-api/vfio-mediated-device.rst
> > index 25eb7d5b834b..2d1f3c0f3c8f 100644
> > --- a/Documentation/driver-api/vfio-mediated-device.rst
> > +++ b/Documentation/driver-api/vfio-mediated-device.rst
> > @@ -202,6 +202,7 @@ Directories and files under the sysfs for Each Physical 
> > Device
> >| |   |--- available_instances
> >| |   |--- device_api
> >| |   |--- description
> > +  | |   |--- migration_version
> >| |   |--- [devices]
> >| |--- []
> >| |   |--- create
> > @@ -209,6 +210,7 @@ Directories and files under the sysfs for Each Physical 
> > Device
> >| |   |--- available_instances
> >| |   |--- device_api
> >| |   |--- description
> > +  | |   |--- migration_version
> >| |   |--- [devices]
> >| |--- []
> >|  |--- create
> > @@ -216,6 +218,7 @@ Dire

[PATCH v5 3/4] vfio/mdev: add migration_version attribute for mdev (under mdev device node)

2020-04-13 Thread Yan Zhao
migration_version attribute is used to check migration compatibility
between two mdev devices of the same mdev type.
The key is that it's rw and its data is opaque to userspace.

Userspace reads migration_version of mdev device at source side and
writes the value to migration_version attribute of mdev device at target
side. It judges migration compatibility according to whether the read
and write operations succeed or fail.

Currently, it is able to read/write migration_version attribute under two
places:

(1) under mdev_type node
userspace is able to know whether two mdev devices are compatible before
a mdev device is created.

userspace also needs to check whether the two mdev devices are of the same
mdev type before checking the migration_version attribute. It also needs
to check device creation parameters if aggregation is supported in future.

(2) under mdev device node
userspace is able to know whether two mdev devices are compatible after
they are all created. But it does not need to check mdev type and device
creation parameter for aggregation as device vendor driver would have
incorporated those information into the migration_version attribute.

 __userspace
  /\  \
 / \write
/ read  \
   /__   ___\|/_
  | migration_version | | migration_version |-->check migration
  - -   compatibility
mdev device A   mdev device B

This patch is for mdev documentation about the second place (under
mdev device node)

Cc: Alex Williamson 
Cc: Erik Skultety 
Cc: "Dr. David Alan Gilbert" 
Cc: Cornelia Huck 
Cc: "Tian, Kevin" 
Cc: Zhenyu Wang 
Cc: "Wang, Zhi A" 
Cc: Neo Jia 
Cc: Kirti Wankhede 
Cc: Daniel P. Berrangé 
Cc: Christophe de Dinechin 

Signed-off-by: Yan Zhao 
---
 .../driver-api/vfio-mediated-device.rst   | 70 +++
 1 file changed, 70 insertions(+)

diff --git a/Documentation/driver-api/vfio-mediated-device.rst 
b/Documentation/driver-api/vfio-mediated-device.rst
index 2d1f3c0f3c8f..efbadfd51b7e 100644
--- a/Documentation/driver-api/vfio-mediated-device.rst
+++ b/Documentation/driver-api/vfio-mediated-device.rst
@@ -383,6 +383,7 @@ Directories and Files Under the sysfs for Each mdev Device
  |--- remove
  |--- mdev_type {link to its type}
  |--- vendor-specific-attributes [optional]
+ |--- migration_verion [optional]
 
 * remove (write only)
 
@@ -394,6 +395,75 @@ Example::
 
# echo 1 > /sys/bus/mdev/devices/$mdev_UUID/remove
 
+* migration_version (rw, optional)
+  It is used to check migration compatibility between two mdev devices.
+  Absence of this attribute means the mdev device does not support migration.
+
+  This attribute provides a way to check migration compatibility between two
+  mdev devices from userspace after device created. The intended usage is
+  for userspace to read the migration_version attribute from one mdev device 
and
+  then writing that value to the migration_version attribute of the other mdev
+  device. The second mdev device indicates compatibility via the return code of
+  the write operation. This makes compatibility between mdev devices completely
+  vendor-defined and opaque to userspace. Userspace should do nothing more
+  than use the migration_version attribute to confirm source to target
+  compatibility.
+
+  Reading/Writing Attribute Data:
+  read(2) will fail if a mdev device does not support migration and otherwise
+succeed and return migration_version string of the mdev device.
+
+This migration_version string is vendor defined and opaque to the
+userspace. Vendor is free to include whatever they feel is relevant.
+e.g. -.
+
+Restrictions on this migration_version string:
+1. It should only contain ascii characters
+2. MAX Length is PATH_MAX (4096)
+
+  write(2) expects migration_version string of source mdev device, and will
+ succeed if it is determined to be compatible and otherwise fail with
+ vendor specific errno.
+
+  Errno:
+  -An errno on read(2) indicates the mdev devicedoes not support migration;
+  -An errno on write(2) indicates the mdev devices are incompatible or the
+   target doesn't support migration.
+  Vendor driver is free to define specific errno and is suggested to
+  print detailed error in syslog for diagnose purpose.
+
+  Userspace should treat ANY of below conditions as two mdev devices not
+  compatible:
+  (1) any one of the two mdev devices does not have a migration_version
+  attribute
+  (2) error when reading from migration_version attribute of one mdev device
+  (3) error when writing migration_version string of one mdev device to
+  migration_version attribute of the other mdev device
+
+  Userspace should regard two mdev devices compatible when ALL of below
+  con

[PATCH v5 4/4] drm/i915/gvt: export migration_version to mdev sysfs (under mdev device node)

2020-04-13 Thread Yan Zhao
mdev device par of migration_version attribute for Intel vGPU is rw.
It is located at
/sys/bus/pci/devices/\:00\:02.0/$mdev_UUID/migration_version,
or /sys/bus/mdev/devices/$mdev_UUID/migration_version

It's used to check migration compatibility for two vGPUs.
migration_version string is defined by vendor driver and opaque to
userspace.

For Intel vGPU of gen8 and gen9, the format of migration_version string
is:
  ---.

For future software versions, e.g. when vGPUs have aggregations, it may
also include aggregation count into migration_version string of a vGPU.

For future platforms, the format of migration_version string is to be
expanded to include more meta data to identify Intel vGPUs for live
migration compatibility check

For old platforms, and for GVT not supporting vGPU live migration
feature, -ENODEV is returned on read(2)/write(2) of migration_version
attribute.
For vGPUs running old GVT who do not expose migration_version
attribute, live migration is regarded as not supported for those vGPUs.

Cc: Alex Williamson 
Cc: Erik Skultety 
Cc: "Dr. David Alan Gilbert" 
Cc: Cornelia Huck 
Cc: "Tian, Kevin" 
Cc: Zhenyu Wang 
Cc: "Wang, Zhi A" 
c: Neo Jia 
Cc: Kirti Wankhede 

Signed-off-by: Yan Zhao 
---
 drivers/gpu/drm/i915/gvt/gvt.h   |  2 ++
 drivers/gpu/drm/i915/gvt/kvmgt.c | 55 
 2 files changed, 57 insertions(+)

diff --git a/drivers/gpu/drm/i915/gvt/gvt.h b/drivers/gpu/drm/i915/gvt/gvt.h
index b26e42596565..664efc83f82e 100644
--- a/drivers/gpu/drm/i915/gvt/gvt.h
+++ b/drivers/gpu/drm/i915/gvt/gvt.h
@@ -205,6 +205,8 @@ struct intel_vgpu {
struct idr object_idr;
 
u32 scan_nonprivbb;
+
+   char *migration_version;
 };
 
 static inline void *intel_vgpu_vdev(struct intel_vgpu *vgpu)
diff --git a/drivers/gpu/drm/i915/gvt/kvmgt.c b/drivers/gpu/drm/i915/gvt/kvmgt.c
index 2f2d4c40f966..4903599cb0ef 100644
--- a/drivers/gpu/drm/i915/gvt/kvmgt.c
+++ b/drivers/gpu/drm/i915/gvt/kvmgt.c
@@ -728,8 +728,13 @@ static int intel_vgpu_create(struct kobject *kobj, struct 
mdev_device *mdev)
kvmgt_vdev(vgpu)->mdev = mdev;
mdev_set_drvdata(mdev, vgpu);
 
+   vgpu->migration_version =
+   intel_gvt_get_vfio_migration_version(gvt, type->name);
+
gvt_dbg_core("intel_vgpu_create succeeded for mdev: %s\n",
 dev_name(mdev_dev(mdev)));
+
+
ret = 0;
 
 out:
@@ -744,6 +749,7 @@ static int intel_vgpu_remove(struct mdev_device *mdev)
return -EBUSY;
 
intel_gvt_ops->vgpu_destroy(vgpu);
+   kfree(vgpu->migration_version);
return 0;
 }
 
@@ -1964,8 +1970,57 @@ static const struct attribute_group intel_vgpu_group = {
.attrs = intel_vgpu_attrs,
 };
 
+static ssize_t migration_version_show(struct device *dev,
+ struct device_attribute *attr, char *buf)
+{
+   struct mdev_device *mdev = mdev_from_dev(dev);
+   struct intel_vgpu *vgpu = mdev_get_drvdata(mdev);
+
+   if (!vgpu->migration_version) {
+   gvt_vgpu_err("Migration not supported on this vgpu. Please 
search previous detailed log\n");
+   return -ENODEV;
+   }
+
+   return snprintf(buf, strlen(vgpu->migration_version) + 2,
+   "%s\n", vgpu->migration_version);
+
+}
+
+static ssize_t migration_version_store(struct device *dev,
+  struct device_attribute *attr,
+  const char *buf, size_t count)
+{
+   struct mdev_device *mdev = mdev_from_dev(dev);
+   struct intel_vgpu *vgpu = mdev_get_drvdata(mdev);
+   struct intel_gvt *gvt = vgpu->gvt;
+   int ret = 0;
+
+   if (!vgpu->migration_version) {
+   gvt_vgpu_err("Migration not supported on this vgpu. Please 
search previous detailed log\n");
+   return -ENODEV;
+   }
+
+   ret = intel_gvt_check_vfio_migration_version(gvt,
+   vgpu->migration_version, buf);
+   return (ret < 0 ? ret : count);
+}
+
+static DEVICE_ATTR_RW(migration_version);
+
+static struct attribute *intel_vgpu_migration_attrs[] = {
+   _attr_migration_version.attr,
+   NULL,
+};
+/* this group has no name, so will be displayed
+ * immediately under sysfs node of the mdev device
+ */
+static const struct attribute_group intel_vgpu_group_empty_name = {
+   .attrs = intel_vgpu_migration_attrs,
+};
+
 static const struct attribute_group *intel_vgpu_groups[] = {
_vgpu_group,
+   _vgpu_group_empty_name,
NULL,
 };
 
-- 
2.17.1




[PATCH v5 2/4] drm/i915/gvt: export migration_version to mdev sysfs (under mdev_type node)

2020-04-13 Thread Yan Zhao
This patch implements the mdev_type part of migration_version attribute
for Intel's vGPU mdev devices.

migration_version attribute under mdev_type node is rw.
It is located at
/sys/class/mdev_bus/:00:02.0/mdev_supported_types/$MDEV_TYPE/
or
/sys/devices/pci:00/:00:02.0/mdev_supported_types/$MDEV_TYPE/

It's used to check migration compatibility for two mdev devices of the
same mdev type.
migration_version string is defined by vendor driver and opaque to
userspace.

For Intel vGPU of gen8 and gen9, the format of migration_version string
is:
  ---.

For future platforms, the format of migration_version string is to be
expanded to include more meta data to identify Intel vGPUs for live
migration compatibility check

For old platforms, and for GVT not supporting vGPU live migration
feature, -ENODEV is returned on read(2)/write(2) of migration_version
attribute.
For vGPUs running old GVT who do not expose migration_version
attribute, live migration is regarded as not supported for those vGPUs.

Cc: Alex Williamson 
Cc: Erik Skultety 
Cc: "Dr. David Alan Gilbert" 
Cc: Cornelia Huck 
Cc: "Tian, Kevin" 
Cc: Zhenyu Wang 
Cc: "Wang, Zhi A" 
c: Neo Jia 
Cc: Kirti Wankhede 

Acked-by: Cornelia Huck 
Acked-by: Zhenyu Wang 
Signed-off-by: Yan Zhao 

---
v5:
updated commit message to indicate this patch introduces migration_version
attributes under mdev_type sysfs directory

v4:
1. fixed Indentation/spell issues and reworded several error messages
(Cornelia Huck)
2. added kfree(version) in snprintf failure case (Zhenyu Wang)

v3:
1. renamed version to migration_version
(Christophe de Dinechin, Cornelia Huck, Alex Williamson)
2. instead of generating migration version strings each time, storing
them in vgpu types generated during initialization.
(Zhenyu Wang, Cornelia Huck)
3. replaced multiple snprintf to one big snprintf in
intel_gvt_get_vfio_migration_version()
(Dr. David Alan Gilbert)
4. printed detailed error log
(Alex Williamson, Erik Skultety, Cornelia Huck, Dr. David Alan Gilbert)
5. incorporated  into migration_version string
(Alex Williamson)
6. do not use ifndef macro to switch off migration_version attribute
(Zhenyu Wang)

v2:
1. removed 32 common part of version string
(Alex Williamson)
2. do not register version attribute for GVT not supporting live
migration.(Cornelia Huck)
3. for platforms out of gen8, gen9, return -EINVAL --> -ENODEV for
incompatible. (Cornelia Huck)
---
 drivers/gpu/drm/i915/gvt/Makefile|   2 +-
 drivers/gpu/drm/i915/gvt/gvt.c   |  39 +
 drivers/gpu/drm/i915/gvt/gvt.h   |   5 +
 drivers/gpu/drm/i915/gvt/migration_version.c | 170 +++
 drivers/gpu/drm/i915/gvt/vgpu.c  |  13 +-
 5 files changed, 226 insertions(+), 3 deletions(-)
 create mode 100644 drivers/gpu/drm/i915/gvt/migration_version.c

diff --git a/drivers/gpu/drm/i915/gvt/Makefile 
b/drivers/gpu/drm/i915/gvt/Makefile
index 9c5bc39a2095..11c6aba0bf0a 100644
--- a/drivers/gpu/drm/i915/gvt/Makefile
+++ b/drivers/gpu/drm/i915/gvt/Makefile
@@ -3,7 +3,7 @@ GVT_DIR := gvt
 GVT_SOURCE := gvt.o aperture_gm.o handlers.o vgpu.o trace_points.o firmware.o \
interrupt.o gtt.o cfg_space.o opregion.o mmio.o display.o edid.o \
execlist.o scheduler.o sched_policy.o mmio_context.o cmd_parser.o 
debugfs.o \
-   fb_decoder.o dmabuf.o page_track.o migrate.o
+   fb_decoder.o dmabuf.o page_track.o migrate.o migration_version.o
 
 ccflags-y  += -I $(srctree)/$(src) -I 
$(srctree)/$(src)/$(GVT_DIR)/
 i915-y += $(addprefix $(GVT_DIR)/, 
$(GVT_SOURCE))
diff --git a/drivers/gpu/drm/i915/gvt/gvt.c b/drivers/gpu/drm/i915/gvt/gvt.c
index d89dbc29bb96..fb464e3b2a57 100644
--- a/drivers/gpu/drm/i915/gvt/gvt.c
+++ b/drivers/gpu/drm/i915/gvt/gvt.c
@@ -106,14 +106,53 @@ static ssize_t description_show(struct kobject *kobj, 
struct device *dev,
   type->weight);
 }
 
+static ssize_t migration_version_show(struct kobject *kobj, struct device *dev,
+   char *buf)
+{
+   struct intel_vgpu_type *type;
+   void *gvt = kdev_to_i915(dev)->gvt;
+
+   type = intel_gvt_find_vgpu_type(gvt, kobject_name(kobj));
+   if (!type || !type->migration_version) {
+   gvt_err("Migration not supported on type %s. Please search 
previous detailed log\n",
+   kobject_name(kobj));
+   return -ENODEV;
+   }
+
+   return snprintf(buf, strlen(type->migration_version) + 2,
+   "%s\n", type->migration_version);
+}
+
+static ssize_t migration_version_store(struct kobject *kobj, struct device 
*dev,
+   const char *buf, size_t count)
+{
+   int ret = 0;
+   struct intel_vgpu_type *type;
+   void *gvt = kdev_to_i915(dev)->gvt;
+
+   type = intel_gvt_find_vgpu_type(

[PATCH v5 1/4] vfio/mdev: add migration_version attribute for mdev (under mdev_type node)

2020-04-13 Thread Yan Zhao
migration_version attribute is used to check migration compatibility
between two mdev devices of the same mdev type.
The key is that it's rw and its data is opaque to userspace.

Userspace reads migration_version of mdev device at source side and
writes the value to migration_version attribute of mdev device at target
side. It judges migration compatibility according to whether the read
and write operations succeed or fail.

Currently, it is able to read/write migration_version attribute under two
places:

(1) under mdev_type node
userspace is able to know whether two mdev devices are compatible before
a mdev device is created.

userspace also needs to check whether the two mdev devices are of the same
mdev type before checking the migration_version attribute. It also needs
to check device creation parameters if aggregation is supported in future.

(2) under mdev device node
userspace is able to know whether two mdev devices are compatible after
they are all created. But it does not need to check mdev type and device
creation parameter for aggregation as device vendor driver would have
incorporated those information into the migration_version attribute.

 __userspace
  /\  \
 / \write
/ read  \
   /__   ___\|/_
  | migration_version | | migration_version |-->check migration
  - -   compatibility
mdev device A   mdev device B

This patch is for mdev documentation about the first place (under
mdev_type node)

Cc: Alex Williamson 
Cc: Erik Skultety 
Cc: "Dr. David Alan Gilbert" 
Cc: Cornelia Huck 
Cc: "Tian, Kevin" 
Cc: Zhenyu Wang 
Cc: "Wang, Zhi A" 
Cc: Neo Jia 
Cc: Kirti Wankhede 
Cc: Daniel P. Berrangé 
Cc: Christophe de Dinechin 

Reviewed-by: Cornelia Huck 
Signed-off-by: Yan Zhao 

---
v5:
updated commit message a little to indicate this patch is for
migration_version attribute under mdev_type node

v4:
fixed a typo. (Cornelia Huck)

v3:
1. renamed version to migration_version
(Christophe de Dinechin, Cornelia Huck, Alex Williamson)
2. let errno to be freely defined by vendor driver
(Alex Williamson, Erik Skultety, Cornelia Huck, Dr. David Alan Gilbert)
3. let checking mdev_type be prerequisite of migration compatibility
check. (Alex Williamson)
4. reworded example usage section.
(most of this section came from Alex Williamson)
5. reworded attribute intention section (Cornelia Huck)

v2:
1. added detailed intent and usage
2. made definition of version string completely private to vendor driver
   (Alex Williamson)
3. abandoned changes to sample mdev drivers (Alex Williamson)
4. mandatory --> optional (Cornelia Huck)
5. added description for errno (Cornelia Huck)
---
 .../driver-api/vfio-mediated-device.rst   | 113 ++
 1 file changed, 113 insertions(+)

diff --git a/Documentation/driver-api/vfio-mediated-device.rst 
b/Documentation/driver-api/vfio-mediated-device.rst
index 25eb7d5b834b..2d1f3c0f3c8f 100644
--- a/Documentation/driver-api/vfio-mediated-device.rst
+++ b/Documentation/driver-api/vfio-mediated-device.rst
@@ -202,6 +202,7 @@ Directories and files under the sysfs for Each Physical 
Device
   | |   |--- available_instances
   | |   |--- device_api
   | |   |--- description
+  | |   |--- migration_version
   | |   |--- [devices]
   | |--- []
   | |   |--- create
@@ -209,6 +210,7 @@ Directories and files under the sysfs for Each Physical 
Device
   | |   |--- available_instances
   | |   |--- device_api
   | |   |--- description
+  | |   |--- migration_version
   | |   |--- [devices]
   | |--- []
   |  |--- create
@@ -216,6 +218,7 @@ Directories and files under the sysfs for Each Physical 
Device
   |  |--- available_instances
   |  |--- device_api
   |  |--- description
+  |  |--- migration_version
   |  |--- [devices]
 
 * [mdev_supported_types]
@@ -246,6 +249,116 @@ Directories and files under the sysfs for Each Physical 
Device
   This attribute should show the number of devices of type  that can 
be
   created.
 
+* migration_version
+
+  This attribute is rw, and is optional.
+  It is used to check migration compatibility between two mdev devices of the
+  same mdev type. Absence of this attribute means the device of type 
+  does not support migration.
+  This attribute provides a way to check migration compatibility between two
+  mdev devices from userspace even before device creation. The intended usage 
is
+  for userspace to read the migration_version attribute from one mdev device 
and
+  then writing that value to the migration_version attribute of the other mdev
+  device. The second mdev device indicates compatibility via the return code of
+  the write operation. This makes compatibility between mdev devices completely
+  vendor-define

[PATCH v5 0/4] introduction of migration_version attribute for VFIO live migration

2020-04-13 Thread Yan Zhao
This patchset introduces a migration_version attribute under sysfs of VFIO
Mediated devices.

This migration_version attribute is used to check migration compatibility
between two mdev devices.

Currently, it has two locations:
(1) under mdev_type node,
which can be used even before device creation, but only for mdev
devices of the same mdev type.
(2) under mdev device node,
which can only be used after the mdev devices are created, but the src
and target mdev devices are not necessarily be of the same mdev type
(The second location is newly added in v5, in order to keep consistent
with the migration_version node for migratable pass-though devices)

Patch 1 defines migration_version attribute for the first location in
Documentation/vfio-mediated-device.txt

Patch 2 uses GVT as an example for patch 1 to show how to expose
migration_version attribute and check migration compatibility in vendor
driver.

Patch 3 defines migration_version attribute for the second location in
Documentation/vfio-mediated-device.txt

Patch 4 uses GVT as an example for patch 3 to show how to expose
migration_version attribute and check migration compatibility in vendor
driver.

(The previous "Reviewed-by" and "Acked-by" for patch 1 and patch 2 are
kept in v5, as there are only small changes to commit messages of the two
patches.)

v5:
added patch 2 and 4 for mdev device part of migration_version attribute.

v4:
1. fixed indentation/spell errors, reworded several error messages
2. added a missing memory free for error handling in patch 2

v3:
1. renamed version to migration_version
2. let errno to be freely defined by vendor driver
3. let checking mdev_type be prerequisite of migration compatibility check
4. reworded most part of patch 1
5. print detailed error log in patch 2 and generate migration_version
string at init time

v2:
1. renamed patched 1
2. made definition of device version string completely private to vendor
driver
3. reverted changes to sample mdev drivers
4. described intent and usage of version attribute more clearly.


Yan Zhao (4):
  vfio/mdev: add migration_version attribute for mdev (under mdev_type
node)
  drm/i915/gvt: export migration_version to mdev sysfs (under mdev_type
node)
  vfio/mdev: add migration_version attribute for mdev (under mdev device
node)
  drm/i915/gvt: export migration_version to mdev sysfs (under mdev
device node)

 .../driver-api/vfio-mediated-device.rst   | 183 ++
 drivers/gpu/drm/i915/gvt/Makefile |   2 +-
 drivers/gpu/drm/i915/gvt/gvt.c|  39 
 drivers/gpu/drm/i915/gvt/gvt.h|   7 +
 drivers/gpu/drm/i915/gvt/kvmgt.c  |  55 ++
 drivers/gpu/drm/i915/gvt/migration_version.c  | 170 
 drivers/gpu/drm/i915/gvt/vgpu.c   |  13 +-
 7 files changed, 466 insertions(+), 3 deletions(-)
 create mode 100644 drivers/gpu/drm/i915/gvt/migration_version.c

-- 
2.17.1




Re: [PATCH v4 0/2] introduction of migration_version attribute for VFIO live migration

2020-03-24 Thread Yan Zhao
On Tue, Mar 24, 2020 at 10:49:54PM +0800, Alex Williamson wrote:
> On Tue, 24 Mar 2020 09:23:31 +
> "Dr. David Alan Gilbert"  wrote:
> 
> > * Yan Zhao (yan.y.z...@intel.com) wrote:
> > > On Tue, Mar 24, 2020 at 05:29:59AM +0800, Alex Williamson wrote:  
> > > > On Mon, 3 Jun 2019 20:34:22 -0400
> > > > Yan Zhao  wrote:
> > > >   
> > > > > On Tue, Jun 04, 2019 at 03:29:32AM +0800, Alex Williamson wrote:  
> > > > > > On Thu, 30 May 2019 20:44:38 -0400
> > > > > > Yan Zhao  wrote:
> > > > > > 
> > > > > > > This patchset introduces a migration_version attribute under 
> > > > > > > sysfs of VFIO
> > > > > > > Mediated devices.
> > > > > > > 
> > > > > > > This migration_version attribute is used to check migration 
> > > > > > > compatibility
> > > > > > > between two mdev devices of the same mdev type.
> > > > > > > 
> > > > > > > Patch 1 defines migration_version attribute in
> > > > > > > Documentation/vfio-mediated-device.txt
> > > > > > > 
> > > > > > > Patch 2 uses GVT as an example to show how to expose 
> > > > > > > migration_version
> > > > > > > attribute and check migration compatibility in vendor driver.
> > > > > > 
> > > > > > Thanks for iterating through this, it looks like we've settled on
> > > > > > something reasonable, but now what?  This is one piece of the 
> > > > > > puzzle to
> > > > > > supporting mdev migration, but I don't think it makes sense to 
> > > > > > commit
> > > > > > this upstream on its own without also defining the remainder of how 
> > > > > > we
> > > > > > actually do migration, preferably with more than one working
> > > > > > implementation and at least prototyped, if not final, QEMU support. 
> > > > > >  I
> > > > > > hope that was the intent, and maybe it's now time to look at the 
> > > > > > next
> > > > > > piece of the puzzle.  Thanks,
> > > > > > 
> > > > > > Alex
> > > > > 
> > > > > Got it. 
> > > > > Also thank you and all for discussing and guiding all along:)
> > > > > We'll move to the next episode now.  
> > > > 
> > > > Hi Yan,
> > > > 
> > > > As we're hopefully moving towards a migration API, would it make sense
> > > > to refresh this series at the same time?  I think we're still expecting
> > > > a vendor driver implementing Kirti's migration API to also implement
> > > > this sysfs interface for compatibility verification.  Thanks,
> > > >  
> > > Hi Alex
> > > Got it!
> > > Thanks for reminding of this. And as now we have vfio-pci implementing
> > > vendor ops to allow live migration of pass-through devices, is it
> > > necessary to implement similar sysfs node for those devices?
> > > or do you think just PCI IDs of those devices are enough for libvirt to
> > > know device compatibility ?  
> > 
> > Wasn't the problem that we'd have to know how to check for things like:
> >   a) Whether different firmware versions in the device were actually
> > compatible
> >   b) Whether minor hardware differences were compatible - e.g. some
> > hardware might let you migrate to the next version of hardware up.
> 
> Yes, minor changes in hardware or firmware that may not be represented
> in the device ID or hardware revision.  Also the version is as much for
> indicating the compatibility of the vendor defined migration protocol
> as it is for the hardware itself.  I certainly wouldn't be so bold as
> to create a protocol that is guaranteed compatible forever.  We'll need
> to expose the same sysfs attribute in some standard location for
> non-mdev devices.  I assume vfio-pci would provide the vendor ops some
> mechanism to expose these in a standard namespace of sysfs attributes
> under the device itself.  Perhaps that indicates we need to link the
> mdev type version under the mdev device as well to make this
> transparent to userspace tools like libvirt.  Thanks,
>
Got it. will do it.
Thanks!

Yan




Re: [PATCH v4 0/2] introduction of migration_version attribute for VFIO live migration

2020-03-23 Thread Yan Zhao
On Tue, Mar 24, 2020 at 05:29:59AM +0800, Alex Williamson wrote:
> On Mon, 3 Jun 2019 20:34:22 -0400
> Yan Zhao  wrote:
> 
> > On Tue, Jun 04, 2019 at 03:29:32AM +0800, Alex Williamson wrote:
> > > On Thu, 30 May 2019 20:44:38 -0400
> > > Yan Zhao  wrote:
> > >   
> > > > This patchset introduces a migration_version attribute under sysfs of 
> > > > VFIO
> > > > Mediated devices.
> > > > 
> > > > This migration_version attribute is used to check migration 
> > > > compatibility
> > > > between two mdev devices of the same mdev type.
> > > > 
> > > > Patch 1 defines migration_version attribute in
> > > > Documentation/vfio-mediated-device.txt
> > > > 
> > > > Patch 2 uses GVT as an example to show how to expose migration_version
> > > > attribute and check migration compatibility in vendor driver.  
> > > 
> > > Thanks for iterating through this, it looks like we've settled on
> > > something reasonable, but now what?  This is one piece of the puzzle to
> > > supporting mdev migration, but I don't think it makes sense to commit
> > > this upstream on its own without also defining the remainder of how we
> > > actually do migration, preferably with more than one working
> > > implementation and at least prototyped, if not final, QEMU support.  I
> > > hope that was the intent, and maybe it's now time to look at the next
> > > piece of the puzzle.  Thanks,
> > > 
> > > Alex  
> > 
> > Got it. 
> > Also thank you and all for discussing and guiding all along:)
> > We'll move to the next episode now.
> 
> Hi Yan,
> 
> As we're hopefully moving towards a migration API, would it make sense
> to refresh this series at the same time?  I think we're still expecting
> a vendor driver implementing Kirti's migration API to also implement
> this sysfs interface for compatibility verification.  Thanks,
>
Hi Alex
Got it!
Thanks for reminding of this. And as now we have vfio-pci implementing
vendor ops to allow live migration of pass-through devices, is it
necessary to implement similar sysfs node for those devices?
or do you think just PCI IDs of those devices are enough for libvirt to
know device compatibility ?

Thanks
Yan





Re: [libvirt] [RFC PATCH 0/9] Introduce mediate ops in vfio-pci

2019-12-11 Thread Yan Zhao
On Thu, Dec 12, 2019 at 11:48:25AM +0800, Jason Wang wrote:
> 
> On 2019/12/6 下午8:49, Yan Zhao wrote:
> > On Fri, Dec 06, 2019 at 05:40:02PM +0800, Jason Wang wrote:
> >> On 2019/12/6 下午4:22, Yan Zhao wrote:
> >>> On Thu, Dec 05, 2019 at 09:05:54PM +0800, Jason Wang wrote:
> >>>> On 2019/12/5 下午4:51, Yan Zhao wrote:
> >>>>> On Thu, Dec 05, 2019 at 02:33:19PM +0800, Jason Wang wrote:
> >>>>>> Hi:
> >>>>>>
> >>>>>> On 2019/12/5 上午11:24, Yan Zhao wrote:
> >>>>>>> For SRIOV devices, VFs are passthroughed into guest directly without 
> >>>>>>> host
> >>>>>>> driver mediation. However, when VMs migrating with passthroughed VFs,
> >>>>>>> dynamic host mediation is required to  (1) get device states, (2) get
> >>>>>>> dirty pages. Since device states as well as other critical information
> >>>>>>> required for dirty page tracking for VFs are usually retrieved from 
> >>>>>>> PFs,
> >>>>>>> it is handy to provide an extension in PF driver to centralizingly 
> >>>>>>> control
> >>>>>>> VFs' migration.
> >>>>>>>
> >>>>>>> Therefore, in order to realize (1) passthrough VFs at normal time, (2)
> >>>>>>> dynamically trap VFs' bars for dirty page tracking and
> >>>>>> A silly question, what's the reason for doing this, is this a must for 
> >>>>>> dirty
> >>>>>> page tracking?
> >>>>>>
> >>>>> For performance consideration. VFs' bars should be passthoughed at
> >>>>> normal time and only enter into trap state on need.
> >>>> Right, but how does this matter for the case of dirty page tracking?
> >>>>
> >>> Take NIC as an example, to trap its VF dirty pages, software way is
> >>> required to trap every write of ring tail that resides in BAR0.
> >>
> >> Interesting, but it looks like we need:
> >> - decode the instruction
> >> - mediate all access to BAR0
> >> All of which seems a great burden for the VF driver. I wonder whether or
> >> not doing interrupt relay and tracking head is better in this case.
> >>
> > hi Jason
> >
> > not familiar with the way you mentioned. could you elaborate more?
> 
> 
> It looks to me that you want to intercept the bar that contains the 
> head. Then you can figure out the buffers submitted from driver and you 
> still need to decide a proper time to mark them as dirty.
> 
Not need to be accurate, right? just a superset of real dirty bitmap is
enough.

> What I meant is, intercept the interrupt, then you can figure still 
> figure out the buffers which has been modified by the device and make 
> them as dirty.
> 
> Then there's no need to trap BAR and do decoding/emulation etc.
> 
> But it will still be tricky to be correct...
>
intercept the interrupt is a little hard if post interrupt is enabled..
I think what you worried about here is the timing to mark dirty pages,
right? upon interrupt receiving, you regard DMAs are finished and safe
to make them dirty.
But with BAR trap way, we at least can keep those dirtied pages as dirty
until device stop. Of course we have other methods to optimize it.

> 
> >>>There's
> >>> still no IOMMU Dirty bit available.
> >>>>>>>  (3) centralizing
> >>>>>>> VF critical states retrieving and VF controls into one driver, we 
> >>>>>>> propose
> >>>>>>> to introduce mediate ops on top of current vfio-pci device driver.
> >>>>>>>
> >>>>>>>
> >>>>>>>_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
> >>>>>>> _ _
> >>>>>>>  __   register mediate ops|  ___ ___  
> >>>>>>>   |
> >>>>>>> |  |<---| VF|   |   |
> >>>>>>> | vfio-pci |  | |  mediate  |   | PF driver |   |
> >>>>>>> |__|--->|   driver  |   |___|
> >>>>>>>  |open(pdev)  |  ---  |   
> >>>>>>>   |
> >>>>>>>  ||
&

Re: [libvirt] [RFC PATCH 4/9] vfio-pci: register default dynamic-trap-bar-info region

2019-12-11 Thread Yan Zhao
On Thu, Dec 12, 2019 at 11:07:42AM +0800, Alex Williamson wrote:
> On Wed, 11 Dec 2019 21:02:40 -0500
> Yan Zhao  wrote:
> 
> > On Thu, Dec 12, 2019 at 02:56:55AM +0800, Alex Williamson wrote:
> > > On Wed, 11 Dec 2019 01:25:55 -0500
> > > Yan Zhao  wrote:
> > >   
> > > > On Wed, Dec 11, 2019 at 12:38:05AM +0800, Alex Williamson wrote:  
> > > > > On Tue, 10 Dec 2019 02:44:44 -0500
> > > > > Yan Zhao  wrote:
> > > > > 
> > > > > > On Tue, Dec 10, 2019 at 05:16:08AM +0800, Alex Williamson wrote:
> > > > > > > On Mon, 9 Dec 2019 01:22:12 -0500
> > > > > > > Yan Zhao  wrote:
> > > > > > >       
> > > > > > > > On Fri, Dec 06, 2019 at 11:20:38PM +0800, Alex Williamson 
> > > > > > > > wrote:  
> > > > > > > > > On Fri, 6 Dec 2019 01:04:07 -0500
> > > > > > > > > Yan Zhao  wrote:
> > > > > > > > > 
> > > > > > > > > > On Fri, Dec 06, 2019 at 07:55:30AM +0800, Alex Williamson 
> > > > > > > > > > wrote:
> > > > > > > > > > > On Wed,  4 Dec 2019 22:26:50 -0500
> > > > > > > > > > > Yan Zhao  wrote:
> > > > > > > > > > >   
> > > > > > > > > > > > Dynamic trap bar info region is a channel for QEMU and 
> > > > > > > > > > > > vendor driver to
> > > > > > > > > > > > communicate dynamic trap info. It is of type
> > > > > > > > > > > > VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO and subtype
> > > > > > > > > > > > VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO.
> > > > > > > > > > > > 
> > > > > > > > > > > > This region has two fields: dt_fd and trap.
> > > > > > > > > > > > When QEMU detects a device regions of this type, it 
> > > > > > > > > > > > will create an
> > > > > > > > > > > > eventfd and write its eventfd id to dt_fd field.
> > > > > > > > > > > > When vendor drivre signals this eventfd, QEMU reads 
> > > > > > > > > > > > trap field of this
> > > > > > > > > > > > info region.
> > > > > > > > > > > > - If trap is true, QEMU would search the device's PCI 
> > > > > > > > > > > > BAR
> > > > > > > > > > > > regions and disable all the sparse mmaped subregions 
> > > > > > > > > > > > (if the sparse
> > > > > > > > > > > > mmaped subregion is disablable).
> > > > > > > > > > > > - If trap is false, QEMU would re-enable those 
> > > > > > > > > > > > subregions.
> > > > > > > > > > > > 
> > > > > > > > > > > > A typical usage is
> > > > > > > > > > > > 1. vendor driver first cuts its bar 0 into several 
> > > > > > > > > > > > sections, all in a
> > > > > > > > > > > > sparse mmap array. So initally, all its bar 0 are 
> > > > > > > > > > > > passthroughed.
> > > > > > > > > > > > 2. vendor driver specifys part of bar 0 sections to be 
> > > > > > > > > > > > disablable.
> > > > > > > > > > > > 3. on migration starts, vendor driver signals dt_fd and 
> > > > > > > > > > > > set trap to true
> > > > > > > > > > > > to notify QEMU disabling the bar 0 sections of 
> > > > > > > > > > > > disablable flags on.
> > > > > > > > > > > > 4. QEMU disables those bar 0 section and hence let 
> > > > > > > > > > > > vendor driver be able
> > > > > > > > > > > > to trap access of bar 0 registers and make dirty page 
> > > > > > > > > > > > tracking possible.
> > > > > > > > > > > > 5. on migration failure, vendor driver sig

Re: [libvirt] [RFC PATCH 4/9] vfio-pci: register default dynamic-trap-bar-info region

2019-12-11 Thread Yan Zhao
On Thu, Dec 12, 2019 at 02:56:55AM +0800, Alex Williamson wrote:
> On Wed, 11 Dec 2019 01:25:55 -0500
> Yan Zhao  wrote:
> 
> > On Wed, Dec 11, 2019 at 12:38:05AM +0800, Alex Williamson wrote:
> > > On Tue, 10 Dec 2019 02:44:44 -0500
> > > Yan Zhao  wrote:
> > >   
> > > > On Tue, Dec 10, 2019 at 05:16:08AM +0800, Alex Williamson wrote:  
> > > > > On Mon, 9 Dec 2019 01:22:12 -0500
> > > > > Yan Zhao  wrote:
> > > > > 
> > > > > > On Fri, Dec 06, 2019 at 11:20:38PM +0800, Alex Williamson wrote:
> > > > > > > On Fri, 6 Dec 2019 01:04:07 -0500
> > > > > > > Yan Zhao  wrote:
> > > > > > >       
> > > > > > > > On Fri, Dec 06, 2019 at 07:55:30AM +0800, Alex Williamson 
> > > > > > > > wrote:  
> > > > > > > > > On Wed,  4 Dec 2019 22:26:50 -0500
> > > > > > > > > Yan Zhao  wrote:
> > > > > > > > > 
> > > > > > > > > > Dynamic trap bar info region is a channel for QEMU and 
> > > > > > > > > > vendor driver to
> > > > > > > > > > communicate dynamic trap info. It is of type
> > > > > > > > > > VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO and subtype
> > > > > > > > > > VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO.
> > > > > > > > > > 
> > > > > > > > > > This region has two fields: dt_fd and trap.
> > > > > > > > > > When QEMU detects a device regions of this type, it will 
> > > > > > > > > > create an
> > > > > > > > > > eventfd and write its eventfd id to dt_fd field.
> > > > > > > > > > When vendor drivre signals this eventfd, QEMU reads trap 
> > > > > > > > > > field of this
> > > > > > > > > > info region.
> > > > > > > > > > - If trap is true, QEMU would search the device's PCI BAR
> > > > > > > > > > regions and disable all the sparse mmaped subregions (if 
> > > > > > > > > > the sparse
> > > > > > > > > > mmaped subregion is disablable).
> > > > > > > > > > - If trap is false, QEMU would re-enable those subregions.
> > > > > > > > > > 
> > > > > > > > > > A typical usage is
> > > > > > > > > > 1. vendor driver first cuts its bar 0 into several 
> > > > > > > > > > sections, all in a
> > > > > > > > > > sparse mmap array. So initally, all its bar 0 are 
> > > > > > > > > > passthroughed.
> > > > > > > > > > 2. vendor driver specifys part of bar 0 sections to be 
> > > > > > > > > > disablable.
> > > > > > > > > > 3. on migration starts, vendor driver signals dt_fd and set 
> > > > > > > > > > trap to true
> > > > > > > > > > to notify QEMU disabling the bar 0 sections of disablable 
> > > > > > > > > > flags on.
> > > > > > > > > > 4. QEMU disables those bar 0 section and hence let vendor 
> > > > > > > > > > driver be able
> > > > > > > > > > to trap access of bar 0 registers and make dirty page 
> > > > > > > > > > tracking possible.
> > > > > > > > > > 5. on migration failure, vendor driver signals dt_fd to 
> > > > > > > > > > QEMU again.
> > > > > > > > > > QEMU reads trap field of this info region which is false 
> > > > > > > > > > and QEMU
> > > > > > > > > > re-passthrough the whole bar 0 region.
> > > > > > > > > > 
> > > > > > > > > > Vendor driver specifies whether it supports 
> > > > > > > > > > dynamic-trap-bar-info region
> > > > > > > > > > through cap VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR in
> > > > > > > > > > vfio_pci_mediate_ops->open().
> > > > > > > > > > 
> > > > > > > > > > If vfio-pci detect

Re: [libvirt] [RFC PATCH 4/9] vfio-pci: register default dynamic-trap-bar-info region

2019-12-10 Thread Yan Zhao
On Wed, Dec 11, 2019 at 12:38:05AM +0800, Alex Williamson wrote:
> On Tue, 10 Dec 2019 02:44:44 -0500
> Yan Zhao  wrote:
> 
> > On Tue, Dec 10, 2019 at 05:16:08AM +0800, Alex Williamson wrote:
> > > On Mon, 9 Dec 2019 01:22:12 -0500
> > > Yan Zhao  wrote:
> > >   
> > > > On Fri, Dec 06, 2019 at 11:20:38PM +0800, Alex Williamson wrote:  
> > > > > On Fri, 6 Dec 2019 01:04:07 -0500
> > > > > Yan Zhao  wrote:
> > > > > 
> > > > > > On Fri, Dec 06, 2019 at 07:55:30AM +0800, Alex Williamson wrote:
> > > > > > > On Wed,  4 Dec 2019 22:26:50 -0500
> > > > > > > Yan Zhao  wrote:
> > > > > > >   
> > > > > > > > Dynamic trap bar info region is a channel for QEMU and vendor 
> > > > > > > > driver to
> > > > > > > > communicate dynamic trap info. It is of type
> > > > > > > > VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO and subtype
> > > > > > > > VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO.
> > > > > > > > 
> > > > > > > > This region has two fields: dt_fd and trap.
> > > > > > > > When QEMU detects a device regions of this type, it will create 
> > > > > > > > an
> > > > > > > > eventfd and write its eventfd id to dt_fd field.
> > > > > > > > When vendor drivre signals this eventfd, QEMU reads trap field 
> > > > > > > > of this
> > > > > > > > info region.
> > > > > > > > - If trap is true, QEMU would search the device's PCI BAR
> > > > > > > > regions and disable all the sparse mmaped subregions (if the 
> > > > > > > > sparse
> > > > > > > > mmaped subregion is disablable).
> > > > > > > > - If trap is false, QEMU would re-enable those subregions.
> > > > > > > > 
> > > > > > > > A typical usage is
> > > > > > > > 1. vendor driver first cuts its bar 0 into several sections, 
> > > > > > > > all in a
> > > > > > > > sparse mmap array. So initally, all its bar 0 are passthroughed.
> > > > > > > > 2. vendor driver specifys part of bar 0 sections to be 
> > > > > > > > disablable.
> > > > > > > > 3. on migration starts, vendor driver signals dt_fd and set 
> > > > > > > > trap to true
> > > > > > > > to notify QEMU disabling the bar 0 sections of disablable flags 
> > > > > > > > on.
> > > > > > > > 4. QEMU disables those bar 0 section and hence let vendor 
> > > > > > > > driver be able
> > > > > > > > to trap access of bar 0 registers and make dirty page tracking 
> > > > > > > > possible.
> > > > > > > > 5. on migration failure, vendor driver signals dt_fd to QEMU 
> > > > > > > > again.
> > > > > > > > QEMU reads trap field of this info region which is false and 
> > > > > > > > QEMU
> > > > > > > > re-passthrough the whole bar 0 region.
> > > > > > > > 
> > > > > > > > Vendor driver specifies whether it supports 
> > > > > > > > dynamic-trap-bar-info region
> > > > > > > > through cap VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR in
> > > > > > > > vfio_pci_mediate_ops->open().
> > > > > > > > 
> > > > > > > > If vfio-pci detects this cap, it will create a default
> > > > > > > > dynamic_trap_bar_info region on behalf of vendor driver with 
> > > > > > > > region len=0
> > > > > > > > and region->ops=null.
> > > > > > > > Vvendor driver should override this region's len, flags, rw, 
> > > > > > > > mmap in its
> > > > > > > > vfio_pci_mediate_ops.  
> > > > > > > 
> > > > > > > TBH, I don't like this interface at all.  Userspace doesn't pass 
> > > > > > > data
> > > > > > > to the kernel via INFO ioctls.  We have a SET_IRQS ioctl for
> > > > > > > configuring user signaling with eventfds.  I think we only need to
>

Re: [libvirt] [RFC PATCH 1/9] vfio/pci: introduce mediate ops to intercept vfio-pci ops

2019-12-10 Thread Yan Zhao
On Wed, Dec 11, 2019 at 12:58:24AM +0800, Alex Williamson wrote:
> On Mon, 9 Dec 2019 21:44:23 -0500
> Yan Zhao  wrote:
> 
> > > > > > Currently, yes, i40e has build dependency on vfio-pci.
> > > > > > It's like this, if i40e decides to support SRIOV and compiles in vf
> > > > > > related code who depends on vfio-pci, it will also have build 
> > > > > > dependency
> > > > > > on vfio-pci. isn't it natural?
> > > > > 
> > > > > No, this is not natural.  There are certainly i40e VF use cases that
> > > > > have no interest in vfio and having dependencies between the two
> > > > > modules is unacceptable.  I think you probably want to modularize the
> > > > > i40e vfio support code and then perhaps register a table in vfio-pci
> > > > > that the vfio-pci code can perform a module request when using a
> > > > > compatible device.  Just and idea, there might be better options.  I
> > > > > will not accept a solution that requires unloading the i40e driver in
> > > > > order to unload the vfio-pci driver.  It's inconvenient with just one
> > > > > NIC driver, imagine how poorly that scales.
> > > > > 
> > > > what about this way:
> > > > mediate driver registers a module notifier and every time when
> > > > vfio_pci is loaded, register to vfio_pci its mediate ops?
> > > > (Just like in below sample code)
> > > > This way vfio-pci is free to unload and this registering only gives
> > > > vfio-pci a name of what module to request.
> > > > After that,
> > > > in vfio_pci_open(), vfio-pci requests the mediate driver. (or puts
> > > > the mediate driver when mediate driver does not support mediating the
> > > > device)
> > > > in vfio_pci_release(), vfio-pci puts the mediate driver.
> > > > 
> > > > static void register_mediate_ops(void)
> > > > {
> > > > int (*func)(struct vfio_pci_mediate_ops *ops) = NULL;
> > > > 
> > > > func = symbol_get(vfio_pci_register_mediate_ops);
> > > > 
> > > > if (func) {
> > > > func(_dt_ops);
> > > > symbol_put(vfio_pci_register_mediate_ops);
> > > > }
> > > > }
> > > > 
> > > > static int igd_module_notify(struct notifier_block *self,
> > > >   unsigned long val, void *data)
> > > > {
> > > > struct module *mod = data;
> > > > int ret = 0;
> > > > 
> > > > switch (val) {
> > > > case MODULE_STATE_LIVE:
> > > > if (!strcmp(mod->name, "vfio_pci"))
> > > > register_mediate_ops();
> > > > break;
> > > > case MODULE_STATE_GOING:
> > > > break;
> > > > default:
> > > > break;
> > > > }
> > > > return ret;
> > > > }
> > > > 
> > > > static struct notifier_block igd_module_nb = {
> > > > .notifier_call = igd_module_notify,
> > > > .priority = 0,
> > > > };
> > > > 
> > > > 
> > > > 
> > > > static int __init igd_dt_init(void)
> > > > {
> > > > ...
> > > > register_mediate_ops();
> > > > register_module_notifier(_module_nb);
> > > > ...
> > > > return 0;
> > > > }  
> > > 
> > > 
> > > No, this is bad.  Please look at MODULE_ALIAS() and request_module() as
> > > used in the vfio-platform for loading reset driver modules.  I think
> > > the correct approach is that vfio-pci should perform a request_module()
> > > based on the device being probed.  Having the mediation provider
> > > listening for vfio-pci and registering itself regardless of whether we
> > > intend to use it assumes that we will want to use it and assumes that
> > > the mediation provider module is already loaded.  We should be able to
> > > support demand loading of modules that may serve no other purpose than
> > > providing this mediation.  Thanks,  
> > hi Alex
> > Thanks for this message.
> > So is it good to create a separate module as mediation provider driver,
> > and alias its module name to "vfio-pci-mediate-vid-did".
> > Then when vfio-pci probes the device, it requests module of that name ?
> 
> I think this would give us an option to have the mediator as a separate
> module, but not require it.  Maybe rather than a request_module(),
> where if we follow the platform reset example we'd then expect the init
> code for the module to register into a list, we could do a
> symbol_request().  AIUI, this would give us a reference to the symbol
> if the module providing it is already loaded, and request a module
> (perhaps via an alias) if it's not already load.  Thanks,
> 
ok. got it!
Thank you :)

Yan


--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list



Re: [libvirt] [RFC PATCH 4/9] vfio-pci: register default dynamic-trap-bar-info region

2019-12-09 Thread Yan Zhao
On Tue, Dec 10, 2019 at 05:16:08AM +0800, Alex Williamson wrote:
> On Mon, 9 Dec 2019 01:22:12 -0500
> Yan Zhao  wrote:
> 
> > On Fri, Dec 06, 2019 at 11:20:38PM +0800, Alex Williamson wrote:
> > > On Fri, 6 Dec 2019 01:04:07 -0500
> > > Yan Zhao  wrote:
> > >   
> > > > On Fri, Dec 06, 2019 at 07:55:30AM +0800, Alex Williamson wrote:  
> > > > > On Wed,  4 Dec 2019 22:26:50 -0500
> > > > > Yan Zhao  wrote:
> > > > > 
> > > > > > Dynamic trap bar info region is a channel for QEMU and vendor 
> > > > > > driver to
> > > > > > communicate dynamic trap info. It is of type
> > > > > > VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO and subtype
> > > > > > VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO.
> > > > > > 
> > > > > > This region has two fields: dt_fd and trap.
> > > > > > When QEMU detects a device regions of this type, it will create an
> > > > > > eventfd and write its eventfd id to dt_fd field.
> > > > > > When vendor drivre signals this eventfd, QEMU reads trap field of 
> > > > > > this
> > > > > > info region.
> > > > > > - If trap is true, QEMU would search the device's PCI BAR
> > > > > > regions and disable all the sparse mmaped subregions (if the sparse
> > > > > > mmaped subregion is disablable).
> > > > > > - If trap is false, QEMU would re-enable those subregions.
> > > > > > 
> > > > > > A typical usage is
> > > > > > 1. vendor driver first cuts its bar 0 into several sections, all in 
> > > > > > a
> > > > > > sparse mmap array. So initally, all its bar 0 are passthroughed.
> > > > > > 2. vendor driver specifys part of bar 0 sections to be disablable.
> > > > > > 3. on migration starts, vendor driver signals dt_fd and set trap to 
> > > > > > true
> > > > > > to notify QEMU disabling the bar 0 sections of disablable flags on.
> > > > > > 4. QEMU disables those bar 0 section and hence let vendor driver be 
> > > > > > able
> > > > > > to trap access of bar 0 registers and make dirty page tracking 
> > > > > > possible.
> > > > > > 5. on migration failure, vendor driver signals dt_fd to QEMU again.
> > > > > > QEMU reads trap field of this info region which is false and QEMU
> > > > > > re-passthrough the whole bar 0 region.
> > > > > > 
> > > > > > Vendor driver specifies whether it supports dynamic-trap-bar-info 
> > > > > > region
> > > > > > through cap VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR in
> > > > > > vfio_pci_mediate_ops->open().
> > > > > > 
> > > > > > If vfio-pci detects this cap, it will create a default
> > > > > > dynamic_trap_bar_info region on behalf of vendor driver with region 
> > > > > > len=0
> > > > > > and region->ops=null.
> > > > > > Vvendor driver should override this region's len, flags, rw, mmap 
> > > > > > in its
> > > > > > vfio_pci_mediate_ops.
> > > > > 
> > > > > TBH, I don't like this interface at all.  Userspace doesn't pass data
> > > > > to the kernel via INFO ioctls.  We have a SET_IRQS ioctl for
> > > > > configuring user signaling with eventfds.  I think we only need to
> > > > > define an IRQ type that tells the user to re-evaluate the sparse mmap
> > > > > information for a region.  The user would enumerate the device IRQs 
> > > > > via
> > > > > GET_IRQ_INFO, find one of this type where the IRQ info would also
> > > > > indicate which region(s) should be re-evaluated on signaling.  The 
> > > > > user
> > > > > would enable that signaling via SET_IRQS and simply re-evaluate the   
> > > > >  
> > > > ok. I'll try to switch to this way. Thanks for this suggestion.
> > > >   
> > > > > sparse mmap capability for the associated regions when signaled.
> > > > 
> > > > Do you like the "disablable" flag of sparse mmap ?
> > > > I think it's a lightweight way for user to switch mmap state of a whole 
> > > > region,
> > > > otherwise going through a complete flow of GET_

Re: [libvirt] [RFC PATCH 1/9] vfio/pci: introduce mediate ops to intercept vfio-pci ops

2019-12-09 Thread Yan Zhao
> > > > Currently, yes, i40e has build dependency on vfio-pci.
> > > > It's like this, if i40e decides to support SRIOV and compiles in vf
> > > > related code who depends on vfio-pci, it will also have build dependency
> > > > on vfio-pci. isn't it natural?  
> > > 
> > > No, this is not natural.  There are certainly i40e VF use cases that
> > > have no interest in vfio and having dependencies between the two
> > > modules is unacceptable.  I think you probably want to modularize the
> > > i40e vfio support code and then perhaps register a table in vfio-pci
> > > that the vfio-pci code can perform a module request when using a
> > > compatible device.  Just and idea, there might be better options.  I
> > > will not accept a solution that requires unloading the i40e driver in
> > > order to unload the vfio-pci driver.  It's inconvenient with just one
> > > NIC driver, imagine how poorly that scales.
> > >   
> > what about this way:
> > mediate driver registers a module notifier and every time when
> > vfio_pci is loaded, register to vfio_pci its mediate ops?
> > (Just like in below sample code)
> > This way vfio-pci is free to unload and this registering only gives
> > vfio-pci a name of what module to request.
> > After that,
> > in vfio_pci_open(), vfio-pci requests the mediate driver. (or puts
> > the mediate driver when mediate driver does not support mediating the
> > device)
> > in vfio_pci_release(), vfio-pci puts the mediate driver.
> > 
> > static void register_mediate_ops(void)
> > {
> > int (*func)(struct vfio_pci_mediate_ops *ops) = NULL;
> > 
> > func = symbol_get(vfio_pci_register_mediate_ops);
> > 
> > if (func) {
> > func(_dt_ops);
> > symbol_put(vfio_pci_register_mediate_ops);
> > }
> > }
> > 
> > static int igd_module_notify(struct notifier_block *self,
> >   unsigned long val, void *data)
> > {
> > struct module *mod = data;
> > int ret = 0;
> > 
> > switch (val) {
> > case MODULE_STATE_LIVE:
> > if (!strcmp(mod->name, "vfio_pci"))
> > register_mediate_ops();
> > break;
> > case MODULE_STATE_GOING:
> > break;
> > default:
> > break;
> > }
> > return ret;
> > }
> > 
> > static struct notifier_block igd_module_nb = {
> > .notifier_call = igd_module_notify,
> > .priority = 0,
> > };
> > 
> > 
> > 
> > static int __init igd_dt_init(void)
> > {
> > ...
> > register_mediate_ops();
> > register_module_notifier(_module_nb);
> > ...
> > return 0;
> > }
> 
> 
> No, this is bad.  Please look at MODULE_ALIAS() and request_module() as
> used in the vfio-platform for loading reset driver modules.  I think
> the correct approach is that vfio-pci should perform a request_module()
> based on the device being probed.  Having the mediation provider
> listening for vfio-pci and registering itself regardless of whether we
> intend to use it assumes that we will want to use it and assumes that
> the mediation provider module is already loaded.  We should be able to
> support demand loading of modules that may serve no other purpose than
> providing this mediation.  Thanks,
hi Alex
Thanks for this message.
So is it good to create a separate module as mediation provider driver,
and alias its module name to "vfio-pci-mediate-vid-did".
Then when vfio-pci probes the device, it requests module of that name ?

Thanks
Yan


--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list



Re: [libvirt] [RFC PATCH 4/9] vfio-pci: register default dynamic-trap-bar-info region

2019-12-08 Thread Yan Zhao
On Fri, Dec 06, 2019 at 11:20:38PM +0800, Alex Williamson wrote:
> On Fri, 6 Dec 2019 01:04:07 -0500
> Yan Zhao  wrote:
> 
> > On Fri, Dec 06, 2019 at 07:55:30AM +0800, Alex Williamson wrote:
> > > On Wed,  4 Dec 2019 22:26:50 -0500
> > > Yan Zhao  wrote:
> > >   
> > > > Dynamic trap bar info region is a channel for QEMU and vendor driver to
> > > > communicate dynamic trap info. It is of type
> > > > VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO and subtype
> > > > VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO.
> > > > 
> > > > This region has two fields: dt_fd and trap.
> > > > When QEMU detects a device regions of this type, it will create an
> > > > eventfd and write its eventfd id to dt_fd field.
> > > > When vendor drivre signals this eventfd, QEMU reads trap field of this
> > > > info region.
> > > > - If trap is true, QEMU would search the device's PCI BAR
> > > > regions and disable all the sparse mmaped subregions (if the sparse
> > > > mmaped subregion is disablable).
> > > > - If trap is false, QEMU would re-enable those subregions.
> > > > 
> > > > A typical usage is
> > > > 1. vendor driver first cuts its bar 0 into several sections, all in a
> > > > sparse mmap array. So initally, all its bar 0 are passthroughed.
> > > > 2. vendor driver specifys part of bar 0 sections to be disablable.
> > > > 3. on migration starts, vendor driver signals dt_fd and set trap to true
> > > > to notify QEMU disabling the bar 0 sections of disablable flags on.
> > > > 4. QEMU disables those bar 0 section and hence let vendor driver be able
> > > > to trap access of bar 0 registers and make dirty page tracking possible.
> > > > 5. on migration failure, vendor driver signals dt_fd to QEMU again.
> > > > QEMU reads trap field of this info region which is false and QEMU
> > > > re-passthrough the whole bar 0 region.
> > > > 
> > > > Vendor driver specifies whether it supports dynamic-trap-bar-info region
> > > > through cap VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR in
> > > > vfio_pci_mediate_ops->open().
> > > > 
> > > > If vfio-pci detects this cap, it will create a default
> > > > dynamic_trap_bar_info region on behalf of vendor driver with region 
> > > > len=0
> > > > and region->ops=null.
> > > > Vvendor driver should override this region's len, flags, rw, mmap in its
> > > > vfio_pci_mediate_ops.  
> > > 
> > > TBH, I don't like this interface at all.  Userspace doesn't pass data
> > > to the kernel via INFO ioctls.  We have a SET_IRQS ioctl for
> > > configuring user signaling with eventfds.  I think we only need to
> > > define an IRQ type that tells the user to re-evaluate the sparse mmap
> > > information for a region.  The user would enumerate the device IRQs via
> > > GET_IRQ_INFO, find one of this type where the IRQ info would also
> > > indicate which region(s) should be re-evaluated on signaling.  The user
> > > would enable that signaling via SET_IRQS and simply re-evaluate the  
> > ok. I'll try to switch to this way. Thanks for this suggestion.
> > 
> > > sparse mmap capability for the associated regions when signaled.  
> > 
> > Do you like the "disablable" flag of sparse mmap ?
> > I think it's a lightweight way for user to switch mmap state of a whole 
> > region,
> > otherwise going through a complete flow of GET_REGION_INFO and re-setup
> > region might be too heavy.
> 
> No, I don't like the disable-able flag.  At what frequency do we expect
> regions to change?  It seems like we'd only change when switching into
> and out of the _SAVING state, which is rare.  It seems easy for
> userspace, at least QEMU, to drop the entire mmap configuration and
ok. I'll try this way.

> re-read it.  Another concern here is how do we synchronize the event?
> Are we assuming that this event would occur when a user switch to
> _SAVING mode on the device?  That operation is synchronous, the device
> must be in saving mode after the write to device state completes, but
> it seems like this might be trying to add an asynchronous dependency.
> Will the write to device_state only complete once the user handles the
> eventfd?  How would the kernel know when the mmap re-evaluation is
> complete.  It seems like there are gaps here that the vendor driver
> could miss traps required for migration because the user hasn't
> completed the mmap 

Re: [libvirt] [RFC PATCH 1/9] vfio/pci: introduce mediate ops to intercept vfio-pci ops

2019-12-08 Thread Yan Zhao
On Sat, Dec 07, 2019 at 05:22:26AM +0800, Alex Williamson wrote:
> On Fri, 6 Dec 2019 02:56:55 -0500
> Yan Zhao  wrote:
> 
> > On Fri, Dec 06, 2019 at 07:55:19AM +0800, Alex Williamson wrote:
> > > On Wed,  4 Dec 2019 22:25:36 -0500
> > > Yan Zhao  wrote:
> > >   
> > > > when vfio-pci is bound to a physical device, almost all the hardware
> > > > resources are passthroughed.
> > > > Sometimes, vendor driver of this physcial device may want to mediate 
> > > > some
> > > > hardware resource access for a short period of time, e.g. dirty page
> > > > tracking during live migration.
> > > > 
> > > > Here we introduce mediate ops in vfio-pci for this purpose.
> > > > 
> > > > Vendor driver can register a mediate ops to vfio-pci.
> > > > But rather than directly bind to the passthroughed device, the
> > > > vendor driver is now either a module that does not bind to any device or
> > > > a module binds to other device.
> > > > E.g. when passing through a VF device that is bound to vfio-pci modules,
> > > > PF driver that binds to PF device can register to vfio-pci to mediate
> > > > VF's regions, hence supporting VF live migration.
> > > > 
> > > > The sequence goes like this:
> > > > 1. Vendor driver register its vfio_pci_mediate_ops to vfio-pci driver
> > > > 
> > > > 2. vfio-pci maintains a list of those registered vfio_pci_mediate_ops
> > > > 
> > > > 3. Whenever vfio-pci opens a device, it searches the list and call
> > > > vfio_pci_mediate_ops->open() to check whether a vendor driver supports
> > > > mediating this device.
> > > > Upon a success return value of from vfio_pci_mediate_ops->open(),
> > > > vfio-pci will stop list searching and store a mediate handle to
> > > > represent this open into vendor driver.
> > > > (so if multiple vendor drivers support mediating a device through
> > > > vfio_pci_mediate_ops, only one will win, depending on their registering
> > > > sequence)
> > > > 
> > > > 4. Whenever a VFIO_DEVICE_GET_REGION_INFO ioctl is received in vfio-pci
> > > > ops, it will chain into vfio_pci_mediate_ops->get_region_info(), so that
> > > > vendor driver is able to override a region's default flags and caps,
> > > > e.g. adding a sparse mmap cap to passthrough only sub-regions of a whole
> > > > region.
> > > > 
> > > > 5. vfio_pci_rw()/vfio_pci_mmap() first calls into
> > > > vfio_pci_mediate_ops->rw()/vfio_pci_mediate_ops->mmaps().
> > > > if pt=true is rteturned, vfio_pci_rw()/vfio_pci_mmap() will further
> > > > passthrough this read/write/mmap to physical device, otherwise it just
> > > > returns without touch physical device.
> > > > 
> > > > 6. When vfio-pci closes a device, vfio_pci_release() chains into
> > > > vfio_pci_mediate_ops->release() to close the reference in vendor driver.
> > > > 
> > > > 7. Vendor driver unregister its vfio_pci_mediate_ops when driver exits
> > > > 
> > > > Cc: Kevin Tian 
> > > > 
> > > > Signed-off-by: Yan Zhao 
> > > > ---
> > > >  drivers/vfio/pci/vfio_pci.c | 146 
> > > >  drivers/vfio/pci/vfio_pci_private.h |   2 +
> > > >  include/linux/vfio.h|  16 +++
> > > >  3 files changed, 164 insertions(+)
> > > > 
> > > > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> > > > index 02206162eaa9..55080ff29495 100644
> > > > --- a/drivers/vfio/pci/vfio_pci.c
> > > > +++ b/drivers/vfio/pci/vfio_pci.c
> > > > @@ -54,6 +54,14 @@ module_param(disable_idle_d3, bool, S_IRUGO | 
> > > > S_IWUSR);
> > > >  MODULE_PARM_DESC(disable_idle_d3,
> > > >  "Disable using the PCI D3 low power state for idle, 
> > > > unused devices");
> > > >  
> > > > +static LIST_HEAD(mediate_ops_list);
> > > > +static DEFINE_MUTEX(mediate_ops_list_lock);
> > > > +struct vfio_pci_mediate_ops_list_entry {
> > > > +   struct vfio_pci_mediate_ops *ops;
> > > > +   int refcnt;
> > > > +   struct list_headnext;
> > > > +};
> > > > +
> > > >  static inline bool vfio_vga_

Re: [libvirt] [RFC PATCH 1/9] vfio/pci: introduce mediate ops to intercept vfio-pci ops

2019-12-08 Thread Yan Zhao
Sorry about that. I'll pay attention to them next time and thank you for
pointing them out :)

On Sat, Dec 07, 2019 at 07:13:30AM +0800, Eric Blake wrote:
> On 12/4/19 9:25 PM, Yan Zhao wrote:
> > when vfio-pci is bound to a physical device, almost all the hardware
> > resources are passthroughed.
> 
> The intent is obvious, but it sounds awkward to a native speaker.
> s/passthroughed/passed through/
> 
> > Sometimes, vendor driver of this physcial device may want to mediate some
> 
> physical
> 
> > hardware resource access for a short period of time, e.g. dirty page
> > tracking during live migration.
> > 
> > Here we introduce mediate ops in vfio-pci for this purpose.
> > 
> > Vendor driver can register a mediate ops to vfio-pci.
> > But rather than directly bind to the passthroughed device, the
> 
> passed-through
> 
> -- 
> Eric Blake, Principal Software Engineer
> Red Hat, Inc.   +1-919-301-3226
> Virtualization:  qemu.org | libvirt.org
> 


--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list



Re: [libvirt] [RFC PATCH 0/9] Introduce mediate ops in vfio-pci

2019-12-06 Thread Yan Zhao
On Fri, Dec 06, 2019 at 05:40:02PM +0800, Jason Wang wrote:
> 
> On 2019/12/6 下午4:22, Yan Zhao wrote:
> > On Thu, Dec 05, 2019 at 09:05:54PM +0800, Jason Wang wrote:
> >> On 2019/12/5 下午4:51, Yan Zhao wrote:
> >>> On Thu, Dec 05, 2019 at 02:33:19PM +0800, Jason Wang wrote:
> >>>> Hi:
> >>>>
> >>>> On 2019/12/5 上午11:24, Yan Zhao wrote:
> >>>>> For SRIOV devices, VFs are passthroughed into guest directly without 
> >>>>> host
> >>>>> driver mediation. However, when VMs migrating with passthroughed VFs,
> >>>>> dynamic host mediation is required to  (1) get device states, (2) get
> >>>>> dirty pages. Since device states as well as other critical information
> >>>>> required for dirty page tracking for VFs are usually retrieved from PFs,
> >>>>> it is handy to provide an extension in PF driver to centralizingly 
> >>>>> control
> >>>>> VFs' migration.
> >>>>>
> >>>>> Therefore, in order to realize (1) passthrough VFs at normal time, (2)
> >>>>> dynamically trap VFs' bars for dirty page tracking and
> >>>> A silly question, what's the reason for doing this, is this a must for 
> >>>> dirty
> >>>> page tracking?
> >>>>
> >>> For performance consideration. VFs' bars should be passthoughed at
> >>> normal time and only enter into trap state on need.
> >>
> >> Right, but how does this matter for the case of dirty page tracking?
> >>
> > Take NIC as an example, to trap its VF dirty pages, software way is
> > required to trap every write of ring tail that resides in BAR0.
> 
> 
> Interesting, but it looks like we need:
> - decode the instruction
> - mediate all access to BAR0
> All of which seems a great burden for the VF driver. I wonder whether or 
> not doing interrupt relay and tracking head is better in this case.
>
hi Jason

not familiar with the way you mentioned. could you elaborate more?
> 
> >   There's
> > still no IOMMU Dirty bit available.
> >>>>> (3) centralizing
> >>>>> VF critical states retrieving and VF controls into one driver, we 
> >>>>> propose
> >>>>> to introduce mediate ops on top of current vfio-pci device driver.
> >>>>>
> >>>>>
> >>>>>   _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> >>>>> __   register mediate ops|  ___ ___|
> >>>>> |  |<---| VF|   |   |
> >>>>> | vfio-pci |  | |  mediate  |   | PF driver |   |
> >>>>> |__|--->|   driver  |   |___|
> >>>>> |open(pdev)  |  ---  | |
> >>>>> ||
> >>>>> ||_ _ _ _ _ _ _ _ _ _ _ _|_ _ _ _ _|
> >>>>>\|/  \|/
> >>>>> --- 
> >>>>> |VF   | |PF|
> >>>>> --- 
> >>>>>
> >>>>>
> >>>>> VF mediate driver could be a standalone driver that does not bind to
> >>>>> any devices (as in demo code in patches 5-6) or it could be a built-in
> >>>>> extension of PF driver (as in patches 7-9) .
> >>>>>
> >>>>> Rather than directly bind to VF, VF mediate driver register a mediate
> >>>>> ops into vfio-pci in driver init. vfio-pci maintains a list of such
> >>>>> mediate ops.
> >>>>> (Note that: VF mediate driver can register mediate ops into vfio-pci
> >>>>> before vfio-pci binding to any devices. And VF mediate driver can
> >>>>> support mediating multiple devices.)
> >>>>>
> >>>>> When opening a device (e.g. a VF), vfio-pci goes through the mediate ops
> >>>>> list and calls each vfio_pci_mediate_ops->open() with pdev of the 
> >>>>> opening
> >>>>> device as a parameter.
> >>>>> VF mediate driver should return success or failure 

Re: [libvirt] [RFC PATCH 0/9] Introduce mediate ops in vfio-pci

2019-12-06 Thread Yan Zhao
On Thu, Dec 05, 2019 at 09:05:54PM +0800, Jason Wang wrote:
> 
> On 2019/12/5 下午4:51, Yan Zhao wrote:
> > On Thu, Dec 05, 2019 at 02:33:19PM +0800, Jason Wang wrote:
> >> Hi:
> >>
> >> On 2019/12/5 上午11:24, Yan Zhao wrote:
> >>> For SRIOV devices, VFs are passthroughed into guest directly without host
> >>> driver mediation. However, when VMs migrating with passthroughed VFs,
> >>> dynamic host mediation is required to  (1) get device states, (2) get
> >>> dirty pages. Since device states as well as other critical information
> >>> required for dirty page tracking for VFs are usually retrieved from PFs,
> >>> it is handy to provide an extension in PF driver to centralizingly control
> >>> VFs' migration.
> >>>
> >>> Therefore, in order to realize (1) passthrough VFs at normal time, (2)
> >>> dynamically trap VFs' bars for dirty page tracking and
> >>
> >> A silly question, what's the reason for doing this, is this a must for 
> >> dirty
> >> page tracking?
> >>
> > For performance consideration. VFs' bars should be passthoughed at
> > normal time and only enter into trap state on need.
> 
> 
> Right, but how does this matter for the case of dirty page tracking?
>
Take NIC as an example, to trap its VF dirty pages, software way is
required to trap every write of ring tail that resides in BAR0. There's
still no IOMMU Dirty bit available.
> 
> >
> >>>(3) centralizing
> >>> VF critical states retrieving and VF controls into one driver, we propose
> >>> to introduce mediate ops on top of current vfio-pci device driver.
> >>>
> >>>
> >>>  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> >>>__   register mediate ops|  ___ ___|
> >>> |  |<---| VF|   |   |
> >>> | vfio-pci |  | |  mediate  |   | PF driver |   |
> >>> |__|--->|   driver  |   |___|
> >>>|open(pdev)  |  ---  | |
> >>>||
> >>>||_ _ _ _ _ _ _ _ _ _ _ _|_ _ _ _ _|
> >>>   \|/  \|/
> >>> --- 
> >>> |VF   | |PF|
> >>> --- 
> >>>
> >>>
> >>> VF mediate driver could be a standalone driver that does not bind to
> >>> any devices (as in demo code in patches 5-6) or it could be a built-in
> >>> extension of PF driver (as in patches 7-9) .
> >>>
> >>> Rather than directly bind to VF, VF mediate driver register a mediate
> >>> ops into vfio-pci in driver init. vfio-pci maintains a list of such
> >>> mediate ops.
> >>> (Note that: VF mediate driver can register mediate ops into vfio-pci
> >>> before vfio-pci binding to any devices. And VF mediate driver can
> >>> support mediating multiple devices.)
> >>>
> >>> When opening a device (e.g. a VF), vfio-pci goes through the mediate ops
> >>> list and calls each vfio_pci_mediate_ops->open() with pdev of the opening
> >>> device as a parameter.
> >>> VF mediate driver should return success or failure depending on it
> >>> supports the pdev or not.
> >>> E.g. VF mediate driver would compare its supported VF devfn with the
> >>> devfn of the passed-in pdev.
> >>> Once vfio-pci finds a successful vfio_pci_mediate_ops->open(), it will
> >>> stop querying other mediate ops and bind the opening device with this
> >>> mediate ops using the returned mediate handle.
> >>>
> >>> Further vfio-pci ops (VFIO_DEVICE_GET_REGION_INFO ioctl, rw, mmap) on the
> >>> VF will be intercepted into VF mediate driver as
> >>> vfio_pci_mediate_ops->get_region_info(),
> >>> vfio_pci_mediate_ops->rw,
> >>> vfio_pci_mediate_ops->mmap, and get customized.
> >>> For vfio_pci_mediate_ops->rw and vfio_pci_mediate_ops->mmap, they will
> >>> further return 'pt' to indicate whether vfio-pci should further
> >>> passthrough data to hw.
> >>>
> >>> when vfio-pci clos

Re: [libvirt] [RFC PATCH 1/9] vfio/pci: introduce mediate ops to intercept vfio-pci ops

2019-12-06 Thread Yan Zhao
On Fri, Dec 06, 2019 at 07:55:19AM +0800, Alex Williamson wrote:
> On Wed,  4 Dec 2019 22:25:36 -0500
> Yan Zhao  wrote:
> 
> > when vfio-pci is bound to a physical device, almost all the hardware
> > resources are passthroughed.
> > Sometimes, vendor driver of this physcial device may want to mediate some
> > hardware resource access for a short period of time, e.g. dirty page
> > tracking during live migration.
> > 
> > Here we introduce mediate ops in vfio-pci for this purpose.
> > 
> > Vendor driver can register a mediate ops to vfio-pci.
> > But rather than directly bind to the passthroughed device, the
> > vendor driver is now either a module that does not bind to any device or
> > a module binds to other device.
> > E.g. when passing through a VF device that is bound to vfio-pci modules,
> > PF driver that binds to PF device can register to vfio-pci to mediate
> > VF's regions, hence supporting VF live migration.
> > 
> > The sequence goes like this:
> > 1. Vendor driver register its vfio_pci_mediate_ops to vfio-pci driver
> > 
> > 2. vfio-pci maintains a list of those registered vfio_pci_mediate_ops
> > 
> > 3. Whenever vfio-pci opens a device, it searches the list and call
> > vfio_pci_mediate_ops->open() to check whether a vendor driver supports
> > mediating this device.
> > Upon a success return value of from vfio_pci_mediate_ops->open(),
> > vfio-pci will stop list searching and store a mediate handle to
> > represent this open into vendor driver.
> > (so if multiple vendor drivers support mediating a device through
> > vfio_pci_mediate_ops, only one will win, depending on their registering
> > sequence)
> > 
> > 4. Whenever a VFIO_DEVICE_GET_REGION_INFO ioctl is received in vfio-pci
> > ops, it will chain into vfio_pci_mediate_ops->get_region_info(), so that
> > vendor driver is able to override a region's default flags and caps,
> > e.g. adding a sparse mmap cap to passthrough only sub-regions of a whole
> > region.
> > 
> > 5. vfio_pci_rw()/vfio_pci_mmap() first calls into
> > vfio_pci_mediate_ops->rw()/vfio_pci_mediate_ops->mmaps().
> > if pt=true is rteturned, vfio_pci_rw()/vfio_pci_mmap() will further
> > passthrough this read/write/mmap to physical device, otherwise it just
> > returns without touch physical device.
> > 
> > 6. When vfio-pci closes a device, vfio_pci_release() chains into
> > vfio_pci_mediate_ops->release() to close the reference in vendor driver.
> > 
> > 7. Vendor driver unregister its vfio_pci_mediate_ops when driver exits
> > 
> > Cc: Kevin Tian 
> > 
> > Signed-off-by: Yan Zhao 
> > ---
> >  drivers/vfio/pci/vfio_pci.c | 146 
> >  drivers/vfio/pci/vfio_pci_private.h |   2 +
> >  include/linux/vfio.h|  16 +++
> >  3 files changed, 164 insertions(+)
> > 
> > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> > index 02206162eaa9..55080ff29495 100644
> > --- a/drivers/vfio/pci/vfio_pci.c
> > +++ b/drivers/vfio/pci/vfio_pci.c
> > @@ -54,6 +54,14 @@ module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
> >  MODULE_PARM_DESC(disable_idle_d3,
> >  "Disable using the PCI D3 low power state for idle, unused 
> > devices");
> >  
> > +static LIST_HEAD(mediate_ops_list);
> > +static DEFINE_MUTEX(mediate_ops_list_lock);
> > +struct vfio_pci_mediate_ops_list_entry {
> > +   struct vfio_pci_mediate_ops *ops;
> > +   int refcnt;
> > +   struct list_headnext;
> > +};
> > +
> >  static inline bool vfio_vga_disabled(void)
> >  {
> >  #ifdef CONFIG_VFIO_PCI_VGA
> > @@ -472,6 +480,10 @@ static void vfio_pci_release(void *device_data)
> > if (!(--vdev->refcnt)) {
> > vfio_spapr_pci_eeh_release(vdev->pdev);
> > vfio_pci_disable(vdev);
> > +   if (vdev->mediate_ops && vdev->mediate_ops->release) {
> > +   vdev->mediate_ops->release(vdev->mediate_handle);
> > +   vdev->mediate_ops = NULL;
> > +   }
> > }
> >  
> > mutex_unlock(>reflck->lock);
> > @@ -483,6 +495,7 @@ static int vfio_pci_open(void *device_data)
> >  {
> > struct vfio_pci_device *vdev = device_data;
> > int ret = 0;
> > +   struct vfio_pci_mediate_ops_list_entry *mentry;
> >  
> > if (!try_module_get(THIS_MODULE))
> > return -ENODEV;
> >

Re: [libvirt] [RFC PATCH 4/9] vfio-pci: register default dynamic-trap-bar-info region

2019-12-05 Thread Yan Zhao
On Fri, Dec 06, 2019 at 07:55:30AM +0800, Alex Williamson wrote:
> On Wed,  4 Dec 2019 22:26:50 -0500
> Yan Zhao  wrote:
> 
> > Dynamic trap bar info region is a channel for QEMU and vendor driver to
> > communicate dynamic trap info. It is of type
> > VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO and subtype
> > VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO.
> > 
> > This region has two fields: dt_fd and trap.
> > When QEMU detects a device regions of this type, it will create an
> > eventfd and write its eventfd id to dt_fd field.
> > When vendor drivre signals this eventfd, QEMU reads trap field of this
> > info region.
> > - If trap is true, QEMU would search the device's PCI BAR
> > regions and disable all the sparse mmaped subregions (if the sparse
> > mmaped subregion is disablable).
> > - If trap is false, QEMU would re-enable those subregions.
> > 
> > A typical usage is
> > 1. vendor driver first cuts its bar 0 into several sections, all in a
> > sparse mmap array. So initally, all its bar 0 are passthroughed.
> > 2. vendor driver specifys part of bar 0 sections to be disablable.
> > 3. on migration starts, vendor driver signals dt_fd and set trap to true
> > to notify QEMU disabling the bar 0 sections of disablable flags on.
> > 4. QEMU disables those bar 0 section and hence let vendor driver be able
> > to trap access of bar 0 registers and make dirty page tracking possible.
> > 5. on migration failure, vendor driver signals dt_fd to QEMU again.
> > QEMU reads trap field of this info region which is false and QEMU
> > re-passthrough the whole bar 0 region.
> > 
> > Vendor driver specifies whether it supports dynamic-trap-bar-info region
> > through cap VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR in
> > vfio_pci_mediate_ops->open().
> > 
> > If vfio-pci detects this cap, it will create a default
> > dynamic_trap_bar_info region on behalf of vendor driver with region len=0
> > and region->ops=null.
> > Vvendor driver should override this region's len, flags, rw, mmap in its
> > vfio_pci_mediate_ops.
> 
> TBH, I don't like this interface at all.  Userspace doesn't pass data
> to the kernel via INFO ioctls.  We have a SET_IRQS ioctl for
> configuring user signaling with eventfds.  I think we only need to
> define an IRQ type that tells the user to re-evaluate the sparse mmap
> information for a region.  The user would enumerate the device IRQs via
> GET_IRQ_INFO, find one of this type where the IRQ info would also
> indicate which region(s) should be re-evaluated on signaling.  The user
> would enable that signaling via SET_IRQS and simply re-evaluate the
ok. I'll try to switch to this way. Thanks for this suggestion.

> sparse mmap capability for the associated regions when signaled.

Do you like the "disablable" flag of sparse mmap ?
I think it's a lightweight way for user to switch mmap state of a whole region,
otherwise going through a complete flow of GET_REGION_INFO and re-setup
region might be too heavy.

Thanks
Yan

> Thanks,
> 
> Alex
>




> > 
> > Cc: Kevin Tian 
> > 
> > Signed-off-by: Yan Zhao 
> > ---
> >  drivers/vfio/pci/vfio_pci.c | 16 
> >  include/linux/vfio.h|  3 ++-
> >  include/uapi/linux/vfio.h   | 11 +++
> >  3 files changed, 29 insertions(+), 1 deletion(-)
> > 
> > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> > index 059660328be2..62b811ca43e4 100644
> > --- a/drivers/vfio/pci/vfio_pci.c
> > +++ b/drivers/vfio/pci/vfio_pci.c
> > @@ -127,6 +127,19 @@ void init_migration_region(struct vfio_pci_device 
> > *vdev)
> > NULL);
> >  }
> >  
> > +/**
> > + * register a region to hold info for dynamically trap bar regions
> > + */
> > +void init_dynamic_trap_bar_info_region(struct vfio_pci_device *vdev)
> > +{
> > +   vfio_pci_register_dev_region(vdev,
> > +   VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO,
> > +   VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO,
> > +   NULL, 0,
> > +   VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE,
> > +   NULL);
> > +}
> > +
> >  static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
> >  {
> > struct resource *res;
> > @@ -538,6 +551,9 @@ static int vfio_pci_open(void *device_data)
> > if (caps & VFIO_PCI_DEVICE_CAP_MIGRATION)
> > init_migration_region(vdev);
> >  
> > +   if (caps & VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR)
> > 

Re: [libvirt] [RFC PATCH 3/9] vfio/pci: register a default migration region

2019-12-05 Thread Yan Zhao
On Fri, Dec 06, 2019 at 07:55:15AM +0800, Alex Williamson wrote:
> On Wed,  4 Dec 2019 22:26:38 -0500
> Yan Zhao  wrote:
> 
> > Vendor driver specifies when to support a migration region through cap
> > VFIO_PCI_DEVICE_CAP_MIGRATION in vfio_pci_mediate_ops->open().
> > 
> > If vfio-pci detects this cap, it creates a default migration region on
> > behalf of vendor driver with region len=0 and region->ops=null.
> > Vendor driver should override this region's len, flags, rw, mmap in
> > its vfio_pci_mediate_ops.
> > 
> > This migration region definition is aligned to QEMU vfio migration code v8:
> > (https://lists.gnu.org/archive/html/qemu-devel/2019-08/msg05542.html)
> > 
> > Cc: Kevin Tian 
> > 
> > Signed-off-by: Yan Zhao 
> > ---
> >  drivers/vfio/pci/vfio_pci.c |  15 
> >  include/linux/vfio.h|   1 +
> >  include/uapi/linux/vfio.h   | 149 
> >  3 files changed, 165 insertions(+)
> > 
> > diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
> > index f3730252ee82..059660328be2 100644
> > --- a/drivers/vfio/pci/vfio_pci.c
> > +++ b/drivers/vfio/pci/vfio_pci.c
> > @@ -115,6 +115,18 @@ static inline bool vfio_pci_is_vga(struct pci_dev 
> > *pdev)
> > return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
> >  }
> >  
> > +/**
> > + * init a region to hold migration ctl & data
> > + */
> > +void init_migration_region(struct vfio_pci_device *vdev)
> > +{
> > +   vfio_pci_register_dev_region(vdev, VFIO_REGION_TYPE_MIGRATION,
> > +   VFIO_REGION_SUBTYPE_MIGRATION,
> > +   NULL, 0,
> > +   VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE,
> > +   NULL);
> > +}
> > +
> >  static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
> >  {
> > struct resource *res;
> > @@ -523,6 +535,9 @@ static int vfio_pci_open(void *device_data)
> > vdev->mediate_ops = mentry->ops;
> > vdev->mediate_handle = handle;
> >  
> > +   if (caps & VFIO_PCI_DEVICE_CAP_MIGRATION)
> > +   init_migration_region(vdev);
> 
> No.  We're not going to add a cap flag for every region the mediation
> driver wants to add.  The mediation driver should have the ability to
> add regions and irqs to the device itself.  Thanks,
> 
> Alex
>
ok. got it. will do it.

Thanks
Yan

> > +
> > pr_info("vfio pci found mediate_ops %s, 
> > caps=%llx, handle=%x for %x:%x\n",
> > vdev->mediate_ops->name, caps,
> > handle, vdev->pdev->vendor,
> 


--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list



[libvirt] [RFC PATCH 7/9] i40e/vf_migration: register mediate_ops to vfio-pci

2019-12-05 Thread Yan Zhao
register to vfio-pci vfio_pci_mediate_ops when i40e binds to PF to
support mediating of VF's vfio-pci ops.
unregister vfio_pci_mediate_ops when i40e unbinds from PF.

vfio_pci_mediate_ops->open will return success if the device passed in
equals to devfn of its VFs

Cc: Shaopeng He 

Signed-off-by: Yan Zhao 
---
 drivers/net/ethernet/intel/Kconfig|   2 +-
 drivers/net/ethernet/intel/i40e/Makefile  |   3 +-
 drivers/net/ethernet/intel/i40e/i40e.h|   2 +
 drivers/net/ethernet/intel/i40e/i40e_main.c   |   3 +
 .../ethernet/intel/i40e/i40e_vf_migration.c   | 169 ++
 .../ethernet/intel/i40e/i40e_vf_migration.h   |  52 ++
 6 files changed, 229 insertions(+), 2 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
 create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.h

diff --git a/drivers/net/ethernet/intel/Kconfig 
b/drivers/net/ethernet/intel/Kconfig
index 154e2e818ec6..b5c7fdf55380 100644
--- a/drivers/net/ethernet/intel/Kconfig
+++ b/drivers/net/ethernet/intel/Kconfig
@@ -240,7 +240,7 @@ config IXGBEVF_IPSEC
 config I40E
tristate "Intel(R) Ethernet Controller XL710 Family support"
imply PTP_1588_CLOCK
-   depends on PCI
+   depends on PCI && VFIO_PCI
---help---
  This driver supports Intel(R) Ethernet Controller XL710 Family of
  devices.  For more information on how to identify your adapter, go
diff --git a/drivers/net/ethernet/intel/i40e/Makefile 
b/drivers/net/ethernet/intel/i40e/Makefile
index 2f21b3e89fd0..ae7a6a23dba9 100644
--- a/drivers/net/ethernet/intel/i40e/Makefile
+++ b/drivers/net/ethernet/intel/i40e/Makefile
@@ -24,6 +24,7 @@ i40e-objs := i40e_main.o \
i40e_ddp.o \
i40e_client.o   \
i40e_virtchnl_pf.o \
-   i40e_xsk.o
+   i40e_xsk.o  \
+   i40e_vf_migration.o
 
 i40e-$(CONFIG_I40E_DCB) += i40e_dcb.o i40e_dcb_nl.o
diff --git a/drivers/net/ethernet/intel/i40e/i40e.h 
b/drivers/net/ethernet/intel/i40e/i40e.h
index 2af9f6308f84..0141c94b835f 100644
--- a/drivers/net/ethernet/intel/i40e/i40e.h
+++ b/drivers/net/ethernet/intel/i40e/i40e.h
@@ -1162,4 +1162,6 @@ int i40e_add_del_cloud_filter(struct i40e_vsi *vsi,
 int i40e_add_del_cloud_filter_big_buf(struct i40e_vsi *vsi,
  struct i40e_cloud_filter *filter,
  bool add);
+int i40e_vf_migration_register(void);
+void i40e_vf_migration_unregister(void);
 #endif /* _I40E_H_ */
diff --git a/drivers/net/ethernet/intel/i40e/i40e_main.c 
b/drivers/net/ethernet/intel/i40e/i40e_main.c
index 6031223eafab..92d1c3fdc808 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_main.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_main.c
@@ -15274,6 +15274,7 @@ static int i40e_probe(struct pci_dev *pdev, const 
struct pci_device_id *ent)
/* print a string summarizing features */
i40e_print_features(pf);
 
+   i40e_vf_migration_register();
return 0;
 
/* Unwind what we've done if something failed in the setup */
@@ -15320,6 +15321,8 @@ static void i40e_remove(struct pci_dev *pdev)
i40e_status ret_code;
int i;
 
+   i40e_vf_migration_unregister();
+
i40e_dbg_pf_exit(pf);
 
i40e_ptp_stop(pf);
diff --git a/drivers/net/ethernet/intel/i40e/i40e_vf_migration.c 
b/drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
new file mode 100644
index ..b2d913459600
--- /dev/null
+++ b/drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
@@ -0,0 +1,169 @@
+// SPDX-License-Identifier: GPL-2.0
+/* Copyright(c) 2013 - 2019 Intel Corporation. */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#include "i40e.h"
+#include "i40e_vf_migration.h"
+
+static long open_device_bits[MAX_OPEN_DEVICE / BITS_PER_LONG + 1];
+static DEFINE_MUTEX(device_bit_lock);
+static struct i40e_vf_migration *i40e_vf_dev_array[MAX_OPEN_DEVICE];
+
+int i40e_vf_migration_open(struct pci_dev *pdev, u64 *caps, u32 *dm_handle)
+{
+   int i, ret = 0;
+   struct i40e_vf_migration *i40e_vf_dev = NULL;
+   int handle;
+   struct pci_dev *pf_dev, *vf_dev;
+   struct i40e_pf *pf;
+   struct i40e_vf *vf;
+   unsigned int vf_devfn, devfn;
+   int vf_id = -1;
+
+   if (!try_module_get(THIS_MODULE))
+   return -ENODEV;
+
+   pf_dev = pdev->physfn;
+   pf = pci_get_drvdata(pf_dev);
+   vf_dev = pdev;
+   vf_devfn = vf_dev->devfn;
+
+   for (i = 0; i < pci_num_vf(pf_dev); i++) {
+   devfn = (pf_dev->devfn + pf_dev->sriov->offset +
+pf_dev->sriov->stride * i) & 0xff;
+   if (devfn == vf_devfn) {
+   vf_id = i;
+   break;
+   }
+   }
+
+   if (vf_id == -1) {
+   ret = -EINVAL;
+   goto out;
+   }
+
+   mutex_lock(_bit_lock);
+  

[libvirt] [RFC PATCH 6/9] sample/vfio-pci/igd_dt: dynamically trap/untrap subregion of IGD bar0

2019-12-05 Thread Yan Zhao
This sample code first returns device
cap |= VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR, so that vfio-pci driver
would create for it a dynamic-trap-bar-info region
(of type VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO and
subtype VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO)

Then in igd_dt_get_region_info(), this sample driver will customize the
size of dynamic-trap-bar-info region.
Also, this sample driver customizes BAR 0 region to be sparse mmaped
(only passthrough subregion from BAR0_DYNAMIC_TRAP_OFFSET of size
BAR0_DYNAMIC_TRAP_SIZE) and set this sparse mmaped subregion as disablable.

Then when QEMU detects the dynamic trap bar info region, it will create
an eventfd and write its fd into 'dt_fd' field of this region.

When BAR0's registers below BAR0_DYNAMIC_TRAP_OFFSET is trapped, it will
signal the eventfd to notify QEMU to read 'trap' field of dynamic trap bar
info region  and put previously passthroughed subregion to be trapped.
After registers within BAR0_DYNAMIC_TRAP_OFFSET and
BAR0_DYNAMIC_TRAP_SIZE are trapped, this sample driver notifies QEMU via
eventfd to passthrough this subregion again.

Cc: Kevin Tian 

Signed-off-by: Yan Zhao 
---
 samples/vfio-pci/igd_dt.c | 176 ++
 1 file changed, 176 insertions(+)

diff --git a/samples/vfio-pci/igd_dt.c b/samples/vfio-pci/igd_dt.c
index 857e8d01b0d1..58ef110917f1 100644
--- a/samples/vfio-pci/igd_dt.c
+++ b/samples/vfio-pci/igd_dt.c
@@ -29,6 +29,9 @@
 /* This driver supports to open max 256 device devices */
 #define MAX_OPEN_DEVICE 256
 
+#define BAR0_DYNAMIC_TRAP_OFFSET (32*1024)
+#define BAR0_DYNAMIC_TRAP_SIZE (32*1024)
+
 /*
  * below are pciids of two IGD devices supported in this driver
  * It is only for demo purpose.
@@ -47,10 +50,30 @@ struct igd_dt_device {
__u32 vendor;
__u32 device;
__u32 handle;
+
+   __u64 dt_region_index;
+   struct eventfd_ctx *dt_trigger;
+   bool is_highend_trapped;
+   bool is_trap_triggered;
 };
 
 static struct igd_dt_device *igd_device_array[MAX_OPEN_DEVICE];
 
+static bool is_handle_valid(int handle)
+{
+   mutex_lock(_bit_lock);
+
+   if (handle >= MAX_OPEN_DEVICE || !igd_device_array[handle] ||
+   !test_bit(handle, igd_device_bits)) {
+   pr_err("%s: handle mismatch, please check interaction with 
vfio-pci module\n",
+   __func__);
+   mutex_unlock(_bit_lock);
+   return false;
+   }
+   mutex_unlock(_bit_lock);
+   return true;
+}
+
 int igd_dt_open(struct pci_dev *pdev, u64 *caps, u32 *mediate_handle)
 {
int supported_dev_cnt = sizeof(pciidlist)/sizeof(struct pci_device_id);
@@ -88,6 +111,7 @@ int igd_dt_open(struct pci_dev *pdev, u64 *caps, u32 
*mediate_handle)
igd_device->vendor = pdev->vendor;
igd_device->device = pdev->device;
igd_device->handle = handle;
+   igd_device->dt_region_index = -1;
igd_device_array[handle] = igd_device;
set_bit(handle, igd_device_bits);
 
@@ -95,6 +119,7 @@ int igd_dt_open(struct pci_dev *pdev, u64 *caps, u32 
*mediate_handle)
pdev->vendor, pdev->device, handle);
 
*mediate_handle = handle;
+   *caps |= VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR;
 
 error:
mutex_unlock(_bit_lock);
@@ -135,14 +160,165 @@ static void igd_dt_get_region_info(int handle,
struct vfio_info_cap *caps,
struct vfio_region_info_cap_type *cap_type)
 {
+   struct vfio_region_info_cap_sparse_mmap *sparse;
+   size_t size;
+   int nr_areas, ret;
+
+   if (!is_handle_valid(handle))
+   return;
+
+   switch (info->index) {
+   case VFIO_PCI_BAR0_REGION_INDEX:
+   info->flags |= VFIO_REGION_INFO_FLAG_MMAP;
+   nr_areas = 1;
+
+   size = sizeof(*sparse) + (nr_areas * sizeof(*sparse->areas));
+
+   sparse = kzalloc(size, GFP_KERNEL);
+   if (!sparse)
+   return;
+
+   sparse->header.id = VFIO_REGION_INFO_CAP_SPARSE_MMAP;
+   sparse->header.version = 1;
+   sparse->nr_areas = nr_areas;
+
+   sparse->areas[0].offset = BAR0_DYNAMIC_TRAP_OFFSET;
+   sparse->areas[0].size = BAR0_DYNAMIC_TRAP_SIZE;
+   sparse->areas[0].disablable = 1;//able to get disabled
+
+   ret = vfio_info_add_capability(caps, >header,
+   size);
+   kfree(sparse);
+   break;
+   case VFIO_PCI_BAR1_REGION_INDEX ... VFIO_PCI_BAR5_REGION_INDEX:
+   case VFIO_PCI_CONFIG_REGION_INDEX:
+   case VFIO_PCI_ROM_REGION_INDEX:
+   case VFIO_PCI_VGA_REGION_INDEX:
+   break;
+   default:
+   if ((cap_type->type ==
+   VFIO_REGION_TYP

[libvirt] [RFC PATCH 1/9] vfio/pci: introduce mediate ops to intercept vfio-pci ops

2019-12-05 Thread Yan Zhao
when vfio-pci is bound to a physical device, almost all the hardware
resources are passthroughed.
Sometimes, vendor driver of this physcial device may want to mediate some
hardware resource access for a short period of time, e.g. dirty page
tracking during live migration.

Here we introduce mediate ops in vfio-pci for this purpose.

Vendor driver can register a mediate ops to vfio-pci.
But rather than directly bind to the passthroughed device, the
vendor driver is now either a module that does not bind to any device or
a module binds to other device.
E.g. when passing through a VF device that is bound to vfio-pci modules,
PF driver that binds to PF device can register to vfio-pci to mediate
VF's regions, hence supporting VF live migration.

The sequence goes like this:
1. Vendor driver register its vfio_pci_mediate_ops to vfio-pci driver

2. vfio-pci maintains a list of those registered vfio_pci_mediate_ops

3. Whenever vfio-pci opens a device, it searches the list and call
vfio_pci_mediate_ops->open() to check whether a vendor driver supports
mediating this device.
Upon a success return value of from vfio_pci_mediate_ops->open(),
vfio-pci will stop list searching and store a mediate handle to
represent this open into vendor driver.
(so if multiple vendor drivers support mediating a device through
vfio_pci_mediate_ops, only one will win, depending on their registering
sequence)

4. Whenever a VFIO_DEVICE_GET_REGION_INFO ioctl is received in vfio-pci
ops, it will chain into vfio_pci_mediate_ops->get_region_info(), so that
vendor driver is able to override a region's default flags and caps,
e.g. adding a sparse mmap cap to passthrough only sub-regions of a whole
region.

5. vfio_pci_rw()/vfio_pci_mmap() first calls into
vfio_pci_mediate_ops->rw()/vfio_pci_mediate_ops->mmaps().
if pt=true is rteturned, vfio_pci_rw()/vfio_pci_mmap() will further
passthrough this read/write/mmap to physical device, otherwise it just
returns without touch physical device.

6. When vfio-pci closes a device, vfio_pci_release() chains into
vfio_pci_mediate_ops->release() to close the reference in vendor driver.

7. Vendor driver unregister its vfio_pci_mediate_ops when driver exits

Cc: Kevin Tian 

Signed-off-by: Yan Zhao 
---
 drivers/vfio/pci/vfio_pci.c | 146 
 drivers/vfio/pci/vfio_pci_private.h |   2 +
 include/linux/vfio.h|  16 +++
 3 files changed, 164 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 02206162eaa9..55080ff29495 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -54,6 +54,14 @@ module_param(disable_idle_d3, bool, S_IRUGO | S_IWUSR);
 MODULE_PARM_DESC(disable_idle_d3,
 "Disable using the PCI D3 low power state for idle, unused 
devices");
 
+static LIST_HEAD(mediate_ops_list);
+static DEFINE_MUTEX(mediate_ops_list_lock);
+struct vfio_pci_mediate_ops_list_entry {
+   struct vfio_pci_mediate_ops *ops;
+   int refcnt;
+   struct list_headnext;
+};
+
 static inline bool vfio_vga_disabled(void)
 {
 #ifdef CONFIG_VFIO_PCI_VGA
@@ -472,6 +480,10 @@ static void vfio_pci_release(void *device_data)
if (!(--vdev->refcnt)) {
vfio_spapr_pci_eeh_release(vdev->pdev);
vfio_pci_disable(vdev);
+   if (vdev->mediate_ops && vdev->mediate_ops->release) {
+   vdev->mediate_ops->release(vdev->mediate_handle);
+   vdev->mediate_ops = NULL;
+   }
}
 
mutex_unlock(>reflck->lock);
@@ -483,6 +495,7 @@ static int vfio_pci_open(void *device_data)
 {
struct vfio_pci_device *vdev = device_data;
int ret = 0;
+   struct vfio_pci_mediate_ops_list_entry *mentry;
 
if (!try_module_get(THIS_MODULE))
return -ENODEV;
@@ -495,6 +508,30 @@ static int vfio_pci_open(void *device_data)
goto error;
 
vfio_spapr_pci_eeh_open(vdev->pdev);
+   mutex_lock(_ops_list_lock);
+   list_for_each_entry(mentry, _ops_list, next) {
+   u64 caps;
+   u32 handle;
+
+   memset(, 0, sizeof(caps));
+   ret = mentry->ops->open(vdev->pdev, , );
+   if (!ret)  {
+   vdev->mediate_ops = mentry->ops;
+   vdev->mediate_handle = handle;
+
+   pr_info("vfio pci found mediate_ops %s, 
caps=%llx, handle=%x for %x:%x\n",
+   vdev->mediate_ops->name, caps,
+   handle, vdev->pdev->vendor,
+   vdev->pdev->d

[libvirt] [RFC PATCH 5/9] samples/vfio-pci/igd_dt: sample driver to mediate a passthrough IGD

2019-12-05 Thread Yan Zhao
This is a sample driver to use mediate ops for passthrough IGDs.

This sample driver does not directly bind to IGD device but defines what
IGD devices to support via a pciidlist.

It registers its vfio_pci_mediate_ops to vfio-pci on driver loading.

when vfio_pci->open() calls vfio_pci_mediate_ops->open(), it will check
the vendor id and device id of the pdev passed in. If they match in
pciidlist, success is returned; otherwise, failure is return.

After a success vfio_pci_mediate_ops->open(), vfio-pci will further call
.get_region_info/.rw/.mmap interface with a mediate handle for each region
and therefore the regions access get mediated/customized.

when vfio-pci->release() is called on the IGD, it first calls
vfio_pci_mediate_ops->release() with a mediate_handle to close the
opened IGD device instance in this sample driver.

This sample driver unregister its vfio_pci_mediate_ops on driver exiting.

Cc: Kevin Tian 

Signed-off-by: Yan Zhao 
---
 samples/Kconfig   |   6 ++
 samples/Makefile  |   1 +
 samples/vfio-pci/Makefile |   2 +
 samples/vfio-pci/igd_dt.c | 191 ++
 4 files changed, 200 insertions(+)
 create mode 100644 samples/vfio-pci/Makefile
 create mode 100644 samples/vfio-pci/igd_dt.c

diff --git a/samples/Kconfig b/samples/Kconfig
index c8dacb4dda80..2da42a725c03 100644
--- a/samples/Kconfig
+++ b/samples/Kconfig
@@ -169,4 +169,10 @@ config SAMPLE_VFS
  as mount API and statx().  Note that this is restricted to the x86
  arch whilst it accesses system calls that aren't yet in all arches.
 
+config SAMPLE_VFIO_PCI_IGD_DT
+   tristate "Build example driver to dynamicaly trap a passthroughed 
device bound to VFIO-PCI -- loadable modules only"
+   depends on VFIO_PCI && m
+   help
+ Build a sample driver to show how to dynamically trap a passthroughed 
device that bound to VFIO-PCI
+
 endif # SAMPLES
diff --git a/samples/Makefile b/samples/Makefile
index 7d6e4ca28d69..f0f422e7dd11 100644
--- a/samples/Makefile
+++ b/samples/Makefile
@@ -18,5 +18,6 @@ subdir-$(CONFIG_SAMPLE_SECCOMP)   += seccomp
 obj-$(CONFIG_SAMPLE_TRACE_EVENTS)  += trace_events/
 obj-$(CONFIG_SAMPLE_TRACE_PRINTK)  += trace_printk/
 obj-$(CONFIG_VIDEO_PCI_SKELETON)   += v4l/
+obj-$(CONFIG_SAMPLE_VFIO_PCI_IGD_DT)   += vfio-pci/
 obj-y  += vfio-mdev/
 subdir-$(CONFIG_SAMPLE_VFS)+= vfs
diff --git a/samples/vfio-pci/Makefile b/samples/vfio-pci/Makefile
new file mode 100644
index ..4b8acc145d65
--- /dev/null
+++ b/samples/vfio-pci/Makefile
@@ -0,0 +1,2 @@
+# SPDX-License-Identifier: GPL-2.0-only
+obj-$(CONFIG_SAMPLE_VFIO_PCI_IGD_DT) += igd_dt.o
diff --git a/samples/vfio-pci/igd_dt.c b/samples/vfio-pci/igd_dt.c
new file mode 100644
index ..857e8d01b0d1
--- /dev/null
+++ b/samples/vfio-pci/igd_dt.c
@@ -0,0 +1,191 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Dynamic trap IGD device that bound to vfio-pci device driver
+ * Copyright(c) 2019 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of the GNU General Public License version 2 as
+ * published by the Free Software Foundation.
+ */
+
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+#include 
+
+#define VERSION_STRING  "0.1"
+#define DRIVER_AUTHOR   "Intel Corporation"
+
+/* helper macros copied from vfio-pci */
+#define VFIO_PCI_OFFSET_SHIFT   40
+#define VFIO_PCI_OFFSET_TO_INDEX(off)   ((off) >> VFIO_PCI_OFFSET_SHIFT)
+#define VFIO_PCI_OFFSET_MASK(((u64)(1) << VFIO_PCI_OFFSET_SHIFT) - 1)
+
+/* This driver supports to open max 256 device devices */
+#define MAX_OPEN_DEVICE 256
+
+/*
+ * below are pciids of two IGD devices supported in this driver
+ * It is only for demo purpose.
+ * You can add more device ids in this list to support any pci devices
+ * that you want to dynamically trap its pci bars
+ */
+static const struct pci_device_id pciidlist[] = {
+   {0x8086, 0x5927, ~0, ~0, 0x3, 0xff, 0},
+   {0x8086, 0x193b, ~0, ~0, 0x3, 0xff, 0},
+};
+
+static long igd_device_bits[MAX_OPEN_DEVICE/BITS_PER_LONG + 1];
+static DEFINE_MUTEX(device_bit_lock);
+
+struct igd_dt_device {
+   __u32 vendor;
+   __u32 device;
+   __u32 handle;
+};
+
+static struct igd_dt_device *igd_device_array[MAX_OPEN_DEVICE];
+
+int igd_dt_open(struct pci_dev *pdev, u64 *caps, u32 *mediate_handle)
+{
+   int supported_dev_cnt = sizeof(pciidlist)/sizeof(struct pci_device_id);
+   int i, ret = 0;
+   struct igd_dt_device *igd_device;
+   int handle;
+
+   if (!try_module_get(THIS_MODULE))
+   return -ENODEV;
+
+   for (i = 0; i < supported_dev_cnt; i++) {
+   if (pciidlist[i].vendor == pdev->vendor &&
+   pciidlist[i].device == pde

Re: [libvirt] [RFC PATCH 0/9] Introduce mediate ops in vfio-pci

2019-12-05 Thread Yan Zhao
On Thu, Dec 05, 2019 at 02:33:19PM +0800, Jason Wang wrote:
> Hi:
> 
> On 2019/12/5 上午11:24, Yan Zhao wrote:
> > For SRIOV devices, VFs are passthroughed into guest directly without host
> > driver mediation. However, when VMs migrating with passthroughed VFs,
> > dynamic host mediation is required to  (1) get device states, (2) get
> > dirty pages. Since device states as well as other critical information
> > required for dirty page tracking for VFs are usually retrieved from PFs,
> > it is handy to provide an extension in PF driver to centralizingly control
> > VFs' migration.
> > 
> > Therefore, in order to realize (1) passthrough VFs at normal time, (2)
> > dynamically trap VFs' bars for dirty page tracking and
> 
> 
> A silly question, what's the reason for doing this, is this a must for dirty
> page tracking?
>
For performance consideration. VFs' bars should be passthoughed at
normal time and only enter into trap state on need.

> 
> >   (3) centralizing
> > VF critical states retrieving and VF controls into one driver, we propose
> > to introduce mediate ops on top of current vfio-pci device driver.
> > 
> > 
> > _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
> >   __   register mediate ops|  ___ ___|
> > |  |<---| VF|   |   |
> > | vfio-pci |  | |  mediate  |   | PF driver |   |
> > |__|--->|   driver  |   |___|
> >   |open(pdev)  |  ---  | |
> >   ||
> >   ||_ _ _ _ _ _ _ _ _ _ _ _|_ _ _ _ _|
> >  \|/  \|/
> > --- 
> > |VF   | |PF|
> > --- 
> > 
> > 
> > VF mediate driver could be a standalone driver that does not bind to
> > any devices (as in demo code in patches 5-6) or it could be a built-in
> > extension of PF driver (as in patches 7-9) .
> > 
> > Rather than directly bind to VF, VF mediate driver register a mediate
> > ops into vfio-pci in driver init. vfio-pci maintains a list of such
> > mediate ops.
> > (Note that: VF mediate driver can register mediate ops into vfio-pci
> > before vfio-pci binding to any devices. And VF mediate driver can
> > support mediating multiple devices.)
> > 
> > When opening a device (e.g. a VF), vfio-pci goes through the mediate ops
> > list and calls each vfio_pci_mediate_ops->open() with pdev of the opening
> > device as a parameter.
> > VF mediate driver should return success or failure depending on it
> > supports the pdev or not.
> > E.g. VF mediate driver would compare its supported VF devfn with the
> > devfn of the passed-in pdev.
> > Once vfio-pci finds a successful vfio_pci_mediate_ops->open(), it will
> > stop querying other mediate ops and bind the opening device with this
> > mediate ops using the returned mediate handle.
> > 
> > Further vfio-pci ops (VFIO_DEVICE_GET_REGION_INFO ioctl, rw, mmap) on the
> > VF will be intercepted into VF mediate driver as
> > vfio_pci_mediate_ops->get_region_info(),
> > vfio_pci_mediate_ops->rw,
> > vfio_pci_mediate_ops->mmap, and get customized.
> > For vfio_pci_mediate_ops->rw and vfio_pci_mediate_ops->mmap, they will
> > further return 'pt' to indicate whether vfio-pci should further
> > passthrough data to hw.
> > 
> > when vfio-pci closes the VF, it calls its vfio_pci_mediate_ops->release()
> > with a mediate handle as parameter.
> > 
> > The mediate handle returned from vfio_pci_mediate_ops->open() lets VF
> > mediate driver be able to differentiate two opening VFs of the same device
> > id and vendor id.
> > 
> > When VF mediate driver exits, it unregisters its mediate ops from
> > vfio-pci.
> > 
> > 
> > In this patchset, we enable vfio-pci to provide 3 things:
> > (1) calling mediate ops to allow vendor driver customizing default
> > region info/rw/mmap of a region.
> > (2) provide a migration region to support migration
> 
> 
> What's the benefit of introducing a region? It looks to me we don't expect
> the region to be accessed directly from guest. Could we simply extend device
> fd ioctl for doing such things?
>
You may take a look on mdev live migration discussions in
https

[libvirt] [RFC PATCH 0/9] Introduce mediate ops in vfio-pci

2019-12-05 Thread Yan Zhao
ops registered by vendor
  driver to mediate/customize region info/rw/mmap.

- patches 5-6 provide a standalone sample driver to register a mediate ops
  for Intel Graphics Devices. It does not bind to IGDs directly but decides
  what devices it supports via its pciidlist. It also demonstrates how to
  dynamic trap a device's PCI bars. (by adding more pciids in its
  pciidlist, this sample driver actually is not necessarily limited to
  support IGDs)

- patch 7-9 provide a sample on i40e driver that supports Intel(R)
  Ethernet Controller XL710 Family of devices. It supports VF precopy live
  migration on Intel's 710 SRIOV. (but we commented out the real
  implementation of dirty page tracking and device state retrieving part
  to focus on demonstrating framework part. Will send out them in future
  versions)
 
  patch 7 registers/unregisters VF mediate ops when PF driver
  probes/removes. It specifies its supporting VFs via
  vfio_pci_mediate_ops->open(pdev)

  patch 8 reports device cap of VFIO_PCI_DEVICE_CAP_MIGRATION and
  provides a sample implementation of migration region.
  The QEMU part of vfio migration is based on v8
  https://lists.gnu.org/archive/html/qemu-devel/2019-08/msg05542.html.
  We do not based on recent v9 because we think there are still opens in
  dirty page track part in that series.

  patch 9 reports device cap of VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR and
  provides an example on how to trap part of bar0 when migration starts
  and passthrough this part of bar0 again when migration fails.

Yan Zhao (9):
  vfio/pci: introduce mediate ops to intercept vfio-pci ops
  vfio/pci: test existence before calling region->ops
  vfio/pci: register a default migration region
  vfio-pci: register default dynamic-trap-bar-info region
  samples/vfio-pci/igd_dt: sample driver to mediate a passthrough IGD
  sample/vfio-pci/igd_dt: dynamically trap/untrap subregion of IGD bar0
  i40e/vf_migration: register mediate_ops to vfio-pci
  i40e/vf_migration: mediate migration region
  i40e/vf_migration: support dynamic trap of bar0

 drivers/net/ethernet/intel/Kconfig|   2 +-
 drivers/net/ethernet/intel/i40e/Makefile  |   3 +-
 drivers/net/ethernet/intel/i40e/i40e.h|   2 +
 drivers/net/ethernet/intel/i40e/i40e_main.c   |   3 +
 .../ethernet/intel/i40e/i40e_vf_migration.c   | 626 ++
 .../ethernet/intel/i40e/i40e_vf_migration.h   |  78 +++
 drivers/vfio/pci/vfio_pci.c   | 189 +-
 drivers/vfio/pci/vfio_pci_private.h   |   2 +
 include/linux/vfio.h  |  18 +
 include/uapi/linux/vfio.h | 160 +
 samples/Kconfig   |   6 +
 samples/Makefile  |   1 +
 samples/vfio-pci/Makefile |   2 +
 samples/vfio-pci/igd_dt.c | 367 ++
 14 files changed, 1455 insertions(+), 4 deletions(-)
 create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
 create mode 100644 drivers/net/ethernet/intel/i40e/i40e_vf_migration.h
 create mode 100644 samples/vfio-pci/Makefile
 create mode 100644 samples/vfio-pci/igd_dt.c

-- 
2.17.1


--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list



[libvirt] [RFC PATCH 8/9] i40e/vf_migration: mediate migration region

2019-12-05 Thread Yan Zhao
in vfio_pci_mediate_ops->get_region_info(), migration region's len and
flags are overridden and its region index is saved.

vfio_pci_mediate_ops->rw() and vfio_pci_mediate_ops->mmap() overrides
default rw/mmap for migration region.

This is only a sample implementation in i440 vf migration to demonstrate
how vf migration code will look like. The actual dirty page tracking and
device state retrieving code would be sent in future. Currently only
comments are used as placeholders.

It's based on QEMU vfio migration code v8:
(https://lists.gnu.org/archive/html/qemu-devel/2019-08/msg05542.html).

Cc: Shaopeng He 

Signed-off-by: Yan Zhao 
---
 .../ethernet/intel/i40e/i40e_vf_migration.c   | 335 +-
 .../ethernet/intel/i40e/i40e_vf_migration.h   |  14 +
 2 files changed, 345 insertions(+), 4 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_vf_migration.c 
b/drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
index b2d913459600..5bb509fed66e 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
@@ -14,6 +14,55 @@ static long open_device_bits[MAX_OPEN_DEVICE / BITS_PER_LONG 
+ 1];
 static DEFINE_MUTEX(device_bit_lock);
 static struct i40e_vf_migration *i40e_vf_dev_array[MAX_OPEN_DEVICE];
 
+static bool is_handle_valid(int handle)
+{
+   mutex_lock(_bit_lock);
+
+   if (handle >= MAX_OPEN_DEVICE || !i40e_vf_dev_array[handle] ||
+   !test_bit(handle, open_device_bits)) {
+   pr_err("%s: handle mismatch, please check interaction with 
vfio-pci module\n",
+  __func__);
+   mutex_unlock(_bit_lock);
+   return false;
+   }
+   mutex_unlock(_bit_lock);
+   return true;
+}
+
+static size_t set_device_state(struct i40e_vf_migration *i40e_vf_dev, u32 
state)
+{
+   int ret = 0;
+   struct vfio_device_migration_info *mig_ctl = i40e_vf_dev->mig_ctl;
+
+   if (state == mig_ctl->device_state)
+   return ret;
+
+   switch (state) {
+   case VFIO_DEVICE_STATE_RUNNING:
+   break;
+   case VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RUNNING:
+   // alloc dirty page tracking resources and
+   // do the first round dirty page scanning
+   break;
+   case VFIO_DEVICE_STATE_SAVING:
+   // do the last round of dirty page scanning
+   break;
+   case ~VFIO_DEVICE_STATE_MASK & VFIO_DEVICE_STATE_MASK:
+   // release dirty page tracking resources
+   //if (mig_ctl->device_state == VFIO_DEVICE_STATE_SAVING)
+   //  i40e_release_scan_resources(i40e_vf_dev);
+   break;
+   case VFIO_DEVICE_STATE_RESUMING:
+   break;
+   default:
+   ret = -EFAULT;
+   }
+
+   mig_ctl->device_state = state;
+
+   return ret;
+}
+
 int i40e_vf_migration_open(struct pci_dev *pdev, u64 *caps, u32 *dm_handle)
 {
int i, ret = 0;
@@ -24,6 +73,8 @@ int i40e_vf_migration_open(struct pci_dev *pdev, u64 *caps, 
u32 *dm_handle)
struct i40e_vf *vf;
unsigned int vf_devfn, devfn;
int vf_id = -1;
+   struct vfio_device_migration_info *mig_ctl = NULL;
+   void *dirty_bitmap_base = NULL;
 
if (!try_module_get(THIS_MODULE))
return -ENODEV;
@@ -68,18 +119,41 @@ int i40e_vf_migration_open(struct pci_dev *pdev, u64 
*caps, u32 *dm_handle)
i40e_vf_dev->vf_dev = vf_dev;
i40e_vf_dev->handle = handle;
 
-   pr_info("%s: device %x %x, vf id %d, handle=%x\n",
-   __func__, pdev->vendor, pdev->device, vf_id, handle);
+   mig_ctl = kzalloc(sizeof(*mig_ctl), GFP_KERNEL);
+   if (!mig_ctl) {
+   ret = -ENOMEM;
+   goto error;
+   }
+
+   dirty_bitmap_base = vmalloc_user(MIGRATION_DIRTY_BITMAP_SIZE);
+   if (!dirty_bitmap_base) {
+   ret = -ENOMEM;
+   goto error;
+   }
+
+   i40e_vf_dev->dirty_bitmap = dirty_bitmap_base;
+   i40e_vf_dev->mig_ctl = mig_ctl;
+   i40e_vf_dev->migration_region_size = DIRTY_BITMAP_OFFSET +
+   MIGRATION_DIRTY_BITMAP_SIZE;
+   i40e_vf_dev->migration_region_index = -1;
+
+   vf = >vf[vf_id];
 
i40e_vf_dev_array[handle] = i40e_vf_dev;
set_bit(handle, open_device_bits);
-   vf = >vf[vf_id];
*dm_handle = handle;
+
+   *caps |= VFIO_PCI_DEVICE_CAP_MIGRATION;
+
+   pr_info("%s: device %x %x, vf id %d, handle=%x\n",
+   __func__, pdev->vendor, pdev->device, vf_id, handle);
 error:
mutex_unlock(_bit_lock);
 
if (ret < 0) {
module_put(THIS_MODULE);
+   kfree(mig_ctl);
+   vfree(dirty_bitmap_base);
kfree(i40e_vf_dev);
}
 
@@ -112,32 +186,285 @@ void i40

[libvirt] [RFC PATCH 3/9] vfio/pci: register a default migration region

2019-12-05 Thread Yan Zhao
Vendor driver specifies when to support a migration region through cap
VFIO_PCI_DEVICE_CAP_MIGRATION in vfio_pci_mediate_ops->open().

If vfio-pci detects this cap, it creates a default migration region on
behalf of vendor driver with region len=0 and region->ops=null.
Vendor driver should override this region's len, flags, rw, mmap in
its vfio_pci_mediate_ops.

This migration region definition is aligned to QEMU vfio migration code v8:
(https://lists.gnu.org/archive/html/qemu-devel/2019-08/msg05542.html)

Cc: Kevin Tian 

Signed-off-by: Yan Zhao 
---
 drivers/vfio/pci/vfio_pci.c |  15 
 include/linux/vfio.h|   1 +
 include/uapi/linux/vfio.h   | 149 
 3 files changed, 165 insertions(+)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index f3730252ee82..059660328be2 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -115,6 +115,18 @@ static inline bool vfio_pci_is_vga(struct pci_dev *pdev)
return (pdev->class >> 8) == PCI_CLASS_DISPLAY_VGA;
 }
 
+/**
+ * init a region to hold migration ctl & data
+ */
+void init_migration_region(struct vfio_pci_device *vdev)
+{
+   vfio_pci_register_dev_region(vdev, VFIO_REGION_TYPE_MIGRATION,
+   VFIO_REGION_SUBTYPE_MIGRATION,
+   NULL, 0,
+   VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE,
+   NULL);
+}
+
 static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
 {
struct resource *res;
@@ -523,6 +535,9 @@ static int vfio_pci_open(void *device_data)
vdev->mediate_ops = mentry->ops;
vdev->mediate_handle = handle;
 
+   if (caps & VFIO_PCI_DEVICE_CAP_MIGRATION)
+   init_migration_region(vdev);
+
pr_info("vfio pci found mediate_ops %s, 
caps=%llx, handle=%x for %x:%x\n",
vdev->mediate_ops->name, caps,
handle, vdev->pdev->vendor,
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index 0265e779acd1..cddea8e9dcb2 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -197,6 +197,7 @@ extern void vfio_virqfd_disable(struct virqfd **pvirqfd);
 
 struct vfio_pci_mediate_ops {
char*name;
+#define VFIO_PCI_DEVICE_CAP_MIGRATION (0x01)
int (*open)(struct pci_dev *pdev, u64 *caps, u32 *handle);
void(*release)(int handle);
void(*get_region_info)(int handle,
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index 9e843a147ead..caf8845a67a6 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -306,6 +306,155 @@ struct vfio_region_info_cap_type {
 #define VFIO_REGION_TYPE_GFX(1)
 #define VFIO_REGION_TYPE_CCW   (2)
 
+/* Migration region type and sub-type */
+#define VFIO_REGION_TYPE_MIGRATION  (3)
+#define VFIO_REGION_SUBTYPE_MIGRATION   (1)
+
+/**
+ * Structure vfio_device_migration_info is placed at 0th offset of
+ * VFIO_REGION_SUBTYPE_MIGRATION region to get/set VFIO device related 
migration
+ * information. Field accesses from this structure are only supported at their
+ * native width and alignment, otherwise the result is undefined and vendor
+ * drivers should return an error.
+ *
+ * device_state: (read/write)
+ *  To indicate vendor driver the state VFIO device should be transitioned
+ *  to. If device state transition fails, write on this field return error.
+ *  It consists of 3 bits:
+ *  - If bit 0 set, indicates _RUNNING state. When its reset, that 
indicates
+ *_STOPPED state. When device is changed to _STOPPED, driver should 
stop
+ *device before write() returns.
+ *  - If bit 1 set, indicates _SAVING state.
+ *  - If bit 2 set, indicates _RESUMING state.
+ *  Bits 3 - 31 are reserved for future use. User should perform
+ *  read-modify-write operation on this field.
+ *  _SAVING and _RESUMING bits set at the same time is invalid state.
+ *
+ * pending bytes: (read only)
+ *  Number of pending bytes yet to be migrated from vendor driver
+ *
+ * data_offset: (read only)
+ *  User application should read data_offset in migration region from where
+ *  user application should read device data during _SAVING state or write
+ *  device data during _RESUMING state or read dirty pages bitmap. See 
below
+ *  for detail of sequence to be followed.
+ *
+ * data_size: (read/write)
+ *  User application should read data_size to get size of data copied in
+ *  migration region during _SAVING state and write size of data copied in
+ *  migration region during _RESUMING state.
+ *
+ * start_pfn: (write only)
+ *  Start address pfn to get bitmap of dirty

[libvirt] [RFC PATCH 9/9] i40e/vf_migration: support dynamic trap of bar0

2019-12-05 Thread Yan Zhao
mediate dynamic_trap_info region to dynamically trap bar0.

bar0 is sparsely mmaped into 5 sub-regions, of which only two need to be
dynamically trapped.
By mediating dynamic_trap_info region and telling QEMU this information,
the two sub-regions of bar0 can be trapped when migration starts and put
to passthrough again when migration fails

Cc: Shaopeng He 

Signed-off-by: Yan Zhao 
---
 .../ethernet/intel/i40e/i40e_vf_migration.c   | 140 +-
 .../ethernet/intel/i40e/i40e_vf_migration.h   |  12 ++
 2 files changed, 147 insertions(+), 5 deletions(-)

diff --git a/drivers/net/ethernet/intel/i40e/i40e_vf_migration.c 
b/drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
index 5bb509fed66e..0b9d5be85049 100644
--- a/drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
+++ b/drivers/net/ethernet/intel/i40e/i40e_vf_migration.c
@@ -29,6 +29,21 @@ static bool is_handle_valid(int handle)
return true;
 }
 
+static
+void i40e_vf_migration_dynamic_trap_bar(struct i40e_vf_migration *i40e_vf_dev)
+{
+   if (i40e_vf_dev->dt_trigger)
+   eventfd_signal(i40e_vf_dev->dt_trigger, 1);
+}
+
+static void i40e_vf_trap_bar0(struct i40e_vf_migration *i40e_vf_dev, bool trap)
+{
+   if (i40e_vf_dev->trap_bar0 != trap) {
+   i40e_vf_dev->trap_bar0 = trap;
+   i40e_vf_migration_dynamic_trap_bar(i40e_vf_dev);
+   }
+}
+
 static size_t set_device_state(struct i40e_vf_migration *i40e_vf_dev, u32 
state)
 {
int ret = 0;
@@ -39,8 +54,10 @@ static size_t set_device_state(struct i40e_vf_migration 
*i40e_vf_dev, u32 state)
 
switch (state) {
case VFIO_DEVICE_STATE_RUNNING:
+   i40e_vf_trap_bar0(i40e_vf_dev, false);
break;
case VFIO_DEVICE_STATE_SAVING | VFIO_DEVICE_STATE_RUNNING:
+   i40e_vf_trap_bar0(i40e_vf_dev, true);
// alloc dirty page tracking resources and
// do the first round dirty page scanning
break;
@@ -137,16 +154,22 @@ int i40e_vf_migration_open(struct pci_dev *pdev, u64 
*caps, u32 *dm_handle)
MIGRATION_DIRTY_BITMAP_SIZE;
i40e_vf_dev->migration_region_index = -1;
 
+   i40e_vf_dev->dt_region_index = -1;
+   i40e_vf_dev->trap_bar0 = false;
+
vf = >vf[vf_id];
 
i40e_vf_dev_array[handle] = i40e_vf_dev;
set_bit(handle, open_device_bits);
+
*dm_handle = handle;
 
*caps |= VFIO_PCI_DEVICE_CAP_MIGRATION;
+   *caps |= VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR;
 
pr_info("%s: device %x %x, vf id %d, handle=%x\n",
__func__, pdev->vendor, pdev->device, vf_id, handle);
+
 error:
mutex_unlock(_bit_lock);
 
@@ -188,6 +211,10 @@ void i40e_vf_migration_release(int handle)
 
kfree(i40e_vf_dev->mig_ctl);
vfree(i40e_vf_dev->dirty_bitmap);
+
+   if (i40e_vf_dev->dt_trigger)
+   eventfd_ctx_put(i40e_vf_dev->dt_trigger);
+
kfree(i40e_vf_dev);
 
module_put(THIS_MODULE);
@@ -216,6 +243,47 @@ static void migration_region_sparse_mmap_cap(struct 
vfio_info_cap *caps)
kfree(sparse);
 }
 
+static void bar0_sparse_mmap_cap(struct vfio_region_info *info,
+struct vfio_info_cap *caps)
+{
+   struct vfio_region_info_cap_sparse_mmap *sparse;
+   size_t size;
+   int nr_areas = 5;
+
+   size = sizeof(*sparse) + (nr_areas * sizeof(*sparse->areas));
+
+   sparse = kzalloc(size, GFP_KERNEL);
+   if (!sparse)
+   return;
+
+   sparse->header.id = VFIO_REGION_INFO_CAP_SPARSE_MMAP;
+   sparse->header.version = 1;
+   sparse->nr_areas = nr_areas;
+
+   sparse->areas[0].offset = 0;
+   sparse->areas[0].size = IAVF_VF_TAIL_START;
+   sparse->areas[0].disablable = 0;//able to get toggled
+
+   sparse->areas[1].offset = IAVF_VF_TAIL_START;
+   sparse->areas[1].size = PAGE_SIZE;
+   sparse->areas[1].disablable = 1;//able to get toggled
+
+   sparse->areas[2].offset = IAVF_VF_TAIL_START + PAGE_SIZE;
+   sparse->areas[2].size = IAVF_VF_ARQH1 - sparse->areas[2].offset;
+   sparse->areas[2].disablable = 0;//able to get toggled
+
+   sparse->areas[3].offset = IAVF_VF_ARQT1;
+   sparse->areas[3].size = PAGE_SIZE;
+   sparse->areas[3].disablable = 1;//able to get toggled
+
+   sparse->areas[4].offset = IAVF_VF_ARQT1 + PAGE_SIZE;
+   sparse->areas[4].size = info->size - sparse->areas[4].offset;
+   sparse->areas[4].disablable = 0;//able to get toggled
+
+   vfio_info_add_capability(caps, >header, size);
+   kfree(sparse);
+}
+
 static void
 i40e_vf_migration_get_region_info(int handle,
  struct vfio_region_info *info,
@@ -227,9 +295,8 @@ i40e_vf_migration_get_region_info(int handle,
 
   

[libvirt] [RFC PATCH 4/9] vfio-pci: register default dynamic-trap-bar-info region

2019-12-05 Thread Yan Zhao
Dynamic trap bar info region is a channel for QEMU and vendor driver to
communicate dynamic trap info. It is of type
VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO and subtype
VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO.

This region has two fields: dt_fd and trap.
When QEMU detects a device regions of this type, it will create an
eventfd and write its eventfd id to dt_fd field.
When vendor drivre signals this eventfd, QEMU reads trap field of this
info region.
- If trap is true, QEMU would search the device's PCI BAR
regions and disable all the sparse mmaped subregions (if the sparse
mmaped subregion is disablable).
- If trap is false, QEMU would re-enable those subregions.

A typical usage is
1. vendor driver first cuts its bar 0 into several sections, all in a
sparse mmap array. So initally, all its bar 0 are passthroughed.
2. vendor driver specifys part of bar 0 sections to be disablable.
3. on migration starts, vendor driver signals dt_fd and set trap to true
to notify QEMU disabling the bar 0 sections of disablable flags on.
4. QEMU disables those bar 0 section and hence let vendor driver be able
to trap access of bar 0 registers and make dirty page tracking possible.
5. on migration failure, vendor driver signals dt_fd to QEMU again.
QEMU reads trap field of this info region which is false and QEMU
re-passthrough the whole bar 0 region.

Vendor driver specifies whether it supports dynamic-trap-bar-info region
through cap VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR in
vfio_pci_mediate_ops->open().

If vfio-pci detects this cap, it will create a default
dynamic_trap_bar_info region on behalf of vendor driver with region len=0
and region->ops=null.
Vvendor driver should override this region's len, flags, rw, mmap in its
vfio_pci_mediate_ops.

Cc: Kevin Tian 

Signed-off-by: Yan Zhao 
---
 drivers/vfio/pci/vfio_pci.c | 16 
 include/linux/vfio.h|  3 ++-
 include/uapi/linux/vfio.h   | 11 +++
 3 files changed, 29 insertions(+), 1 deletion(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 059660328be2..62b811ca43e4 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -127,6 +127,19 @@ void init_migration_region(struct vfio_pci_device *vdev)
NULL);
 }
 
+/**
+ * register a region to hold info for dynamically trap bar regions
+ */
+void init_dynamic_trap_bar_info_region(struct vfio_pci_device *vdev)
+{
+   vfio_pci_register_dev_region(vdev,
+   VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO,
+   VFIO_REGION_SUBTYPE_DYNAMIC_TRAP_BAR_INFO,
+   NULL, 0,
+   VFIO_REGION_INFO_FLAG_READ | VFIO_REGION_INFO_FLAG_WRITE,
+   NULL);
+}
+
 static void vfio_pci_probe_mmaps(struct vfio_pci_device *vdev)
 {
struct resource *res;
@@ -538,6 +551,9 @@ static int vfio_pci_open(void *device_data)
if (caps & VFIO_PCI_DEVICE_CAP_MIGRATION)
init_migration_region(vdev);
 
+   if (caps & VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR)
+   init_dynamic_trap_bar_info_region(vdev);
+
pr_info("vfio pci found mediate_ops %s, 
caps=%llx, handle=%x for %x:%x\n",
vdev->mediate_ops->name, caps,
handle, vdev->pdev->vendor,
diff --git a/include/linux/vfio.h b/include/linux/vfio.h
index cddea8e9dcb2..cf8ecf687bee 100644
--- a/include/linux/vfio.h
+++ b/include/linux/vfio.h
@@ -197,7 +197,8 @@ extern void vfio_virqfd_disable(struct virqfd **pvirqfd);
 
 struct vfio_pci_mediate_ops {
char*name;
-#define VFIO_PCI_DEVICE_CAP_MIGRATION (0x01)
+#define VFIO_PCI_DEVICE_CAP_MIGRATION  (0x01)
+#define VFIO_PCI_DEVICE_CAP_DYNAMIC_TRAP_BAR   (0x02)
int (*open)(struct pci_dev *pdev, u64 *caps, u32 *handle);
void(*release)(int handle);
void(*get_region_info)(int handle,
diff --git a/include/uapi/linux/vfio.h b/include/uapi/linux/vfio.h
index caf8845a67a6..74a2d0b57741 100644
--- a/include/uapi/linux/vfio.h
+++ b/include/uapi/linux/vfio.h
@@ -258,6 +258,9 @@ struct vfio_region_info {
 struct vfio_region_sparse_mmap_area {
__u64   offset; /* Offset of mmap'able area within region */
__u64   size;   /* Size of mmap'able area */
+   __u32   disablable; /* whether this mmap'able are able to
+*  be dynamically disabled
+*/
 };
 
 struct vfio_region_info_cap_sparse_mmap {
@@ -454,6 +457,14 @@ struct vfio_device_migration_info {
 #define VFIO_DEVICE_DIRTY_PFNS_ALL (~0ULL)
 } __attribute__((packed));
 
+/* Region type and sub-type to hold info to dynamically trap bars */
+#define VFIO_REGION_TYPE_DYNAMIC_TRAP_BAR_INFO (4)
+#define VFIO_REGION_SUBTYPE_DYNA

[libvirt] [RFC PATCH 2/9] vfio/pci: test existence before calling region->ops

2019-12-05 Thread Yan Zhao
For regions registered through vfio_pci_register_dev_region(),
before calling region->ops, first check whether region->ops is not null.

As in the next two patches, dev regions of null region->ops are to be
registered by default on behalf of vendor driver, we need to check here
to prevent null pointer access if vendor driver forgets to handle those
dev regions

Cc: Kevin Tian 

Signed-off-by: Yan Zhao 
---
 drivers/vfio/pci/vfio_pci.c | 12 ++--
 1 file changed, 10 insertions(+), 2 deletions(-)

diff --git a/drivers/vfio/pci/vfio_pci.c b/drivers/vfio/pci/vfio_pci.c
index 55080ff29495..f3730252ee82 100644
--- a/drivers/vfio/pci/vfio_pci.c
+++ b/drivers/vfio/pci/vfio_pci.c
@@ -398,8 +398,12 @@ static void vfio_pci_disable(struct vfio_pci_device *vdev)
 
vdev->virq_disabled = false;
 
-   for (i = 0; i < vdev->num_regions; i++)
+   for (i = 0; i < vdev->num_regions; i++) {
+   if (!vdev->region[i].ops || vdev->region[i].ops->release)
+   continue;
+
vdev->region[i].ops->release(vdev, >region[i]);
+   }
 
vdev->num_regions = 0;
kfree(vdev->region);
@@ -900,7 +904,8 @@ static long vfio_pci_ioctl(void *device_data,
if (ret)
return ret;
 
-   if (vdev->region[i].ops->add_capability) {
+   if (vdev->region[i].ops &&
+   vdev->region[i].ops->add_capability) {
ret = vdev->region[i].ops->add_capability(vdev,
>region[i], );
if (ret)
@@ -1251,6 +1256,9 @@ static ssize_t vfio_pci_rw(void *device_data, char __user 
*buf,
return vfio_pci_vga_rw(vdev, buf, count, ppos, iswrite);
default:
index -= VFIO_PCI_NUM_REGIONS;
+   if (!vdev->region[index].ops || !vdev->region[index].ops->rw)
+   return -EINVAL;
+
return vdev->region[index].ops->rw(vdev, buf,
   count, ppos, iswrite);
}
-- 
2.17.1


--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list



Re: [libvirt] [PATCH v4 0/2] introduction of migration_version attribute for VFIO live migration

2019-06-03 Thread Yan Zhao
On Tue, Jun 04, 2019 at 03:29:32AM +0800, Alex Williamson wrote:
> On Thu, 30 May 2019 20:44:38 -0400
> Yan Zhao  wrote:
> 
> > This patchset introduces a migration_version attribute under sysfs of VFIO
> > Mediated devices.
> > 
> > This migration_version attribute is used to check migration compatibility
> > between two mdev devices of the same mdev type.
> > 
> > Patch 1 defines migration_version attribute in
> > Documentation/vfio-mediated-device.txt
> > 
> > Patch 2 uses GVT as an example to show how to expose migration_version
> > attribute and check migration compatibility in vendor driver.
> 
> Thanks for iterating through this, it looks like we've settled on
> something reasonable, but now what?  This is one piece of the puzzle to
> supporting mdev migration, but I don't think it makes sense to commit
> this upstream on its own without also defining the remainder of how we
> actually do migration, preferably with more than one working
> implementation and at least prototyped, if not final, QEMU support.  I
> hope that was the intent, and maybe it's now time to look at the next
> piece of the puzzle.  Thanks,
> 
> Alex

Got it. 
Also thank you and all for discussing and guiding all along:)
We'll move to the next episode now.

Thanks
Yan

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


[libvirt] [PATCH v4 2/2] drm/i915/gvt: export migration_version to mdev sysfs for Intel vGPU

2019-05-30 Thread Yan Zhao
This feature implements the migration_version attribute for Intel's vGPU
mdev devices.

migration_version attribute is rw.
It's used to check migration compatibility for two mdev devices of the
same mdev type.
migration_version string is defined by vendor driver and opaque to
userspace.

For Intel vGPU of gen8 and gen9, the format of migration_version string
is:
  ---.

For future platforms, the format of migration_version string is to be
expanded to include more meta data to identify Intel vGPUs for live
migration compatibility check

For old platforms, and for GVT not supporting vGPU live migration
feature, -ENODEV is returned on read(2)/write(2) of migration_version
attribute.
For vGPUs running old GVT who do not expose migration_version
attribute, live migration is regarded as not supported for those vGPUs.

Cc: Alex Williamson 
Cc: Erik Skultety 
Cc: "Dr. David Alan Gilbert" 
Cc: Cornelia Huck 
Cc: "Tian, Kevin" 
Cc: Zhenyu Wang 
Cc: "Wang, Zhi A" 
c: Neo Jia 
Cc: Kirti Wankhede 

Signed-off-by: Yan Zhao 
Acked-by: Cornelia Huck 
Acked-by: Zhenyu Wang 

---
v4:
1. fixed Indentation/spell issues and reworded several error messages
(Cornelia Huck)
2. added kfree(version) in snprintf failure case (Zhenyu Wang)

v3:
1. renamed version to migration_version
(Christophe de Dinechin, Cornelia Huck, Alex Williamson)
2. instead of generating migration version strings each time, storing
them in vgpu types generated during initialization.
(Zhenyu Wang, Cornelia Huck)
3. replaced multiple snprintf to one big snprintf in
intel_gvt_get_vfio_migration_version()
(Dr. David Alan Gilbert)
4. printed detailed error log
(Alex Williamson, Erik Skultety, Cornelia Huck, Dr. David Alan Gilbert)
5. incorporated  into migration_version string
(Alex Williamson)
6. do not use ifndef macro to switch off migration_version attribute
(Zhenyu Wang)

v2:
1. removed 32 common part of version string
(Alex Williamson)
2. do not register version attribute for GVT not supporting live
migration.(Cornelia Huck)
3. for platforms out of gen8, gen9, return -EINVAL --> -ENODEV for
incompatible. (Cornelia Huck)
---
 drivers/gpu/drm/i915/gvt/Makefile|   2 +-
 drivers/gpu/drm/i915/gvt/gvt.c   |  39 +
 drivers/gpu/drm/i915/gvt/gvt.h   |   5 +
 drivers/gpu/drm/i915/gvt/migration_version.c | 170 +++
 drivers/gpu/drm/i915/gvt/vgpu.c  |  13 +-
 5 files changed, 226 insertions(+), 3 deletions(-)
 create mode 100644 drivers/gpu/drm/i915/gvt/migration_version.c

diff --git a/drivers/gpu/drm/i915/gvt/Makefile 
b/drivers/gpu/drm/i915/gvt/Makefile
index ea8324abc784..a4143510ea22 100644
--- a/drivers/gpu/drm/i915/gvt/Makefile
+++ b/drivers/gpu/drm/i915/gvt/Makefile
@@ -3,7 +3,7 @@ GVT_DIR := gvt
 GVT_SOURCE := gvt.o aperture_gm.o handlers.o vgpu.o trace_points.o firmware.o \
interrupt.o gtt.o cfg_space.o opregion.o mmio.o display.o edid.o \
execlist.o scheduler.o sched_policy.o mmio_context.o cmd_parser.o 
debugfs.o \
-   fb_decoder.o dmabuf.o page_track.o
+   fb_decoder.o dmabuf.o page_track.o migration_version.o
 
 ccflags-y  += -I $(srctree)/$(src) -I 
$(srctree)/$(src)/$(GVT_DIR)/
 i915-y += $(addprefix $(GVT_DIR)/, 
$(GVT_SOURCE))
diff --git a/drivers/gpu/drm/i915/gvt/gvt.c b/drivers/gpu/drm/i915/gvt/gvt.c
index 43f4242062dd..35fb3c20eb0e 100644
--- a/drivers/gpu/drm/i915/gvt/gvt.c
+++ b/drivers/gpu/drm/i915/gvt/gvt.c
@@ -105,14 +105,53 @@ static ssize_t description_show(struct kobject *kobj, 
struct device *dev,
   type->weight);
 }
 
+static ssize_t migration_version_show(struct kobject *kobj, struct device *dev,
+   char *buf)
+{
+   struct intel_vgpu_type *type;
+   void *gvt = kdev_to_i915(dev)->gvt;
+
+   type = intel_gvt_find_vgpu_type(gvt, kobject_name(kobj));
+   if (!type || !type->migration_version) {
+   gvt_err("Migration not supported on type %s. Please search 
previous detailed log\n",
+   kobject_name(kobj));
+   return -ENODEV;
+   }
+
+   return snprintf(buf, strlen(type->migration_version) + 2,
+   "%s\n", type->migration_version);
+}
+
+static ssize_t migration_version_store(struct kobject *kobj, struct device 
*dev,
+   const char *buf, size_t count)
+{
+   int ret = 0;
+   struct intel_vgpu_type *type;
+   void *gvt = kdev_to_i915(dev)->gvt;
+
+   type = intel_gvt_find_vgpu_type(gvt, kobject_name(kobj));
+   if (!type || !type->migration_version) {
+   gvt_err("Migration not supported on type %s. Please search 
previous detailed log\n",
+   kobject_name(kobj));
+   return -ENODEV;
+   }
+
+   ret = intel_gvt_check_vfio_migration_version(g

[PATCH v4 1/2] vfio/mdev: add migration_version attribute for mdev device

2019-05-30 Thread Yan Zhao
migration_version attribute is used to check migration compatibility
between two mdev devices of the same mdev type.
The key is that it's rw and its data is opaque to userspace.

Userspace reads migration_version of mdev device at source side and
writes the value to migration_version attribute of mdev device at target
side. It judges migration compatibility according to whether the read
and write operations succeed or fail.

As this attribute is under mdev_type node, userspace is able to know
whether two mdev devices are compatible before a mdev device is created.

userspace needs to check whether the two mdev devices are of the same
mdev type before checking the migration_version attribute. It also needs
to check device creation parameters if aggregation is supported in
future.

 __userspace
  /\  \
 / \write
/ read  \
   /__   ___\|/_
  | migration_version | | migration_version |-->check migration
  - -   compatibility
mdev device A   mdev device B

Cc: Alex Williamson 
Cc: Erik Skultety 
Cc: "Dr. David Alan Gilbert" 
Cc: Cornelia Huck 
Cc: "Tian, Kevin" 
Cc: Zhenyu Wang 
Cc: "Wang, Zhi A" 
Cc: Neo Jia 
Cc: Kirti Wankhede 
Cc: Daniel P. Berrangé 
Cc: Christophe de Dinechin 

Signed-off-by: Yan Zhao 
Reviewed-by: Cornelia Huck 

---
v4:
fixed a typo. (Cornelia Huck)

v3:
1. renamed version to migration_version
(Christophe de Dinechin, Cornelia Huck, Alex Williamson)
2. let errno to be freely defined by vendor driver
(Alex Williamson, Erik Skultety, Cornelia Huck, Dr. David Alan Gilbert)
3. let checking mdev_type be prerequisite of migration compatibility
check. (Alex Williamson)
4. reworded example usage section.
(most of this section came from Alex Williamson)
5. reworded attribute intention section (Cornelia Huck)

v2:
1. added detailed intent and usage
2. made definition of version string completely private to vendor driver
   (Alex Williamson)
3. abandoned changes to sample mdev drivers (Alex Williamson)
4. mandatory --> optional (Cornelia Huck)
5. added description for errno (Cornelia Huck)
---
 Documentation/vfio-mediated-device.txt | 113 +
 1 file changed, 113 insertions(+)

diff --git a/Documentation/vfio-mediated-device.txt 
b/Documentation/vfio-mediated-device.txt
index c3f69bcaf96e..1241e1cee64e 100644
--- a/Documentation/vfio-mediated-device.txt
+++ b/Documentation/vfio-mediated-device.txt
@@ -202,6 +202,7 @@ Directories and files under the sysfs for Each Physical 
Device
   | |   |--- available_instances
   | |   |--- device_api
   | |   |--- description
+  | |   |--- migration_version
   | |   |--- [devices]
   | |--- []
   | |   |--- create
@@ -209,6 +210,7 @@ Directories and files under the sysfs for Each Physical 
Device
   | |   |--- available_instances
   | |   |--- device_api
   | |   |--- description
+  | |   |--- migration_version
   | |   |--- [devices]
   | |--- []
   |  |--- create
@@ -216,6 +218,7 @@ Directories and files under the sysfs for Each Physical 
Device
   |  |--- available_instances
   |  |--- device_api
   |  |--- description
+  |  |--- migration_version
   |  |--- [devices]
 
 * [mdev_supported_types]
@@ -246,6 +249,116 @@ Directories and files under the sysfs for Each Physical 
Device
   This attribute should show the number of devices of type  that can 
be
   created.
 
+* migration_version
+
+  This attribute is rw, and is optional.
+  It is used to check migration compatibility between two mdev devices of the
+  same mdev type. Absence of this attribute means the device of type 
+  does not support migration.
+  This attribute provides a way to check migration compatibility between two
+  mdev devices from userspace even before device creation. The intended usage 
is
+  for userspace to read the migration_version attribute from one mdev device 
and
+  then writing that value to the migration_version attribute of the other mdev
+  device. The second mdev device indicates compatibility via the return code of
+  the write operation. This makes compatibility between mdev devices completely
+  vendor-defined and opaque to userspace. Userspace should do nothing more
+  than verify the mdev types match and then use the migration_version attribute
+  to confirm source to target compatibility.
+
+  Reading/Writing Attribute Data:
+  read(2) will fail if device of type  does not support migration and
+  otherwise succeed and return migration_version string of the device 
of
+  type .
+
+  This migration_version string is vendor defined and opaque to the
+  userspace. Vendor is free to include whatever they feel is relevant.
+  e.g. -.
+
+  Restrictions on this migration_version st

[PATCH v4 0/2] introduction of migration_version attribute for VFIO live migration

2019-05-30 Thread Yan Zhao
This patchset introduces a migration_version attribute under sysfs of VFIO
Mediated devices.

This migration_version attribute is used to check migration compatibility
between two mdev devices of the same mdev type.

Patch 1 defines migration_version attribute in
Documentation/vfio-mediated-device.txt

Patch 2 uses GVT as an example to show how to expose migration_version
attribute and check migration compatibility in vendor driver.

v4:
1. fixed indentation/spell errors, reworded several error messages
2. added a missing memory free for error handling in patch 2

v3:
1. renamed version to migration_version
2. let errno to be freely defined by vendor driver
3. let checking mdev_type be prerequisite of migration compatibility check
4. reworded most part of patch 1
5. print detailed error log in patch 2 and generate migration_version
string at init time

v2:
1. renamed patched 1
2. made definition of device version string completely private to vendor
driver
3. reverted changes to sample mdev drivers
4. described intent and usage of version attribute more clearly.


Yan Zhao (2):
  vfio/mdev: add migration_version attribute for mdev device
  drm/i915/gvt: export migration_version to mdev sysfs for Intel vGPU

 Documentation/vfio-mediated-device.txt   | 113 +
 drivers/gpu/drm/i915/gvt/Makefile|   2 +-
 drivers/gpu/drm/i915/gvt/gvt.c   |  39 +
 drivers/gpu/drm/i915/gvt/gvt.h   |   5 +
 drivers/gpu/drm/i915/gvt/migration_version.c | 168 +++
 drivers/gpu/drm/i915/gvt/vgpu.c  |  13 +-
 6 files changed, 337 insertions(+), 3 deletions(-)
 create mode 100644 drivers/gpu/drm/i915/gvt/migration_version.c

-- 
2.17.1



Re: [libvirt] [PATCH v3 2/2] drm/i915/gvt: export migration_version to mdev sysfs for Intel vGPU

2019-05-29 Thread Yan Zhao
On Wed, May 29, 2019 at 11:07:50AM +0800, Zhenyu Wang wrote:
> On 2019.05.26 23:44:37 -0400, Yan Zhao wrote:
> > This feature implements the migration_version attribute for Intel's vGPU
> > mdev devices.
> > 
> > migration_version attribute is rw.
> > It's used to check migration compatibility for two mdev devices of the
> > same mdev type.
> > migration_version string is defined by vendor driver and opaque to
> > userspace.
> > 
> > For Intel vGPU of gen8 and gen9, the format of migration_version string
> > is:
> >   ---.
> > 
> > For future platforms, the format of migration_version string is to be
> > expanded to include more meta data to identify Intel vGPUs for live
> > migration compatibility check
> > 
> > For old platforms, and for GVT not supporting vGPU live migration
> > feature, -ENODEV is returned on read(2)/write(2) of migration_version
> > attribute.
> > For vGPUs running old GVT who do not expose migration_version
> > attribute, live migration is regarded as not supported for those vGPUs.
> > 
> > Cc: Alex Williamson 
> > Cc: Erik Skultety 
> > Cc: "Dr. David Alan Gilbert" 
> > Cc: Cornelia Huck 
> > Cc: "Tian, Kevin" 
> > Cc: Zhenyu Wang 
> > Cc: "Wang, Zhi A" 
> > c: Neo Jia 
> > Cc: Kirti Wankhede 
> > 
> > Signed-off-by: Yan Zhao 
> > 
> > ---
> > v3:
> > 1. renamed version to migration_version
> > (Christophe de Dinechin, Cornelia Huck, Alex Williamson)
> > 2. instead of generating migration version strings each time, storing
> > them in vgpu types generated during initialization.
> > (Zhenyu Wang, Cornelia Huck)
> > 3. replaced multiple snprintf to one big snprintf in
> > intel_gvt_get_vfio_migration_version()
> > (Dr. David Alan Gilbert)
> > 4. printed detailed error log
> > (Alex Williamson, Erik Skultety, Cornelia Huck, Dr. David Alan Gilbert)
> > 5. incorporated  into migration_version string
> > (Alex Williamson)
> > 6. do not use ifndef macro to switch off migration_version attribute
> > (Zhenyu Wang)
> > 
> > v2:
> > 1. removed 32 common part of version string
> > (Alex Williamson)
> > 2. do not register version attribute for GVT not supporting live
> > migration.(Cornelia Huck)
> > 3. for platforms out of gen8, gen9, return -EINVAL --> -ENODEV for
> > incompatible. (Cornelia Huck)
> > ---
> >  drivers/gpu/drm/i915/gvt/Makefile|   2 +-
> >  drivers/gpu/drm/i915/gvt/gvt.c   |  39 +
> >  drivers/gpu/drm/i915/gvt/gvt.h   |   5 +
> >  drivers/gpu/drm/i915/gvt/migration_version.c | 167 +++
> >  drivers/gpu/drm/i915/gvt/vgpu.c  |  13 +-
> >  5 files changed, 223 insertions(+), 3 deletions(-)
> >  create mode 100644 drivers/gpu/drm/i915/gvt/migration_version.c
> > 
> > diff --git a/drivers/gpu/drm/i915/gvt/Makefile 
> > b/drivers/gpu/drm/i915/gvt/Makefile
> > index 271fb46d4dd0..a9d561c93ab8 100644
> > --- a/drivers/gpu/drm/i915/gvt/Makefile
> > +++ b/drivers/gpu/drm/i915/gvt/Makefile
> > @@ -3,7 +3,7 @@ GVT_DIR := gvt
> >  GVT_SOURCE := gvt.o aperture_gm.o handlers.o vgpu.o trace_points.o 
> > firmware.o \
> > interrupt.o gtt.o cfg_space.o opregion.o mmio.o display.o edid.o \
> > execlist.o scheduler.o sched_policy.o mmio_context.o cmd_parser.o 
> > debugfs.o \
> > -   fb_decoder.o dmabuf.o page_track.o
> > +   fb_decoder.o dmabuf.o page_track.o migration_version.o
> >  
> >  ccflags-y  += -I$(src) -I$(src)/$(GVT_DIR)
> >  i915-y += $(addprefix $(GVT_DIR)/, 
> > $(GVT_SOURCE))
> > diff --git a/drivers/gpu/drm/i915/gvt/gvt.c b/drivers/gpu/drm/i915/gvt/gvt.c
> > index 43f4242062dd..be2980e8ac75 100644
> > --- a/drivers/gpu/drm/i915/gvt/gvt.c
> > +++ b/drivers/gpu/drm/i915/gvt/gvt.c
> > @@ -105,14 +105,53 @@ static ssize_t description_show(struct kobject *kobj, 
> > struct device *dev,
> >type->weight);
> >  }
> >  
> > +static ssize_t migration_version_show(struct kobject *kobj, struct device 
> > *dev,
> > +   char *buf)
> > +{
> > +   struct intel_vgpu_type *type;
> > +   void *gvt = kdev_to_i915(dev)->gvt;
> > +
> > +   type = intel_gvt_find_vgpu_type(gvt, kobject_name(kobj));
> > +   if (!type || !type->migration_version) {
> > +   gvt_err("Does not support migraion on type %s. Please search 
> > previous detailed log\n",
> > + 

Re: [libvirt] [PATCH v3 2/2] drm/i915/gvt: export migration_version to mdev sysfs for Intel vGPU

2019-05-28 Thread Yan Zhao
On Tue, May 28, 2019 at 05:01:35PM +0800, Cornelia Huck wrote:
> On Sun, 26 May 2019 23:44:37 -0400
> Yan Zhao  wrote:
> 
> > This feature implements the migration_version attribute for Intel's vGPU
> > mdev devices.
> > 
> > migration_version attribute is rw.
> > It's used to check migration compatibility for two mdev devices of the
> > same mdev type.
> > migration_version string is defined by vendor driver and opaque to
> > userspace.
> > 
> > For Intel vGPU of gen8 and gen9, the format of migration_version string
> > is:
> >   ---.
> > 
> > For future platforms, the format of migration_version string is to be
> > expanded to include more meta data to identify Intel vGPUs for live
> > migration compatibility check
> > 
> > For old platforms, and for GVT not supporting vGPU live migration
> > feature, -ENODEV is returned on read(2)/write(2) of migration_version
> > attribute.
> > For vGPUs running old GVT who do not expose migration_version
> > attribute, live migration is regarded as not supported for those vGPUs.
> > 
> > Cc: Alex Williamson 
> > Cc: Erik Skultety 
> > Cc: "Dr. David Alan Gilbert" 
> > Cc: Cornelia Huck 
> > Cc: "Tian, Kevin" 
> > Cc: Zhenyu Wang 
> > Cc: "Wang, Zhi A" 
> > c: Neo Jia 
> > Cc: Kirti Wankhede 
> > 
> > Signed-off-by: Yan Zhao 
> > 
> > ---
> > v3:
> > 1. renamed version to migration_version
> > (Christophe de Dinechin, Cornelia Huck, Alex Williamson)
> > 2. instead of generating migration version strings each time, storing
> > them in vgpu types generated during initialization.
> > (Zhenyu Wang, Cornelia Huck)
> > 3. replaced multiple snprintf to one big snprintf in
> > intel_gvt_get_vfio_migration_version()
> > (Dr. David Alan Gilbert)
> > 4. printed detailed error log
> > (Alex Williamson, Erik Skultety, Cornelia Huck, Dr. David Alan Gilbert)
> > 5. incorporated  into migration_version string
> > (Alex Williamson)
> > 6. do not use ifndef macro to switch off migration_version attribute
> > (Zhenyu Wang)
> > 
> > v2:
> > 1. removed 32 common part of version string
> > (Alex Williamson)
> > 2. do not register version attribute for GVT not supporting live
> > migration.(Cornelia Huck)
> > 3. for platforms out of gen8, gen9, return -EINVAL --> -ENODEV for
> > incompatible. (Cornelia Huck)
> > ---
> >  drivers/gpu/drm/i915/gvt/Makefile|   2 +-
> >  drivers/gpu/drm/i915/gvt/gvt.c   |  39 +
> >  drivers/gpu/drm/i915/gvt/gvt.h   |   5 +
> >  drivers/gpu/drm/i915/gvt/migration_version.c | 167 +++
> >  drivers/gpu/drm/i915/gvt/vgpu.c  |  13 +-
> >  5 files changed, 223 insertions(+), 3 deletions(-)
> >  create mode 100644 drivers/gpu/drm/i915/gvt/migration_version.c
> > 
> 
> (...)
> 
> > diff --git a/drivers/gpu/drm/i915/gvt/gvt.c b/drivers/gpu/drm/i915/gvt/gvt.c
> > index 43f4242062dd..be2980e8ac75 100644
> > --- a/drivers/gpu/drm/i915/gvt/gvt.c
> > +++ b/drivers/gpu/drm/i915/gvt/gvt.c
> > @@ -105,14 +105,53 @@ static ssize_t description_show(struct kobject *kobj, 
> > struct device *dev,
> >type->weight);
> >  }
> >  
> > +static ssize_t migration_version_show(struct kobject *kobj, struct device 
> > *dev,
> > +   char *buf)
> 
> Indentation looks a bit odd? (Also below.)
>
yes. Let me correct it in next revision.

> > +{
> > +   struct intel_vgpu_type *type;
> > +   void *gvt = kdev_to_i915(dev)->gvt;
> > +
> > +   type = intel_gvt_find_vgpu_type(gvt, kobject_name(kobj));
> > +   if (!type || !type->migration_version) {
> > +   gvt_err("Does not support migraion on type %s. Please search 
> > previous detailed log\n",
> 
> s/migraion/migration/ (also below)
>
Sorry for typos again. I'll be more careful next time. thank you:)

> Or reword to "Migration not supported on type %s."?
>
Yes, better :)

> > +   kobject_name(kobj));
> > +   return -ENODEV;
> > +   }
> > +
> > +   return snprintf(buf, strlen(type->migration_version) + 2,
> > +   "%s\n", type->migration_version);
> > +}
> > +
> > +static ssize_t migration_version_store(struct kobject *kobj, struct device 
> > *dev,
> > +   const char *buf, size_t count)
> > +{
> > +   int ret = 0;
> > +   struct intel_vgpu_type *type;
> > +   void *

Re: [libvirt] [PATCH v3 1/2] vfio/mdev: add migration_version attribute for mdev device

2019-05-28 Thread Yan Zhao
On Tue, May 28, 2019 at 04:53:32PM +0800, Cornelia Huck wrote:
> On Sun, 26 May 2019 23:43:42 -0400
> Yan Zhao  wrote:
> 
> > migration_version attribute is used to check migration compatibility
> > between two mdev device of the same mdev type.
> 
> s/device/devices/
>
yes... sorry and thanks :)

> > The key is that it's rw and its data is opaque to userspace.
> > 
> > Userspace reads migration_version of mdev device at source side and
> > writes the value to migration_version attribute of mdev device at target
> > side. It judges migration compatibility according to whether the read
> > and write operations succeed or fail.
> > 
> > As this attribute is under mdev_type node, userspace is able to know
> > whether two mdev devices are compatible before a mdev device is created.
> > 
> > userspace needs to check whether the two mdev devices are of the same
> > mdev type before checking the migration_version attribute. It also needs
> > to check device creation parameters if aggregation is supported in
> > future.
> > 
> >  __userspace
> >   /\  \
> >  / \write
> > / read  \
> >/__   ___\|/_
> >   | migration_version | | migration_version |-->check migration
> >   - -   compatibility
> > mdev device A   mdev device B
> > 
> > Cc: Alex Williamson 
> > Cc: Erik Skultety 
> > Cc: "Dr. David Alan Gilbert" 
> > Cc: Cornelia Huck 
> > Cc: "Tian, Kevin" 
> > Cc: Zhenyu Wang 
> > Cc: "Wang, Zhi A" 
> > Cc: Neo Jia 
> > Cc: Kirti Wankhede 
> > Cc: Daniel P. Berrangé 
> > Cc: Christophe de Dinechin 
> > 
> > Signed-off-by: Yan Zhao 
> > 
> > ---
> > v3:
> > 1. renamed version to migration_version
> > (Christophe de Dinechin, Cornelia Huck, Alex Williamson)
> > 2. let errno to be freely defined by vendor driver
> > (Alex Williamson, Erik Skultety, Cornelia Huck, Dr. David Alan Gilbert)
> > 3. let checking mdev_type be prerequisite of migration compatibility
> > check. (Alex Williamson)
> > 4. reworded example usage section.
> > (most of this section came from Alex Williamson)
> > 5. reworded attribute intention section (Cornelia Huck)
> > 
> > v2:
> > 1. added detailed intent and usage
> > 2. made definition of version string completely private to vendor driver
> >(Alex Williamson)
> > 3. abandoned changes to sample mdev drivers (Alex Williamson)
> > 4. mandatory --> optional (Cornelia Huck)
> > 5. added description for errno (Cornelia Huck)
> > ---
> >  Documentation/vfio-mediated-device.txt | 113 +
> >  1 file changed, 113 insertions(+)
> > 
> 
> While I probably would have written a more compact description, your
> version is fine with me as well.
> 
> Reviewed-by: Cornelia Huck 
Thank you Cornelia!

> ___
> intel-gvt-dev mailing list
> intel-gvt-...@lists.freedesktop.org
> https://lists.freedesktop.org/mailman/listinfo/intel-gvt-dev

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


[PATCH v3 2/2] drm/i915/gvt: export migration_version to mdev sysfs for Intel vGPU

2019-05-26 Thread Yan Zhao
This feature implements the migration_version attribute for Intel's vGPU
mdev devices.

migration_version attribute is rw.
It's used to check migration compatibility for two mdev devices of the
same mdev type.
migration_version string is defined by vendor driver and opaque to
userspace.

For Intel vGPU of gen8 and gen9, the format of migration_version string
is:
  ---.

For future platforms, the format of migration_version string is to be
expanded to include more meta data to identify Intel vGPUs for live
migration compatibility check

For old platforms, and for GVT not supporting vGPU live migration
feature, -ENODEV is returned on read(2)/write(2) of migration_version
attribute.
For vGPUs running old GVT who do not expose migration_version
attribute, live migration is regarded as not supported for those vGPUs.

Cc: Alex Williamson 
Cc: Erik Skultety 
Cc: "Dr. David Alan Gilbert" 
Cc: Cornelia Huck 
Cc: "Tian, Kevin" 
Cc: Zhenyu Wang 
Cc: "Wang, Zhi A" 
c: Neo Jia 
Cc: Kirti Wankhede 

Signed-off-by: Yan Zhao 

---
v3:
1. renamed version to migration_version
(Christophe de Dinechin, Cornelia Huck, Alex Williamson)
2. instead of generating migration version strings each time, storing
them in vgpu types generated during initialization.
(Zhenyu Wang, Cornelia Huck)
3. replaced multiple snprintf to one big snprintf in
intel_gvt_get_vfio_migration_version()
(Dr. David Alan Gilbert)
4. printed detailed error log
(Alex Williamson, Erik Skultety, Cornelia Huck, Dr. David Alan Gilbert)
5. incorporated  into migration_version string
(Alex Williamson)
6. do not use ifndef macro to switch off migration_version attribute
(Zhenyu Wang)

v2:
1. removed 32 common part of version string
(Alex Williamson)
2. do not register version attribute for GVT not supporting live
migration.(Cornelia Huck)
3. for platforms out of gen8, gen9, return -EINVAL --> -ENODEV for
incompatible. (Cornelia Huck)
---
 drivers/gpu/drm/i915/gvt/Makefile|   2 +-
 drivers/gpu/drm/i915/gvt/gvt.c   |  39 +
 drivers/gpu/drm/i915/gvt/gvt.h   |   5 +
 drivers/gpu/drm/i915/gvt/migration_version.c | 167 +++
 drivers/gpu/drm/i915/gvt/vgpu.c  |  13 +-
 5 files changed, 223 insertions(+), 3 deletions(-)
 create mode 100644 drivers/gpu/drm/i915/gvt/migration_version.c

diff --git a/drivers/gpu/drm/i915/gvt/Makefile 
b/drivers/gpu/drm/i915/gvt/Makefile
index 271fb46d4dd0..a9d561c93ab8 100644
--- a/drivers/gpu/drm/i915/gvt/Makefile
+++ b/drivers/gpu/drm/i915/gvt/Makefile
@@ -3,7 +3,7 @@ GVT_DIR := gvt
 GVT_SOURCE := gvt.o aperture_gm.o handlers.o vgpu.o trace_points.o firmware.o \
interrupt.o gtt.o cfg_space.o opregion.o mmio.o display.o edid.o \
execlist.o scheduler.o sched_policy.o mmio_context.o cmd_parser.o 
debugfs.o \
-   fb_decoder.o dmabuf.o page_track.o
+   fb_decoder.o dmabuf.o page_track.o migration_version.o
 
 ccflags-y  += -I$(src) -I$(src)/$(GVT_DIR)
 i915-y += $(addprefix $(GVT_DIR)/, 
$(GVT_SOURCE))
diff --git a/drivers/gpu/drm/i915/gvt/gvt.c b/drivers/gpu/drm/i915/gvt/gvt.c
index 43f4242062dd..be2980e8ac75 100644
--- a/drivers/gpu/drm/i915/gvt/gvt.c
+++ b/drivers/gpu/drm/i915/gvt/gvt.c
@@ -105,14 +105,53 @@ static ssize_t description_show(struct kobject *kobj, 
struct device *dev,
   type->weight);
 }
 
+static ssize_t migration_version_show(struct kobject *kobj, struct device *dev,
+   char *buf)
+{
+   struct intel_vgpu_type *type;
+   void *gvt = kdev_to_i915(dev)->gvt;
+
+   type = intel_gvt_find_vgpu_type(gvt, kobject_name(kobj));
+   if (!type || !type->migration_version) {
+   gvt_err("Does not support migraion on type %s. Please search 
previous detailed log\n",
+   kobject_name(kobj));
+   return -ENODEV;
+   }
+
+   return snprintf(buf, strlen(type->migration_version) + 2,
+   "%s\n", type->migration_version);
+}
+
+static ssize_t migration_version_store(struct kobject *kobj, struct device 
*dev,
+   const char *buf, size_t count)
+{
+   int ret = 0;
+   struct intel_vgpu_type *type;
+   void *gvt = kdev_to_i915(dev)->gvt;
+
+   type = intel_gvt_find_vgpu_type(gvt, kobject_name(kobj));
+   if (!type || !type->migration_version) {
+   gvt_err("Does not support migraion on type %s. Please search 
previous detailed log\n",
+   kobject_name(kobj));
+   return -ENODEV;
+   }
+
+   ret = intel_gvt_check_vfio_migration_version(gvt,
+   type->migration_version, buf);
+
+   return (ret < 0 ? ret : count);
+}
+
 static MDEV_TYPE_ATTR_RO(available_instances);
 static MDEV_TYPE_ATTR_RO(device_api);
 static MDEV_TYPE_ATTR_RO(description)

[PATCH v3 1/2] vfio/mdev: add migration_version attribute for mdev device

2019-05-26 Thread Yan Zhao
migration_version attribute is used to check migration compatibility
between two mdev device of the same mdev type.
The key is that it's rw and its data is opaque to userspace.

Userspace reads migration_version of mdev device at source side and
writes the value to migration_version attribute of mdev device at target
side. It judges migration compatibility according to whether the read
and write operations succeed or fail.

As this attribute is under mdev_type node, userspace is able to know
whether two mdev devices are compatible before a mdev device is created.

userspace needs to check whether the two mdev devices are of the same
mdev type before checking the migration_version attribute. It also needs
to check device creation parameters if aggregation is supported in
future.

 __userspace
  /\  \
 / \write
/ read  \
   /__   ___\|/_
  | migration_version | | migration_version |-->check migration
  - -   compatibility
mdev device A   mdev device B

Cc: Alex Williamson 
Cc: Erik Skultety 
Cc: "Dr. David Alan Gilbert" 
Cc: Cornelia Huck 
Cc: "Tian, Kevin" 
Cc: Zhenyu Wang 
Cc: "Wang, Zhi A" 
Cc: Neo Jia 
Cc: Kirti Wankhede 
Cc: Daniel P. Berrangé 
Cc: Christophe de Dinechin 

Signed-off-by: Yan Zhao 

---
v3:
1. renamed version to migration_version
(Christophe de Dinechin, Cornelia Huck, Alex Williamson)
2. let errno to be freely defined by vendor driver
(Alex Williamson, Erik Skultety, Cornelia Huck, Dr. David Alan Gilbert)
3. let checking mdev_type be prerequisite of migration compatibility
check. (Alex Williamson)
4. reworded example usage section.
(most of this section came from Alex Williamson)
5. reworded attribute intention section (Cornelia Huck)

v2:
1. added detailed intent and usage
2. made definition of version string completely private to vendor driver
   (Alex Williamson)
3. abandoned changes to sample mdev drivers (Alex Williamson)
4. mandatory --> optional (Cornelia Huck)
5. added description for errno (Cornelia Huck)
---
 Documentation/vfio-mediated-device.txt | 113 +
 1 file changed, 113 insertions(+)

diff --git a/Documentation/vfio-mediated-device.txt 
b/Documentation/vfio-mediated-device.txt
index c3f69bcaf96e..1241e1cee64e 100644
--- a/Documentation/vfio-mediated-device.txt
+++ b/Documentation/vfio-mediated-device.txt
@@ -202,6 +202,7 @@ Directories and files under the sysfs for Each Physical 
Device
   | |   |--- available_instances
   | |   |--- device_api
   | |   |--- description
+  | |   |--- migration_version
   | |   |--- [devices]
   | |--- []
   | |   |--- create
@@ -209,6 +210,7 @@ Directories and files under the sysfs for Each Physical 
Device
   | |   |--- available_instances
   | |   |--- device_api
   | |   |--- description
+  | |   |--- migration_version
   | |   |--- [devices]
   | |--- []
   |  |--- create
@@ -216,6 +218,7 @@ Directories and files under the sysfs for Each Physical 
Device
   |  |--- available_instances
   |  |--- device_api
   |  |--- description
+  |  |--- migration_version
   |  |--- [devices]
 
 * [mdev_supported_types]
@@ -246,6 +249,116 @@ Directories and files under the sysfs for Each Physical 
Device
   This attribute should show the number of devices of type  that can 
be
   created.
 
+* migration_version
+
+  This attribute is rw, and is optional.
+  It is used to check migration compatibility between two mdev devices of the
+  same mdev type. Absence of this attribute means the device of type 
+  does not support migration.
+  This attribute provides a way to check migration compatibility between two
+  mdev devices from userspace even before device creation. The intended usage 
is
+  for userspace to read the migration_version attribute from one mdev device 
and
+  then writing that value to the migration_version attribute of the other mdev
+  device. The second mdev device indicates compatibility via the return code of
+  the write operation. This makes compatibility between mdev devices completely
+  vendor-defined and opaque to userspace. Userspace should do nothing more
+  than verify the mdev types match and then use the migration_version attribute
+  to confirm source to target compatibility.
+
+  Reading/Writing Attribute Data:
+  read(2) will fail if device of type  does not support migration and
+  otherwise succeed and return migration_version string of the device 
of
+  type .
+
+  This migration_version string is vendor defined and opaque to the
+  userspace. Vendor is free to include whatever they feel is relevant.
+  e.g. -.
+
+  Restrictions on this migration_version string:
+1. It should only contain ascii characters
+ 

[libvirt] [PATCH v3 0/2] introduction of migration_version attribute for VFIO live migration

2019-05-26 Thread Yan Zhao
This patchset introduces a migration_version attribute under sysfs of VFIO
Mediated devices.

This migration_version attribute is used to check migration compatibility
between two mdev devices of the same mdev type.

Patch 1 defines migration_version attribute in
Documentation/vfio-mediated-device.txt

Patch 2 uses GVT as an example to show how to expose migration_version
attribute and check migration compatibility in vendor driver.


v3:
1. renamed version to migration_version
2. let errno to be freely defined by vendor driver
3. let checking mdev_type be prerequisite of migration compatibility check
4. reworded most part of patch 1
5. print detailed error log in patch 2 and generate migration_version
string at init time

v2:
1. renamed patched 1
2. made definition of device version string completely private to vendor
driver
3. reverted changes to sample mdev drivers
4. described intent and usage of version attribute more clearly.


Yan Zhao (2):
  vfio/mdev: add migration_version attribute for mdev device
  drm/i915/gvt: export migration_version to mdev sysfs for Intel vGPU

 Documentation/vfio-mediated-device.txt   | 113 +
 drivers/gpu/drm/i915/gvt/Makefile|   2 +-
 drivers/gpu/drm/i915/gvt/gvt.c   |  39 +
 drivers/gpu/drm/i915/gvt/gvt.h   |   5 +
 drivers/gpu/drm/i915/gvt/migration_version.c | 167 +++
 drivers/gpu/drm/i915/gvt/vgpu.c  |  13 +-
 6 files changed, 336 insertions(+), 3 deletions(-)
 create mode 100644 drivers/gpu/drm/i915/gvt/migration_version.c

-- 
2.17.1

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


Re: [libvirt] [PATCH v2 1/2] vfio/mdev: add version attribute for mdev device

2019-05-15 Thread Yan Zhao
On Tue, May 14, 2019 at 11:01:42PM +0800, Alex Williamson wrote:
> On Tue, 14 May 2019 09:43:44 +0200
> Erik Skultety  wrote:
> 
> > On Tue, May 14, 2019 at 03:32:19AM -0400, Yan Zhao wrote:
> > > On Tue, May 14, 2019 at 03:20:40PM +0800, Erik Skultety wrote:  
> > > > On Tue, May 14, 2019 at 02:12:35AM -0400, Yan Zhao wrote:  
> > > > > On Mon, May 13, 2019 at 09:28:04PM +0800, Erik Skultety wrote:  
> > > > > > On Fri, May 10, 2019 at 11:48:38AM +0200, Cornelia Huck wrote:  
> > > > > > > On Fri, 10 May 2019 10:36:09 +0100
> > > > > > > "Dr. David Alan Gilbert"  wrote:
> > > > > > >  
> > > > > > > > * Cornelia Huck (coh...@redhat.com) wrote:  
> > > > > > > > > On Thu, 9 May 2019 17:48:26 +0100
> > > > > > > > > "Dr. David Alan Gilbert"  wrote:
> > > > > > > > >  
> > > > > > > > > > * Cornelia Huck (coh...@redhat.com) wrote:  
> > > > > > > > > > > On Thu, 9 May 2019 16:48:57 +0100
> > > > > > > > > > > "Dr. David Alan Gilbert"  wrote:
> > > > > > > > > > >  
> > > > > > > > > > > > * Cornelia Huck (coh...@redhat.com) wrote:  
> > > > > > > > > > > > > On Tue, 7 May 2019 15:18:26 -0600
> > > > > > > > > > > > > Alex Williamson  wrote:
> > > > > > > > > > > > >  
> > > > > > > > > > > > > > On Sun,  5 May 2019 21:49:04 -0400
> > > > > > > > > > > > > > Yan Zhao  wrote:  
> > > > > > > > > > > > >  
> > > > > > > > > > > > > > > +  Errno:
> > > > > > > > > > > > > > > +  If vendor driver wants to claim a mdev device 
> > > > > > > > > > > > > > > incompatible to all other mdev
> > > > > > > > > > > > > > > +  devices, it should not register version 
> > > > > > > > > > > > > > > attribute for this mdev device. But if
> > > > > > > > > > > > > > > +  a vendor driver has already registered version 
> > > > > > > > > > > > > > > attribute and it wants to claim
> > > > > > > > > > > > > > > +  a mdev device incompatible to all other mdev 
> > > > > > > > > > > > > > > devices, it needs to return
> > > > > > > > > > > > > > > +  -ENODEV on access to this mdev device's 
> > > > > > > > > > > > > > > version attribute.
> > > > > > > > > > > > > > > +  If a mdev device is only incompatible to 
> > > > > > > > > > > > > > > certain mdev devices, write of
> > > > > > > > > > > > > > > +  incompatible mdev devices's version strings to 
> > > > > > > > > > > > > > > its version attribute should
> > > > > > > > > > > > > > > +  return -EINVAL;  
> > > > > > > > > > > > > >
> > > > > > > > > > > > > > I think it's best not to define the specific errno 
> > > > > > > > > > > > > > returned for a
> > > > > > > > > > > > > > specific situation, let the vendor driver decide, 
> > > > > > > > > > > > > > userspace simply
> > > > > > > > > > > > > > needs to know that an errno on read indicates the 
> > > > > > > > > > > > > > device does not
> > > > > > > > > > > > > > support migration version comparison and that an 
> > > > > > > > > > > > > > errno on write
> > > > > > > > > > > > > > indicates the devices are incompatible or the 
> > > > > > > > > > > > > > target doesn't support
> > > > > > > > > > > > > > migration versions.  
> > > > > &

Re: [libvirt] [PATCH v2 1/2] vfio/mdev: add version attribute for mdev device

2019-05-14 Thread Yan Zhao
On Tue, May 14, 2019 at 03:43:44PM +0800, Erik Skultety wrote:
> On Tue, May 14, 2019 at 03:32:19AM -0400, Yan Zhao wrote:
> > On Tue, May 14, 2019 at 03:20:40PM +0800, Erik Skultety wrote:
> > > On Tue, May 14, 2019 at 02:12:35AM -0400, Yan Zhao wrote:
> > > > On Mon, May 13, 2019 at 09:28:04PM +0800, Erik Skultety wrote:
> > > > > On Fri, May 10, 2019 at 11:48:38AM +0200, Cornelia Huck wrote:
> > > > > > On Fri, 10 May 2019 10:36:09 +0100
> > > > > > "Dr. David Alan Gilbert"  wrote:
> > > > > >
> > > > > > > * Cornelia Huck (coh...@redhat.com) wrote:
> > > > > > > > On Thu, 9 May 2019 17:48:26 +0100
> > > > > > > > "Dr. David Alan Gilbert"  wrote:
> > > > > > > >
> > > > > > > > > * Cornelia Huck (coh...@redhat.com) wrote:
> > > > > > > > > > On Thu, 9 May 2019 16:48:57 +0100
> > > > > > > > > > "Dr. David Alan Gilbert"  wrote:
> > > > > > > > > >
> > > > > > > > > > > * Cornelia Huck (coh...@redhat.com) wrote:
> > > > > > > > > > > > On Tue, 7 May 2019 15:18:26 -0600
> > > > > > > > > > > > Alex Williamson  wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > On Sun,  5 May 2019 21:49:04 -0400
> > > > > > > > > > > > > Yan Zhao  wrote:
> > > > > > > > > > > >
> > > > > > > > > > > > > > +  Errno:
> > > > > > > > > > > > > > +  If vendor driver wants to claim a mdev device 
> > > > > > > > > > > > > > incompatible to all other mdev
> > > > > > > > > > > > > > +  devices, it should not register version 
> > > > > > > > > > > > > > attribute for this mdev device. But if
> > > > > > > > > > > > > > +  a vendor driver has already registered version 
> > > > > > > > > > > > > > attribute and it wants to claim
> > > > > > > > > > > > > > +  a mdev device incompatible to all other mdev 
> > > > > > > > > > > > > > devices, it needs to return
> > > > > > > > > > > > > > +  -ENODEV on access to this mdev device's version 
> > > > > > > > > > > > > > attribute.
> > > > > > > > > > > > > > +  If a mdev device is only incompatible to certain 
> > > > > > > > > > > > > > mdev devices, write of
> > > > > > > > > > > > > > +  incompatible mdev devices's version strings to 
> > > > > > > > > > > > > > its version attribute should
> > > > > > > > > > > > > > +  return -EINVAL;
> > > > > > > > > > > > >
> > > > > > > > > > > > > I think it's best not to define the specific errno 
> > > > > > > > > > > > > returned for a
> > > > > > > > > > > > > specific situation, let the vendor driver decide, 
> > > > > > > > > > > > > userspace simply
> > > > > > > > > > > > > needs to know that an errno on read indicates the 
> > > > > > > > > > > > > device does not
> > > > > > > > > > > > > support migration version comparison and that an 
> > > > > > > > > > > > > errno on write
> > > > > > > > > > > > > indicates the devices are incompatible or the target 
> > > > > > > > > > > > > doesn't support
> > > > > > > > > > > > > migration versions.
> > > > > > > > > > > >
> > > > > > > > > > > > I think I have to disagree here: It's probably valuable 
> > > > > > > > > > > > to have an
> > > > > > > > > > > > agreed error for 'cannot migrate at all' vs 'cannot 
> > >

Re: [PATCH v2 1/2] vfio/mdev: add version attribute for mdev device

2019-05-14 Thread Yan Zhao
On Tue, May 14, 2019 at 03:20:40PM +0800, Erik Skultety wrote:
> On Tue, May 14, 2019 at 02:12:35AM -0400, Yan Zhao wrote:
> > On Mon, May 13, 2019 at 09:28:04PM +0800, Erik Skultety wrote:
> > > On Fri, May 10, 2019 at 11:48:38AM +0200, Cornelia Huck wrote:
> > > > On Fri, 10 May 2019 10:36:09 +0100
> > > > "Dr. David Alan Gilbert"  wrote:
> > > >
> > > > > * Cornelia Huck (coh...@redhat.com) wrote:
> > > > > > On Thu, 9 May 2019 17:48:26 +0100
> > > > > > "Dr. David Alan Gilbert"  wrote:
> > > > > >
> > > > > > > * Cornelia Huck (coh...@redhat.com) wrote:
> > > > > > > > On Thu, 9 May 2019 16:48:57 +0100
> > > > > > > > "Dr. David Alan Gilbert"  wrote:
> > > > > > > >
> > > > > > > > > * Cornelia Huck (coh...@redhat.com) wrote:
> > > > > > > > > > On Tue, 7 May 2019 15:18:26 -0600
> > > > > > > > > > Alex Williamson  wrote:
> > > > > > > > > >
> > > > > > > > > > > On Sun,  5 May 2019 21:49:04 -0400
> > > > > > > > > > > Yan Zhao  wrote:
> > > > > > > > > >
> > > > > > > > > > > > +  Errno:
> > > > > > > > > > > > +  If vendor driver wants to claim a mdev device 
> > > > > > > > > > > > incompatible to all other mdev
> > > > > > > > > > > > +  devices, it should not register version attribute 
> > > > > > > > > > > > for this mdev device. But if
> > > > > > > > > > > > +  a vendor driver has already registered version 
> > > > > > > > > > > > attribute and it wants to claim
> > > > > > > > > > > > +  a mdev device incompatible to all other mdev 
> > > > > > > > > > > > devices, it needs to return
> > > > > > > > > > > > +  -ENODEV on access to this mdev device's version 
> > > > > > > > > > > > attribute.
> > > > > > > > > > > > +  If a mdev device is only incompatible to certain 
> > > > > > > > > > > > mdev devices, write of
> > > > > > > > > > > > +  incompatible mdev devices's version strings to its 
> > > > > > > > > > > > version attribute should
> > > > > > > > > > > > +  return -EINVAL;
> > > > > > > > > > >
> > > > > > > > > > > I think it's best not to define the specific errno 
> > > > > > > > > > > returned for a
> > > > > > > > > > > specific situation, let the vendor driver decide, 
> > > > > > > > > > > userspace simply
> > > > > > > > > > > needs to know that an errno on read indicates the device 
> > > > > > > > > > > does not
> > > > > > > > > > > support migration version comparison and that an errno on 
> > > > > > > > > > > write
> > > > > > > > > > > indicates the devices are incompatible or the target 
> > > > > > > > > > > doesn't support
> > > > > > > > > > > migration versions.
> > > > > > > > > >
> > > > > > > > > > I think I have to disagree here: It's probably valuable to 
> > > > > > > > > > have an
> > > > > > > > > > agreed error for 'cannot migrate at all' vs 'cannot migrate 
> > > > > > > > > > between
> > > > > > > > > > those two particular devices'. Userspace might want to do 
> > > > > > > > > > different
> > > > > > > > > > things (e.g. trying with different device pairs).
> > > > > > > > >
> > > > > > > > > Trying to stuff these things down an errno seems a bad idea; 
> > > > > > > > > we can't
> > > > > > > > > get much information that way.
> > > > > > > >
> > > > > > >

Re: [libvirt] [PATCH v2 1/2] vfio/mdev: add version attribute for mdev device

2019-05-14 Thread Yan Zhao
On Mon, May 13, 2019 at 09:28:04PM +0800, Erik Skultety wrote:
> On Fri, May 10, 2019 at 11:48:38AM +0200, Cornelia Huck wrote:
> > On Fri, 10 May 2019 10:36:09 +0100
> > "Dr. David Alan Gilbert"  wrote:
> >
> > > * Cornelia Huck (coh...@redhat.com) wrote:
> > > > On Thu, 9 May 2019 17:48:26 +0100
> > > > "Dr. David Alan Gilbert"  wrote:
> > > >
> > > > > * Cornelia Huck (coh...@redhat.com) wrote:
> > > > > > On Thu, 9 May 2019 16:48:57 +0100
> > > > > > "Dr. David Alan Gilbert"  wrote:
> > > > > >
> > > > > > > * Cornelia Huck (coh...@redhat.com) wrote:
> > > > > > > > On Tue, 7 May 2019 15:18:26 -0600
> > > > > > > > Alex Williamson  wrote:
> > > > > > > >
> > > > > > > > > On Sun,  5 May 2019 21:49:04 -0400
> > > > > > > > > Yan Zhao  wrote:
> > > > > > > >
> > > > > > > > > > +  Errno:
> > > > > > > > > > +  If vendor driver wants to claim a mdev device 
> > > > > > > > > > incompatible to all other mdev
> > > > > > > > > > +  devices, it should not register version attribute for 
> > > > > > > > > > this mdev device. But if
> > > > > > > > > > +  a vendor driver has already registered version attribute 
> > > > > > > > > > and it wants to claim
> > > > > > > > > > +  a mdev device incompatible to all other mdev devices, it 
> > > > > > > > > > needs to return
> > > > > > > > > > +  -ENODEV on access to this mdev device's version 
> > > > > > > > > > attribute.
> > > > > > > > > > +  If a mdev device is only incompatible to certain mdev 
> > > > > > > > > > devices, write of
> > > > > > > > > > +  incompatible mdev devices's version strings to its 
> > > > > > > > > > version attribute should
> > > > > > > > > > +  return -EINVAL;
> > > > > > > > >
> > > > > > > > > I think it's best not to define the specific errno returned 
> > > > > > > > > for a
> > > > > > > > > specific situation, let the vendor driver decide, userspace 
> > > > > > > > > simply
> > > > > > > > > needs to know that an errno on read indicates the device does 
> > > > > > > > > not
> > > > > > > > > support migration version comparison and that an errno on 
> > > > > > > > > write
> > > > > > > > > indicates the devices are incompatible or the target doesn't 
> > > > > > > > > support
> > > > > > > > > migration versions.
> > > > > > > >
> > > > > > > > I think I have to disagree here: It's probably valuable to have 
> > > > > > > > an
> > > > > > > > agreed error for 'cannot migrate at all' vs 'cannot migrate 
> > > > > > > > between
> > > > > > > > those two particular devices'. Userspace might want to do 
> > > > > > > > different
> > > > > > > > things (e.g. trying with different device pairs).
> > > > > > >
> > > > > > > Trying to stuff these things down an errno seems a bad idea; we 
> > > > > > > can't
> > > > > > > get much information that way.
> > > > > >
> > > > > > So, what would be a reasonable approach? Userspace should first read
> > > > > > the version attributes on both devices (to find out whether 
> > > > > > migration
> > > > > > is supported at all), and only then figure out via writing whether 
> > > > > > they
> > > > > > are compatible?
> > > > > >
> > > > > > (Or just go ahead and try, if it does not care about the reason.)
> > > > >
> > > > > Well, I'm OK with something like writing to test whether it's
> > > > > compatible, it's just we need a better way of saying 'no'.
> > > > > I'm no

Re: [PATCH v2 1/2] vfio/mdev: add version attribute for mdev device

2019-05-12 Thread Yan Zhao
On Fri, May 10, 2019 at 05:48:38PM +0800, Cornelia Huck wrote:
> On Fri, 10 May 2019 10:36:09 +0100
> "Dr. David Alan Gilbert"  wrote:
> 
> > * Cornelia Huck (coh...@redhat.com) wrote:
> > > On Thu, 9 May 2019 17:48:26 +0100
> > > "Dr. David Alan Gilbert"  wrote:
> > >   
> > > > * Cornelia Huck (coh...@redhat.com) wrote:  
> > > > > On Thu, 9 May 2019 16:48:57 +0100
> > > > > "Dr. David Alan Gilbert"  wrote:
> > > > > 
> > > > > > * Cornelia Huck (coh...@redhat.com) wrote:
> > > > > > > On Tue, 7 May 2019 15:18:26 -0600
> > > > > > > Alex Williamson  wrote:
> > > > > > >   
> > > > > > > > On Sun,  5 May 2019 21:49:04 -0400
> > > > > > > > Yan Zhao  wrote:  
> > > > > > >   
> > > > > > > > > +  Errno:
> > > > > > > > > +  If vendor driver wants to claim a mdev device incompatible 
> > > > > > > > > to all other mdev
> > > > > > > > > +  devices, it should not register version attribute for this 
> > > > > > > > > mdev device. But if
> > > > > > > > > +  a vendor driver has already registered version attribute 
> > > > > > > > > and it wants to claim
> > > > > > > > > +  a mdev device incompatible to all other mdev devices, it 
> > > > > > > > > needs to return
> > > > > > > > > +  -ENODEV on access to this mdev device's version attribute.
> > > > > > > > > +  If a mdev device is only incompatible to certain mdev 
> > > > > > > > > devices, write of
> > > > > > > > > +  incompatible mdev devices's version strings to its version 
> > > > > > > > > attribute should
> > > > > > > > > +  return -EINVAL;
> > > > > > > > 
> > > > > > > > I think it's best not to define the specific errno returned for 
> > > > > > > > a
> > > > > > > > specific situation, let the vendor driver decide, userspace 
> > > > > > > > simply
> > > > > > > > needs to know that an errno on read indicates the device does 
> > > > > > > > not
> > > > > > > > support migration version comparison and that an errno on write
> > > > > > > > indicates the devices are incompatible or the target doesn't 
> > > > > > > > support
> > > > > > > > migration versions.  
> > > > > > > 
> > > > > > > I think I have to disagree here: It's probably valuable to have an
> > > > > > > agreed error for 'cannot migrate at all' vs 'cannot migrate 
> > > > > > > between
> > > > > > > those two particular devices'. Userspace might want to do 
> > > > > > > different
> > > > > > > things (e.g. trying with different device pairs).  
> > > > > > 
> > > > > > Trying to stuff these things down an errno seems a bad idea; we 
> > > > > > can't
> > > > > > get much information that way.
> > > > > 
> > > > > So, what would be a reasonable approach? Userspace should first read
> > > > > the version attributes on both devices (to find out whether migration
> > > > > is supported at all), and only then figure out via writing whether 
> > > > > they
> > > > > are compatible?
> > > > > 
> > > > > (Or just go ahead and try, if it does not care about the reason.)
> > > > 
> > > > Well, I'm OK with something like writing to test whether it's
> > > > compatible, it's just we need a better way of saying 'no'.
> > > > I'm not sure if that involves reading back from somewhere after
> > > > the write or what.  
> > > 
> > > Hm, so I basically see two ways of doing that:
> > > - standardize on some error codes... problem: error codes can be hard
> > >   to fit to reasons
> > > - make the error available in some attribute that can be read
> > > 
> > > I'm not sure how we can serialize the readback with the last write,
> > > though (this looks

Re: [libvirt] [PATCH v2 1/2] vfio/mdev: add version attribute for mdev device

2019-05-09 Thread Yan Zhao
On Thu, May 09, 2019 at 11:24:49PM +0800, Cornelia Huck wrote:
> On Wed, 8 May 2019 07:57:05 -0400
> Yan Zhao  wrote:
> 
> > On Tue, May 07, 2019 at 05:19:54PM +0800, Cornelia Huck wrote:
> > > On Sun,  5 May 2019 21:49:04 -0400
> > > Yan Zhao  wrote:
> > >   
> > > > version attribute is used to check two mdev devices' compatibility.
> > > > 
> > > > The key point of this version attribute is that it's rw.
> > > > User space has no need to understand internal of device version and no
> > > > need to compare versions by itself.
> > > > Compared to reading version strings from both two mdev devices being
> > > > checked, user space only reads from one mdev device's version attribute.
> > > > After getting its version string, user space writes this string into the
> > > > other mdev device's version attribute. Vendor driver of mdev device
> > > > whose version attribute being written will check device compatibility of
> > > > the two mdev devices for user space and return success for compatibility
> > > > or errno for incompatibility.  
> > > 
> > > I'm still missing a bit _what_ is actually supposed to be
> > > compatible/incompatible. I'd assume some internal state descriptions
> > > (even if this is not actually limited to migration).
> > >  
> > right.
> > originally, I thought this attribute should only contain a device's hardware
> > compatibility info. But seems also including vendor specific software 
> > migration
> > version is more reasonable, because general VFIO migration code cannot know
> > version of vendor specific software migration code until migration data is
> > transferring to the target vm. Then renaming it to migration_version is more
> > appropriate.
> > :)
> 
> Nod.
> 
> (...)
> 
> > > > @@ -246,6 +249,143 @@ Directories and files under the sysfs for Each 
> > > > Physical Device
> > > >This attribute should show the number of devices of type  
> > > > that can be
> > > >created.
> > > >  
> > > > +* version
> > > > +
> > > > +  This attribute is rw, and is optional.
> > > > +  It is used to check device compatibility between two mdev devices 
> > > > and is
> > > > +  accessed in pairs between the two mdev devices being checked.
> > > > +  The intent of this attribute is to make an mdev device's version 
> > > > opaque to
> > > > +  user space, so instead of reading two mdev devices' version strings 
> > > > and
> > > > +  comparing in userspace, user space should only read one mdev 
> > > > device's version
> > > > +  attribute, and writes this version string into the other mdev 
> > > > device's version
> > > > +  attribute. Then vendor driver of mdev device whose version attribute 
> > > > being
> > > > +  written would check the incoming version string and tell user space 
> > > > whether
> > > > +  the two mdev devices are compatible via return value. That's why this
> > > > +  attribute is writable.  
> > > 
> > > I would reword this a bit:
> > > 
> > > "This attribute provides a way to check device compatibility between
> > > two mdev devices from userspace. The intended usage is for userspace to
> > > read the version attribute from one mdev device and then writing that
> > > value to the version attribute of the other mdev device. The second
> > > mdev device indicates compatibility via the return code of the write
> > > operation. This makes compatibility between mdev devices completely
> > > vendor-defined and opaque to userspace."
> > > 
> > > We still should explain _what_ compatibility we're talking about here,
> > > though.
> > >   
> > Thanks. It's much better than mine:) 
> > Then I'll change compatibility --> migration compatibility.
> 
> Ok, with that it should be clear enough.
> 
> > 
> > > > +
> > > > +  when reading this attribute, it should show device version string of
> > > > +  the device of type .
> > > > +
> > > > +  This string is private to vendor driver itself. Vendor driver is 
> > > > able to
> > > > +  freely define format and length of device version string.
> > > > +  e.g. It can use a combination of pciid of parent device + mdev type.
> > > > +
> > > > 

Re: [libvirt] [PATCH v2 1/2] vfio/mdev: add version attribute for mdev device

2019-05-09 Thread Yan Zhao
On Wed, May 08, 2019 at 11:27:47PM +0800, Boris Fiuczynski wrote:
> On 5/8/19 11:22 PM, Alex Williamson wrote:
> >>> I thought there was a request to make this more specific to migration
> >>> by renaming it to something like migration_version.  Also, as an
> >>>   
> >> so this attribute may not only include a mdev device's parent device info 
> >> and
> >> mdev type, but also include numeric software version of vendor specific
> >> migration code, right?
> > It's a vendor defined string, it should be considered opaque to the
> > user, the vendor can include whatever they feel is relevant.
> > 
> Would a vendor also be allowed to provide a string expressing required 
> features as well as containing backend resource requirements which need 
> to be compatible for a successful migration? Somehow a bit like a cpu 
> model... maybe even as json or xml...
> I am asking this with vfio-ap in mind. In that context checking 
> compatibility of two vfio-ap mdev devices is not as simple as checking 
> if version A is smaller or equal to version B.
>
I think so. vendor driver is allowed to put whatever content into the
migration_version string as long as it thinks it's necessary. 
vendor driver only needs ensure in the target mdev device, the write(2)
operation on its migration_version attribute would correctly fail or succeeed
based on the input string.

Thanks
Yan
> -- 
> Mit freundlichen Grüßen/Kind regards
> Boris Fiuczynski
> 
> IBM Deutschland Research & Development GmbH
> Vorsitzender des Aufsichtsrats: Matthias Hartmann
> Geschäftsführung: Dirk Wittkopp
> Sitz der Gesellschaft: Böblingen
> Registergericht: Amtsgericht Stuttgart, HRB 243294
> 

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


Re: [libvirt] [PATCH v2 1/2] vfio/mdev: add version attribute for mdev device

2019-05-08 Thread Yan Zhao
On Thu, May 09, 2019 at 11:38:34AM +0800, Alex Williamson wrote:
> On Wed, 8 May 2019 23:10:55 -0400
> Yan Zhao  wrote:
> 
> > On Thu, May 09, 2019 at 05:22:42AM +0800, Alex Williamson wrote:
> > > On Wed, 8 May 2019 07:27:40 -0400
> > > Yan Zhao  wrote:
> > >   
> > > > On Wed, May 08, 2019 at 05:18:26AM +0800, Alex Williamson wrote:  
> > > > > On Sun,  5 May 2019 21:49:04 -0400
> > > > > Yan Zhao  wrote:
> > > > > 
> > > > > > version attribute is used to check two mdev devices' compatibility.
> > > > > > 
> > > > > > The key point of this version attribute is that it's rw.
> > > > > > User space has no need to understand internal of device version and 
> > > > > > no
> > > > > > need to compare versions by itself.
> > > > > > Compared to reading version strings from both two mdev devices being
> > > > > > checked, user space only reads from one mdev device's version 
> > > > > > attribute.
> > > > > > After getting its version string, user space writes this string 
> > > > > > into the
> > > > > > other mdev device's version attribute. Vendor driver of mdev device
> > > > > > whose version attribute being written will check device 
> > > > > > compatibility of
> > > > > > the two mdev devices for user space and return success for 
> > > > > > compatibility
> > > > > > or errno for incompatibility.
> > > > > > So two readings of version attributes + checking in user space are 
> > > > > > now
> > > > > > changed to one reading + one writing of version attributes + 
> > > > > > checking in
> > > > > > vendor driver.
> > > > > > Format and length of version strings are now private to vendor 
> > > > > > driver
> > > > > > who can define them freely.
> > > > > > 
> > > > > >  __ user space
> > > > > >   /\  \
> > > > > >  / \write
> > > > > > / read  \
> > > > > >  __/__   ___\|/___
> > > > > > | version | | version |-->check compatibility
> > > > > > --- ---
> > > > > > mdev device A   mdev device B
> > > > > > 
> > > > > > This version attribute is optional. If a mdev device does not 
> > > > > > provide
> > > > > > with a version attribute, this mdev device is incompatible to all 
> > > > > > other
> > > > > > mdev devices.
> > > > > > 
> > > > > > Live migration is able to take advantage of this version attribute.
> > > > > > Before user space actually starts live migration, it can first check
> > > > > > whether two mdev devices are compatible.
> > > > > > 
> > > > > > v2:
> > > > > > 1. added detailed intent and usage
> > > > > > 2. made definition of version string completely private to vendor 
> > > > > > driver
> > > > > >(Alex Williamson)
> > > > > > 3. abandoned changes to sample mdev drivers (Alex Williamson)
> > > > > > 4. mandatory --> optional (Cornelia Huck)
> > > > > > 5. added description for errno (Cornelia Huck)
> > > > > > 
> > > > > > Cc: Alex Williamson 
> > > > > > Cc: Erik Skultety 
> > > > > > Cc: "Dr. David Alan Gilbert" 
> > > > > > Cc: Cornelia Huck 
> > > > > > Cc: "Tian, Kevin" 
> > > > > > Cc: Zhenyu Wang 
> > > > > > Cc: "Wang, Zhi A" 
> > > > > > Cc: Neo Jia 
> > > > > > Cc: Kirti Wankhede 
> > > > > > Cc: Daniel P. Berrangé 
> > > > > > Cc: Christophe de Dinechin 
> > > > > > 
> > > > > > Signed-off-by: Yan Zhao 
> > > > > > ---
> > > > > >  Documentation/vfio-mediated-device.txt | 140 
> > > > > > +
> > > > > >  1 file changed, 140 insertions(+)
> > > > > > 
> > > > > >

Re: [libvirt] [PATCH v2 1/2] vfio/mdev: add version attribute for mdev device

2019-05-08 Thread Yan Zhao
On Thu, May 09, 2019 at 05:22:42AM +0800, Alex Williamson wrote:
> On Wed, 8 May 2019 07:27:40 -0400
> Yan Zhao  wrote:
> 
> > On Wed, May 08, 2019 at 05:18:26AM +0800, Alex Williamson wrote:
> > > On Sun,  5 May 2019 21:49:04 -0400
> > > Yan Zhao  wrote:
> > >   
> > > > version attribute is used to check two mdev devices' compatibility.
> > > > 
> > > > The key point of this version attribute is that it's rw.
> > > > User space has no need to understand internal of device version and no
> > > > need to compare versions by itself.
> > > > Compared to reading version strings from both two mdev devices being
> > > > checked, user space only reads from one mdev device's version attribute.
> > > > After getting its version string, user space writes this string into the
> > > > other mdev device's version attribute. Vendor driver of mdev device
> > > > whose version attribute being written will check device compatibility of
> > > > the two mdev devices for user space and return success for compatibility
> > > > or errno for incompatibility.
> > > > So two readings of version attributes + checking in user space are now
> > > > changed to one reading + one writing of version attributes + checking in
> > > > vendor driver.
> > > > Format and length of version strings are now private to vendor driver
> > > > who can define them freely.
> > > > 
> > > >  __ user space
> > > >   /\  \
> > > >  / \write
> > > > / read  \
> > > >  __/__   ___\|/___
> > > > | version | | version |-->check compatibility
> > > > --- ---
> > > > mdev device A   mdev device B
> > > > 
> > > > This version attribute is optional. If a mdev device does not provide
> > > > with a version attribute, this mdev device is incompatible to all other
> > > > mdev devices.
> > > > 
> > > > Live migration is able to take advantage of this version attribute.
> > > > Before user space actually starts live migration, it can first check
> > > > whether two mdev devices are compatible.
> > > > 
> > > > v2:
> > > > 1. added detailed intent and usage
> > > > 2. made definition of version string completely private to vendor driver
> > > >(Alex Williamson)
> > > > 3. abandoned changes to sample mdev drivers (Alex Williamson)
> > > > 4. mandatory --> optional (Cornelia Huck)
> > > > 5. added description for errno (Cornelia Huck)
> > > > 
> > > > Cc: Alex Williamson 
> > > > Cc: Erik Skultety 
> > > > Cc: "Dr. David Alan Gilbert" 
> > > > Cc: Cornelia Huck 
> > > > Cc: "Tian, Kevin" 
> > > > Cc: Zhenyu Wang 
> > > > Cc: "Wang, Zhi A" 
> > > > Cc: Neo Jia 
> > > > Cc: Kirti Wankhede 
> > > > Cc: Daniel P. Berrangé 
> > > > Cc: Christophe de Dinechin 
> > > > 
> > > > Signed-off-by: Yan Zhao 
> > > > ---
> > > >  Documentation/vfio-mediated-device.txt | 140 +
> > > >  1 file changed, 140 insertions(+)
> > > > 
> > > > diff --git a/Documentation/vfio-mediated-device.txt 
> > > > b/Documentation/vfio-mediated-device.txt
> > > > index c3f69bcaf96e..013a764968eb 100644
> > > > --- a/Documentation/vfio-mediated-device.txt
> > > > +++ b/Documentation/vfio-mediated-device.txt
> > > > @@ -202,6 +202,7 @@ Directories and files under the sysfs for Each 
> > > > Physical Device
> > > >| |   |--- available_instances
> > > >| |   |--- device_api
> > > >| |   |--- description
> > > > +  | |   |--- version
> > > >| |   |--- [devices]
> > > >| |--- []
> > > >| |   |--- create
> > > > @@ -209,6 +210,7 @@ Directories and files under the sysfs for Each 
> > > > Physical Device
> > > >| |   |--- available_instances
> > > >| |   |--- device_api
> > > >| |   |--- description
> > > > +  | |   |--- version
> > > >| |   |--- [devices]
> > > >| |--- []
> > > >|

Re: [libvirt] [PATCH v2 2/2] drm/i915/gvt: export mdev device version to sysfs for Intel vGPU

2019-05-08 Thread Yan Zhao
On Wed, May 08, 2019 at 06:50:33PM +0800, Dr. David Alan Gilbert wrote:
> * Yan Zhao (yan.y.z...@intel.com) wrote:
> > This feature implements the version attribute for Intel's vGPU mdev
> > devices.
> > 
> > version attribute is rw.
> > It's used to check device compatibility for two mdev devices.
> > version string format and length are private for vendor driver. vendor
> > driver is able to define them freely.
> > 
> > For Intel vGPU of gen8 and gen9, the mdev device version
> > consists of 3 fields: "vendor id" + "device id" + "mdev type".
> > 
> > Reading from a vGPU's version attribute, a string is returned in below
> > format: --. e.g.
> > 8086-193b-i915-GVTg_V5_2.
> > 
> > Writing a string to a vGPU's version attribute will trigger GVT to check
> > whether a vGPU identified by the written string is compatible with
> > current vGPU owning this version attribute. errno is returned if the two
> > vGPUs are incompatible. The length of written string is returned in
> > compatible case.
> > 
> > For other platforms, and for GVT not supporting vGPU live migration
> > feature, errnos are returned when read/write of mdev devices' version
> > attributes.
> > 
> > For old GVT versions where no version attributes exposed in sysfs, it is
> > regarded as not supporting vGPU live migration.
> > 
> > For future platforms, besides the current 2 fields in vendor proprietary
> > part, more fields may be added to identify Intel vGPU well for live
> > migration purpose.
> > 
> > v2:
> > 1. removed 32 common part of version string
> > (Alex Williamson)
> > 2. do not register version attribute for GVT not supporting live
> > migration.(Cornelia Huck)
> > 3. for platforms out of gen8, gen9, return -EINVAL --> -ENODEV for
> > incompatible. (Cornelia Huck)
> > 
> > Cc: Alex Williamson 
> > Cc: Erik Skultety 
> > Cc: "Dr. David Alan Gilbert" 
> > Cc: Cornelia Huck 
> > Cc: "Tian, Kevin" 
> > Cc: Zhenyu Wang 
> > Cc: "Wang, Zhi A" 
> > c: Neo Jia 
> > Cc: Kirti Wankhede 
> > 
> > Signed-off-by: Yan Zhao 
> > ---
> >  drivers/gpu/drm/i915/gvt/Makefile |  2 +-
> >  drivers/gpu/drm/i915/gvt/device_version.c | 87 +++
> >  drivers/gpu/drm/i915/gvt/gvt.c| 51 +
> >  drivers/gpu/drm/i915/gvt/gvt.h|  6 ++
> >  4 files changed, 145 insertions(+), 1 deletion(-)
> >  create mode 100644 drivers/gpu/drm/i915/gvt/device_version.c
> > 
> > diff --git a/drivers/gpu/drm/i915/gvt/Makefile 
> > b/drivers/gpu/drm/i915/gvt/Makefile
> > index 271fb46d4dd0..54e209a23899 100644
> > --- a/drivers/gpu/drm/i915/gvt/Makefile
> > +++ b/drivers/gpu/drm/i915/gvt/Makefile
> > @@ -3,7 +3,7 @@ GVT_DIR := gvt
> >  GVT_SOURCE := gvt.o aperture_gm.o handlers.o vgpu.o trace_points.o 
> > firmware.o \
> > interrupt.o gtt.o cfg_space.o opregion.o mmio.o display.o edid.o \
> > execlist.o scheduler.o sched_policy.o mmio_context.o cmd_parser.o 
> > debugfs.o \
> > -   fb_decoder.o dmabuf.o page_track.o
> > +   fb_decoder.o dmabuf.o page_track.o device_version.o
> >  
> >  ccflags-y  += -I$(src) -I$(src)/$(GVT_DIR)
> >  i915-y += $(addprefix $(GVT_DIR)/, 
> > $(GVT_SOURCE))
> > diff --git a/drivers/gpu/drm/i915/gvt/device_version.c 
> > b/drivers/gpu/drm/i915/gvt/device_version.c
> > new file mode 100644
> > index ..bd4cdcbdba95
> > --- /dev/null
> > +++ b/drivers/gpu/drm/i915/gvt/device_version.c
> > @@ -0,0 +1,87 @@
> > +/*
> > + * Copyright(c) 2011-2017 Intel Corporation. All rights reserved.
> > + *
> > + * Permission is hereby granted, free of charge, to any person obtaining a
> > + * copy of this software and associated documentation files (the 
> > "Software"),
> > + * to deal in the Software without restriction, including without 
> > limitation
> > + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> > + * and/or sell copies of the Software, and to permit persons to whom the
> > + * Software is furnished to do so, subject to the following conditions:
> > + *
> > + * The above copyright notice and this permission notice (including the 
> > next
> > + * paragraph) shall be included in all copies or substantial portions of 
> > the
> > + * Software.
> > + *
> > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRA

Re: [libvirt] [PATCH v2 2/2] drm/i915/gvt: export mdev device version to sysfs for Intel vGPU

2019-05-08 Thread Yan Zhao
On Tue, May 07, 2019 at 05:27:53PM +0800, Cornelia Huck wrote:
> On Sun,  5 May 2019 21:51:02 -0400
> Yan Zhao  wrote:
> 
> > This feature implements the version attribute for Intel's vGPU mdev
> > devices.
> > 
> > version attribute is rw.
> > It's used to check device compatibility for two mdev devices.
> > version string format and length are private for vendor driver. vendor
> > driver is able to define them freely.
> > 
> > For Intel vGPU of gen8 and gen9, the mdev device version
> > consists of 3 fields: "vendor id" + "device id" + "mdev type".
> > 
> > Reading from a vGPU's version attribute, a string is returned in below
> > format: --. e.g.
> > 8086-193b-i915-GVTg_V5_2.
> > 
> > Writing a string to a vGPU's version attribute will trigger GVT to check
> > whether a vGPU identified by the written string is compatible with
> > current vGPU owning this version attribute. errno is returned if the two
> > vGPUs are incompatible. The length of written string is returned in
> > compatible case.
> > 
> > For other platforms, and for GVT not supporting vGPU live migration
> > feature, errnos are returned when read/write of mdev devices' version
> > attributes.
> > 
> > For old GVT versions where no version attributes exposed in sysfs, it is
> > regarded as not supporting vGPU live migration.
> > 
> > For future platforms, besides the current 2 fields in vendor proprietary
> > part, more fields may be added to identify Intel vGPU well for live
> > migration purpose.
> > 
> > v2:
> > 1. removed 32 common part of version string
> > (Alex Williamson)
> > 2. do not register version attribute for GVT not supporting live
> > migration.(Cornelia Huck)
> > 3. for platforms out of gen8, gen9, return -EINVAL --> -ENODEV for
> > incompatible. (Cornelia Huck)
> 
> Should go below '---'.
>
got it. will change it in next revision.

> > 
> > Cc: Alex Williamson 
> > Cc: Erik Skultety 
> > Cc: "Dr. David Alan Gilbert" 
> > Cc: Cornelia Huck 
> > Cc: "Tian, Kevin" 
> > Cc: Zhenyu Wang 
> > Cc: "Wang, Zhi A" 
> > c: Neo Jia 
> > Cc: Kirti Wankhede 
> > 
> > Signed-off-by: Yan Zhao 
> > ---
> >  drivers/gpu/drm/i915/gvt/Makefile |  2 +-
> >  drivers/gpu/drm/i915/gvt/device_version.c | 87 +++
> >  drivers/gpu/drm/i915/gvt/gvt.c| 51 +
> >  drivers/gpu/drm/i915/gvt/gvt.h|  6 ++
> >  4 files changed, 145 insertions(+), 1 deletion(-)
> >  create mode 100644 drivers/gpu/drm/i915/gvt/device_version.c
> > 
> 
> (...)
> 
> > diff --git a/drivers/gpu/drm/i915/gvt/device_version.c 
> > b/drivers/gpu/drm/i915/gvt/device_version.c
> > new file mode 100644
> > index ..bd4cdcbdba95
> > --- /dev/null
> > +++ b/drivers/gpu/drm/i915/gvt/device_version.c
> > @@ -0,0 +1,87 @@
> > +/*
> > + * Copyright(c) 2011-2017 Intel Corporation. All rights reserved.
> > + *
> > + * Permission is hereby granted, free of charge, to any person obtaining a
> > + * copy of this software and associated documentation files (the 
> > "Software"),
> > + * to deal in the Software without restriction, including without 
> > limitation
> > + * the rights to use, copy, modify, merge, publish, distribute, sublicense,
> > + * and/or sell copies of the Software, and to permit persons to whom the
> > + * Software is furnished to do so, subject to the following conditions:
> > + *
> > + * The above copyright notice and this permission notice (including the 
> > next
> > + * paragraph) shall be included in all copies or substantial portions of 
> > the
> > + * Software.
> > + *
> > + * THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS 
> > OR
> > + * IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
> > + * FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT.  IN NO EVENT SHALL
> > + * THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR 
> > OTHER
> > + * LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING 
> > FROM,
> > + * OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS 
> > IN THE
> > + * SOFTWARE.
> > + *
> > + * Authors:
> > + *Yan Zhao 
> > + */
> > +#include 
> > +#include "i915_drv.h"
> > +
> > +static bool is_compatible(const char *self, const 

  1   2   >