RE: [RFC 00/18] vfio: Adopt iommufd

2022-04-28 Thread Tian, Kevin
> From: Daniel P. Berrangé 
> Sent: Friday, April 29, 2022 12:20 AM
> 
> On Thu, Apr 28, 2022 at 08:24:48AM -0600, Alex Williamson wrote:
> > On Thu, 28 Apr 2022 03:21:45 +0000
> > "Tian, Kevin"  wrote:
> >
> > > > From: Alex Williamson 
> > > > Sent: Wednesday, April 27, 2022 12:22 AM
> > > > > >
> > > > > > My expectation would be that libvirt uses:
> > > > > >
> > > > > >  -object iommufd,id=iommufd0,fd=NNN
> > > > > >  -device vfio-pci,fd=MMM,iommufd=iommufd0
> > > > > >
> > > > > > Whereas simple QEMU command line would be:
> > > > > >
> > > > > >  -object iommufd,id=iommufd0
> > > > > >  -device vfio-pci,iommufd=iommufd0,host=:02:00.0
> > > > > >
> > > > > > The iommufd object would open /dev/iommufd itself.  Creating an
> > > > > > implicit iommufd object is someone problematic because one of the
> > > > > > things I forgot to highlight in my previous description is that the
> > > > > > iommufd object is meant to be shared across not only various vfio
> > > > > > devices (platform, ccw, ap, nvme, etc), but also across subsystems,
> ex.
> > > > > > vdpa.
> > > > >
> > > > > Out of curiosity - in concept one iommufd is sufficient to support all
> > > > > ioas requirements across subsystems while having multiple
> iommufd's
> > > > > instead lose the benefit of centralized accounting. The latter will 
> > > > > also
> > > > > cause some trouble when we start virtualizing ENQCMD which
> requires
> > > > > VM-wide PASID virtualization thus further needs to share that
> > > > > information across iommufd's. Not unsolvable but really no gain by
> > > > > adding such complexity. So I'm curious whether Qemu provide
> > > > > a way to restrict that certain object type can only have one instance
> > > > > to discourage such multi-iommufd attempt?
> > > >
> > > > I don't see any reason for QEMU to restrict iommufd objects.  The
> QEMU
> > > > philosophy seems to be to let users create whatever configuration they
> > > > want.  For libvirt though, the assumption would be that a single
> > > > iommufd object can be used across subsystems, so libvirt would never
> > > > automatically create multiple objects.
> > >
> > > I like the flexibility what the objection approach gives in your proposal.
> > > But with the said complexity in mind (with no foreseen benefit), I wonder
> >
> > What's the actual complexity?  Front-end/backend splits are very common
> > in QEMU.  We're making the object connection via name, why is it
> > significantly more complicated to allow multiple iommufd objects?  On
> > the contrary, it seems to me that we'd need to go out of our way to add
> > code to block multiple iommufd objects.

Probably it's just a hypothetical concern when I thought about the need
of managing certain global information (e.g. PASID virtualization) cross
iommufd's down the road. With your and Daniel's replies I think we'll
first try to follow the common practice in Qemu first given there are
more positive reasons to do so than the hypothetical concern itself.

> >
> > > whether an alternative approach which treats iommufd as a global
> > > property instead of an object is acceptable in Qemu, i.e.:
> > >
> > > -iommufd on/off
> > > -device vfio-pci,iommufd,[fd=MMM/host=:02:00.0]
> > >
> > > All devices with iommufd specified then implicitly share a single iommufd
> > > object within Qemu.
> >
> > QEMU requires key-value pairs AFAIK, so the above doesn't work, then
> > we're just back to the iommufd=on/off.
> >
> > > This still allows vfio devices to be specified via fd but just requires 
> > > Libvirt
> > > to grant file permission on /dev/iommu. Is it a worthwhile tradeoff to be
> > > considered or just not a typical way in Qemu philosophy e.g. any object
> > > associated with a device must be explicitly specified?
> >
> > Avoiding QEMU opening files was a significant focus of my alternate
> > proposal.  Also note that we must be able to support hotplug, so we
> > need to be able to dynamically add and remove the iommufd object, I
> > don't see that a global property allows for that.  Implicit
> > associations of devices to shared resources doesn't seem particularly
> > desirable to me.  Thanks,
> 
> Adding new global properties/options is rather an anti-pattern for QEMU
> these days. Using -object is the right approach. If you only want to
> allow for one of them, just document this requirement. We've got other
> objects which are singletons like all the confidential guest classes
> for each arch.
> 

Good to know such last resort. As said we'll try to avoid this restriction
and follow Alex's proposal unless there are unexpectedly unreasonable
complexities arising later.

Thanks
Kevin


RE: [RFC 00/18] vfio: Adopt iommufd

2022-04-27 Thread Tian, Kevin
> From: Alex Williamson 
> Sent: Wednesday, April 27, 2022 12:22 AM
> > >
> > > My expectation would be that libvirt uses:
> > >
> > >  -object iommufd,id=iommufd0,fd=NNN
> > >  -device vfio-pci,fd=MMM,iommufd=iommufd0
> > >
> > > Whereas simple QEMU command line would be:
> > >
> > >  -object iommufd,id=iommufd0
> > >  -device vfio-pci,iommufd=iommufd0,host=:02:00.0
> > >
> > > The iommufd object would open /dev/iommufd itself.  Creating an
> > > implicit iommufd object is someone problematic because one of the
> > > things I forgot to highlight in my previous description is that the
> > > iommufd object is meant to be shared across not only various vfio
> > > devices (platform, ccw, ap, nvme, etc), but also across subsystems, ex.
> > > vdpa.
> >
> > Out of curiosity - in concept one iommufd is sufficient to support all
> > ioas requirements across subsystems while having multiple iommufd's
> > instead lose the benefit of centralized accounting. The latter will also
> > cause some trouble when we start virtualizing ENQCMD which requires
> > VM-wide PASID virtualization thus further needs to share that
> > information across iommufd's. Not unsolvable but really no gain by
> > adding such complexity. So I'm curious whether Qemu provide
> > a way to restrict that certain object type can only have one instance
> > to discourage such multi-iommufd attempt?
> 
> I don't see any reason for QEMU to restrict iommufd objects.  The QEMU
> philosophy seems to be to let users create whatever configuration they
> want.  For libvirt though, the assumption would be that a single
> iommufd object can be used across subsystems, so libvirt would never
> automatically create multiple objects.

I like the flexibility what the objection approach gives in your proposal.
But with the said complexity in mind (with no foreseen benefit), I wonder
whether an alternative approach which treats iommufd as a global
property instead of an object is acceptable in Qemu, i.e.:

-iommufd on/off
-device vfio-pci,iommufd,[fd=MMM/host=:02:00.0]

All devices with iommufd specified then implicitly share a single iommufd
object within Qemu.

This still allows vfio devices to be specified via fd but just requires Libvirt
to grant file permission on /dev/iommu. Is it a worthwhile tradeoff to be
considered or just not a typical way in Qemu philosophy e.g. any object
associated with a device must be explicitly specified?

Thanks
Kevin


RE: [RFC 00/18] vfio: Adopt iommufd

2022-04-26 Thread Tian, Kevin
> From: Alex Williamson 
> Sent: Monday, April 25, 2022 10:38 PM
> 
> On Mon, 25 Apr 2022 11:10:14 +0100
> Daniel P. Berrangé  wrote:
> 
> > On Fri, Apr 22, 2022 at 04:09:43PM -0600, Alex Williamson wrote:
> > > [Cc +libvirt folks]
> > >
> > > On Thu, 14 Apr 2022 03:46:52 -0700
> > > Yi Liu  wrote:
> > >
> > > > With the introduction of iommufd[1], the linux kernel provides a
> generic
> > > > interface for userspace drivers to propagate their DMA mappings to
> kernel
> > > > for assigned devices. This series does the porting of the VFIO devices
> > > > onto the /dev/iommu uapi and let it coexist with the legacy
> implementation.
> > > > Other devices like vpda, vfio mdev and etc. are not considered yet.
> >
> > snip
> >
> > > > The selection of the backend is made on a device basis using the new
> > > > iommufd option (on/off/auto). By default the iommufd backend is
> selected
> > > > if supported by the host and by QEMU (iommufd KConfig). This option
> is
> > > > currently available only for the vfio-pci device. For other types of
> > > > devices, it does not yet exist and the legacy BE is chosen by default.
> > >
> > > I've discussed this a bit with Eric, but let me propose a different
> > > command line interface.  Libvirt generally likes to pass file
> > > descriptors to QEMU rather than grant it access to those files
> > > directly.  This was problematic with vfio-pci because libvirt can't
> > > easily know when QEMU will want to grab another /dev/vfio/vfio
> > > container.  Therefore we abandoned this approach and instead libvirt
> > > grants file permissions.
> > >
> > > However, with iommufd there's no reason that QEMU ever needs more
> than
> > > a single instance of /dev/iommufd and we're using per device vfio file
> > > descriptors, so it seems like a good time to revisit this.
> >
> > I assume access to '/dev/iommufd' gives the process somewhat elevated
> > privileges, such that you don't want to unconditionally give QEMU
> > access to this device ?
> 
> It's not that much dissimilar to /dev/vfio/vfio, it's an unprivileged
> interface which should have limited scope for abuse, but more so here
> the goal would be to de-privilege QEMU that one step further that it
> cannot open the device file itself.
> 
> > > The interface I was considering would be to add an iommufd object to
> > > QEMU, so we might have a:
> > >
> > > -device iommufd[,fd=#][,id=foo]
> > >
> > > For non-libivrt usage this would have the ability to open /dev/iommufd
> > > itself if an fd is not provided.  This object could be shared with
> > > other iommufd users in the VM and maybe we'd allow multiple instances
> > > for more esoteric use cases.  [NB, maybe this should be a -object rather
> than
> > > -device since the iommufd is not a guest visible device?]
> >
> > Yes,  -object would be the right answer for something that's purely
> > a host side backend impl selector.
> >
> > > The vfio-pci device might then become:
> > >
> > > -device vfio-
> pci[,host=:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=f
> oo]
> > >
> > > So essentially we can specify the device via host, sysfsdev, or passing
> > > an fd to the vfio device file.  When an iommufd object is specified,
> > > "foo" in the example above, each of those options would use the
> > > vfio-device access mechanism, essentially the same as iommufd=on in
> > > your example.  With the fd passing option, an iommufd object would be
> > > required and necessarily use device level access.
> > >
> > > In your example, the iommufd=auto seems especially troublesome for
> > > libvirt because QEMU is going to have different locked memory
> > > requirements based on whether we're using type1 or iommufd, where
> the
> > > latter resolves the duplicate accounting issues.  libvirt needs to know

Based on current plan there is probably a transition window between the
point where the first vfio device type (vfio-pci) gaining iommufd support
and the point where all vfio types supporting iommufd. Libvirt can figure
out which one to use iommufd by checking the presence of
/dev/vfio/devices/vfioX. But what would be the resource limit policy
in Libvirt in such transition window when both type1 and iommufd might
be used? Or do we just expect Libvirt to support iommufd only after the
transition window ends to avoid handling such mess?

> > > deterministically which backed is being used, which this proposal seems
> > > to provide, while at the same time bringing us more in line with fd
> > > passing.  Thoughts?  Thanks,
> >
> > Yep, I agree that libvirt needs to have more direct control over this.
> > This is also even more important if there are notable feature differences
> > in the 2 backends.
> >
> > I wonder if anyone has considered an even more distinct impl, whereby
> > we have a completely different device type on the backend, eg
> >
> >   -device vfio-iommu-
> pci[,host=:BB:DD.f][,sysfsdev=/sys/path/to/device][,fd=#][,iommufd=f
> oo]
> >
> > If a vendor wants to fully 

RE: device compatibility interface for live migration with assigned devices

2020-09-11 Thread Tian, Kevin
> From: Cornelia Huck 
> Sent: Friday, September 11, 2020 6:08 PM
> 
> On Fri, 11 Sep 2020 08:56:00 +0800
> Yan Zhao  wrote:
> 
> > On Thu, Sep 10, 2020 at 12:02:44PM -0600, Alex Williamson wrote:
> > > On Thu, 10 Sep 2020 13:50:11 +0100
> > > Sean Mooney  wrote:
> > >
> > > > On Thu, 2020-09-10 at 14:38 +0200, Cornelia Huck wrote:
> > > > > On Wed, 9 Sep 2020 10:13:09 +0800
> > > > > Yan Zhao  wrote:
> > > > >
> > > > > > > > still, I'd like to put it more explicitly to make ensure it's 
> > > > > > > > not
> missed:
> > > > > > > > the reason we want to specify compatible_type as a trait and
> check
> > > > > > > > whether target compatible_type is the superset of source
> > > > > > > > compatible_type is for the consideration of backward
> compatibility.
> > > > > > > > e.g.
> > > > > > > > an old generation device may have a mdev type xxx-v4-yyy,
> while a newer
> > > > > > > > generation  device may be of mdev type xxx-v5-yyy.
> > > > > > > > with the compatible_type traits, the old generation device is 
> > > > > > > > still
> > > > > > > > able to be regarded as compatible to newer generation device
> even their
> > > > > > > > mdev types are not equal.
> > > > > > >
> > > > > > > If you want to support migration from v4 to v5, can't the
> (presumably
> > > > > > > newer) driver that supports v5 simply register the v4 type as 
> > > > > > > well,
> so
> > > > > > > that the mdev can be created as v4? (Just like QEMU versioned
> machine
> > > > > > > types work.)
> > > > > >
> > > > > > yes, it should work in some conditions.
> > > > > > but it may not be that good in some cases when v5 and v4 in the
> name string
> > > > > > of mdev type identify hardware generation (e.g. v4 for gen8, and v5
> for
> > > > > > gen9)
> > > > > >
> > > > > > e.g.
> > > > > > (1). when src mdev type is v4 and target mdev type is v5 as
> > > > > > software does not support it initially, and v4 and v5 identify
> hardware
> > > > > > differences.
> > > > >
> > > > > My first hunch here is: Don't introduce types that may be compatible
> > > > > later. Either make them compatible, or make them distinct by design,
> > > > > and possibly add a different, compatible type later.
> > > > >
> > > > > > then after software upgrade, v5 is now compatible to v4, should the
> > > > > > software now downgrade mdev type from v5 to v4?
> > > > > > not sure if moving hardware generation info into a separate
> attribute
> > > > > > from mdev type name is better. e.g. remove v4, v5 in mdev type,
> while use
> > > > > > compatible_pci_ids to identify compatibility.
> > > > >
> > > > > If the generations are compatible, don't mention it in the mdev type.
> > > > > If they aren't, use distinct types, so that management software
> doesn't
> > > > > have to guess. At least that would be my naive approach here.
> 
> [*]
> 
> > > > yep that is what i would prefer to see too.
> > > > >
> > > > > >
> > > > > > (2) name string of mdev type is composed by "driver_name +
> type_name".
> > > > > > in some devices, e.g. qat, different generations of devices are
> binding to
> > > > > > drivers of different names, e.g. "qat-v4", "qat-v5".
> > > > > > then though type_name is equal, mdev type is not equal. e.g.
> > > > > > "qat-v4-type1", "qat-v5-type1".
> > > > >
> > > > > I guess that shows a shortcoming of that "driver_name + type_name"
> > > > > approach? Or maybe I'm just confused.
> > > > yes i really dont like haveing the version in the mdev-type name
> > > > i would stongly perfger just qat-type-1 wehere qat is just there as a 
> > > > way
> of namespacing.
> > > > although symmetric-cryto, asymmetric-cryto and compression woudl be
> a better name then type-1, type-2, type-3 if
> > > > that is what they would end up mapping too. e.g. qat-compression or
> qat-aes is a much better name then type-1
> > > > higher layers of software are unlikely to parse the mdev names but as a
> human looking at them its much eaiser to
> > > > understand if the names are meaningful. the qat prefix i think is
> important however to make sure that your mdev-types
> > > > dont colide with other vendeors mdev types. so i woudl encurage all
> vendors to prefix there mdev types with etiher the
> > > > device name or the vendor.
> > >
> > > +1 to all this, the mdev type is meant to indicate a software
> > > compatible interface, if different hardware versions can be software
> > > compatible, then don't make the job of finding a compatible device
> > > harder.  The full type is a combination of the vendor driver name plus
> > > the vendor provided type name specifically in order to provide a type
> > > namespace per vendor driver.  That's done at the mdev core level.
> > > Thanks,
> >
> > hi Alex,
> > got it. so do you suggest that vendors use consistent driver name over
> > generations of devices?
> > for qat, they create different modules for each generation. This
> > practice is not good if they want to support migration between devices
> > of different generations, 

RE: [PATCH v5 0/4] introduction of migration_version attribute for VFIO live migration

2020-04-21 Thread Tian, Kevin
> From: Yan Zhao
> Sent: Tuesday, April 21, 2020 10:37 AM
> 
> On Tue, Apr 21, 2020 at 06:56:00AM +0800, Alex Williamson wrote:
> > On Sun, 19 Apr 2020 21:24:57 -0400
> > Yan Zhao  wrote:
> >
> > > On Fri, Apr 17, 2020 at 07:24:57PM +0800, Cornelia Huck wrote:
> > > > On Fri, 17 Apr 2020 05:52:02 -0400
> > > > Yan Zhao  wrote:
> > > >
> > > > > On Fri, Apr 17, 2020 at 04:44:50PM +0800, Cornelia Huck wrote:
> > > > > > On Mon, 13 Apr 2020 01:52:01 -0400
> > > > > > Yan Zhao  wrote:
> > > > > >
> > > > > > > This patchset introduces a migration_version attribute under sysfs
> of VFIO
> > > > > > > Mediated devices.
> > > > > > >
> > > > > > > This migration_version attribute is used to check migration
> compatibility
> > > > > > > between two mdev devices.
> > > > > > >
> > > > > > > Currently, it has two locations:
> > > > > > > (1) under mdev_type node,
> > > > > > > which can be used even before device creation, but only for
> mdev
> > > > > > > devices of the same mdev type.
> > > > > > > (2) under mdev device node,
> > > > > > > which can only be used after the mdev devices are created, but
> the src
> > > > > > > and target mdev devices are not necessarily be of the same
> mdev type
> > > > > > > (The second location is newly added in v5, in order to keep
> consistent
> > > > > > > with the migration_version node for migratable pass-though
> devices)
> > > > > >
> > > > > > What is the relationship between those two attributes?
> > > > > >
> > > > > (1) is for mdev devices specifically, and (2) is provided to keep the
> same
> > > > > sysfs interface as with non-mdev cases. so (2) is for both mdev
> devices and
> > > > > non-mdev devices.
> > > > >
> > > > > in future, if we enable vfio-pci vendor ops, (i.e. a non-mdev device
> > > > > is binding to vfio-pci, but is able to register migration region and 
> > > > > do
> > > > > migration transactions from a vendor provided affiliate driver),
> > > > > the vendor driver would export (2) directly, under device node.
> > > > > It is not able to provide (1) as there're no mdev devices involved.
> > > >
> > > > Ok, creating an alternate attribute for non-mdev devices makes sense.
> > > > However, wouldn't that rather be a case (3)? The change here only
> > > > refers to mdev devices.
> > > >
> > > as you pointed below, (3) and (2) serve the same purpose.
> > > and I think a possible usage is to migrate between a non-mdev device and
> > > an mdev device. so I think it's better for them both to use (2) rather
> > > than creating (3).
> >
> > An mdev type is meant to define a software compatible interface, so in
> > the case of mdev->mdev migration, doesn't migrating to a different type
> > fail the most basic of compatibility tests that we expect userspace to
> > perform?  IOW, if two mdev types are migration compatible, it seems a
> > prerequisite to that is that they provide the same software interface,
> > which means they should be the same mdev type.
> >
> > In the hybrid cases of mdev->phys or phys->mdev, how does a
> management
> > tool begin to even guess what might be compatible?  Are we expecting
> > libvirt to probe ever device with this attribute in the system?  Is
> > there going to be a new class hierarchy created to enumerate all
> > possible migrate-able devices?
> >
> yes, management tool needs to guess and test migration compatible
> between two devices. But I think it's not the problem only for
> mdev->phys or phys->mdev. even for mdev->mdev, management tool needs
> to
> first assume that the two mdevs have the same type of parent devices
> (e.g.their pciids are equal). otherwise, it's still enumerating
> possibilities.
> 
> on the other hand, for two mdevs,
> mdev1 from pdev1, its mdev_type is 1/2 of pdev1;
> mdev2 from pdev2, its mdev_type is 1/4 of pdev2;
> if pdev2 is exactly 2 times of pdev1, why not allow migration between
> mdev1 <-> mdev2.

How could the manage tool figure out that 1/2 of pdev1 is equivalent 
to 1/4 of pdev2? If we really want to allow such thing happen, the best
choice is to report the same mdev type on both pdev1 and pdev2.

btw mdev<->phys just brings trouble to upper stack as Alex pointed out. 
Can we simplify the requirement by allowing only mdev<->mdev and 
phys<->phys migration? If an customer does want to migrate between a 
mdev and phys, he could wrap physical device into a wrapped mdev 
instance (with the same type as the source mdev) instead of using vendor 
ops. Doing so does add some burden but if mdev<->phys is not dominant 
usage then such tradeoff might be worthywhile...

Thanks
Kevin

> 
> 
> > I agree that there was a gap in the previous proposal for non-mdev
> > devices, but I think this bring a lot of questions that we need to
> > puzzle through and libvirt will need to re-evaluate how they might
> > decide to pick a migration target device.  For example, I'm sure
> > libvirt would reject any policy decisions regarding picking a physical
> > device versus an mdev device.  Had we 

Re: [libvirt] [PATCH 0/6] VFIO mdev aggregated resources handling

2019-11-19 Thread Tian, Kevin
> From: Alex Williamson
> Sent: Wednesday, November 20, 2019 6:58 AM
> 
> On Fri, 15 Nov 2019 04:24:35 +0000
> "Tian, Kevin"  wrote:
> 
> > > From: Alex Williamson
> > > Sent: Thursday, November 7, 2019 2:45 AM
> > >
> > > On Wed, 6 Nov 2019 12:20:31 +0800
> > > Zhenyu Wang  wrote:
> > >
> > > > On 2019.11.05 14:10:42 -0700, Alex Williamson wrote:
> > > > > On Thu, 24 Oct 2019 13:08:23 +0800
> > > > > Zhenyu Wang  wrote:
> > > > >
> > > > > > Hi,
> > > > > >
> > > > > > This is a refresh for previous send of this series. I got impression
> that
> > > > > > some SIOV drivers would still deploy their own create and config
> > > method so
> > > > > > stopped effort on this. But seems this would still be useful for
> some
> > > other
> > > > > > SIOV driver which may simply want capability to aggregate
> resources.
> > > So here's
> > > > > > refreshed series.
> > > > > >
> > > > > > Current mdev device create interface depends on fixed mdev type,
> > > which get uuid
> > > > > > from user to create instance of mdev device. If user wants to use
> > > customized
> > > > > > number of resource for mdev device, then only can create new
> mdev
> > > type for that
> > > > > > which may not be flexible. This requirement comes not only from
> to
> > > be able to
> > > > > > allocate flexible resources for KVMGT, but also from Intel scalable
> IO
> > > > > > virtualization which would use vfio/mdev to be able to allocate
> > > arbitrary
> > > > > > resources on mdev instance. More info on [1] [2] [3].
> > > > > >
> > > > > > To allow to create user defined resources for mdev, it trys to
> extend
> > > mdev
> > > > > > create interface by adding new "aggregate=xxx" parameter
> following
> > > UUID, for
> > > > > > target mdev type if aggregation is supported, it can create new
> mdev
> > > device
> > > > > > which contains resources combined by number of instances, e.g
> > > > > >
> > > > > > echo ",aggregate=10" > create
> > > > > >
> > > > > > VM manager e.g libvirt can check mdev type with "aggregation"
> > > attribute which
> > > > > > can support this setting. If no "aggregation" attribute found for
> mdev
> > > type,
> > > > > > previous behavior is still kept for one instance allocation. And new
> > > sysfs
> > > > > > attribute "aggregated_instances" is created for each mdev device
> to
> > > show allocated number.
> > > > >
> > > > > Given discussions we've had recently around libvirt interacting with
> > > > > mdev, I think that libvirt would rather have an abstract interface via
> > > > > mdevctl[1].  Therefore can you evaluate how mdevctl would support
> > > this
> > > > > creation extension?  It seems like it would fit within the existing
> > > > > mdev and mdevctl framework if aggregation were simply a sysfs
> > > attribute
> > > > > for the device.  For example, the mdevctl steps might look like this:
> > > > >
> > > > > mdevctl define -u UUID -p PARENT -t TYPE
> > > > > mdevctl modify -u UUID --addattr=mdev/aggregation --value=2
> > > > > mdevctl start -u UUID
> >
> > Hi, Alex, can you elaborate why a sysfs attribute is more friendly
> > to mdevctl? what is the complexity if having mdevctl to pass
> > additional parameter at creation time as this series originally
> > proposed? Just want to clearly understand the limitation of the
> > parameter way. :-)
> 
> We could also flip this question around, vfio-ap already uses sysfs to
> finish composing a device after it's created, therefore why shouldn't
> aggregation use this existing mechanism.  Extending the creation
> interface is a more fundamental change than simply standardizing an
> optional sysfs namespace entry.
> 
> > > > >
> > > > > When mdevctl starts the mdev, it will first create it using the
> > > > > existing mechanism, then apply aggregation attribute, which can
> > > consume
> > > > > the necessary a

Re: [libvirt] [PATCH 0/6] VFIO mdev aggregated resources handling

2019-11-15 Thread Tian, Kevin
> From: Alex Williamson
> Sent: Thursday, November 7, 2019 2:45 AM
> 
> On Wed, 6 Nov 2019 12:20:31 +0800
> Zhenyu Wang  wrote:
> 
> > On 2019.11.05 14:10:42 -0700, Alex Williamson wrote:
> > > On Thu, 24 Oct 2019 13:08:23 +0800
> > > Zhenyu Wang  wrote:
> > >
> > > > Hi,
> > > >
> > > > This is a refresh for previous send of this series. I got impression 
> > > > that
> > > > some SIOV drivers would still deploy their own create and config
> method so
> > > > stopped effort on this. But seems this would still be useful for some
> other
> > > > SIOV driver which may simply want capability to aggregate resources.
> So here's
> > > > refreshed series.
> > > >
> > > > Current mdev device create interface depends on fixed mdev type,
> which get uuid
> > > > from user to create instance of mdev device. If user wants to use
> customized
> > > > number of resource for mdev device, then only can create new mdev
> type for that
> > > > which may not be flexible. This requirement comes not only from to
> be able to
> > > > allocate flexible resources for KVMGT, but also from Intel scalable IO
> > > > virtualization which would use vfio/mdev to be able to allocate
> arbitrary
> > > > resources on mdev instance. More info on [1] [2] [3].
> > > >
> > > > To allow to create user defined resources for mdev, it trys to extend
> mdev
> > > > create interface by adding new "aggregate=xxx" parameter following
> UUID, for
> > > > target mdev type if aggregation is supported, it can create new mdev
> device
> > > > which contains resources combined by number of instances, e.g
> > > >
> > > > echo ",aggregate=10" > create
> > > >
> > > > VM manager e.g libvirt can check mdev type with "aggregation"
> attribute which
> > > > can support this setting. If no "aggregation" attribute found for mdev
> type,
> > > > previous behavior is still kept for one instance allocation. And new
> sysfs
> > > > attribute "aggregated_instances" is created for each mdev device to
> show allocated number.
> > >
> > > Given discussions we've had recently around libvirt interacting with
> > > mdev, I think that libvirt would rather have an abstract interface via
> > > mdevctl[1].  Therefore can you evaluate how mdevctl would support
> this
> > > creation extension?  It seems like it would fit within the existing
> > > mdev and mdevctl framework if aggregation were simply a sysfs
> attribute
> > > for the device.  For example, the mdevctl steps might look like this:
> > >
> > > mdevctl define -u UUID -p PARENT -t TYPE
> > > mdevctl modify -u UUID --addattr=mdev/aggregation --value=2
> > > mdevctl start -u UUID

Hi, Alex, can you elaborate why a sysfs attribute is more friendly
to mdevctl? what is the complexity if having mdevctl to pass
additional parameter at creation time as this series originally 
proposed? Just want to clearly understand the limitation of the
parameter way. :-)

> > >
> > > When mdevctl starts the mdev, it will first create it using the
> > > existing mechanism, then apply aggregation attribute, which can
> consume
> > > the necessary additional instances from the parent device, or return an
> > > error, which would unwind and return a failure code to the caller
> > > (libvirt).  I think the vendor driver would then have freedom to decide
> > > when the attribute could be modified, for instance it would be entirely
> > > reasonable to return -EBUSY if the user attempts to modify the
> > > attribute while the mdev device is in-use.  Effectively aggregation
> > > simply becomes a standardized attribute with common meaning.
> Thoughts?
> > > [cc libvirt folks for their impression] Thanks,
> >
> > I think one problem is that before mdevctl start to create mdev you
> > don't know what vendor attributes are, as we apply mdev attributes
> > after create. You may need some lookup depending on parent.. I think
> > making aggregation like other vendor attribute for mdev might be the
> > simplest way, but do we want to define its behavior in formal? e.g
> > like previous discussed it should show maxium instances for aggregation,
> etc.
> 
> Yes, we'd still want to standardize how we enable and discover
> aggregation since we expect multiple users.  Even if libvirt were to
> use mdevctl as it's mdev interface, higher level tools should have an
> introspection mechanism available.  Possibly the sysfs interfaces
> proposed in this series remains largely the same, but I think perhaps
> the implementation of them moves out to the vendor driver.  In fact,
> perhaps the only change to mdev core is to define the standard.  For
> example, the "aggregation" attribute on the type is potentially simply
> a defined, optional, per type attribute, similar to "name" and
> "description".  For "aggregated_instances" we already have the
> mdev_attr_groups of the mdev_parent_ops, we could define an
> attribute_group with .name = "mdev" as a set of standardized
> attributes, such that vendors could provide both their own vendor
> specific attributes and per 

Re: [libvirt] [PATCH v2 1/1] Add attribute single_usage_restriction for mdev type-id

2018-10-10 Thread Tian, Kevin
> From: Alex Williamson
> Sent: Thursday, October 11, 2018 4:39 AM
> 
> On Tue, 9 Oct 2018 01:40:17 +0530
> Kirti Wankhede  wrote:
> 
> > Generally a single instance of mdev device, a share of physical device, is
> > assigned to user space application or a VM. There are cases when
> multiple
> > instances of mdev devices of same or different types are required by user
> > space application or VM. For example in case of vGPU, multiple mdev
> devices
> > of type which represents whole GPU can be assigned to one instance of
> > application or VM.
> >
> > All types of mdev devices may not support assigning multiple mdev
> devices
> > to a user space application. In that case vendor driver can fail open()
> > call of mdev device. But there is no way to know User space application to
> > about the configuration supported by vendor driver.
> >
> > To expose supported configuration, vendor driver should add
> > 'single_usage_restriction' attribute to type-id directory. Returning Y for
> > this attribute indicates vendor driver has restriction of single mdev
> > device of particular  assigned to one user space application.
> > Returning N indicates that multiple mdev devices of particular 
> > can be assigned to one user space application.
> >
> > User space application should read if 'single_usage_restriction' attibute
> > is present in  directory of all mdev devices which are going to be
> > used. If all read N then user space application can proceed with multiple
> > mdev devices.
> >
> > This is optional and readonly attribute.
> >
> > Signed-off-by: Kirti Wankhede 
> > Reviewed-by: Neo Jia 
> > ---
> >  Documentation/ABI/testing/sysfs-bus-vfio-mdev | 16 
> >  1 file changed, 16 insertions(+)
> >
> > diff --git a/Documentation/ABI/testing/sysfs-bus-vfio-mdev
> b/Documentation/ABI/testing/sysfs-bus-vfio-mdev
> > index 452dbe39270e..3aca352a70e5 100644
> > --- a/Documentation/ABI/testing/sysfs-bus-vfio-mdev
> > +++ b/Documentation/ABI/testing/sysfs-bus-vfio-mdev
> > @@ -85,6 +85,22 @@ Users:
> > a particular  that can help in understanding the
> > features provided by that type of mediated device.
> >
> > +What:  /sys/.../mdev_supported_types/ id>/single_usage_restriction
> > +Date:   October 2018
> > +Contact:Kirti Wankhede 
> > +Description:
> > +   Reading this attribute will return Y or N. Returning Y
> indicates
> > +   vendor driver has restriction of single mdev device of this
> > +   particular  assigned to one user space application.
> > +   Returning N indicates that multiple mdev devices of
> particular
> > +can be assigned to one user space application.
> > +   This is optional and readonly attribute.
> > +Users:
> > +   User space application should read if
> 'single_usage_restriction'
> > +   attibute is present in  directory of all mdev devices
> > +   which are going to be used. If all read N then user space
> > +   application can proceed with multiple mdev devices.
> 
> But we don't say what userspace should do when this optional attribute
> is not present.  Do we know of any cases other than the NVIDIA GRID
> vGPU drivers that have this restriction?  Intel folks, are multiple
> GVT-g mdevs currently allowed in a VM?  I don't think the libvirt
> algorithm is going to be as simple as suggested here and we should
> probably understand what it really needs to be.

technically I don't see a restriction in GVT-g side, i.e. multiple GVT-g
mdevs can be assigned to same VM. But the fact is that Intel GPU is 
integrated thus just one per platform. Then guest i915 driver may have 
problem to operate multiple vGPUs if with some assumption on integrated
part. I don't think we verified such configuration. Zhenyu?

> 
> There's also a scope issue that's unclear here, the verbiage above
> suggests that I can't combine a 'single_usage_restriction=Y' mdev with
> any other mdev, but clearly there's no dependency between adding both
> an NVIDIA and Intel vGPU to the same VM, right?  Or NVIDIA and any of
> the mdev sample drivers.  The restriction is across mdev types and
> parent devices in this case, but I think it stops at the vendor driver,
> right?  How do we make that more clear, both in wording and perhaps
> implicit in the attribute itself?  Thanks,
> 
> Alex

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


Re: [libvirt] [PATCH v2 0/4] New mdev type handling for aggregated resources

2018-10-07 Thread Tian, Kevin
Hi, Zhenyu,

curious about the progress of this series. Is there still some open remaining
or a new version coming soon?

Thanks
Kevin

> From: Zhenyu Wang [mailto:zhen...@linux.intel.com]
> Sent: Friday, July 20, 2018 10:19 AM
> 
> Current mdev device create interface depends on fixed mdev type, which
> get uuid
> from user to create instance of mdev device. If user wants to use
> customized
> number of resource for mdev device, then only can create new mdev type
> for that
> which may not be flexible. This requirement comes not only from to be
> able to
> allocate flexible resources for KVMGT, but also from Intel scalable IO
> virtualization which would use vfio/mdev to be able to allocate arbitrary
> resources on mdev instance. More info on [1] [2] [3].
> 
> To allow to create user defined resources for mdev, it trys to extend mdev
> create interface by adding new "instances=xxx" parameter following uuid,
> for
> target mdev type if aggregation is supported, it can create new mdev device
> which contains resources combined by number of instances, e.g
> 
> echo ",instances=10" > create
> 
> VM manager e.g libvirt can check mdev type with "aggregation" attribute
> which
> can support this setting. If no "aggregation" attribute found for mdev type,
> previous behavior is still kept for one instance allocation. And new sysfs
> attribute "instances" is created for each mdev device to show allocated
> number.
> 
> This trys to create new KVMGT type with minimal vGPU resources which
> can be
> combined with "instances=x" setting to allocate for user wanted resources.
> 
> References:
> [1] https://software.intel.com/en-us/download/intel-virtualization-
> technology-for-directed-io-architecture-specification
> [2] https://software.intel.com/en-us/download/intel-scalable-io-
> virtualization-technical-specification
> [3] https://schd.ws/hosted_files/lc32018/00/LC3-SIOV-final.pdf
> 
> v2:
>   - Add new create_with_instances driver hook
>   - Update doc for new attributes
> 
> Zhenyu Wang (4):
>   vfio/mdev: Add new instances parameter for mdev create
>   vfio/mdev: Add mdev device instances attribute
>   drm/i915/gvt: Add new aggregation type support
>   Documentation/vfio-mediated-device.txt: update for aggregation
> attribute
> 
>  Documentation/vfio-mediated-device.txt | 39 +++---
>  drivers/gpu/drm/i915/gvt/gvt.c | 26 +---
>  drivers/gpu/drm/i915/gvt/gvt.h | 14 ---
>  drivers/gpu/drm/i915/gvt/kvmgt.c   | 30 +++---
>  drivers/gpu/drm/i915/gvt/vgpu.c| 56 ++
>  drivers/vfio/mdev/mdev_core.c  | 19 +++--
>  drivers/vfio/mdev/mdev_private.h   |  6 ++-
>  drivers/vfio/mdev/mdev_sysfs.c | 42 ---
>  include/linux/mdev.h   | 10 +
>  9 files changed, 203 insertions(+), 39 deletions(-)
> 
> --
> 2.18.0


--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


Re: [libvirt] [RFC PATCH v1 1/1] Add attribute multiple_mdev_support for mdev type-id

2018-09-27 Thread Tian, Kevin
> From: Alex Williamson [mailto:alex.william...@redhat.com]
> Sent: Friday, September 28, 2018 4:01 AM
> 
> On Fri, 28 Sep 2018 00:53:00 +0530
> Kirti Wankhede  wrote:
> 
> > On 9/27/2018 9:29 PM, Alex Williamson wrote:
> > > On Thu, 27 Sep 2018 06:48:27 +
> > > "Tian, Kevin"  wrote:
> > >
> > >>> From: Kirti Wankhede
> > >>> Sent: Thursday, September 27, 2018 2:22 PM
> > >>>
> > >>> Generally a single instance of mdev device, a share of physical device,
> is
> > >>> assigned to user space application or a VM. There are cases when
> multiple
> > >>> instances of mdev devices of same or different types are required by
> User
> > >>> space application or VM. For example in case of vGPU, multiple mdev
> > >>> devices
> > >>> of type which represents whole GPU can be assigned to one instance
> of
> > >>> application or VM.
> > >>>
> > >>> All types of mdev devices may not support assigning multiple mdev
> devices
> > >>> to a user space application. In that case vendor driver can fail open()
> > >>> call of mdev device. But there is no way to know User space
> application
> > >>> about the configuration supported by vendor driver.
> > >>>
> > >>> To expose supported configuration, vendor driver should add
> > >>> 'multiple_mdev_support' attribute to type-id directory if vendor
> driver
> > >>> supports assigning multiple mdev devices of a particular type-id to
> one
> > >>> instance of user space application or a VM.
> > >>>
> > >>> User space application should read if 'multiple_mdev_support'
> attibute is
> > >>> present in type-id directory of all mdev devices which are going to be
> > >>> used. If all read 1 then user space application can proceed with
> multiple
> > >>> mdev devices.
> > >>>
> > >>> This is optional and readonly attribute.
> > >>
> > >> I didn't get what is the exact problem from above description. Isn't it
> the
> > >> basic point behind mdev concept that parent driver can create multiple
> > >> mdev instances on a physical device? VFIO usually doesn't care
> whether
> > >> multiple devices (pci or mdev) are assigned to same process/VM or not.
> > >> Can you give some example of actual problem behind this change?
> > >
> > > The situation here is that mdev imposes no restrictions regarding how
> > > mdev devices can be used by the user.  Multiple mdevs, regardless of
> > > type, can be combined and used in any way the user sees fit, at least
> > > that's the theory.  However, in practice, certain vendor drivers impose
> > > a limitation that prevents multiple mdev devices from being used in the
> > > same VM.  This is done by returning an error when the user attempts to
> > > open those additional devices.  Now we're in the very predictable
> > > situation that we want to consider that some of the vendor's mdev
> types
> > > might be combined in the same userspace instance, while others still
> > > impose a restriction, and how can we optionally expose such per mdev
> > > type restrictions to the userspace so we have something better than
> > > "try it and see if it works".
> > >
> > > The below proposal assumes that a single instance per VM is the norm
> > > and that we add something to the API to indicate multiple instances are
> > > supported.  However, that's really never been the position of the mdev
> > > core, where there's no concept of limiting devices per user instance;
> > > it's a vendor driver restriction.  So I think the question is then why
> > > should we burden vendor drivers who do not impose a restriction with
> an
> > > additional field?  Does the below extension provide a better backwards
> > > compatibility scenario?
> >
> > The vendor driver who do not want to impose restriction, doesn't need to
> > provide this attribute. In that case, behavior would remain same as it
> > is now, i.e. multiple mdev instances can be used by one instance of
> > application.
> >
> >
> > > With the current code, I think it's reasonable for userspace to assume
> > > there are no mdev limits per userspace instance, those that exist are
> > > vendor specific.  The below proposal is for an optional field, what
> > > does the manageme

Re: [libvirt] [RFC PATCH v1 1/1] Add attribute multiple_mdev_support for mdev type-id

2018-09-27 Thread Tian, Kevin
> From: Tian, Kevin
> Sent: Friday, September 28, 2018 9:27 AM
> 
> > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > Sent: Friday, September 28, 2018 4:01 AM
> >
> > On Fri, 28 Sep 2018 00:53:00 +0530
> > Kirti Wankhede  wrote:
> >
> > > On 9/27/2018 9:29 PM, Alex Williamson wrote:
> > > > On Thu, 27 Sep 2018 06:48:27 +
> > > > "Tian, Kevin"  wrote:
> > > >
> > > >>> From: Kirti Wankhede
> > > >>> Sent: Thursday, September 27, 2018 2:22 PM
> > > >>>
> > > >>> Generally a single instance of mdev device, a share of physical
> device,
> > is
> > > >>> assigned to user space application or a VM. There are cases when
> > multiple
> > > >>> instances of mdev devices of same or different types are required
> by
> > User
> > > >>> space application or VM. For example in case of vGPU, multiple
> mdev
> > > >>> devices
> > > >>> of type which represents whole GPU can be assigned to one
> instance
> > of
> > > >>> application or VM.
> > > >>>
> > > >>> All types of mdev devices may not support assigning multiple mdev
> > devices
> > > >>> to a user space application. In that case vendor driver can fail 
> > > >>> open()
> > > >>> call of mdev device. But there is no way to know User space
> > application
> > > >>> about the configuration supported by vendor driver.
> > > >>>
> > > >>> To expose supported configuration, vendor driver should add
> > > >>> 'multiple_mdev_support' attribute to type-id directory if vendor
> > driver
> > > >>> supports assigning multiple mdev devices of a particular type-id to
> > one
> > > >>> instance of user space application or a VM.
> > > >>>
> > > >>> User space application should read if 'multiple_mdev_support'
> > attibute is
> > > >>> present in type-id directory of all mdev devices which are going to
> be
> > > >>> used. If all read 1 then user space application can proceed with
> > multiple
> > > >>> mdev devices.
> > > >>>
> > > >>> This is optional and readonly attribute.
> > > >>
> > > >> I didn't get what is the exact problem from above description. Isn't it
> > the
> > > >> basic point behind mdev concept that parent driver can create
> multiple
> > > >> mdev instances on a physical device? VFIO usually doesn't care
> > whether
> > > >> multiple devices (pci or mdev) are assigned to same process/VM or
> not.
> > > >> Can you give some example of actual problem behind this change?
> > > >
> > > > The situation here is that mdev imposes no restrictions regarding how
> > > > mdev devices can be used by the user.  Multiple mdevs, regardless of
> > > > type, can be combined and used in any way the user sees fit, at least
> > > > that's the theory.  However, in practice, certain vendor drivers impose
> > > > a limitation that prevents multiple mdev devices from being used in
> the
> > > > same VM.  This is done by returning an error when the user attempts
> to
> > > > open those additional devices.  Now we're in the very predictable
> > > > situation that we want to consider that some of the vendor's mdev
> > types
> > > > might be combined in the same userspace instance, while others still
> > > > impose a restriction, and how can we optionally expose such per
> mdev
> > > > type restrictions to the userspace so we have something better than
> > > > "try it and see if it works".
> > > >
> > > > The below proposal assumes that a single instance per VM is the
> norm
> > > > and that we add something to the API to indicate multiple instances
> are
> > > > supported.  However, that's really never been the position of the
> mdev
> > > > core, where there's no concept of limiting devices per user instance;
> > > > it's a vendor driver restriction.  So I think the question is then why
> > > > should we burden vendor drivers who do not impose a restriction
> with
> > an
> > > > additional field?  Does the below extension provide a better
> backwards
> > > > compatibility scenario?
> &g

Re: [libvirt] [RFC PATCH v1 1/1] Add attribute multiple_mdev_support for mdev type-id

2018-09-27 Thread Tian, Kevin
> From: Kirti Wankhede
> Sent: Thursday, September 27, 2018 2:22 PM
> 
> Generally a single instance of mdev device, a share of physical device, is
> assigned to user space application or a VM. There are cases when multiple
> instances of mdev devices of same or different types are required by User
> space application or VM. For example in case of vGPU, multiple mdev
> devices
> of type which represents whole GPU can be assigned to one instance of
> application or VM.
> 
> All types of mdev devices may not support assigning multiple mdev devices
> to a user space application. In that case vendor driver can fail open()
> call of mdev device. But there is no way to know User space application
> about the configuration supported by vendor driver.
> 
> To expose supported configuration, vendor driver should add
> 'multiple_mdev_support' attribute to type-id directory if vendor driver
> supports assigning multiple mdev devices of a particular type-id to one
> instance of user space application or a VM.
> 
> User space application should read if 'multiple_mdev_support' attibute is
> present in type-id directory of all mdev devices which are going to be
> used. If all read 1 then user space application can proceed with multiple
> mdev devices.
> 
> This is optional and readonly attribute.

I didn't get what is the exact problem from above description. Isn't it the
basic point behind mdev concept that parent driver can create multiple
mdev instances on a physical device? VFIO usually doesn't care whether
multiple devices (pci or mdev) are assigned to same process/VM or not.
Can you give some example of actual problem behind this change?

Thanks
Kevin

> 
> Signed-off-by: Neo Jia 
> Signed-off-by: Kirti Wankhede 
> ---
>  Documentation/ABI/testing/sysfs-bus-vfio-mdev | 13 +
>  1 file changed, 13 insertions(+)
> 
> diff --git a/Documentation/ABI/testing/sysfs-bus-vfio-mdev
> b/Documentation/ABI/testing/sysfs-bus-vfio-mdev
> index 452dbe39270e..69e1291479ce 100644
> --- a/Documentation/ABI/testing/sysfs-bus-vfio-mdev
> +++ b/Documentation/ABI/testing/sysfs-bus-vfio-mdev
> @@ -85,6 +85,19 @@ Users:
>   a particular  that can help in understanding the
>   features provided by that type of mediated device.
> 
> +What:   /sys/.../mdev_supported_types/ id>/multiple_mdev_support
> +Date:   September 2018
> +Contact:Kirti Wankhede 
> +Description:
> + Reading this attribute will return 0 or 1. Returning 1
> indicates
> + multiple mdev devices of a particular  assigned to
> one
> + User space application is supported by vendor driver. This is
> + optional and readonly attribute.
> +Users:
> + Userspace applications interested in knowing if multiple
> mdev
> + devices of a particular  can be assigned to one
> + instance of application.
> +
>  What:   /sys/...///
>  Date:   October 2016
>  Contact:Kirti Wankhede 
> --
> 2.7.0


--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


Re: [libvirt] Matching the type of mediated devices in the migration

2018-08-22 Thread Tian, Kevin
> From: Alex Williamson [mailto:alex.william...@redhat.com]
> Sent: Thursday, August 23, 2018 11:47 AM
> 
> On Wed, 22 Aug 2018 02:30:12 +0000
> "Tian, Kevin"  wrote:
> 
> > > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > > Sent: Wednesday, August 22, 2018 10:08 AM
> > >
> > > On Wed, 22 Aug 2018 01:27:05 +
> > > "Tian, Kevin"  wrote:
> > >
> > > > > From: Wang, Zhi A
> > > > > Sent: Wednesday, August 22, 2018 2:43 AM
> > > > > >
> > > > > > Are there any suggestions how we can deal with security issues?
> > > > > > Allowing userspace to provide a data stream representing the
> internal
> > > > > > state of a virtual device model living within the kernel seems
> > > > > > troublesome.  If we need to trust the data stream, do we need to
> > > > > > somehow make the operation more privileged than what a vfio
> user
> > > > > might
> > > > > > have otherwise?  Does the data stream need to be somehow
> signed
> > > and
> > > > > how
> > > > > > might we do that?  How can we build in protection against an
> > > untrusted
> > > > > > restore image?  Thanks,
> > > >
> > > > imo it is not necessary. restoring mdev state should be handled as if
> > > > guest is programming the mdev.
> > >
> > > To me this suggests that a state save/restore is just an algorithm
> > > executed by userspace using the existing vfio device accesses.  This is
> > > not at all what we've been discussing for migration.  I believe the
> >
> > not algorithm by userspace. It's kernel driver to apply the audit when
> > receiving opaque state data.
> 
> And a kernel driver receiving and processing opaque state date from a
> user doesn't raise security concerns for you?

opaque is from userspace p.o.v. kernel driver understands the actual
format and thus can audit when restoring the state.

> 
> > > interface we've been hashing out exposes opaque device state through
> a
> > > vfio region.  We therefore must assume that that opaque data contains
> > > not only device state, but also emulation state, similar to what we see
> > > for any QEMU device.  Not only is there internal emulation state, but
> > > we have no guarantee that the device state goes through the same
> > > auditing as it does through the vfio interface.  Since this device and
> > > emulation state live inside the kernel and not just within the user's
> > > own process, a malicious user can do far more than shoot themselves.
> It
> > > would be one thing devices were IOMMU isolated, but they're not,
> > > they're isolated through vendor and device specific mechanism, and for
> > > all we know the parameters of that isolation are included in the
> > > restore state.  I don't see how we can say this is not an issue.
> >
> > I didn't quite get this. My understanding is that isolation configuration
> > is completed when a mdev is created on DEST machine given a type
> > definition. The state image contains just runtime data reflecting what
> > guest driver does on SRC machine. Restoring such state shouldn't
> > change the isolation policy.
> 
> Let's invent an example where the mdev vendor driver has a set of
> pinned pages which are the current working set for the device at the
> time of migration.  Information about that pinning might be included in
> the opaque migration state.  If a malicious user discovers this, they
> can potentially also craft a modified state which can exploit the host
> kernel isolation.

pinned pages may be not a good example. the pin knowledge could be
reconstructed when restoring the state (e.g. in GVT-g pinning is triggered
by shadowing GPU page table which has to be recreated on DEST). 

> 
> > > > Then all the audits/security checks
> > > > enforced in normal emulation path should still apply. vendor driver
> > > > may choose to audit every state restore operation one-by-one, and
> > > > do it altoghter at a synchronization point (e.g. when the mdev is re-
> > > > scheduled, similar to what we did before VMENTRY).
> > >
> > > Giving the vendor driver the choice of whether to be secure or not is
> > > exactly what I'm trying to propose we spend some time thinking about.
> > > For instance, what if instead of allowing the user to load device state
> > > through a region, the kernel could side load it using sometime similar
&

Re: [libvirt] Matching the type of mediated devices in the migration

2018-08-21 Thread Tian, Kevin
> From: Alex Williamson [mailto:alex.william...@redhat.com]
> Sent: Wednesday, August 22, 2018 10:08 AM
> 
> On Wed, 22 Aug 2018 01:27:05 +0000
> "Tian, Kevin"  wrote:
> 
> > > From: Wang, Zhi A
> > > Sent: Wednesday, August 22, 2018 2:43 AM
> > > >
> > > > Are there any suggestions how we can deal with security issues?
> > > > Allowing userspace to provide a data stream representing the internal
> > > > state of a virtual device model living within the kernel seems
> > > > troublesome.  If we need to trust the data stream, do we need to
> > > > somehow make the operation more privileged than what a vfio user
> > > might
> > > > have otherwise?  Does the data stream need to be somehow signed
> and
> > > how
> > > > might we do that?  How can we build in protection against an
> untrusted
> > > > restore image?  Thanks,
> >
> > imo it is not necessary. restoring mdev state should be handled as if
> > guest is programming the mdev.
> 
> To me this suggests that a state save/restore is just an algorithm
> executed by userspace using the existing vfio device accesses.  This is
> not at all what we've been discussing for migration.  I believe the

not algorithm by userspace. It's kernel driver to apply the audit when
receiving opaque state data.

> interface we've been hashing out exposes opaque device state through a
> vfio region.  We therefore must assume that that opaque data contains
> not only device state, but also emulation state, similar to what we see
> for any QEMU device.  Not only is there internal emulation state, but
> we have no guarantee that the device state goes through the same
> auditing as it does through the vfio interface.  Since this device and
> emulation state live inside the kernel and not just within the user's
> own process, a malicious user can do far more than shoot themselves.  It
> would be one thing devices were IOMMU isolated, but they're not,
> they're isolated through vendor and device specific mechanism, and for
> all we know the parameters of that isolation are included in the
> restore state.  I don't see how we can say this is not an issue.

I didn't quite get this. My understanding is that isolation configuration
is completed when a mdev is created on DEST machine given a type
definition. The state image contains just runtime data reflecting what
guest driver does on SRC machine. Restoring such state shouldn't
change the isolation policy.

> 
> > Then all the audits/security checks
> > enforced in normal emulation path should still apply. vendor driver
> > may choose to audit every state restore operation one-by-one, and
> > do it altoghter at a synchronization point (e.g. when the mdev is re-
> > scheduled, similar to what we did before VMENTRY).
> 
> Giving the vendor driver the choice of whether to be secure or not is
> exactly what I'm trying to propose we spend some time thinking about.
> For instance, what if instead of allowing the user to load device state
> through a region, the kernel could side load it using sometime similar
> to the firmware loading path.  The user could be provided with a file
> name token that they push through the vfio interface to trigger the
> state loading from a location with proper file level ACLs such that the
> image can be considered trusted.  Unfortunately the collateral is that
> libvirt would need to become the secure delivery entity, somehow
> stripping this section of the migration stream into a file and
> providing a token for the user to ask the kernel to load it.  What are
> some other options?  Could save/restore be done simply as an
> algorithmic script matched to stack of data, as I read into your first
> statement above?  I have doubts that we can achieve the internal state
> we need, or maybe even the performance we need using such a process.
> Thanks,
> 

for GVT-g I think we invoke common functions as used in emulation path
to recover vGPU state, e.g. gtt rw handler, etc. Zhi can correct me if
I'm wrong.

Can you elaborate the difference between device state and emulation
state which you mentioned earlier? We may need look at some concrete
example to understand the actual problem here.

Thanks
Kevin

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


Re: [libvirt] Matching the type of mediated devices in the migration

2018-08-21 Thread Tian, Kevin
> From: Wang, Zhi A
> Sent: Wednesday, August 22, 2018 2:43 AM
> >
> > Are there any suggestions how we can deal with security issues?
> > Allowing userspace to provide a data stream representing the internal
> > state of a virtual device model living within the kernel seems
> > troublesome.  If we need to trust the data stream, do we need to
> > somehow make the operation more privileged than what a vfio user
> might
> > have otherwise?  Does the data stream need to be somehow signed and
> how
> > might we do that?  How can we build in protection against an untrusted
> > restore image?  Thanks,

imo it is not necessary. restoring mdev state should be handled as if
guest is programming the mdev. Then all the audits/security checks
enforced in normal emulation path should still apply. vendor driver
may choose to audit every state restore operation one-by-one, and 
do it altoghter at a synchronization point (e.g. when the mdev is re-
scheduled, similar to what we did before VMENTRY).

> What a good point!
> 
> I dig the kernel module security case, which seems similar with this
> case. The security of loading kernel module relies on root privilege and
> signature.
> 
> For root privilege, QEMU could run as non root in libvirtd. So this
> wouldn't be an option.
> 
> For signature, I am wondering if there is any similar cases in other
> kernel components, like KVM or another modules which provides ioctls to
> userspace. Maybe they don't even load some binary from userspace, but
> they could suffer from DDOS flood from userspace. Maybe some ioctls or
> interfaces in kernel should only allow signed/trusted userspace
> application to call. (previously it's "allow signed kernel module to load")
> 
> Thanks,
> Zhi.
> 
> >
> > Alex
> >

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


Re: [libvirt] [PATCH 0/3] sample: vfio mdev display devices.

2018-04-25 Thread Tian, Kevin
> From: Alex Williamson
> Sent: Thursday, April 19, 2018 2:32 AM
> 
> That almost begins to look reasonable, but then we can only expose this
> for mdev devices, what if we were to hack a back door into a directly
> assigned GPU that tracks the location of active display in the
> framebuffer and implement the GFX_PLANE interface for that?  We have no
> sysfs representation for either the template or the actual device for
> anything other than mdev.  This inconsistency with physically assigned
> devices has been one of my arguments against enhancing mdev sysfs.
> 

One possible option is to wrap directly assigned GPU into a mdev. The
parent driver could be a dummy PCI driver which does basic PCI
initialization, and then provide hooks for vendor-specific hack. 

Thanks
Kevin

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


Re: [libvirt] [RFC]Add new mdev interface for QoS

2017-08-01 Thread Tian, Kevin
> From: Alex Williamson [mailto:alex.william...@redhat.com]
> Sent: Wednesday, August 2, 2017 6:26 AM
> 
> On Tue, 1 Aug 2017 13:54:27 +0800
> "Gao, Ping A"  wrote:
> 
> > On 2017/7/28 0:00, Gao, Ping A wrote:
> > > On 2017/7/27 0:43, Alex Williamson wrote:
> > >> [cc +libvir-list]
> > >>
> > >> On Wed, 26 Jul 2017 21:16:59 +0800
> > >> "Gao, Ping A"  wrote:
> > >>
> > >>> The vfio-mdev provide the capability to let different guest share the
> > >>> same physical device through mediate sharing, as result it bring a
> > >>> requirement about how to control the device sharing, we need a QoS
> > >>> related interface for mdev to management virtual device resource.
> > >>>
> > >>> E.g. In practical use, vGPUs assigned to different quests almost has
> > >>> different performance requirements, some guests may need higher
> priority
> > >>> for real time usage, some other may need more portion of the GPU
> > >>> resource to get higher 3D performance, corresponding we can define
> some
> > >>> interfaces like weight/cap for overall budget control, priority for
> > >>> single submission control.
> > >>>
> > >>> So I suggest to add some common attributes which are vendor agnostic
> in
> > >>> mdev core sysfs for QoS purpose.
> > >> I think what you're asking for is just some standardization of a QoS
> > >> attribute_group which a vendor can optionally include within the
> > >> existing mdev_parent_ops.mdev_attr_groups.  The mdev core will
> > >> transparently enable this, but it really only provides the standard,
> > >> all of the support code is left for the vendor.  I'm fine with that,
> > >> but of course the trouble with and sort of standardization is arriving
> > >> at an agreed upon standard.  Are there QoS knobs that are generic
> > >> across any mdev device type?  Are there others that are more specific
> > >> to vGPU?  Are there existing examples of this that we can steal their
> > >> specification?
> > > Yes, you are right, standardization QoS knobs are exactly what I wanted.
> > > Only when it become a part of the mdev framework and libvirt, then QoS
> > > such critical feature can be leveraged by cloud usage. HW vendor only
> > > need to focus on the implementation of the corresponding QoS algorithm
> > > in their back-end driver.
> > >
> > > Vfio-mdev framework provide the capability to share the device that lack
> > > of HW virtualization support to guests, no matter the device type,
> > > mediated sharing actually is a time sharing multiplex method, from this
> > > point of view, QoS can be take as a generic way about how to control the
> > > time assignment for virtual mdev device that occupy HW. As result we can
> > > define QoS knob generic across any device type by this way. Even if HW
> > > has build in with some kind of QoS support, I think it's not a problem
> > > for back-end driver to convert mdev standard QoS definition to their
> > > specification to reach the same performance expectation. Seems there
> are
> > > no examples for us to follow, we need define it from scratch.
> > >
> > > I proposal universal QoS control interfaces like below:
> > >
> > > Cap: The cap limits the maximum percentage of time a mdev device can
> own
> > > physical device. e.g. cap=60, means mdev device cannot take over 60% of
> > > total physical resource.
> > >
> > > Weight: The weight define proportional control of the mdev device
> > > resource between guests, it’s orthogonal with Cap, to target load
> > > balancing. E.g. if guest 1 should take double mdev device resource
> > > compare with guest 2, need set weight ratio to 2:1.
> > >
> > > Priority: The guest who has higher priority will get execution first,
> > > target to some real time usage and speeding interactive response.
> > >
> > > Above QoS interfaces cover both overall budget control and single
> > > submission control. I will sent out detail design later once get aligned.
> >
> > Hi Alex,
> > Any comments about the interface mentioned above?
> 
> Not really.
> 
> Kirti, are there any QoS knobs that would be interesting
> for NVIDIA devices?
> 
> Implementing libvirt support at the same time might be an interesting
> exercise if we don't have a second user in the kernel to validate
> against.  We could at least have two communities reviewing the feature
> then.  Thanks,
> 

We planned to introduce new vdev types to indirectly validate 
some features (e.g. weight and cap) in our device model, which
however will not exercise the to-be-proposed sysfs interface.
yes, we can check/extend libvirt simultaneously to draw a
whole picture of all required changes in the stack...

Thanks
Kevin

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list

Re: [libvirt] summary of current vfio mdev upstreaming status

2016-09-29 Thread Tian, Kevin
> From: Neo Jia [mailto:c...@nvidia.com]
> Sent: Thursday, September 29, 2016 5:17 PM
> 
> On Thu, Sep 29, 2016 at 04:55:39PM +0800, Jike Song wrote:
> > Hi all,
> >
> > In order to have a clear understanding about the VFIO mdev upstreaming
> > status, I'd like to summarize it. Please share your opinions on this,
> > and correct my misunderstandings.
> >
> > The whole vfio mdev series can be logically divided into several parts,
> > they work together to provide the mdev support.
> 
> Hi Jike,
> 
> Thanks for summarizing this, but I will defer to Kirti to comment on the 
> actual
> upstream status of her patches, couples things to note though:
> 
> 1) iommu type1 patches have been extensively reviewed by Alex already and we
> have one action item left to implement which is already queued up in v8 
> patchset.
> 
> 2) regarding the sysfs interface and libvirt discussion, I would like to hear
> what kind of attributes Intel folks are having so far as Daniel is
> asking about adding a class "gpu" which will pull several attributes as 
> mandatory.
> 

I have replied to Daniel that Intel doesn't plan to support those attributes. 
We think default mdev attributes are enough for our requirements. You
may put them into 'description' attribute for informative purpose...

If both sides agree this assumption, along with Kirti's reply that you don't
require grouping mdev any more, it'd be more promising of pushing whole
mdev bits to upstream then. :-)

Thanks
Kevin

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


Re: [libvirt] [Qemu-devel] [RFC v2] libvirt vGPU QEMU integration

2016-09-29 Thread Tian, Kevin
> From: Daniel P. Berrange [mailto:berra...@redhat.com]
> Sent: Thursday, September 29, 2016 10:39 PM
> 
> On Thu, Sep 29, 2016 at 02:35:48PM +, Tian, Kevin wrote:
> > > From: Daniel P. Berrange [mailto:berra...@redhat.com]
> > > Sent: Thursday, September 29, 2016 4:06 PM
> > >
> > > On Wed, Sep 28, 2016 at 12:48:33PM -0700, Neo Jia wrote:
> > > > On Tue, Sep 20, 2016 at 10:47:53AM +0100, Daniel P. Berrange wrote:
> > > > > On Tue, Sep 20, 2016 at 02:05:52AM +0530, Kirti Wankhede wrote:
> > > > > >
> > > > > > Hi libvirt experts,
> > > > > >
> > > > > > Thanks for valuable input on v1 version of RFC.
> > > > > >
> > > > > > Quick brief, VFIO based mediated device framework provides a way to
> > > > > > virtualize their devices without SR-IOV, like NVIDIA vGPU, Intel 
> > > > > > KVMGT
> > > > > > and IBM's channel IO. This framework reuses VFIO APIs for all the
> > > > > > functionalities for mediated devices which are currently being used 
> > > > > > for
> > > > > > pass through devices. This framework introduces a set of new sysfs 
> > > > > > files
> > > > > > for device creation and its life cycle management.
> > > > > >
> > > > > > Here is the summary of discussion on v1:
> > > > > > 1. Discover mediated device:
> > > > > > As part of physical device initialization process, vendor driver 
> > > > > > will
> > > > > > register their physical devices, which will be used to create 
> > > > > > virtual
> > > > > > device (mediated device, aka mdev) to the mediated framework.
> > > > > >
> > > > > > Vendor driver should specify mdev_supported_types in directory 
> > > > > > format.
> > > > > > This format is class based, for example, display class directory 
> > > > > > format
> > > > > > should be as below. We need to define such set for each class of 
> > > > > > devices
> > > > > > which would be supported by mediated device framework.
> > > > > >
> > > > > >  --- mdev_destroy
> > > > > >  --- mdev_supported_types
> > > > > >  |-- 11
> > > > > >  |   |-- create
> > > > > >  |   |-- name
> > > > > >  |   |-- fb_length
> > > > > >  |   |-- resolution
> > > > > >  |   |-- heads
> > > > > >  |   |-- max_instances
> > > > > >  |   |-- params
> > > > > >  |   |-- requires_group
> > > > > >  |-- 12
> > > > > >  |   |-- create
> > > > > >  |   |-- name
> > > > > >  |   |-- fb_length
> > > > > >  |   |-- resolution
> > > > > >  |   |-- heads
> > > > > >  |   |-- max_instances
> > > > > >  |   |-- params
> > > > > >  |   |-- requires_group
> > > > > >  |-- 13
> > > > > >  |-- create
> > > > > >  |-- name
> > > > > >  |-- fb_length
> > > > > >  |-- resolution
> > > > > >  |-- heads
> > > > > >  |-- max_instances
> > > > > >  |-- params
> > > > > >  |-- requires_group
> > > > > >
> > > > > >
> > > > > > In the above example directory '11' represents a type id of mdev 
> > > > > > device.
> > > > > > 'name', 'fb_length', 'resolution', 'heads', 'max_instance' and
> > > > > > 'requires_group' would be Read-Only files that vendor would provide 
> > > > > > to
> > > > > > describe about that type.
> > > > > >
> > > > > > 'create':
> > > > > > Write-only file. Mandatory.
> > > > > > Accepts string to create mediated device.
> > > > > >
> > > > > > 'name':
> > > > > > Read-Only file. Mandatory.
> > > > > > Returns string, the name of that type id.
> > > > >
> > > > > Presumably this is a human-targetted title/descr

Re: [libvirt] [Qemu-devel] [RFC v2] libvirt vGPU QEMU integration

2016-09-29 Thread Tian, Kevin
> From: Daniel P. Berrange [mailto:berra...@redhat.com]
> Sent: Thursday, September 29, 2016 4:06 PM
> 
> On Wed, Sep 28, 2016 at 12:48:33PM -0700, Neo Jia wrote:
> > On Tue, Sep 20, 2016 at 10:47:53AM +0100, Daniel P. Berrange wrote:
> > > On Tue, Sep 20, 2016 at 02:05:52AM +0530, Kirti Wankhede wrote:
> > > >
> > > > Hi libvirt experts,
> > > >
> > > > Thanks for valuable input on v1 version of RFC.
> > > >
> > > > Quick brief, VFIO based mediated device framework provides a way to
> > > > virtualize their devices without SR-IOV, like NVIDIA vGPU, Intel KVMGT
> > > > and IBM's channel IO. This framework reuses VFIO APIs for all the
> > > > functionalities for mediated devices which are currently being used for
> > > > pass through devices. This framework introduces a set of new sysfs files
> > > > for device creation and its life cycle management.
> > > >
> > > > Here is the summary of discussion on v1:
> > > > 1. Discover mediated device:
> > > > As part of physical device initialization process, vendor driver will
> > > > register their physical devices, which will be used to create virtual
> > > > device (mediated device, aka mdev) to the mediated framework.
> > > >
> > > > Vendor driver should specify mdev_supported_types in directory format.
> > > > This format is class based, for example, display class directory format
> > > > should be as below. We need to define such set for each class of devices
> > > > which would be supported by mediated device framework.
> > > >
> > > >  --- mdev_destroy
> > > >  --- mdev_supported_types
> > > >  |-- 11
> > > >  |   |-- create
> > > >  |   |-- name
> > > >  |   |-- fb_length
> > > >  |   |-- resolution
> > > >  |   |-- heads
> > > >  |   |-- max_instances
> > > >  |   |-- params
> > > >  |   |-- requires_group
> > > >  |-- 12
> > > >  |   |-- create
> > > >  |   |-- name
> > > >  |   |-- fb_length
> > > >  |   |-- resolution
> > > >  |   |-- heads
> > > >  |   |-- max_instances
> > > >  |   |-- params
> > > >  |   |-- requires_group
> > > >  |-- 13
> > > >  |-- create
> > > >  |-- name
> > > >  |-- fb_length
> > > >  |-- resolution
> > > >  |-- heads
> > > >  |-- max_instances
> > > >  |-- params
> > > >  |-- requires_group
> > > >
> > > >
> > > > In the above example directory '11' represents a type id of mdev device.
> > > > 'name', 'fb_length', 'resolution', 'heads', 'max_instance' and
> > > > 'requires_group' would be Read-Only files that vendor would provide to
> > > > describe about that type.
> > > >
> > > > 'create':
> > > > Write-only file. Mandatory.
> > > > Accepts string to create mediated device.
> > > >
> > > > 'name':
> > > > Read-Only file. Mandatory.
> > > > Returns string, the name of that type id.
> > >
> > > Presumably this is a human-targetted title/description of
> > > the device.
> > >
> > > >
> > > > 'fb_length':
> > > > Read-only file. Mandatory.
> > > > Returns {K,M,G}, size of framebuffer.
> > > >
> > > > 'resolution':
> > > > Read-Only file. Mandatory.
> > > > Returns 'hres x vres' format. Maximum supported resolution.
> > > >
> > > > 'heads':
> > > > Read-Only file. Mandatory.
> > > > Returns integer. Number of maximum heads supported.
> > >
> > > None of these should be mandatory as that makes the mdev
> > > useless for non-GPU devices.
> > >
> > > I'd expect to see a 'class' or 'type' attribute in the
> > > directory whcih tells you what kind of mdev it is. A
> > > valid 'class' value would be 'gpu'. The fb_length,
> > > resolution, and heads parameters would only be mandatory
> > > when class==gpu.
> > >
> >
> > Hi Daniel,
> >
> > Here you are proposing to add a class named "gpu", which will make all 
> > those gpu
> > related attributes mandatory, which libvirt can allow user to better
> > parse/present a particular mdev configuration?
> >
> > I am just wondering if there is another option that we just make all those
> > attributes that a mdev device can have as optional but still meaningful to
> > libvirt, so libvirt can still parse / recognize them as an class "mdev".
> 
> 'mdev' isn't a class - mdev is the name of the kernel module. The class
> refers to the broad capability of the device. class would be things
> like "gpu", "nic", "fpga" or other such things. The point of the class
> is to identify which other attributes will be considered mandatory.
> 
> 

Thanks Daniel. This class definition makes sense to me.

However I'm not sure whether we should define such common mandatory attributes
of a 'gpu' class now. Intel will go with a 2's power sharing of type 
definition... actual 
type name to be finalized, but an example looks like below:

[GVTG-SKL-x2]: available instances (2)
[GVTG-SKL-x4]: available instances (4)
[GVTG-SKL-x8]: available instances (8)
...

User can create different types of vGPUs simultaneously. A GVTG-SKL-x2 type
vGPU will 

Re: [libvirt] [RFC v2] libvirt vGPU QEMU integration

2016-09-28 Thread Tian, Kevin
> From: Neo Jia [mailto:c...@nvidia.com]
> Sent: Thursday, September 29, 2016 3:23 AM
> 
> On Thu, Sep 22, 2016 at 03:26:38PM +0100, Daniel P. Berrange wrote:
> > On Thu, Sep 22, 2016 at 08:19:21AM -0600, Alex Williamson wrote:
> > > On Thu, 22 Sep 2016 09:41:20 +0530
> > > Kirti Wankhede  wrote:
> > >
> > > > > My concern is that a type id seems arbitrary but we're specifying 
> > > > > that
> > > > > it be unique.  We already have something unique, the name.  So 
> > > > > why try
> > > > > to make the type id unique as well?  A vendor can accidentally 
> > > > > create
> > > > > their vendor driver so that a given name means something very
> > > > > specific.  On the other hand they need to be extremely deliberate 
> > > > > to
> > > > > coordinate that a type id means a unique thing across all their 
> > > > > product
> > > > > lines.
> > > > >
> > > > 
> > > >  Let me clarify, type id should be unique in the list of
> > > >  mdev_supported_types. You can't have 2 directories in with same 
> > > >  name.
> > > > >>>
> > > > >>> Of course, but does that mean it's only unique to the machine I'm
> > > > >>> currently running on?  Let's say I have a Tesla P100 on my system 
> > > > >>> and
> > > > >>> type-id 11 is named "GRID-M60-0B".  At some point in the future I
> > > > >>> replace the Tesla P100 with a Q1000 (made up).  Is type-id 11 on 
> > > > >>> that
> > > > >>> new card still going to be a "GRID-M60-0B"?  If not then we've based
> > > > >>> our XML on the wrong attribute.  If the new device does not support
> > > > >>> "GRID-M60-0B" then we should generate an error, not simply 
> > > > >>> initialize
> > > > >>> whatever type-id 11 happens to be on this new card.
> > > > >>>
> > > > >>
> > > > >> If there are 2 M60 in the system then you would find '11' type 
> > > > >> directory
> > > > >> in mdev_supported_types of both M60. If you have P100, '11' type 
> > > > >> would
> > > > >> not be there in its mdev_supported_types, it will have different 
> > > > >> types.
> > > > >>
> > > > >> For example, if you replace M60 with P100, but XML is not updated. 
> > > > >> XML
> > > > >> have type '11'. When libvirt would try to create mdev device, libvirt
> > > > >> would have to find 'create' file in sysfs in following directory 
> > > > >> format:
> > > > >>
> > > > >>  --- mdev_supported_types
> > > > >>  |-- 11
> > > > >>  |   |-- create
> > > > >>
> > > > >> but now for P100, '11' directory is not there, so libvirt should 
> > > > >> throw
> > > > >> error on not able to find '11' directory.
> > > > >
> > > > > This really seems like an accident waiting to happen.  What happens
> > > > > when the user replaces their M60 with an Intel XYZ device that happens
> > > > > to expose a type 11 mdev class gpu device?  How is libvirt supposed to
> > > > > know that the XML used to refer to a GRID-M60-0B and now it's an
> > > > > INTEL-IGD-XYZ?  Doesn't basing the XML entry on the name and removing
> > > > > yet another arbitrary requirement that we have some sort of globally
> > > > > unique type-id database make a lot of sense?  The same issue applies
> > > > > for simple debug-ability, if I'm reviewing the XML for a domain and 
> > > > > the
> > > > > name is the primary index for the mdev device, I know what it is.
> > > > > Seeing type-id='11' is meaningless.
> > > > >
> > > >
> > > > Let me clarify again, type '11' is a string that vendor driver would
> > > > define (see my previous reply below) it could be "11" or "GRID-M60-0B".
> > > > If 2 vendors used same string we can't control that. right?
> > > >
> > > >
> > > >  Lets remove 'id' from type id in XML if that is the concern. 
> > > >  Supported
> > > >  types is going to be defined by vendor driver, so let vendor driver
> > > >  decide what to use for directory name and same should be used in 
> > > >  device
> > > >  xml file, it could be '11' or "GRID M60-0B":
> > > > 
> > > >  
> > > >    my-vgpu
> > > >    pci__86_00_0
> > > >    
> > > >  

Re: [libvirt] [RFC v2] libvirt vGPU QEMU integration

2016-09-22 Thread Tian, Kevin
> From: Kirti Wankhede [mailto:kwankh...@nvidia.com]
> Sent: Wednesday, September 21, 2016 12:23 AM
> >
> >>>  I have
> >>> a hard time believing that a given vendor can even allocate unique type
> >>> ids for their own devices.  Unique type id across vendors is not
> >>> practical.  So which attribute are we actually using to identify which
> >>> type of mdev device we need and why would we ever specify additional
> >>> attributes like fb_length?  Doesn't the vendor guarantee that "GRID
> >>> M60-0B" has a fixed setup of those attributes?
> >>>
> >>
> >> Specifying attributes here is not our requirement. Yes we have fixed set
> >> of attributes for "GRID-M60-0B" and on.
> >> We are defining the attributed here for "display" class for all other
> >> vendor of gpu can use.
> >>

Hi, Kirti, 

We decide to go with above type-based interface for KVMGT, with fixed setup 
of attributes too. If both Intel and NVIDIA solutions use such fixed manner,
can we go with a proposal which doesn't include 'class' extension for now?
Later if there is a concrete usage requiring such class-specific attribute 
setting,
then it's easy to extend following discussion in this thread. I'm thinking how 
we
can converge the discussion here into something simple enough (and extensible)
to accelerate upstreaming of both Intel/NVIDIA solutions...

Thanks
Kevin

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


Re: [libvirt] [RFC v2] libvirt vGPU QEMU integration

2016-09-21 Thread Tian, Kevin
> From: Alex Williamson [mailto:alex.william...@redhat.com]
> Sent: Thursday, September 22, 2016 11:02 AM
> 
> On Thu, 22 Sep 2016 02:33:01 +0000
> "Tian, Kevin" <kevin.t...@intel.com> wrote:
> 
> > > From: Alex Williamson [mailto:alex.william...@redhat.com]
> > > Sent: Wednesday, September 21, 2016 12:37 PM
> > >
> > > On Wed, 21 Sep 2016 03:56:21 +
> > > "Tian, Kevin" <kevin.t...@intel.com> wrote:
> > >
> > > > > From: Kirti Wankhede [mailto:kwankh...@nvidia.com]
> > > > > Sent: Tuesday, September 20, 2016 10:22 PM
> > > > > >>
> > > > > >> 'max_instance':
> > > > > >> Read-Only file. Mandatory.
> > > > > >> Returns integer.  Returns maximum mdev device could be created
> > > > > >> at the moment when this file is read. This count would be updated 
> > > > > >> by
> > > > > >> vendor driver. Before creating mdev device of this type, check if
> > > > > >> max_instance is > 0.
> > > > > >
> > > > > > Didn't we discuss this being being something like 
> > > > > > "available_instances"
> > > > > > with a "devices" sub-directory linking current devices of that type?
> > > > > >
> > > > >
> > > > > I'm ok with 'available_instances' as name of this file.
> > > > > Yes, mdev device directory will have links to devices of that type. I
> > > > > think that is part of mdev core module discussion. I'm trying to list
> > > > > sysfs files for libvirt interface here to focus more on libvirt
> > > > > interface. Hence didn't add that. I'll make sure to add all such 
> > > > > details
> > > > > in future.
> > > > >
> > > > >
> > > >
> > > > Definitely you need provide those details since they also contain
> > > > libvirt interface, e.g. 'destroy' which we discussed should be in
> > > > mdev directory.
> > >
> > > $ find /sys/devices -name remove | wc -l
> > > 24
> > > $ find /sys/devices -name destroy | wc -l
> > > 0
> > >
> > > 'remove' is the sysfs deconstructor we should use.
> >
> > make sense.
> >
> > >
> > > > Another example is class-specific attributes such
> > > > as 'resolution', etc. which you listed under 'display class'. All those
> > > > attributes should be moved to mdev directory. Only class ID is
> > > > required under each type.
> > >
> > > Can you elaborate on what you're proposing here?  If we don't have
> > > attributes like 'resolution' under the type-id then we can't describe
> > > to the user the features of the mdev before we create it.  I don't
> > > think anybody wants a 'create it and find out' type of interface.
> > > Thanks,
> > >
> >
> > I think such way would be racy. What about two clients creating mdev
> > devices simultaneously? How to guarantee their configuration of class
> > specific attributes not mutual-impacted since before creation any such
> > configuration would be global under that type?
> 
> A simple mutex in the sysfs store attribute to serialize, return write
> error if insufficient resources to create additional devices...  Easy.
> Let's not exaggerate the issue.
> 
> > My feeling is that we'd better keep create simple - just write a UUID
> > to "type-id/create". Any class-specific attribute, if we really want to
> > support, should be able to be individually configured with required
> > resource allocated incrementally on top of basic resources allocated
> > at create stage. Then libvirt can set those attributes between create
> > and open a mdev device. If this direction is accepted, then naturally
> > such attributes should be put under mdev directory. Possibly we
> > don't need a class description under type-id. libvirt just checks directly
> > whether any known class represented in each mdev directory (vendor
> > driver will expose it on demand), and then operate attributes under
> > that class.
> 
> It seems like this just moves the hypothetical race to the point where
> we're adjusting individual attributes for the mdev device, but now the
> problem is worse because there's no simple locking or serialization
> that guarantees we can get all the attributes we want.  We might get
> half the attributes for one devi

Re: [libvirt] [RFC v2] libvirt vGPU QEMU integration

2016-09-21 Thread Tian, Kevin
> From: Tian, Kevin
> Sent: Thursday, September 22, 2016 10:33 AM
> 
> >
> > > Another example is class-specific attributes such
> > > as 'resolution', etc. which you listed under 'display class'. All those
> > > attributes should be moved to mdev directory. Only class ID is
> > > required under each type.
> >
> > Can you elaborate on what you're proposing here?  If we don't have
> > attributes like 'resolution' under the type-id then we can't describe
> > to the user the features of the mdev before we create it.  I don't
> > think anybody wants a 'create it and find out' type of interface.
> > Thanks,
> >
> 
> I think such way would be racy. What about two clients creating mdev
> devices simultaneously? How to guarantee their configuration of class
> specific attributes not mutual-impacted since before creation any such
> configuration would be global under that type?
> 
> My feeling is that we'd better keep create simple - just write a UUID
> to "type-id/create". Any class-specific attribute, if we really want to
> support, should be able to be individually configured with required
> resource allocated incrementally on top of basic resources allocated
> at create stage. Then libvirt can set those attributes between create
> and open a mdev device. If this direction is accepted, then naturally
> such attributes should be put under mdev directory. Possibly we
> don't need a class description under type-id. libvirt just checks directly
> whether any known class represented in each mdev directory (vendor
> driver will expose it on demand), and then operate attributes under
> that class.
> 

Have a better understanding of your concern now. User needs to know
and set a list of attributes before requesting to create a new mdev
device. From this angle all attributes should be enumerated per type-id.
But as I explained above, writing multiple sysfs nodes to create a
mdev device is racy. We either need a way to guarantee transactional
operations on those nodes, or having type-id directory to only expose
supported attributes while programming those attributes has to be 
through per-mdev directory after mdev device is created...

Thanks
Kevin

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


Re: [libvirt] [RFC v2] libvirt vGPU QEMU integration

2016-09-21 Thread Tian, Kevin
> From: Alex Williamson [mailto:alex.william...@redhat.com]
> Sent: Wednesday, September 21, 2016 12:43 PM
> 
> On Wed, 21 Sep 2016 04:10:53 +0000
> "Tian, Kevin" <kevin.t...@intel.com> wrote:
> 
> > > From: Kirti Wankhede [mailto:kwankh...@nvidia.com]
> > > Sent: Wednesday, September 21, 2016 12:23 AM
> > >
> > >
> > > On 9/20/2016 8:13 PM, Alex Williamson wrote:
> > > > On Tue, 20 Sep 2016 19:51:58 +0530
> > > > Kirti Wankhede <kwankh...@nvidia.com> wrote:
> > > >
> > > >> On 9/20/2016 3:06 AM, Alex Williamson wrote:
> > > >>> On Tue, 20 Sep 2016 02:05:52 +0530
> > > >>> Kirti Wankhede <kwankh...@nvidia.com> wrote:
> > > >>>
> > > >>>> Hi libvirt experts,
> > > >>>>
> > > >>>>
> > > >>>> 'create':
> > > >>>> Write-only file. Mandatory.
> > > >>>> Accepts string to create mediated device.
> > > >>>>
> > > >>>> 'name':
> > > >>>> Read-Only file. Mandatory.
> > > >>>> Returns string, the name of that type id.
> > > >>>>
> > > >>>> 'fb_length':
> > > >>>> Read-only file. Mandatory.
> > > >>>> Returns {K,M,G}, size of framebuffer.
> > > >>>
> > > >>> This can't be mandatory, it's only relevant to vGPU devices, vGPUs are
> > > >>> just one user of mediated devices.
> > > >>>
> > > >>
> > > >> As per our discussion in BOF, we would define class and its attributes.
> > > >> As I mentioned above these are attributes of "display" class.
> > > >
> > > > Just as I didn't know here, how does libvirt know the class of a given
> > > > type id?
> > > >
> > >
> > > Yes, we need some way to identify the class as Daniel mentioned in his
> > > suggestion. Add a file 'class' in mdev_supported_types that would return
> > > the class.
> > >
> > >
> >
> > 'display' class or 'gpu' class? display represents only part of GPU
> > functionalities. I assume you want to provide an unique class ID
> > for each type, instead of allowing multiple classes each identifying
> > a subset of controllable GPU resources (e.g. 'display', 'render', etc.)...
> 
> Not sure where you're going with this, we're not creating a programming
> interface for the device, that exists through the vfio API.  I believe
> the intention here is simply a user level classification for the
> purpose of being able to interpret the attributes provided in a logical
> way.

I'm not against that purpose.

My main intention is that 'display' is not an accurate classification here.
If we allow classification in 'display' level, it implies more finer-grained
classifications may be further defined upon which vendor specific GPU 
resources is allowed to be configured, which might be an overkill. Here 
we just need a general classification like 'gpu' to cover all possible 
vendor-agnostic attributes beyond display.

> 
> > for unique class ID, I once considered whether it could be directly
> > inherited from class of parent device, since typically a vendor driver
> > creates mdev devices using the same type as physical device. But later
> > I realized one potential value of keep a class field for mdev types,
> > e.g. if we want to extend mdev to FPGA which could be programmed
> > as different classes of functionalities. :-)
> 
> And how do we generically determine the class of a parent device?  We
> cannot assume the parent device is PCI.  Thanks,
> 

Yes, this is also a good point.

Thanks
Kevin

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


Re: [libvirt] [RFC v2] libvirt vGPU QEMU integration

2016-09-21 Thread Tian, Kevin
> From: Alex Williamson [mailto:alex.william...@redhat.com]
> Sent: Wednesday, September 21, 2016 12:37 PM
> 
> On Wed, 21 Sep 2016 03:56:21 +0000
> "Tian, Kevin" <kevin.t...@intel.com> wrote:
> 
> > > From: Kirti Wankhede [mailto:kwankh...@nvidia.com]
> > > Sent: Tuesday, September 20, 2016 10:22 PM
> > > >>
> > > >> 'max_instance':
> > > >> Read-Only file. Mandatory.
> > > >> Returns integer.  Returns maximum mdev device could be created
> > > >> at the moment when this file is read. This count would be updated by
> > > >> vendor driver. Before creating mdev device of this type, check if
> > > >> max_instance is > 0.
> > > >
> > > > Didn't we discuss this being being something like "available_instances"
> > > > with a "devices" sub-directory linking current devices of that type?
> > > >
> > >
> > > I'm ok with 'available_instances' as name of this file.
> > > Yes, mdev device directory will have links to devices of that type. I
> > > think that is part of mdev core module discussion. I'm trying to list
> > > sysfs files for libvirt interface here to focus more on libvirt
> > > interface. Hence didn't add that. I'll make sure to add all such details
> > > in future.
> > >
> > >
> >
> > Definitely you need provide those details since they also contain
> > libvirt interface, e.g. 'destroy' which we discussed should be in
> > mdev directory.
> 
> $ find /sys/devices -name remove | wc -l
> 24
> $ find /sys/devices -name destroy | wc -l
> 0
> 
> 'remove' is the sysfs deconstructor we should use.

make sense.

> 
> > Another example is class-specific attributes such
> > as 'resolution', etc. which you listed under 'display class'. All those
> > attributes should be moved to mdev directory. Only class ID is
> > required under each type.
> 
> Can you elaborate on what you're proposing here?  If we don't have
> attributes like 'resolution' under the type-id then we can't describe
> to the user the features of the mdev before we create it.  I don't
> think anybody wants a 'create it and find out' type of interface.
> Thanks,
> 

I think such way would be racy. What about two clients creating mdev 
devices simultaneously? How to guarantee their configuration of class
specific attributes not mutual-impacted since before creation any such
configuration would be global under that type?

My feeling is that we'd better keep create simple - just write a UUID
to "type-id/create". Any class-specific attribute, if we really want to
support, should be able to be individually configured with required
resource allocated incrementally on top of basic resources allocated 
at create stage. Then libvirt can set those attributes between create
and open a mdev device. If this direction is accepted, then naturally
such attributes should be put under mdev directory. Possibly we 
don't need a class description under type-id. libvirt just checks directly
whether any known class represented in each mdev directory (vendor
driver will expose it on demand), and then operate attributes under
that class.

Thanks
Kevin

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


Re: [libvirt] [RFC v2] libvirt vGPU QEMU integration

2016-09-20 Thread Tian, Kevin
> From: Kirti Wankhede [mailto:kwankh...@nvidia.com]
> Sent: Wednesday, September 21, 2016 12:23 AM
> 
> 
> On 9/20/2016 8:13 PM, Alex Williamson wrote:
> > On Tue, 20 Sep 2016 19:51:58 +0530
> > Kirti Wankhede  wrote:
> >
> >> On 9/20/2016 3:06 AM, Alex Williamson wrote:
> >>> On Tue, 20 Sep 2016 02:05:52 +0530
> >>> Kirti Wankhede  wrote:
> >>>
>  Hi libvirt experts,
> 
> 
>  'create':
>  Write-only file. Mandatory.
>  Accepts string to create mediated device.
> 
>  'name':
>  Read-Only file. Mandatory.
>  Returns string, the name of that type id.
> 
>  'fb_length':
>  Read-only file. Mandatory.
>  Returns {K,M,G}, size of framebuffer.
> >>>
> >>> This can't be mandatory, it's only relevant to vGPU devices, vGPUs are
> >>> just one user of mediated devices.
> >>>
> >>
> >> As per our discussion in BOF, we would define class and its attributes.
> >> As I mentioned above these are attributes of "display" class.
> >
> > Just as I didn't know here, how does libvirt know the class of a given
> > type id?
> >
> 
> Yes, we need some way to identify the class as Daniel mentioned in his
> suggestion. Add a file 'class' in mdev_supported_types that would return
> the class.
> 
> 

'display' class or 'gpu' class? display represents only part of GPU 
functionalities. I assume you want to provide an unique class ID
for each type, instead of allowing multiple classes each identifying
a subset of controllable GPU resources (e.g. 'display', 'render', etc.)...

for unique class ID, I once considered whether it could be directly
inherited from class of parent device, since typically a vendor driver
creates mdev devices using the same type as physical device. But later
I realized one potential value of keep a class field for mdev types,
e.g. if we want to extend mdev to FPGA which could be programmed
as different classes of functionalities. :-)

Thanks
Kevin

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


Re: [libvirt] [RFC v2] libvirt vGPU QEMU integration

2016-09-20 Thread Tian, Kevin
> From: Kirti Wankhede [mailto:kwankh...@nvidia.com]
> Sent: Tuesday, September 20, 2016 10:22 PM
> >>
> >> 'max_instance':
> >> Read-Only file. Mandatory.
> >> Returns integer.  Returns maximum mdev device could be created
> >> at the moment when this file is read. This count would be updated by
> >> vendor driver. Before creating mdev device of this type, check if
> >> max_instance is > 0.
> >
> > Didn't we discuss this being being something like "available_instances"
> > with a "devices" sub-directory linking current devices of that type?
> >
> 
> I'm ok with 'available_instances' as name of this file.
> Yes, mdev device directory will have links to devices of that type. I
> think that is part of mdev core module discussion. I'm trying to list
> sysfs files for libvirt interface here to focus more on libvirt
> interface. Hence didn't add that. I'll make sure to add all such details
> in future.
> 
> 

Definitely you need provide those details since they also contain
libvirt interface, e.g. 'destroy' which we discussed should be in
mdev directory. Another example is class-specific attributes such
as 'resolution', etc. which you listed under 'display class'. All those
attributes should be moved to mdev directory. Only class ID is
required under each type.

Thanks
Kevin

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


Re: [libvirt] [RFC v2] libvirt vGPU QEMU integration

2016-09-19 Thread Tian, Kevin
> From: Kirti Wankhede [mailto:kwankh...@nvidia.com]
> Sent: Tuesday, September 20, 2016 4:36 AM
> 
> 
> Hi libvirt experts,
> 
> Thanks for valuable input on v1 version of RFC.
> 
> Quick brief, VFIO based mediated device framework provides a way to
> virtualize their devices without SR-IOV, like NVIDIA vGPU, Intel KVMGT
> and IBM's channel IO. This framework reuses VFIO APIs for all the
> functionalities for mediated devices which are currently being used for
> pass through devices. This framework introduces a set of new sysfs files
> for device creation and its life cycle management.
> 
> Here is the summary of discussion on v1:
> 1. Discover mediated device:
> As part of physical device initialization process, vendor driver will
> register their physical devices, which will be used to create virtual
> device (mediated device, aka mdev) to the mediated framework.
> 
> Vendor driver should specify mdev_supported_types in directory format.
> This format is class based, for example, display class directory format
> should be as below. We need to define such set for each class of devices
> which would be supported by mediated device framework.
> 
>  --- mdev_destroy
>  --- mdev_supported_types
>  |-- 11
>  |   |-- create
>  |   |-- name
>  |   |-- fb_length
>  |   |-- resolution
>  |   |-- heads
>  |   |-- max_instances
>  |   |-- params
>  |   |-- requires_group
>  |-- 12
>  |   |-- create
>  |   |-- name
>  |   |-- fb_length
>  |   |-- resolution
>  |   |-- heads
>  |   |-- max_instances
>  |   |-- params
>  |   |-- requires_group
>  |-- 13
>  |-- create
>  |-- name
>  |-- fb_length
>  |-- resolution
>  |-- heads
>  |-- max_instances
>  |-- params
>  |-- requires_group
> 
> 

Per previous discussion, I'd think below layout is more general and clearer:

  --- mdev_supported_types
  |-- ID (or name)
  |   |-- create
  |   |-- available_instances
  |   |-- requires_group
  |-- ID (or name)
  |   |-- create
  |   |-- available_instances
  |   |-- requires_group
 ...

Then under each mdev directory:
  |-- type (link to type directory)
  |-- destroy
  |-- online
  |-- reset
  |-- Destroy

Group association may be done under mdev directory too, per Alex's earlier
suggestion.

Optional vendor-agnostic attributes can be extended in the future, e.g.:
  |-- priority
  |-- weight

Vendor-specific or class-specific parameters (black string or explicit 
attributes) 
could be also extended per-mdev directory, if necessary to support.

Thanks
Kevin

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


Re: [libvirt] [RFC v2] libvirt vGPU QEMU integration

2016-09-19 Thread Tian, Kevin
> From: Alex Williamson [mailto:alex.william...@redhat.com]
> Sent: Tuesday, September 20, 2016 5:36 AM
> 
> > In the above example directory '11' represents a type id of mdev device.
> > 'name', 'fb_length', 'resolution', 'heads', 'max_instance' and
> > 'requires_group' would be Read-Only files that vendor would provide to
> > describe about that type.
> >
> > 'create':
> > Write-only file. Mandatory.
> > Accepts string to create mediated device.
> >
> > 'name':
> > Read-Only file. Mandatory.
> > Returns string, the name of that type id.
> >
> > 'fb_length':
> > Read-only file. Mandatory.
> > Returns {K,M,G}, size of framebuffer.
> 
> This can't be mandatory, it's only relevant to vGPU devices, vGPUs are
> just one user of mediated devices.

Agree. at least it doesn't make sense for s390 passthrough usage.

> 
> >
> > 'max_instance':
> > Read-Only file. Mandatory.
> > Returns integer.  Returns maximum mdev device could be created
> > at the moment when this file is read. This count would be updated by
> > vendor driver. Before creating mdev device of this type, check if
> > max_instance is > 0.
> 
> Didn't we discuss this being being something like "available_instances"
> with a "devices" sub-directory linking current devices of that type?

Right. "available_instances" which is dynamically changed under two scenarios:

a) create a new mdev under one type decrements "available_instances" of this
type by 1;
b) create a new mdev under one type decrements "available_instances" under
other types by type-specific number, if vendor supports creating multiple types
simultaneously like in Intel solution;

> 
> > 'params'
> > Write-Only file. Optional.
> > String input. Libvirt would pass the string given in XML file to
> > this file and then create mdev device. Set empty string to clear params.
> > For example, set parameter 'frame_rate_limiter=0' to disable frame rate
> > limiter for performance benchmarking, then create device of type 11. The
> > device created would have that parameter set by vendor driver.
> 
> Might as well just call it "magic", there's absolutely no ability to
> introspect what parameters are allowed or even valid here.  Can all
> parameters be changed dynamically?  Do parameter changes apply to
> existing devices or only future devices?  This is a poorly specified
> interface.  I'd prefer this be done via module options on the vendor
> driver.

I guess it might be mdev specific for nvidia's case, then module option
is not suitable. Then this magic string should be placed under mdev
directory instead of here, otherwise there'll be race condition.

> 
> >
> > 2. Create/destroy mediated device
> >
> > With above example, vGPU device XML would look like:
> >
> >
> >  my-vgpu
> >  pci__86_00_0
> >  
> >
> >1
> >'frame_rate_limiter=0'
> >  
> >
> >
> > 'type id' is mandatory.
> 
> I was under the impression that the vendor supplied "name" was the one
> unique identifier.  We're certainly not going to create a registrar to
> hand out type ids to each vendor, so what makes type id unique?  I have
> a hard time believing that a given vendor can even allocate unique type
> ids for their own devices.  Unique type id across vendors is not
> practical.  So which attribute are we actually using to identify which
> type of mdev device we need and why would we ever specify additional
> attributes like fb_length?  Doesn't the vendor guarantee that "GRID
> M60-0B" has a fixed setup of those attributes?

I'm a bit confused about type ID and type name here. Just keeping one 
should be enough (ether a number or descriptive name), which is 
per-device scope, not necessarily to cross vendors. As long as the vendor
defines the ID or name unique cross its own generations, looks should be
sufficient (I guess cross-vendor migration is not an interest here)

> 
> > 'params' is optional field. User should set this field if extra
> > parameters need to be set for a particular vGPU device. Libvirt don't
> > need to parse these params. These are meant for vendor driver.
> 
> ie. magic black hole.  nope.

A curious question here, though it's not a mandatory requirement for
Intel's GPU virtualization solution now, but allow such extension might
be useful in the long term. Has libvirt ever supported similar black 
string in other usages? I can think it's a problem for public cloud usage,
where any configuration on the managed device should be controlled
by orchestration layer (e.g. openstack) automatically. It's not appropriate
to allow a cloud tenant specifying such parameter, so in that manner
any configuration parameter should be explicitly understood from upper
level stack to libvirt then vendor specific parameters would be hard to
handle. On the other hand, if we consider enterprise virtualization 
scenario which is more proprietary, administrator may add such vendor
specific parameters as a black string through libvirt, as long 

Re: [libvirt] [Qemu-devel] [PATCH v7 0/4] Add Mediated device support

2016-09-07 Thread Tian, Kevin
> From: Alex Williamson
> Sent: Wednesday, September 07, 2016 5:29 AM
> 
> On Wed, 7 Sep 2016 01:05:11 +0530
> Kirti Wankhede  wrote:
> 
> > On 9/6/2016 11:10 PM, Alex Williamson wrote:
> > > On Sat, 3 Sep 2016 22:04:56 +0530
> > > Kirti Wankhede  wrote:
> > >
> > >> On 9/3/2016 3:18 AM, Paolo Bonzini wrote:
> > >>>
> > >>>
> > >>> On 02/09/2016 20:33, Kirti Wankhede wrote:
> >   We could even do:
> > >>
> > >> echo $UUID1:$GROUPA > create
> > >>
> > >> where $GROUPA is the group ID of a previously created mdev device 
> > >> into
> > >> which $UUID1 is to be created and added to the same group.
> >  
> > >>>
> > >>> From the point of view of libvirt, I think I prefer Alex's idea.
> > >>>  could be an additional element in the nodedev-create XML:
> > >>>
> > >>> 
> > >>>   my-vgpu
> > >>>   pci__86_00_0
> > >>>   
> > >>> 
> > >>> 0695d332-7831-493f-9e71-1c85c8911a08
> > >>> group1
> > >>>   
> > >>> 
> > >>>
> > >>> (should group also be a UUID?)
> > >>>
> > >>
> > >> No, this should be a unique number in a system, similar to iommu_group.
> > >
> > > Sorry, just trying to catch up on this thread after a long weekend.
> > >
> > > We're talking about iommu groups here, we're not creating any sort of
> > > parallel grouping specific to mdev devices.
> >
> > I thought we were talking about group of mdev devices and not iommu
> > group. IIRC, there were concerns about it (this would be similar to
> > UUID+instance) and that would (ab)use iommu groups.
> 
> What constraints does a group, which is not an iommu group, place on the
> usage of the mdev devices?  What happens if we put two mdev devices in
> the same "mdev group" and then assign them to separate VMs/users?  I
> believe that the answer is that this theoretical "mdev group" doesn't
> actually impose any constraints on the devices within the group or how
> they're used.
> 
> vfio knows about iommu groups and we consider an iommu group to be the
> unit of ownership for userspace.  Therefore by placing multiple mdev
> devices within the same iommu group we can be assured that there's only
> one user for that group.  Furthermore, the specific case for this
> association on NVIDIA is to couple the hardware peer-to-peer resources
> for the individual mdev devices.  Therefore this particular grouping
> does imply a lack of isolation between those mdev devices involved in
> the group.
> 
> For mdev devices which are actually isolated from one another, where
> they don't poke these p2p holes, placing them in the same iommu group
> is definitely an abuse of the interface and is going to lead to
> problems with a single iommu context.  But how does libvirt know that
> one type of mdev device needs to be grouped while another type doesn't?

can we introduce an attribute under specific type to indicate such p2p
requirement so libvirt knows the need of additional group action?

> 
> There's really not much that I like about using iommu groups in this
> way, it's just that they seem to solve this particular problem of
> enforcing how such a group can be used and imposing a second form of
> grouping onto the vfio infrastructure seems much too complex.
> 
> > I'm thinking about your suggestion, but would also like to know your
> > thought how sysfs interface would look like? Its still no clear to me.
> > Or will it be better to have grouping at mdev layer?
> 
> In previous replies I had proposed that a group could be an additional
> argument when we write the mdev UUID to the create entry in sysfs.
> This is specifically why I listed only the UUID when creating the first
> mdev device and UUID:group when creating the second.  The user would
> need to go determine the group ID allocated for the first entry to
> specify creating the second within that same group.
> 
> I have no love for this proposal, it's functional but not elegant and
> again leaves libvirt lost in trying to determine which devices need to
> be grouped together and which have no business being grouped together.
> 
> Let's think through this further and let me make a couple assumptions
> to get started:
> 
> 1) iommu groups are the way that we want to group NVIDIA vGPUs because:
>   a) The peer-to-peer resources represent an isolation gap between
>  mdev devices, iommu groups represent sets of isolated devices.
>   b) The 1:1 mapping of an iommu group to a user matches the NVIDIA
>  device model.
>   c) iommu_group_for_each_dev() gives the vendor driver the
>  functionality it needs to perform a first-open/last-close
>  device walk for configuring these p2p resources.
> 
> 2) iommu groups as used by mdev devices should contain the minimum
> number of devices in order to provide the maximum iommu context
> flexibility.
> 
> Do we agree on these?  The corollary is that NVIDIA is going to suffer
> reduced iommu granularity exactly because of the requirement to 

Re: [libvirt] [Qemu-devel] [PATCH v7 0/4] Add Mediated device support

2016-09-07 Thread Tian, Kevin
> From: Alex Williamson [mailto:alex.william...@redhat.com]
> Sent: Wednesday, September 07, 2016 1:41 AM
> 
> On Sat, 3 Sep 2016 22:04:56 +0530
> Kirti Wankhede  wrote:
> 
> > On 9/3/2016 3:18 AM, Paolo Bonzini wrote:
> > >
> > >
> > > On 02/09/2016 20:33, Kirti Wankhede wrote:
> > >>  We could even do:
> > 
> >  echo $UUID1:$GROUPA > create
> > 
> >  where $GROUPA is the group ID of a previously created mdev device into
> >  which $UUID1 is to be created and added to the same group.
> > >> 
> > >
> > > From the point of view of libvirt, I think I prefer Alex's idea.
> > >  could be an additional element in the nodedev-create XML:
> > >
> > > 
> > >   my-vgpu
> > >   pci__86_00_0
> > >   
> > > 
> > > 0695d332-7831-493f-9e71-1c85c8911a08
> > > group1
> > >   
> > > 
> > >
> > > (should group also be a UUID?)
> > >
> >
> > No, this should be a unique number in a system, similar to iommu_group.
> 
> Sorry, just trying to catch up on this thread after a long weekend.
> 
> We're talking about iommu groups here, we're not creating any sort of
> parallel grouping specific to mdev devices.  This is why my example
> created a device and then required the user to go find the group number
> given to that device in order to create another device within the same
> group.  iommu group numbering is not within the user's control and is
> not a uuid.  libvirt can refer to the group as anything it wants in the
> xml, but the host group number is allocated by the host, not under user
> control, is not persistent.  libvirt would just be giving it a name to
> know which devices are part of the same group.  Perhaps the runtime xml
> would fill in the group number once created.
> 
> There were also a lot of unanswered questions in my proposal, it's not
> clear that there's a standard algorithm for when mdev devices need to
> be grouped together.  Should we even allow groups to span multiple host
> devices?  Should they be allowed to span devices from different
> vendors?

I think we should limit the scope of iommu group for mdev here, which 
better only contains mdev belonging to same parent device. Spanning
multiple host devices (regardless of whether from different vendors)
are grouped based on physical isolation granularity. Better not to mix
two levels together. I'm not sure whether NVIDIA has requirement to
start all vGPUs together even when they come from different parent
devices. Hope not...

> 
> If we imagine a scenario of a group composed of a mix of Intel and
> NVIDIA vGPUs, what happens when an Intel device is opened first?  The
> NVIDIA driver wouldn't know about this, but it would know when the
> first NVIDIA device is opened and be able to establish p2p for the
> NVIDIA devices at that point.  Can we do what we need with that model?
> What if libvirt is asked to hot-add an NVIDIA vGPU?  It would need to
> do a create on the NVIDIA parent device with the existing group id, at
> which point the NVIDIA vendor driver could fail the device create if
> the p2p setup has already been done.  The Intel vendor driver might
> allow it.  Similar to open, the last close of the mdev device for a
> given vendor (which might not be the last close of mdev devices within
> the group) would need to trigger the offline process for that vendor.

I assume iommu group is for minimal isolation granularity. In higher
level we have VFIO container which could deliver both Intel vGPUs
and NVIDIA vGPUs to the same VM. Intel vGPUs each have its own
iommu group, while NVIDIA vGPUs of the same parent device may
be in one group.

Thanks
Kevin

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


Re: [libvirt] [PATCH v7 0/4] Add Mediated device support

2016-08-31 Thread Tian, Kevin
> From: Alex Williamson [mailto:alex.william...@redhat.com]
> Sent: Wednesday, August 31, 2016 11:49 PM
> 
> > >
> > > IGD doesn't have such peer-to-peer resource setup requirement. So
> > > it's sufficient to create/destroy a mdev instance in a single action on
> > > IGD. However I'd expect we still keep the "start/stop" interface (
> > > maybe not exposed as sysfs node, instead being a VFIO API), as
> > > required to support future live migration usage. We've made prototype
> > > working for KVMGT today.
> 
> Great!
> 

btw here is a link to KVMGT live migration demo:

https://www.youtube.com/watch?v=y2SkU5JODIY

Thanks
Kevin

--
libvir-list mailing list
libvir-list@redhat.com
https://www.redhat.com/mailman/listinfo/libvir-list


Re: [libvirt] [PATCH v7 0/4] Add Mediated device support

2016-08-31 Thread Tian, Kevin
> From: Alex Williamson [mailto:alex.william...@redhat.com]
> Sent: Wednesday, August 31, 2016 11:49 PM
> 
> On Wed, 31 Aug 2016 15:04:13 +0800
> Jike Song <jike.s...@intel.com> wrote:
> 
> > On 08/31/2016 02:12 PM, Tian, Kevin wrote:
> > >> From: Alex Williamson [mailto:alex.william...@redhat.com]
> > >> Sent: Wednesday, August 31, 2016 12:17 AM
> > >>
> > >> Hi folks,
> > >>
> > >> At KVM Forum we had a BoF session primarily around the mediated device
> > >> sysfs interface.  I'd like to share what I think we agreed on and the
> > >> "problem areas" that still need some work so we can get the thoughts
> > >> and ideas from those who weren't able to attend.
> > >>
> > >> DanPB expressed some concern about the mdev_supported_types sysfs
> > >> interface, which exposes a flat csv file with fields like "type",
> > >> "number of instance", "vendor string", and then a bunch of type
> > >> specific fields like "framebuffer size", "resolution", "frame rate
> > >> limit", etc.  This is not entirely machine parsing friendly and sort of
> > >> abuses the sysfs concept of one value per file.  Example output taken
> > >> from Neo's libvirt RFC:
> > >>
> > >> cat /sys/bus/pci/devices/:86:00.0/mdev_supported_types
> > >> # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, 
> > >> framebuffer,
> > >> max_resolution
> > >> 11  ,"GRID M60-0B",  16,   2,  45, 512M,2560x1600
> > >> 12  ,"GRID M60-0Q",  16,   2,  60, 512M,2560x1600
> > >> 13  ,"GRID M60-1B",   8,   2,  45,1024M,2560x1600
> > >> 14  ,"GRID M60-1Q",   8,   2,  60,1024M,2560x1600
> > >> 15  ,"GRID M60-2B",   4,   2,  45,2048M,2560x1600
> > >> 16  ,"GRID M60-2Q",   4,   4,  60,2048M,2560x1600
> > >> 17  ,"GRID M60-4Q",   2,   4,  60,4096M,3840x2160
> > >> 18  ,"GRID M60-8Q",   1,   4,  60,8192M,3840x2160
> > >>
> > >> The create/destroy then looks like this:
> > >>
> > >> echo "$mdev_UUID:vendor_specific_argument_list" >
> > >>  /sys/bus/pci/devices/.../mdev_create
> > >>
> > >> echo "$mdev_UUID:vendor_specific_argument_list" >
> > >>  /sys/bus/pci/devices/.../mdev_destroy
> > >>
> > >> "vendor_specific_argument_list" is nebulous.
> > >>
> > >> So the idea to fix this is to explode this into a directory structure,
> > >> something like:
> > >>
> > >> ├── mdev_destroy
> > >> └── mdev_supported_types
> > >> ├── 11
> > >> │   ├── create
> > >> │   ├── description
> > >> │   └── max_instances
> > >> ├── 12
> > >> │   ├── create
> > >> │   ├── description
> > >> │   └── max_instances
> > >> └── 13
> > >> ├── create
> > >> ├── description
> > >> └── max_instances
> > >>
> > >> Note that I'm only exposing the minimal attributes here for simplicity,
> > >> the other attributes would be included in separate files and we would
> > >> require vendors to create standard attributes for common device classes.
> > >
> > > I like this idea. All standard attributes are reflected into this 
> > > hierarchy.
> > > In the meantime, can we still allow optional vendor string in create
> > > interface? libvirt doesn't need to know the meaning, but allows upper
> > > layer to do some vendor specific tweak if necessary.
> > >
> >
> > Not sure whether this can done within MDEV framework (attrs provided by
> > vendor driver of course), or must be within the vendor driver.
> 
> The purpose of the sub-directories is that libvirt doesn't need to pass
> arbitrary, vendor strings to the create function, the attributes of the
> mdev device created are defined by the attributes in the sysfs
> directory where the create is done.  The user only provides a uuid for
> the device.  Arbitrary vendor parameters are a barrier, libvirt may not
> need to know the meaning, but would need to know wh

Re: [libvirt] [PATCH v7 0/4] Add Mediated device support

2016-08-31 Thread Tian, Kevin
> From: Alex Williamson [mailto:alex.william...@redhat.com]
> Sent: Wednesday, August 31, 2016 12:17 AM
> 
> Hi folks,
> 
> At KVM Forum we had a BoF session primarily around the mediated device
> sysfs interface.  I'd like to share what I think we agreed on and the
> "problem areas" that still need some work so we can get the thoughts
> and ideas from those who weren't able to attend.
> 
> DanPB expressed some concern about the mdev_supported_types sysfs
> interface, which exposes a flat csv file with fields like "type",
> "number of instance", "vendor string", and then a bunch of type
> specific fields like "framebuffer size", "resolution", "frame rate
> limit", etc.  This is not entirely machine parsing friendly and sort of
> abuses the sysfs concept of one value per file.  Example output taken
> from Neo's libvirt RFC:
> 
> cat /sys/bus/pci/devices/:86:00.0/mdev_supported_types
> # vgpu_type_id, vgpu_type, max_instance, num_heads, frl_config, framebuffer,
> max_resolution
> 11  ,"GRID M60-0B",  16,   2,  45, 512M,2560x1600
> 12  ,"GRID M60-0Q",  16,   2,  60, 512M,2560x1600
> 13  ,"GRID M60-1B",   8,   2,  45,1024M,2560x1600
> 14  ,"GRID M60-1Q",   8,   2,  60,1024M,2560x1600
> 15  ,"GRID M60-2B",   4,   2,  45,2048M,2560x1600
> 16  ,"GRID M60-2Q",   4,   4,  60,2048M,2560x1600
> 17  ,"GRID M60-4Q",   2,   4,  60,4096M,3840x2160
> 18  ,"GRID M60-8Q",   1,   4,  60,8192M,3840x2160
> 
> The create/destroy then looks like this:
> 
> echo "$mdev_UUID:vendor_specific_argument_list" >
>   /sys/bus/pci/devices/.../mdev_create
> 
> echo "$mdev_UUID:vendor_specific_argument_list" >
>   /sys/bus/pci/devices/.../mdev_destroy
> 
> "vendor_specific_argument_list" is nebulous.
> 
> So the idea to fix this is to explode this into a directory structure,
> something like:
> 
> ├── mdev_destroy
> └── mdev_supported_types
> ├── 11
> │   ├── create
> │   ├── description
> │   └── max_instances
> ├── 12
> │   ├── create
> │   ├── description
> │   └── max_instances
> └── 13
> ├── create
> ├── description
> └── max_instances
> 
> Note that I'm only exposing the minimal attributes here for simplicity,
> the other attributes would be included in separate files and we would
> require vendors to create standard attributes for common device classes.

I like this idea. All standard attributes are reflected into this hierarchy.
In the meantime, can we still allow optional vendor string in create 
interface? libvirt doesn't need to know the meaning, but allows upper
layer to do some vendor specific tweak if necessary.

> 
> For vGPUs like NVIDIA where we don't support multiple types
> concurrently, this directory structure would update as mdev devices are
> created, removing no longer available types.  I carried forward

or keep the type with max_instances cleared to ZERO.

> max_instances here, but perhaps we really want to copy SR-IOV and
> report a max and current allocation.  Creation and deletion is

right, cur/max_instances look reasonable.

> simplified as we can simply "echo $UUID > create" per type.  I don't
> understand why destroy had a parameter list, so here I imagine we can
> simply do the same... in fact, I'd actually rather see a "remove" sysfs
> entry under each mdev device, so we remove it at the device rather than
> in some central location (any objections?).

OK to me. 

> 
> We discussed how this might look with Intel devices which do allow
> mixed vGPU types concurrently.  We believe, but need confirmation, that
> the vendor driver could still make a finite set of supported types,
> perhaps with additional module options to the vendor driver to enable
> more "exotic" types.  So for instance if IGD vGPUs are based on
> power-of-2 portions of the framebuffer size, then the vendor driver
> could list types with 32MB, 64MB, 128MB, etc in useful and popular
> sizes.  As vGPUs are allocated, the larger sizes may become unavailable.

Yes, Intel can do such type of definition. One thing I'm not sure is 
about impact cross listed types, i.e. when creating a new instance
under a given type, max_instances under other types would be 
dynamically decremented based on available resource. Would it be
a problem for libvirt or upper level stack, since a natural interpretation
of max_instances should be a static number?

An alternative is to make max_instances configurable, so libvirt has
chance to define a pool of available instances with different types
before creating any instance. For example, initially IGD driver may 
report max_instances only for a minimal sharing granularity:
128MB:
max_instances (8)
256MB:
max_instances (0)
512MB:
max_instances (0)

Then libvirt can configure more types