Re: kvm PCI assignment & VFIO ramblings

2011-08-23 Thread aafabbri



On 8/23/11 4:04 AM, "Joerg Roedel"  wrote:

> On Mon, Aug 22, 2011 at 08:52:18PM -0400, aafabbri wrote:
>> You have to enforce group/iommu domain assignment whether you have the
>> existing uiommu API, or if you change it to your proposed
>> ioctl(inherit_iommu) API.
>> 
>> The only change needed to VFIO here should be to make uiommu fd assignment
>> happen on the groups instead of on device fds.  That operation fails or
>> succeeds according to the group semantics (all-or-none assignment/same
>> uiommu).
> 
> That is makes uiommu basically the same as the meta-groups, right?

Yes, functionality seems the same, thus my suggestion to keep uiommu
explicit.  Is there some need for group-groups besides defining sets of
groups which share IOMMU resources?

I do all this stuff (bringing up sets of devices which may share IOMMU
domain) dynamically from C applications.  I don't really want some static
(boot-time or sysfs fiddling) supergroup config unless there is a good
reason KVM/power needs it.

As you say in your next email, doing it all from ioctls is very easy,
programmatically.

-Aaron Fabbri

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm PCI assignment & VFIO ramblings

2011-08-22 Thread aafabbri



On 8/22/11 2:49 PM, "Benjamin Herrenschmidt" 
wrote:

> 
>>> I wouldn't use uiommu for that.
>> 
>> Any particular reason besides saving a file descriptor?
>> 
>> We use it today, and it seems like a cleaner API than what you propose
>> changing it to.
> 
> Well for one, we are back to square one vs. grouping constraints.

I'm not following you.

You have to enforce group/iommu domain assignment whether you have the
existing uiommu API, or if you change it to your proposed
ioctl(inherit_iommu) API.

The only change needed to VFIO here should be to make uiommu fd assignment
happen on the groups instead of on device fds.  That operation fails or
succeeds according to the group semantics (all-or-none assignment/same
uiommu).

I think the question is: do we force 1:1 iommu/group mapping, or do we allow
arbitrary mapping (satisfying group constraints) as we do today.

I'm saying I'm an existing user who wants the arbitrary iommu/group mapping
ability and definitely think the uiommu approach is cleaner than the
ioctl(inherit_iommu) approach.  We considered that approach before but it
seemed less clean so we went with the explicit uiommu context.

>  .../...
> 
>> If we in singleton-group land were building our own "groups" which were sets
>> of devices sharing the IOMMU domains we wanted, I suppose we could do away
>> with uiommu fds, but it sounds like the current proposal would create 20
>> singleton groups (x86 iommu w/o PCI bridges => all devices are partitionable
>> endpoints).  Asking me to ioctl(inherit) them together into a blob sounds
>> worse than the current explicit uiommu API.
> 
> I'd rather have an API to create super-groups (groups of groups)
> statically and then you can use such groups as normal groups using the
> same interface. That create/management process could be done via a
> simple command line utility or via sysfs banging, whatever...




--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm PCI assignment & VFIO ramblings

2011-08-22 Thread aafabbri



On 8/22/11 1:49 PM, "Benjamin Herrenschmidt" 
wrote:

> On Mon, 2011-08-22 at 13:29 -0700, aafabbri wrote:
> 
>>> Each device fd would then support a
>>> similar set of ioctls and mapping (mmio/pio/config) interface as current
>>> vfio, except for the obvious domain and dma ioctls superseded by the
>>> group fd.
>>> 
>>> Another valid model might be that /dev/vfio/$GROUP is created for all
>>> groups when the vfio module is loaded.  The group fd would allow open()
>>> and some set of iommu querying and device enumeration ioctls, but would
>>> error on dma mapping and retrieving device fds until all of the group
>>> devices are bound to the vfio driver.
>>> 
>>> In either case, the uiommu interface is removed entirely since dma
>>> mapping is done via the group fd.
>> 
>> The loss in generality is unfortunate. I'd like to be able to support
>> arbitrary iommu domain <-> device assignment.  One way to do this would be
>> to keep uiommu, but to return an error if someone tries to assign more than
>> one uiommu context to devices in the same group.
> 
> I wouldn't use uiommu for that.

Any particular reason besides saving a file descriptor?

We use it today, and it seems like a cleaner API than what you propose
changing it to.

> If the HW or underlying kernel drivers
> support it, what I'd suggest is that you have an (optional) ioctl to
> bind two groups (you have to have both opened already) or for one group
> to "capture" another one.

You'll need other rules there too.. "both opened already, but zero mappings
performed yet as they would have instantiated a default IOMMU domain".

Keep in mind the only case I'm using is singleton groups, a.k.a. devices.

Since what I want is to specify which devices can do things like share
network buffers (in a way that conserves IOMMU hw resources), it seems
cleanest to expose this explicitly, versus some "inherit iommu domain from
another device" ioctl.  What happens if I do something like this:

dev1_fd = open ("/dev/vfio0")
dev2_fd = open ("/dev/vfio1")
dev2_fd.inherit_iommu(dev1_fd)

error = close(dev1_fd)

There are other gross cases as well.

> 
> The binding means under the hood the iommus get shared, with the
> lifetime being that of the "owning" group.

So what happens in the close() above?  EINUSE?  Reset all children?  Still
seems less clean than having an explicit iommu fd.  Without some benefit I'm
not sure why we'd want to change this API.

If we in singleton-group land were building our own "groups" which were sets
of devices sharing the IOMMU domains we wanted, I suppose we could do away
with uiommu fds, but it sounds like the current proposal would create 20
singleton groups (x86 iommu w/o PCI bridges => all devices are partitionable
endpoints).  Asking me to ioctl(inherit) them together into a blob sounds
worse than the current explicit uiommu API.

Thanks,
Aaron

> 
> Another option is to make that static configuration APIs via special
> ioctls (or even netlink if you really like it), to change the grouping
> on architectures that allow it.
> 
> Cheers.
> Ben.
> 
>> 
>> -Aaron
>> 
>>> As necessary in the future, we can
>>> define a more high performance dma mapping interface for streaming dma
>>> via the group fd.  I expect we'll also include architecture specific
>>> group ioctls to describe features and capabilities of the iommu.  The
>>> group fd will need to prevent concurrent open()s to maintain a 1:1 group
>>> to userspace process ownership model.
>>> 
>>> Also on the table is supporting non-PCI devices with vfio.  To do this,
>>> we need to generalize the read/write/mmap and irq eventfd interfaces.
>>> We could keep the same model of segmenting the device fd address space,
>>> perhaps adding ioctls to define the segment offset bit position or we
>>> could split each region into it's own fd (VFIO_GET_PCI_BAR_FD(0),
>>> VFIO_GET_PCI_CONFIG_FD(), VFIO_GET_MMIO_FD(3)), though we're already
>>> suffering some degree of fd bloat (group fd, device fd(s), interrupt
>>> event fd(s), per resource fd, etc).  For interrupts we can overload
>>> VFIO_SET_IRQ_EVENTFD to be either PCI INTx or non-PCI irq (do non-PCI
>>> devices support MSI?).
>>> 
>>> For qemu, these changes imply we'd only support a model where we have a
>>> 1:1 group to iommu domain.  The current vfio driver could probably
>>> become vfio-pci as we might end up with more target specific vfio
>>> drivers for non-pci.  PCI should be able to maintain a simple -device
>>> vfio-pci,host=bb:dd.f to enable hotplug of individual devices.  We'll
>>> need to come up with extra options when we need to expose groups to
>>> guest for pvdma.
>>> 
>>> Hope that captures it, feel free to jump in with corrections and
>>> suggestions.  Thanks,
>>> 
>>> Alex
>>> 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: kvm PCI assignment & VFIO ramblings

2011-08-22 Thread aafabbri



On 8/20/11 9:51 AM, "Alex Williamson"  wrote:

> We had an extremely productive VFIO BoF on Monday.  Here's my attempt to
> capture the plan that I think we agreed to:
> 
> We need to address both the description and enforcement of device
> groups.  Groups are formed any time the iommu does not have resolution
> between a set of devices.  On x86, this typically happens when a
> PCI-to-PCI bridge exists between the set of devices and the iommu.  For
> Power, partitionable endpoints define a group.  Grouping information
> needs to be exposed for both userspace and kernel internal usage.  This
> will be a sysfs attribute setup by the iommu drivers.  Perhaps:
> 
> # cat /sys/devices/pci:00/:00:19.0/iommu_group
> 42
> 
> (I use a PCI example here, but attribute should not be PCI specific)
> 
> From there we have a few options.  In the BoF we discussed a model where
> binding a device to vfio creates a /dev/vfio$GROUP character device
> file.  This "group" fd provides provides dma mapping ioctls as well as
> ioctls to enumerate and return a "device" fd for each attached member of
> the group (similar to KVM_CREATE_VCPU).  We enforce grouping by
> returning an error on open() of the group fd if there are members of the
> group not bound to the vfio driver.

Sounds reasonable.

> Each device fd would then support a
> similar set of ioctls and mapping (mmio/pio/config) interface as current
> vfio, except for the obvious domain and dma ioctls superseded by the
> group fd.
> 
> Another valid model might be that /dev/vfio/$GROUP is created for all
> groups when the vfio module is loaded.  The group fd would allow open()
> and some set of iommu querying and device enumeration ioctls, but would
> error on dma mapping and retrieving device fds until all of the group
> devices are bound to the vfio driver.
> 
> In either case, the uiommu interface is removed entirely since dma
> mapping is done via the group fd.

The loss in generality is unfortunate. I'd like to be able to support
arbitrary iommu domain <-> device assignment.  One way to do this would be
to keep uiommu, but to return an error if someone tries to assign more than
one uiommu context to devices in the same group.


-Aaron

> As necessary in the future, we can
> define a more high performance dma mapping interface for streaming dma
> via the group fd.  I expect we'll also include architecture specific
> group ioctls to describe features and capabilities of the iommu.  The
> group fd will need to prevent concurrent open()s to maintain a 1:1 group
> to userspace process ownership model.
> 
> Also on the table is supporting non-PCI devices with vfio.  To do this,
> we need to generalize the read/write/mmap and irq eventfd interfaces.
> We could keep the same model of segmenting the device fd address space,
> perhaps adding ioctls to define the segment offset bit position or we
> could split each region into it's own fd (VFIO_GET_PCI_BAR_FD(0),
> VFIO_GET_PCI_CONFIG_FD(), VFIO_GET_MMIO_FD(3)), though we're already
> suffering some degree of fd bloat (group fd, device fd(s), interrupt
> event fd(s), per resource fd, etc).  For interrupts we can overload
> VFIO_SET_IRQ_EVENTFD to be either PCI INTx or non-PCI irq (do non-PCI
> devices support MSI?).
> 
> For qemu, these changes imply we'd only support a model where we have a
> 1:1 group to iommu domain.  The current vfio driver could probably
> become vfio-pci as we might end up with more target specific vfio
> drivers for non-pci.  PCI should be able to maintain a simple -device
> vfio-pci,host=bb:dd.f to enable hotplug of individual devices.  We'll
> need to come up with extra options when we need to expose groups to
> guest for pvdma.
> 
> Hope that captures it, feel free to jump in with corrections and
> suggestions.  Thanks,
> 
> Alex
> 

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html