Re: kvm PCI assignment & VFIO ramblings
On Sun, Aug 28, 2011 at 05:04:32PM +0300, Avi Kivity wrote: > On 08/28/2011 04:56 PM, Joerg Roedel wrote: >> This can't be secured by a lock, because it introduces potential >> A->B<-->B->A lock problem when two processes try to take each others mm. >> It could probably be solved by a task->real_mm pointer, havn't thought >> about this yet... >> > > Or a workqueue - you get a kernel thread context with a bit of boilerplate. Right, a workqueue might do the trick. We'll evaluate that. Thanks for the idea :) Joerg -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Fri, Aug 26, 2011 at 12:04:22PM -0600, Alex Williamson wrote: > On Thu, 2011-08-25 at 20:05 +0200, Joerg Roedel wrote: > > If we really expect segment numbers that need the full 16 bit then this > > would be the way to go. Otherwise I would prefer returning the group-id > > directly and partition the group-id space for the error values (s32 with > > negative numbers being errors). > > It's unlikely to have segments using the top bit, but it would be broken > for an iommu driver to define it's group numbers using pci s:b:d.f if we > don't have that bit available. Ben/David, do PEs have an identifier of > a convenient size? I'd guess any hardware based identifier is going to > use a full unsigned bit width. Okay, if we want to go the secure way I am fine with the "int *group" parameter. Another option is to just return u64 and use the extended number space for errors. But that is even worse as an interface, I think. Joerg -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
eOn Fri, Aug 26, 2011 at 01:17:05PM -0700, Aaron Fabbri wrote: [snip] > Yes. In essence, I'd rather not have to run any other admin processes. > Doing things programmatically, on the fly, from each process, is the > cleanest model right now. The "persistent group" model doesn't necessarily prevent that. There's no reason your program can't use the administrative interface as well as the "use" interface, and I don't see that making the admin interface separate and persistent makes this any harder. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On 08/28/2011 04:56 PM, Joerg Roedel wrote: On Sun, Aug 28, 2011 at 04:14:00PM +0300, Avi Kivity wrote: > On 08/26/2011 12:24 PM, Roedel, Joerg wrote: >> The biggest problem with this approach is that it has to happen in the >> context of the given process. Linux can't really modify an mm which >> which belong to another context in a safe way. >> > > Is use_mm() insufficient? Yes, it introduces a set of race conditions when a process that already has an mm wants to take over another processes mm temporarily (and when use_mm is modified to actually provide this functionality). It is only save when used from kernel-thread context. One example: Process A Process B Process C . . . . <-- takes A->mm . . and assignes as B->mm. . . --> Wants to take . . B->mm, but gets A->mm now Good catch. This can't be secured by a lock, because it introduces potential A->B<-->B->A lock problem when two processes try to take each others mm. It could probably be solved by a task->real_mm pointer, havn't thought about this yet... Or a workqueue - you get a kernel thread context with a bit of boilerplate. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Sun, Aug 28, 2011 at 04:14:00PM +0300, Avi Kivity wrote: > On 08/26/2011 12:24 PM, Roedel, Joerg wrote: >> The biggest problem with this approach is that it has to happen in the >> context of the given process. Linux can't really modify an mm which >> which belong to another context in a safe way. >> > > Is use_mm() insufficient? Yes, it introduces a set of race conditions when a process that already has an mm wants to take over another processes mm temporarily (and when use_mm is modified to actually provide this functionality). It is only save when used from kernel-thread context. One example: Process A Process B Process C . . . . <-- takes A->mm . . and assignes as B->mm . . . --> Wants to take . . B->mm, but gets A->mm now This can't be secured by a lock, because it introduces potential A->B<-->B->A lock problem when two processes try to take each others mm. It could probably be solved by a task->real_mm pointer, havn't thought about this yet... Joerg -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On 08/26/2011 12:24 PM, Roedel, Joerg wrote: > > As I see it there are two options: (a) make subsequent accesses from > userspace or the guest result in either a SIGBUS that userspace must > either deal with or die, or (b) replace the mapping with a dummy RO > mapping containing 0xff, with any trapped writes emulated as nops. The biggest problem with this approach is that it has to happen in the context of the given process. Linux can't really modify an mm which which belong to another context in a safe way. Is use_mm() insufficient? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
* Aaron Fabbri (aafab...@cisco.com) wrote: > On 8/26/11 12:35 PM, "Chris Wright" wrote: > > * Aaron Fabbri (aafab...@cisco.com) wrote: > >> Each process will open vfio devices on the fly, and they need to be able to > >> share IOMMU resources. > > > > How do you share IOMMU resources w/ multiple processes, are the processes > > sharing memory? > > Sorry, bad wording. I share IOMMU domains *within* each process. Ah, got it. Thanks. > E.g. If one process has 3 devices and another has 10, I can get by with two > iommu domains (and can share buffers among devices within each process). > > If I ever need to share devices across processes, the shared memory case > might be interesting. > > > > >> So I need the ability to dynamically bring up devices and assign them to a > >> group. The number of actual devices and how they map to iommu domains is > >> not known ahead of time. We have a single piece of silicon that can expose > >> hundreds of pci devices. > > > > This does not seem fundamentally different from the KVM use case. > > > > We have 2 kinds of groupings. > > > > 1) low-level system or topoolgy grouping > > > >Some may have multiple devices in a single group > > > >* the PCIe-PCI bridge example > >* the POWER partitionable endpoint > > > >Many will not > > > >* singleton group, e.g. typical x86 PCIe function (majority of > > assigned devices) > > > >Not sure it makes sense to have these administratively defined as > >opposed to system defined. > > > > 2) logical grouping > > > >* multiple low-level groups (singleton or otherwise) attached to same > > process, allowing things like single set of io page tables where > > applicable. > > > >These are nominally adminstratively defined. In the KVM case, there > >is likely a privileged task (i.e. libvirtd) involved w/ making the > >device available to the guest and can do things like group merging. > >In your userspace case, perhaps it should be directly exposed. > > Yes. In essence, I'd rather not have to run any other admin processes. > Doing things programmatically, on the fly, from each process, is the > cleanest model right now. I don't see an issue w/ this. As long it can not add devices to the system defined groups, it's not a privileged operation. So we still need the iommu domain concept exposed in some form to logically put groups into a single iommu domain (if desired). In fact, I believe Alex covered this in his most recent recap: ...The group fd will provide interfaces for enumerating the devices in the group, returning a file descriptor for each device in the group (the "device fd"), binding groups together, and returning a file descriptor for iommu operations (the "iommu fd"). thanks, -chris -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On 8/26/11 12:35 PM, "Chris Wright" wrote: > * Aaron Fabbri (aafab...@cisco.com) wrote: >> On 8/26/11 7:07 AM, "Alexander Graf" wrote: >>> Forget the KVM case for a moment and think of a user space device driver. I >>> as >>> a user am not root. But I as a user when having access to /dev/vfioX want to >>> be able to access the device and manage it - and only it. The admin of that >>> box needs to set it up properly for me to be able to access it. >>> >>> So having two steps is really the correct way to go: >>> >>> * create VFIO group >>> * use VFIO group >>> >>> because the two are done by completely different users. >> >> This is not the case for my userspace drivers using VFIO today. >> >> Each process will open vfio devices on the fly, and they need to be able to >> share IOMMU resources. > > How do you share IOMMU resources w/ multiple processes, are the processes > sharing memory? Sorry, bad wording. I share IOMMU domains *within* each process. E.g. If one process has 3 devices and another has 10, I can get by with two iommu domains (and can share buffers among devices within each process). If I ever need to share devices across processes, the shared memory case might be interesting. > >> So I need the ability to dynamically bring up devices and assign them to a >> group. The number of actual devices and how they map to iommu domains is >> not known ahead of time. We have a single piece of silicon that can expose >> hundreds of pci devices. > > This does not seem fundamentally different from the KVM use case. > > We have 2 kinds of groupings. > > 1) low-level system or topoolgy grouping > >Some may have multiple devices in a single group > >* the PCIe-PCI bridge example >* the POWER partitionable endpoint > >Many will not > >* singleton group, e.g. typical x86 PCIe function (majority of > assigned devices) > >Not sure it makes sense to have these administratively defined as >opposed to system defined. > > 2) logical grouping > >* multiple low-level groups (singleton or otherwise) attached to same > process, allowing things like single set of io page tables where > applicable. > >These are nominally adminstratively defined. In the KVM case, there >is likely a privileged task (i.e. libvirtd) involved w/ making the >device available to the guest and can do things like group merging. >In your userspace case, perhaps it should be directly exposed. Yes. In essence, I'd rather not have to run any other admin processes. Doing things programmatically, on the fly, from each process, is the cleanest model right now. > >> In my case, the only administrative task would be to give my processes/users >> access to the vfio groups (which are initially singletons), and the >> application actually opens them and needs the ability to merge groups >> together to conserve IOMMU resources (assuming we're not going to expose >> uiommu). > > I agree, we definitely need to expose _some_ way to do this. > > thanks, > -chris -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
* Aaron Fabbri (aafab...@cisco.com) wrote: > On 8/26/11 7:07 AM, "Alexander Graf" wrote: > > Forget the KVM case for a moment and think of a user space device driver. I > > as > > a user am not root. But I as a user when having access to /dev/vfioX want to > > be able to access the device and manage it - and only it. The admin of that > > box needs to set it up properly for me to be able to access it. > > > > So having two steps is really the correct way to go: > > > > * create VFIO group > > * use VFIO group > > > > because the two are done by completely different users. > > This is not the case for my userspace drivers using VFIO today. > > Each process will open vfio devices on the fly, and they need to be able to > share IOMMU resources. How do you share IOMMU resources w/ multiple processes, are the processes sharing memory? > So I need the ability to dynamically bring up devices and assign them to a > group. The number of actual devices and how they map to iommu domains is > not known ahead of time. We have a single piece of silicon that can expose > hundreds of pci devices. This does not seem fundamentally different from the KVM use case. We have 2 kinds of groupings. 1) low-level system or topoolgy grouping Some may have multiple devices in a single group * the PCIe-PCI bridge example * the POWER partitionable endpoint Many will not * singleton group, e.g. typical x86 PCIe function (majority of assigned devices) Not sure it makes sense to have these administratively defined as opposed to system defined. 2) logical grouping * multiple low-level groups (singleton or otherwise) attached to same process, allowing things like single set of io page tables where applicable. These are nominally adminstratively defined. In the KVM case, there is likely a privileged task (i.e. libvirtd) involved w/ making the device available to the guest and can do things like group merging. In your userspace case, perhaps it should be directly exposed. > In my case, the only administrative task would be to give my processes/users > access to the vfio groups (which are initially singletons), and the > application actually opens them and needs the ability to merge groups > together to conserve IOMMU resources (assuming we're not going to expose > uiommu). I agree, we definitely need to expose _some_ way to do this. thanks, -chris -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Thu, 2011-08-25 at 20:05 +0200, Joerg Roedel wrote: > On Thu, Aug 25, 2011 at 11:20:30AM -0600, Alex Williamson wrote: > > On Thu, 2011-08-25 at 12:54 +0200, Roedel, Joerg wrote: > > > > We need to solve this differently. ARM is starting to use the iommu-api > > > too and this definitly does not work there. One possible solution might > > > be to make the iommu-ops per-bus. > > > > That sounds good. Is anyone working on it? It seems like it doesn't > > hurt to use this in the interim, we may just be watching the wrong bus > > and never add any sysfs group info. > > I'll cook something up for RFC over the weekend. > > > > Also the return type should not be long but something that fits into > > > 32bit on all platforms. Since you use -ENODEV, probably s32 is a good > > > choice. > > > > The convenience of using seg|bus|dev|fn was too much to resist, too bad > > it requires a full 32bits. Maybe I'll change it to: > > int iommu_device_group(struct device *dev, unsigned int *group) > > If we really expect segment numbers that need the full 16 bit then this > would be the way to go. Otherwise I would prefer returning the group-id > directly and partition the group-id space for the error values (s32 with > negative numbers being errors). It's unlikely to have segments using the top bit, but it would be broken for an iommu driver to define it's group numbers using pci s:b:d.f if we don't have that bit available. Ben/David, do PEs have an identifier of a convenient size? I'd guess any hardware based identifier is going to use a full unsigned bit width. Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On 8/26/11 7:07 AM, "Alexander Graf" wrote: > > > Forget the KVM case for a moment and think of a user space device driver. I as > a user am not root. But I as a user when having access to /dev/vfioX want to > be able to access the device and manage it - and only it. The admin of that > box needs to set it up properly for me to be able to access it. > > So having two steps is really the correct way to go: > > * create VFIO group > * use VFIO group > > because the two are done by completely different users. This is not the case for my userspace drivers using VFIO today. Each process will open vfio devices on the fly, and they need to be able to share IOMMU resources. So I need the ability to dynamically bring up devices and assign them to a group. The number of actual devices and how they map to iommu domains is not known ahead of time. We have a single piece of silicon that can expose hundreds of pci devices. In my case, the only administrative task would be to give my processes/users access to the vfio groups (which are initially singletons), and the application actually opens them and needs the ability to merge groups together to conserve IOMMU resources (assuming we're not going to expose uiommu). -Aaron -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On 26.08.2011, at 10:24, Joerg Roedel wrote: > On Fri, Aug 26, 2011 at 09:07:35AM -0500, Alexander Graf wrote: >> On 26.08.2011, at 04:33, Roedel, Joerg wrote: >>> >>> The reason is that you mean the usability for the programmer and I mean >>> it for the actual user of qemu :) >> >> No, we mean the actual user of qemu. The reason being that making a >> device available for any user space application is an administrative >> task. >> >> Forget the KVM case for a moment and think of a user space device >> driver. I as a user am not root. But I as a user when having access to >> /dev/vfioX want to be able to access the device and manage it - and >> only it. The admin of that box needs to set it up properly for me to >> be able to access it. > > Right, and that task is being performed by attaching the device(s) in > question to the vfio driver. The rights-management happens on the > /dev/vfio/$group file. Yup :) > >> So having two steps is really the correct way to go: >> >> * create VFIO group >> * use VFIO group >> >> because the two are done by completely different users. It's similar >> to how tun/tap works in Linux too. Of course nothing keeps you from >> also creating a group on the fly, but it shouldn't be the only >> interface available. The persistent setup is definitely more useful. > > I see the use-case. But to make it as easy as possible for the end-user > we can do both. > > So the user of (qemu again) does this: > > # vfio-ctl attach 00:01.0 > vfio-ctl: attached to group 8 > # vfio-ctl attach 00:02.0 > vfio-ctl: attached to group 16 > $ qemu -device vfio-pci,host=00:01.0 -device vfio,host=00:01.0 ... > > which should cover the usecase you prefer. Qemu still creates the > meta-group that allow the devices to share the same page-table. But what > should also be possible is: > > # qemu -device vfio-pci,host=00:01.0 -device vfio-pci,host=00:02.0 > > In that case qemu detects that the devices are not yet bound to vfio and > will do so and also unbinds them afterwards (essentially the developer > use-case). I agree. The same it works with tun today. You can either have qemu spawn a tun device dynamically or have a preallocated one you use. If you run qemu as a user (which I always do), I preallocate a tun device and attach qemu to it. > Your interface which requires pre-binding of devices into one group by > the administrator only makes sense if you want to force userspace to > use certain devices (which do not belong to the same hw-group) only > together. But I don't see a usecase for defining such constraints (yet). Agreed. As long as the kernel backend can always figure out the hw-groups, we're good :) Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Fri, Aug 26, 2011 at 09:07:35AM -0500, Alexander Graf wrote: > On 26.08.2011, at 04:33, Roedel, Joerg wrote: > > > > The reason is that you mean the usability for the programmer and I mean > > it for the actual user of qemu :) > > No, we mean the actual user of qemu. The reason being that making a > device available for any user space application is an administrative > task. > > Forget the KVM case for a moment and think of a user space device > driver. I as a user am not root. But I as a user when having access to > /dev/vfioX want to be able to access the device and manage it - and > only it. The admin of that box needs to set it up properly for me to > be able to access it. Right, and that task is being performed by attaching the device(s) in question to the vfio driver. The rights-management happens on the /dev/vfio/$group file. > So having two steps is really the correct way to go: > > * create VFIO group > * use VFIO group > > because the two are done by completely different users. It's similar > to how tun/tap works in Linux too. Of course nothing keeps you from > also creating a group on the fly, but it shouldn't be the only > interface available. The persistent setup is definitely more useful. I see the use-case. But to make it as easy as possible for the end-user we can do both. So the user of (qemu again) does this: # vfio-ctl attach 00:01.0 vfio-ctl: attached to group 8 # vfio-ctl attach 00:02.0 vfio-ctl: attached to group 16 $ qemu -device vfio-pci,host=00:01.0 -device vfio,host=00:01.0 ... which should cover the usecase you prefer. Qemu still creates the meta-group that allow the devices to share the same page-table. But what should also be possible is: # qemu -device vfio-pci,host=00:01.0 -device vfio-pci,host=00:02.0 In that case qemu detects that the devices are not yet bound to vfio and will do so and also unbinds them afterwards (essentially the developer use-case). Your interface which requires pre-binding of devices into one group by the administrator only makes sense if you want to force userspace to use certain devices (which do not belong to the same hw-group) only together. But I don't see a usecase for defining such constraints (yet). Joerg -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On 26.08.2011, at 04:33, Roedel, Joerg wrote: > On Fri, Aug 26, 2011 at 12:20:00AM -0400, David Gibson wrote: >> On Wed, Aug 24, 2011 at 01:03:32PM +0200, Roedel, Joerg wrote: >>> On Wed, Aug 24, 2011 at 05:33:00AM -0400, David Gibson wrote: On Wed, Aug 24, 2011 at 11:14:26AM +0200, Roedel, Joerg wrote: >>> > I don't see a reason to make this meta-grouping static. It would harm > flexibility on x86. I think it makes things easier on power but there > are options on that platform to get the dynamic solution too. I think several people are misreading what Ben means by "static". I would prefer to say 'persistent', in that the meta-groups lifetime is not tied to an fd, but they can be freely created, altered and removed during runtime. >>> >>> Even if it can be altered at runtime, from a usability perspective it is >>> certainly the best to handle these groups directly in qemu. Or are there >>> strong reasons to do it somewhere else? >> >> Funny, Ben and I think usability demands it be the other way around. > > The reason is that you mean the usability for the programmer and I mean > it for the actual user of qemu :) No, we mean the actual user of qemu. The reason being that making a device available for any user space application is an administrative task. Forget the KVM case for a moment and think of a user space device driver. I as a user am not root. But I as a user when having access to /dev/vfioX want to be able to access the device and manage it - and only it. The admin of that box needs to set it up properly for me to be able to access it. So having two steps is really the correct way to go: * create VFIO group * use VFIO group because the two are done by completely different users. It's similar to how tun/tap works in Linux too. Of course nothing keeps you from also creating a group on the fly, but it shouldn't be the only interface available. The persistent setup is definitely more useful. > >> If the meta-groups are transient - that is lifetime tied to an fd - >> then any program that wants to use meta-groups *must* know the >> interfaces for creating one, whatever they are. >> >> But if they're persistent, the admin can use other tools to create the >> meta-group then just hand it to a program to use, since the interfaces >> for _using_ a meta-group are identical to those for an atomic group. >> >> This doesn't preclude a program from being meta-group aware, and >> creating its own if it wants to, of course. My guess is that qemu >> would not want to build its own meta-groups, but libvirt probably >> would. > > Doing it in libvirt makes it really hard for a plain user of qemu to > assign more than one device to a guest. What I want it that a user just > types > > qemu -device vfio,host=00:01.0 -device vfio,host=00:02.0 ... > > and it just works. Qemu creates the meta-groups and they are > automatically destroyed when qemu exits. That the programs are not aware > of meta-groups is not a big problem because all software using vfio > needs still to be written :) > > Btw, with this concept the programmer can still decide to not use > meta-groups and just multiplex the mappings to all open device-fds it > uses. What I want to see is: # vfio-create 00:01.0 /dev/vfio0 # vftio-create -a /dev/vfio0 00:02.0 /dev/vfio0 $ qemu -vfio dev=/dev/vfio0,id=vfio0 -device vfio,vfio=vfio0.0 -device vfio,vfio=vfio0.1 Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Fri, Aug 26, 2011 at 12:20:00AM -0400, David Gibson wrote: > On Wed, Aug 24, 2011 at 01:03:32PM +0200, Roedel, Joerg wrote: > > On Wed, Aug 24, 2011 at 05:33:00AM -0400, David Gibson wrote: > > > On Wed, Aug 24, 2011 at 11:14:26AM +0200, Roedel, Joerg wrote: > > > > > > I don't see a reason to make this meta-grouping static. It would harm > > > > flexibility on x86. I think it makes things easier on power but there > > > > are options on that platform to get the dynamic solution too. > > > > > > I think several people are misreading what Ben means by "static". I > > > would prefer to say 'persistent', in that the meta-groups lifetime is > > > not tied to an fd, but they can be freely created, altered and removed > > > during runtime. > > > > Even if it can be altered at runtime, from a usability perspective it is > > certainly the best to handle these groups directly in qemu. Or are there > > strong reasons to do it somewhere else? > > Funny, Ben and I think usability demands it be the other way around. The reason is that you mean the usability for the programmer and I mean it for the actual user of qemu :) > If the meta-groups are transient - that is lifetime tied to an fd - > then any program that wants to use meta-groups *must* know the > interfaces for creating one, whatever they are. > > But if they're persistent, the admin can use other tools to create the > meta-group then just hand it to a program to use, since the interfaces > for _using_ a meta-group are identical to those for an atomic group. > > This doesn't preclude a program from being meta-group aware, and > creating its own if it wants to, of course. My guess is that qemu > would not want to build its own meta-groups, but libvirt probably > would. Doing it in libvirt makes it really hard for a plain user of qemu to assign more than one device to a guest. What I want it that a user just types qemu -device vfio,host=00:01.0 -device vfio,host=00:02.0 ... and it just works. Qemu creates the meta-groups and they are automatically destroyed when qemu exits. That the programs are not aware of meta-groups is not a big problem because all software using vfio needs still to be written :) Btw, with this concept the programmer can still decide to not use meta-groups and just multiplex the mappings to all open device-fds it uses. Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Fri, Aug 26, 2011 at 12:24:23AM -0400, David Gibson wrote: > On Thu, Aug 25, 2011 at 08:25:45AM -0500, Alexander Graf wrote: > > On 25.08.2011, at 07:31, Roedel, Joerg wrote: > > > For mmio we could stop the guest and replace the mmio region with a > > > region that is filled with 0xff, no? > > > > Sure, but that happens in user space. The question is how does > > kernel space enforce an MMIO region to not be mapped after the > > hotplug event occured? Keep in mind that user space is pretty much > > untrusted here - it doesn't have to be QEMU. It could just as well > > be a generic user space driver. And that can just ignore hotplug > > events. > > We're saying you hard yank the mapping from the userspace process. > That is, you invalidate all its PTEs mapping the MMIO space, and don't > let it fault them back in. > > As I see it there are two options: (a) make subsequent accesses from > userspace or the guest result in either a SIGBUS that userspace must > either deal with or die, or (b) replace the mapping with a dummy RO > mapping containing 0xff, with any trapped writes emulated as nops. The biggest problem with this approach is that it has to happen in the context of the given process. Linux can't really modify an mm which which belong to another context in a safe way. The more I think about this, I come to the conclusion that it would be the best to just kill the process accessing the device if it is manually de-assigned from vfio. It should be a non-standard path anyway so it doesn't make a lot of sense to implement complicated handling semantics for it, no? Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Thu, Aug 25, 2011 at 08:25:45AM -0500, Alexander Graf wrote: > > On 25.08.2011, at 07:31, Roedel, Joerg wrote: > > > On Wed, Aug 24, 2011 at 11:07:46AM -0400, Alex Williamson wrote: > >> On Wed, 2011-08-24 at 10:52 +0200, Roedel, Joerg wrote: > > > > [...] > > >> We need to try the polite method of attempting to hot unplug the device > >> from qemu first, which the current vfio code already implements. We can > >> then escalate if it doesn't respond. The current code calls abort in > >> qemu if the guest doesn't respond, but I agree we should also be > >> enforcing this at the kernel interface. I think the problem with the > >> hard-unplug is that we don't have a good revoke mechanism for the mmio > >> mmaps. > > > > For mmio we could stop the guest and replace the mmio region with a > > region that is filled with 0xff, no? > > Sure, but that happens in user space. The question is how does > kernel space enforce an MMIO region to not be mapped after the > hotplug event occured? Keep in mind that user space is pretty much > untrusted here - it doesn't have to be QEMU. It could just as well > be a generic user space driver. And that can just ignore hotplug > events. We're saying you hard yank the mapping from the userspace process. That is, you invalidate all its PTEs mapping the MMIO space, and don't let it fault them back in. As I see it there are two options: (a) make subsequent accesses from userspace or the guest result in either a SIGBUS that userspace must either deal with or die, or (b) replace the mapping with a dummy RO mapping containing 0xff, with any trapped writes emulated as nops. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Wed, Aug 24, 2011 at 01:03:32PM +0200, Roedel, Joerg wrote: > On Wed, Aug 24, 2011 at 05:33:00AM -0400, David Gibson wrote: > > On Wed, Aug 24, 2011 at 11:14:26AM +0200, Roedel, Joerg wrote: > > > > I don't see a reason to make this meta-grouping static. It would harm > > > flexibility on x86. I think it makes things easier on power but there > > > are options on that platform to get the dynamic solution too. > > > > I think several people are misreading what Ben means by "static". I > > would prefer to say 'persistent', in that the meta-groups lifetime is > > not tied to an fd, but they can be freely created, altered and removed > > during runtime. > > Even if it can be altered at runtime, from a usability perspective it is > certainly the best to handle these groups directly in qemu. Or are there > strong reasons to do it somewhere else? Funny, Ben and I think usability demands it be the other way around. If the meta-groups are transient - that is lifetime tied to an fd - then any program that wants to use meta-groups *must* know the interfaces for creating one, whatever they are. But if they're persistent, the admin can use other tools to create the meta-group then just hand it to a program to use, since the interfaces for _using_ a meta-group are identical to those for an atomic group. This doesn't preclude a program from being meta-group aware, and creating its own if it wants to, of course. My guess is that qemu would not want to build its own meta-groups, but libvirt probably would. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Thu, Aug 25, 2011 at 11:20:30AM -0600, Alex Williamson wrote: > On Thu, 2011-08-25 at 12:54 +0200, Roedel, Joerg wrote: > > We need to solve this differently. ARM is starting to use the iommu-api > > too and this definitly does not work there. One possible solution might > > be to make the iommu-ops per-bus. > > That sounds good. Is anyone working on it? It seems like it doesn't > hurt to use this in the interim, we may just be watching the wrong bus > and never add any sysfs group info. I'll cook something up for RFC over the weekend. > > Also the return type should not be long but something that fits into > > 32bit on all platforms. Since you use -ENODEV, probably s32 is a good > > choice. > > The convenience of using seg|bus|dev|fn was too much to resist, too bad > it requires a full 32bits. Maybe I'll change it to: > int iommu_device_group(struct device *dev, unsigned int *group) If we really expect segment numbers that need the full 16 bit then this would be the way to go. Otherwise I would prefer returning the group-id directly and partition the group-id space for the error values (s32 with negative numbers being errors). > > > @@ -438,6 +439,10 @@ static int __init intel_iommu_setup(char *str) > > > printk(KERN_INFO > > > "Intel-IOMMU: disable supported super page\n"); > > > intel_iommu_superpage = 0; > > > + } else if (!strncmp(str, "no_mf_groups", 12)) { > > > + printk(KERN_INFO > > > + "Intel-IOMMU: disable separate groups for > > > multifunction devices\n"); > > > + intel_iommu_no_mf_groups = 1; > > > > This should really be a global iommu option and not be VT-d specific. > > You think? It's meaningless on benh's power systems. But it is not meaningless on AMD-Vi systems :) There should be one option for both. On the other hand this requires an iommu= parameter on ia64, but thats probably not that bad. > > This looks like code duplication in the VT-d driver. It doesn't need to > > be generalized now, but we should keep in mind to do a more general > > solution later. > > Maybe it is beneficial if the IOMMU drivers only setup the number in > > dev->arch.iommu.groupid and the iommu-api fetches it from there then. > > But as I said, this is some more work and does not need to be done for > > this patch(-set). > > The iommu-api reaches into dev->arch.iommu.groupid? I figured we should > at least start out with a lightweight, optional interface without the > overhead of predefining groupids setup by bus notification callbacks in > each iommu driver. Thanks, As I said, this is just an idea for an later optimization. It is fine for now as it is in this patch. Joerg -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Thu, 2011-08-25 at 12:54 +0200, Roedel, Joerg wrote: > Hi Alex, > > On Wed, Aug 24, 2011 at 05:13:49PM -0400, Alex Williamson wrote: > > Is this roughly what you're thinking of for the iommu_group component? > > Adding a dev_to_group iommu ops callback let's us consolidate the sysfs > > support in the iommu base. Would AMD-Vi do something similar (or > > exactly the same) for group #s? Thanks, > > The concept looks good, I have some comments, though. On AMD-Vi the > implementation would look a bit different because there is a > data-structure were the information can be gathered from, so no need for > PCI bus scanning there. > > > diff --git a/drivers/base/iommu.c b/drivers/base/iommu.c > > index 6e6b6a1..6b54c1a 100644 > > --- a/drivers/base/iommu.c > > +++ b/drivers/base/iommu.c > > @@ -17,20 +17,56 @@ > > */ > > > > #include > > +#include > > #include > > #include > > #include > > #include > > #include > > +#include > > > > static struct iommu_ops *iommu_ops; > > > > +static ssize_t show_iommu_group(struct device *dev, > > + struct device_attribute *attr, char *buf) > > +{ > > + return sprintf(buf, "%lx", iommu_dev_to_group(dev)); > > Probably add a 0x prefix so userspace knows the format? I think I'll probably change it to %u. Seems common to have decimal in sysfs and doesn't get confusing if we cat it with a string. As a bonus, it abstracts that vt-d is just stuffing a PCI device address in there, which nobody should ever rely on. > > +} > > +static DEVICE_ATTR(iommu_group, S_IRUGO, show_iommu_group, NULL); > > + > > +static int add_iommu_group(struct device *dev, void *unused) > > +{ > > + if (iommu_dev_to_group(dev) >= 0) > > + return device_create_file(dev, &dev_attr_iommu_group); > > + > > + return 0; > > +} > > + > > +static int device_notifier(struct notifier_block *nb, > > + unsigned long action, void *data) > > +{ > > + struct device *dev = data; > > + > > + if (action == BUS_NOTIFY_ADD_DEVICE) > > + return add_iommu_group(dev, NULL); > > + > > + return 0; > > +} > > + > > +static struct notifier_block device_nb = { > > + .notifier_call = device_notifier, > > +}; > > + > > void register_iommu(struct iommu_ops *ops) > > { > > if (iommu_ops) > > BUG(); > > > > iommu_ops = ops; > > + > > + /* FIXME - non-PCI, really want for_each_bus() */ > > + bus_register_notifier(&pci_bus_type, &device_nb); > > + bus_for_each_dev(&pci_bus_type, NULL, NULL, add_iommu_group); > > } > > We need to solve this differently. ARM is starting to use the iommu-api > too and this definitly does not work there. One possible solution might > be to make the iommu-ops per-bus. That sounds good. Is anyone working on it? It seems like it doesn't hurt to use this in the interim, we may just be watching the wrong bus and never add any sysfs group info. > > bool iommu_found(void) > > @@ -94,6 +130,14 @@ int iommu_domain_has_cap(struct iommu_domain *domain, > > } > > EXPORT_SYMBOL_GPL(iommu_domain_has_cap); > > > > +long iommu_dev_to_group(struct device *dev) > > +{ > > + if (iommu_ops->dev_to_group) > > + return iommu_ops->dev_to_group(dev); > > + return -ENODEV; > > +} > > +EXPORT_SYMBOL_GPL(iommu_dev_to_group); > > Please rename this to iommu_device_group(). The dev_to_group name > suggests a conversion but it is actually just a property of the device. Ok. > Also the return type should not be long but something that fits into > 32bit on all platforms. Since you use -ENODEV, probably s32 is a good > choice. The convenience of using seg|bus|dev|fn was too much to resist, too bad it requires a full 32bits. Maybe I'll change it to: int iommu_device_group(struct device *dev, unsigned int *group) > > + > > int iommu_map(struct iommu_domain *domain, unsigned long iova, > > phys_addr_t paddr, int gfp_order, int prot) > > { > > diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c > > index f02c34d..477259c 100644 > > --- a/drivers/pci/intel-iommu.c > > +++ b/drivers/pci/intel-iommu.c > > @@ -404,6 +404,7 @@ static int dmar_map_gfx = 1; > > static int dmar_forcedac; > > static int intel_iommu_strict; > > static int intel_iommu_superpage = 1; > > +static int intel_iommu_no_mf_groups; > > > > #define DUMMY_DEVICE_DOMAIN_INFO ((struct device_domain_info *)(-1)) > > static DEFINE_SPINLOCK(device_domain_lock); > > @@ -438,6 +439,10 @@ static int __init intel_iommu_setup(char *str) > > printk(KERN_INFO > > "Intel-IOMMU: disable supported super page\n"); > > intel_iommu_superpage = 0; > > + } else if (!strncmp(str, "no_mf_groups", 12)) { > > + printk(KERN_INFO > > + "Intel-IOMMU: disable separate groups for > > multifunction devices\n"); > > + intel_iommu_no_mf_groups = 1; > > This should r
Re: kvm PCI assignment & VFIO ramblings
On Thu, Aug 25, 2011 at 11:38:09AM -0400, Don Dutile wrote: > On 08/25/2011 06:54 AM, Roedel, Joerg wrote: > > We need to solve this differently. ARM is starting to use the iommu-api > > too and this definitly does not work there. One possible solution might > > be to make the iommu-ops per-bus. > > > When you think of a system where there isn't just one bus-type > with iommu support, it makes more sense. > Additionally, it also allows the long-term architecture to use different types > of IOMMUs on each bus segment -- think per-PCIe-switch/bridge IOMMUs -- > esp. 'tuned' IOMMUs -- ones better geared for networks, ones better geared > for direct-attach disk hba's. Not sure how likely it is to have different types of IOMMUs within a given bus-type. But if they become reality we can multiplex in the iommu-api without much hassle :) For now, something like bus_set_iommu() or bus_register_iommu() would provide a nice way to do bus-specific setups for a given iommu implementation. Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On 08/25/2011 06:54 AM, Roedel, Joerg wrote: Hi Alex, On Wed, Aug 24, 2011 at 05:13:49PM -0400, Alex Williamson wrote: Is this roughly what you're thinking of for the iommu_group component? Adding a dev_to_group iommu ops callback let's us consolidate the sysfs support in the iommu base. Would AMD-Vi do something similar (or exactly the same) for group #s? Thanks, The concept looks good, I have some comments, though. On AMD-Vi the implementation would look a bit different because there is a data-structure were the information can be gathered from, so no need for PCI bus scanning there. diff --git a/drivers/base/iommu.c b/drivers/base/iommu.c index 6e6b6a1..6b54c1a 100644 --- a/drivers/base/iommu.c +++ b/drivers/base/iommu.c @@ -17,20 +17,56 @@ */ #include +#include #include #include #include #include #include +#include static struct iommu_ops *iommu_ops; +static ssize_t show_iommu_group(struct device *dev, + struct device_attribute *attr, char *buf) +{ + return sprintf(buf, "%lx", iommu_dev_to_group(dev)); Probably add a 0x prefix so userspace knows the format? +} +static DEVICE_ATTR(iommu_group, S_IRUGO, show_iommu_group, NULL); + +static int add_iommu_group(struct device *dev, void *unused) +{ + if (iommu_dev_to_group(dev)>= 0) + return device_create_file(dev,&dev_attr_iommu_group); + + return 0; +} + +static int device_notifier(struct notifier_block *nb, + unsigned long action, void *data) +{ + struct device *dev = data; + + if (action == BUS_NOTIFY_ADD_DEVICE) + return add_iommu_group(dev, NULL); + + return 0; +} + +static struct notifier_block device_nb = { + .notifier_call = device_notifier, +}; + void register_iommu(struct iommu_ops *ops) { if (iommu_ops) BUG(); iommu_ops = ops; + + /* FIXME - non-PCI, really want for_each_bus() */ + bus_register_notifier(&pci_bus_type,&device_nb); + bus_for_each_dev(&pci_bus_type, NULL, NULL, add_iommu_group); } We need to solve this differently. ARM is starting to use the iommu-api too and this definitly does not work there. One possible solution might be to make the iommu-ops per-bus. When you think of a system where there isn't just one bus-type with iommu support, it makes more sense. Additionally, it also allows the long-term architecture to use different types of IOMMUs on each bus segment -- think per-PCIe-switch/bridge IOMMUs -- esp. 'tuned' IOMMUs -- ones better geared for networks, ones better geared for direct-attach disk hba's. bool iommu_found(void) @@ -94,6 +130,14 @@ int iommu_domain_has_cap(struct iommu_domain *domain, } EXPORT_SYMBOL_GPL(iommu_domain_has_cap); +long iommu_dev_to_group(struct device *dev) +{ + if (iommu_ops->dev_to_group) + return iommu_ops->dev_to_group(dev); + return -ENODEV; +} +EXPORT_SYMBOL_GPL(iommu_dev_to_group); Please rename this to iommu_device_group(). The dev_to_group name suggests a conversion but it is actually just a property of the device. Also the return type should not be long but something that fits into 32bit on all platforms. Since you use -ENODEV, probably s32 is a good choice. + int iommu_map(struct iommu_domain *domain, unsigned long iova, phys_addr_t paddr, int gfp_order, int prot) { diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c index f02c34d..477259c 100644 --- a/drivers/pci/intel-iommu.c +++ b/drivers/pci/intel-iommu.c @@ -404,6 +404,7 @@ static int dmar_map_gfx = 1; static int dmar_forcedac; static int intel_iommu_strict; static int intel_iommu_superpage = 1; +static int intel_iommu_no_mf_groups; #define DUMMY_DEVICE_DOMAIN_INFO ((struct device_domain_info *)(-1)) static DEFINE_SPINLOCK(device_domain_lock); @@ -438,6 +439,10 @@ static int __init intel_iommu_setup(char *str) printk(KERN_INFO "Intel-IOMMU: disable supported super page\n"); intel_iommu_superpage = 0; + } else if (!strncmp(str, "no_mf_groups", 12)) { + printk(KERN_INFO + "Intel-IOMMU: disable separate groups for multifunction devices\n"); + intel_iommu_no_mf_groups = 1; This should really be a global iommu option and not be VT-d specific. str += strcspn(str, ","); @@ -3902,6 +3907,52 @@ static int intel_iommu_domain_has_cap(struct iommu_domain *domain, return 0; } +/* Group numbers are arbitrary. Device with the same group number + * indicate the iommu cannot differentiate between them. To avoid + * tracking used groups we just use the seg|bus|devfn of the lowest + * level we're able to differentiate devices */ +static long intel_iommu_dev_to_group(struct device *dev) +{ + struct pci_dev *pdev = to_pci_dev(dev); + struct pci_d
Re: kvm PCI assignment & VFIO ramblings
On Wed, Aug 24, 2011 at 11:07:46AM -0400, Alex Williamson wrote: > On Wed, 2011-08-24 at 10:52 +0200, Roedel, Joerg wrote: > > On Tue, Aug 23, 2011 at 01:08:29PM -0400, Alex Williamson wrote: > > > On Tue, 2011-08-23 at 15:14 +0200, Roedel, Joerg wrote: > > > > > > Handling it through fds is a good idea. This makes sure that everything > > > > belongs to one process. I am not really sure yet if we go the way to > > > > just bind plain groups together or if we create meta-groups. The > > > > meta-groups thing seems somewhat cleaner, though. > > > > > > I'm leaning towards binding because we need to make it dynamic, but I > > > don't really have a good picture of the lifecycle of a meta-group. > > > > In my view the life-cycle of the meta-group is a subrange of the > > qemu-instance's life-cycle. > > I guess I mean the lifecycle of a super-group that's actually exposed as > a new group in sysfs. Who creates it? How? How are groups dynamically > added and removed from the super-group? The group merging makes sense > to me because it's largely just an optimization that qemu will try to > merge groups. If it works, great. If not, it manages them separately. > When all the devices from a group are unplugged, unmerge the group if > necessary. Right. The super-group thing is an optimization. > We need to try the polite method of attempting to hot unplug the device > from qemu first, which the current vfio code already implements. We can > then escalate if it doesn't respond. The current code calls abort in > qemu if the guest doesn't respond, but I agree we should also be > enforcing this at the kernel interface. I think the problem with the > hard-unplug is that we don't have a good revoke mechanism for the mmio > mmaps. For mmio we could stop the guest and replace the mmio region with a region that is filled with 0xff, no? Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Wed, Aug 24, 2011 at 10:56:13AM -0400, Alex Williamson wrote: > On Wed, 2011-08-24 at 10:43 +0200, Joerg Roedel wrote: > > A side-note: Might it be better to expose assigned devices in a guest on > > a seperate bus? This will make it easier to emulate an IOMMU for the > > guest inside qemu. > > I think we want that option, sure. A lot of guests aren't going to > support hotplugging buses though, so I think our default, map the entire > guest model should still be using bus 0. The ACPI gets a lot more > complicated for that model too; dynamic SSDTs? Thanks, Ok, if only AMD-Vi should be emulated then it is not strictly necessary. For this IOMMU we can specify that devices on the same bus belong to different IOMMUs. So we can implement an IOMMU that handles internal qemu-devices and one that handles pass-through devices. Not sure if this is possible with VT-d too. Okay VT-d emulation would also require that the devices emulation of a PCIe bridge, no? Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
Hi Alex, On Wed, Aug 24, 2011 at 05:13:49PM -0400, Alex Williamson wrote: > Is this roughly what you're thinking of for the iommu_group component? > Adding a dev_to_group iommu ops callback let's us consolidate the sysfs > support in the iommu base. Would AMD-Vi do something similar (or > exactly the same) for group #s? Thanks, The concept looks good, I have some comments, though. On AMD-Vi the implementation would look a bit different because there is a data-structure were the information can be gathered from, so no need for PCI bus scanning there. > diff --git a/drivers/base/iommu.c b/drivers/base/iommu.c > index 6e6b6a1..6b54c1a 100644 > --- a/drivers/base/iommu.c > +++ b/drivers/base/iommu.c > @@ -17,20 +17,56 @@ > */ > > #include > +#include > #include > #include > #include > #include > #include > +#include > > static struct iommu_ops *iommu_ops; > > +static ssize_t show_iommu_group(struct device *dev, > + struct device_attribute *attr, char *buf) > +{ > + return sprintf(buf, "%lx", iommu_dev_to_group(dev)); Probably add a 0x prefix so userspace knows the format? > +} > +static DEVICE_ATTR(iommu_group, S_IRUGO, show_iommu_group, NULL); > + > +static int add_iommu_group(struct device *dev, void *unused) > +{ > + if (iommu_dev_to_group(dev) >= 0) > + return device_create_file(dev, &dev_attr_iommu_group); > + > + return 0; > +} > + > +static int device_notifier(struct notifier_block *nb, > +unsigned long action, void *data) > +{ > + struct device *dev = data; > + > + if (action == BUS_NOTIFY_ADD_DEVICE) > + return add_iommu_group(dev, NULL); > + > + return 0; > +} > + > +static struct notifier_block device_nb = { > + .notifier_call = device_notifier, > +}; > + > void register_iommu(struct iommu_ops *ops) > { > if (iommu_ops) > BUG(); > > iommu_ops = ops; > + > + /* FIXME - non-PCI, really want for_each_bus() */ > + bus_register_notifier(&pci_bus_type, &device_nb); > + bus_for_each_dev(&pci_bus_type, NULL, NULL, add_iommu_group); > } We need to solve this differently. ARM is starting to use the iommu-api too and this definitly does not work there. One possible solution might be to make the iommu-ops per-bus. > bool iommu_found(void) > @@ -94,6 +130,14 @@ int iommu_domain_has_cap(struct iommu_domain *domain, > } > EXPORT_SYMBOL_GPL(iommu_domain_has_cap); > > +long iommu_dev_to_group(struct device *dev) > +{ > + if (iommu_ops->dev_to_group) > + return iommu_ops->dev_to_group(dev); > + return -ENODEV; > +} > +EXPORT_SYMBOL_GPL(iommu_dev_to_group); Please rename this to iommu_device_group(). The dev_to_group name suggests a conversion but it is actually just a property of the device. Also the return type should not be long but something that fits into 32bit on all platforms. Since you use -ENODEV, probably s32 is a good choice. > + > int iommu_map(struct iommu_domain *domain, unsigned long iova, > phys_addr_t paddr, int gfp_order, int prot) > { > diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c > index f02c34d..477259c 100644 > --- a/drivers/pci/intel-iommu.c > +++ b/drivers/pci/intel-iommu.c > @@ -404,6 +404,7 @@ static int dmar_map_gfx = 1; > static int dmar_forcedac; > static int intel_iommu_strict; > static int intel_iommu_superpage = 1; > +static int intel_iommu_no_mf_groups; > > #define DUMMY_DEVICE_DOMAIN_INFO ((struct device_domain_info *)(-1)) > static DEFINE_SPINLOCK(device_domain_lock); > @@ -438,6 +439,10 @@ static int __init intel_iommu_setup(char *str) > printk(KERN_INFO > "Intel-IOMMU: disable supported super page\n"); > intel_iommu_superpage = 0; > + } else if (!strncmp(str, "no_mf_groups", 12)) { > + printk(KERN_INFO > + "Intel-IOMMU: disable separate groups for > multifunction devices\n"); > + intel_iommu_no_mf_groups = 1; This should really be a global iommu option and not be VT-d specific. > > str += strcspn(str, ","); > @@ -3902,6 +3907,52 @@ static int intel_iommu_domain_has_cap(struct > iommu_domain *domain, > return 0; > } > > +/* Group numbers are arbitrary. Device with the same group number > + * indicate the iommu cannot differentiate between them. To avoid > + * tracking used groups we just use the seg|bus|devfn of the lowest > + * level we're able to differentiate devices */ > +static long intel_iommu_dev_to_group(struct device *dev) > +{ > + struct pci_dev *pdev = to_pci_dev(dev); > + struct pci_dev *bridge; > + union { > + struct { > + u8 devfn; > + u8 bus; > + u16 segment; > + } pci; > + u32 group; > + } id; > + > + if (iommu_no_mapping(dev))
Re: kvm PCI assignment & VFIO ramblings
Joerg, Is this roughly what you're thinking of for the iommu_group component? Adding a dev_to_group iommu ops callback let's us consolidate the sysfs support in the iommu base. Would AMD-Vi do something similar (or exactly the same) for group #s? Thanks, Alex Signed-off-by: Alex Williamson diff --git a/drivers/base/iommu.c b/drivers/base/iommu.c index 6e6b6a1..6b54c1a 100644 --- a/drivers/base/iommu.c +++ b/drivers/base/iommu.c @@ -17,20 +17,56 @@ */ #include +#include #include #include #include #include #include +#include static struct iommu_ops *iommu_ops; +static ssize_t show_iommu_group(struct device *dev, + struct device_attribute *attr, char *buf) +{ + return sprintf(buf, "%lx", iommu_dev_to_group(dev)); +} +static DEVICE_ATTR(iommu_group, S_IRUGO, show_iommu_group, NULL); + +static int add_iommu_group(struct device *dev, void *unused) +{ + if (iommu_dev_to_group(dev) >= 0) + return device_create_file(dev, &dev_attr_iommu_group); + + return 0; +} + +static int device_notifier(struct notifier_block *nb, + unsigned long action, void *data) +{ + struct device *dev = data; + + if (action == BUS_NOTIFY_ADD_DEVICE) + return add_iommu_group(dev, NULL); + + return 0; +} + +static struct notifier_block device_nb = { + .notifier_call = device_notifier, +}; + void register_iommu(struct iommu_ops *ops) { if (iommu_ops) BUG(); iommu_ops = ops; + + /* FIXME - non-PCI, really want for_each_bus() */ + bus_register_notifier(&pci_bus_type, &device_nb); + bus_for_each_dev(&pci_bus_type, NULL, NULL, add_iommu_group); } bool iommu_found(void) @@ -94,6 +130,14 @@ int iommu_domain_has_cap(struct iommu_domain *domain, } EXPORT_SYMBOL_GPL(iommu_domain_has_cap); +long iommu_dev_to_group(struct device *dev) +{ + if (iommu_ops->dev_to_group) + return iommu_ops->dev_to_group(dev); + return -ENODEV; +} +EXPORT_SYMBOL_GPL(iommu_dev_to_group); + int iommu_map(struct iommu_domain *domain, unsigned long iova, phys_addr_t paddr, int gfp_order, int prot) { diff --git a/drivers/pci/intel-iommu.c b/drivers/pci/intel-iommu.c index f02c34d..477259c 100644 --- a/drivers/pci/intel-iommu.c +++ b/drivers/pci/intel-iommu.c @@ -404,6 +404,7 @@ static int dmar_map_gfx = 1; static int dmar_forcedac; static int intel_iommu_strict; static int intel_iommu_superpage = 1; +static int intel_iommu_no_mf_groups; #define DUMMY_DEVICE_DOMAIN_INFO ((struct device_domain_info *)(-1)) static DEFINE_SPINLOCK(device_domain_lock); @@ -438,6 +439,10 @@ static int __init intel_iommu_setup(char *str) printk(KERN_INFO "Intel-IOMMU: disable supported super page\n"); intel_iommu_superpage = 0; + } else if (!strncmp(str, "no_mf_groups", 12)) { + printk(KERN_INFO + "Intel-IOMMU: disable separate groups for multifunction devices\n"); + intel_iommu_no_mf_groups = 1; } str += strcspn(str, ","); @@ -3902,6 +3907,52 @@ static int intel_iommu_domain_has_cap(struct iommu_domain *domain, return 0; } +/* Group numbers are arbitrary. Device with the same group number + * indicate the iommu cannot differentiate between them. To avoid + * tracking used groups we just use the seg|bus|devfn of the lowest + * level we're able to differentiate devices */ +static long intel_iommu_dev_to_group(struct device *dev) +{ + struct pci_dev *pdev = to_pci_dev(dev); + struct pci_dev *bridge; + union { + struct { + u8 devfn; + u8 bus; + u16 segment; + } pci; + u32 group; + } id; + + if (iommu_no_mapping(dev)) + return -ENODEV; + + id.pci.segment = pci_domain_nr(pdev->bus); + id.pci.bus = pdev->bus->number; + id.pci.devfn = pdev->devfn; + + if (!device_to_iommu(id.pci.segment, id.pci.bus, id.pci.devfn)) + return -ENODEV; + + bridge = pci_find_upstream_pcie_bridge(pdev); + if (bridge) { + if (pci_is_pcie(bridge)) { + id.pci.bus = bridge->subordinate->number; + id.pci.devfn = 0; + } else { + id.pci.bus = bridge->bus->number; + id.pci.devfn = bridge->devfn; + } + } + + /* Virtual functions always get their own group */ + if (!pdev->is_virtfn && intel_iommu_no_mf_groups) + id.pci.devfn = PCI_DEVFN(PCI_SLOT(id.pci.devfn), 0); + + /* FIXME - seg # >= 0x8000 on 32b */ + return id.group; +} + static struct iommu_ops intel_iommu_ops = { .domain_init= in
Re: kvm PCI assignment & VFIO ramblings
On Wed, 2011-08-24 at 10:52 +0200, Roedel, Joerg wrote: > On Tue, Aug 23, 2011 at 01:08:29PM -0400, Alex Williamson wrote: > > On Tue, 2011-08-23 at 15:14 +0200, Roedel, Joerg wrote: > > > > Handling it through fds is a good idea. This makes sure that everything > > > belongs to one process. I am not really sure yet if we go the way to > > > just bind plain groups together or if we create meta-groups. The > > > meta-groups thing seems somewhat cleaner, though. > > > > I'm leaning towards binding because we need to make it dynamic, but I > > don't really have a good picture of the lifecycle of a meta-group. > > In my view the life-cycle of the meta-group is a subrange of the > qemu-instance's life-cycle. I guess I mean the lifecycle of a super-group that's actually exposed as a new group in sysfs. Who creates it? How? How are groups dynamically added and removed from the super-group? The group merging makes sense to me because it's largely just an optimization that qemu will try to merge groups. If it works, great. If not, it manages them separately. When all the devices from a group are unplugged, unmerge the group if necessary. > > > Putting the process to sleep (which would be uninterruptible) seems bad. > > > The process would sleep until the guest releases the device-group, which > > > can take days or months. > > > The best thing (and the most intrusive :-) ) is to change PCI core to > > > allow unbindings to fail, I think. But this probably further complicates > > > the way to upstream VFIO... > > > > Yes, it's not ideal but I think it's sufficient for now and if we later > > get support for returning an error from release, we can set a timeout > > after notifying the user to make use of that. Thanks, > > Ben had the idea of just forcing to hard-unplug this device from the > guest. Thats probably the best way to deal with that, I think. VFIO > sends a notification to qemu that the device is gone and qemu informs > the guest in some way about it. We need to try the polite method of attempting to hot unplug the device from qemu first, which the current vfio code already implements. We can then escalate if it doesn't respond. The current code calls abort in qemu if the guest doesn't respond, but I agree we should also be enforcing this at the kernel interface. I think the problem with the hard-unplug is that we don't have a good revoke mechanism for the mmio mmaps. Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Wed, 2011-08-24 at 10:43 +0200, Joerg Roedel wrote: > On Tue, Aug 23, 2011 at 03:30:06PM -0400, Alex Williamson wrote: > > On Tue, 2011-08-23 at 07:01 +1000, Benjamin Herrenschmidt wrote: > > > > Could be tho in what form ? returning sysfs pathes ? > > > > I'm at a loss there, please suggest. I think we need an ioctl that > > returns some kind of array of devices within the group and another that > > maybe takes an index from that array and returns an fd for that device. > > A sysfs path string might be a reasonable array element, but it sounds > > like a pain to work with. > > Limiting to PCI we can just pass the BDF as the argument to optain the > device-fd. For a more generic solution we need a unique identifier in > some way which is unique across all 'struct device' instances in the > system. As far as I know we don't have that yet (besides the sysfs-path) > so we either add that or stick with bus-specific solutions. > > > > 1:1 process has the advantage of linking to an -mm which makes the whole > > > mmu notifier business doable. How do you want to track down mappings and > > > do the second level translation in the case of explicit map/unmap (like > > > on power) if you are not tied to an mm_struct ? > > > > Right, I threw away the mmu notifier code that was originally part of > > vfio because we can't do anything useful with it yet on x86. I > > definitely don't want to prevent it where it makes sense though. Maybe > > we just record current->mm on open and restrict subsequent opens to the > > same. > > Hmm, I think we need io-page-fault support in the iommu-api then. Yeah, when we can handle iommu page faults, this gets more interesting. > > > Another aspect I don't see discussed is how we represent these things to > > > the guest. > > > > > > On Power for example, I have a requirement that a given iommu domain is > > > represented by a single dma window property in the device-tree. What > > > that means is that that property needs to be either in the node of the > > > device itself if there's only one device in the group or in a parent > > > node (ie a bridge or host bridge) if there are multiple devices. > > > > > > Now I do -not- want to go down the path of simulating P2P bridges, > > > besides we'll quickly run out of bus numbers if we go there. > > > > > > For us the most simple and logical approach (which is also what pHyp > > > uses and what Linux handles well) is really to expose a given PCI host > > > bridge per group to the guest. Believe it or not, it makes things > > > easier :-) > > > > I'm all for easier. Why does exposing the bridge use less bus numbers > > than emulating a bridge? > > > > On x86, I want to maintain that our default assignment is at the device > > level. A user should be able to pick single or multiple devices from > > across several groups and have them all show up as individual, > > hotpluggable devices on bus 0 in the guest. Not surprisingly, we've > > also seen cases where users try to attach a bridge to the guest, > > assuming they'll get all the devices below the bridge, so I'd be in > > favor of making this "just work" if possible too, though we may have to > > prevent hotplug of those. > > A side-note: Might it be better to expose assigned devices in a guest on > a seperate bus? This will make it easier to emulate an IOMMU for the > guest inside qemu. I think we want that option, sure. A lot of guests aren't going to support hotplugging buses though, so I think our default, map the entire guest model should still be using bus 0. The ACPI gets a lot more complicated for that model too; dynamic SSDTs? Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Wed, 2011-08-24 at 09:51 +1000, Benjamin Herrenschmidt wrote: > > > For us the most simple and logical approach (which is also what pHyp > > > uses and what Linux handles well) is really to expose a given PCI host > > > bridge per group to the guest. Believe it or not, it makes things > > > easier :-) > > > > I'm all for easier. Why does exposing the bridge use less bus numbers > > than emulating a bridge? > > Because a host bridge doesn't look like a PCI to PCI bridge at all for > us. It's an entire separate domain with it's own bus number space > (unlike most x86 setups). Ok, I missed the "host" bridge. > In fact we have some problems afaik in qemu today with the concept of > PCI domains, for example, I think qemu has assumptions about a single > shared IO space domain which isn't true for us (each PCI host bridge > provides a distinct IO space domain starting at 0). We'll have to fix > that, but it's not a huge deal. Yep, I've seen similar on ia64 systems. > So for each "group" we'd expose in the guest an entire separate PCI > domain space with its own IO, MMIO etc... spaces, handed off from a > single device-tree "host bridge" which doesn't itself appear in the > config space, doesn't need any emulation of any config space etc... > > > On x86, I want to maintain that our default assignment is at the device > > level. A user should be able to pick single or multiple devices from > > across several groups and have them all show up as individual, > > hotpluggable devices on bus 0 in the guest. Not surprisingly, we've > > also seen cases where users try to attach a bridge to the guest, > > assuming they'll get all the devices below the bridge, so I'd be in > > favor of making this "just work" if possible too, though we may have to > > prevent hotplug of those. > > > > Given the device requirement on x86 and since everything is a PCI device > > on x86, I'd like to keep a qemu command line something like -device > > vfio,host=00:19.0. I assume that some of the iommu properties, such as > > dma window size/address, will be query-able through an architecture > > specific (or general if possible) ioctl on the vfio group fd. I hope > > that will help the specification, but I don't fully understand what all > > remains. Thanks, > > Well, for iommu there's a couple of different issues here but yes, > basically on one side we'll have some kind of ioctl to know what segment > of the device(s) DMA address space is assigned to the group and we'll > need to represent that to the guest via a device-tree property in some > kind of "parent" node of all the devices in that group. > > We -might- be able to implement some kind of hotplug of individual > devices of a group under such a PHB (PCI Host Bridge), I don't know for > sure yet, some of that PAPR stuff is pretty arcane, but basically, for > all intend and purpose, we really want a group to be represented as a > PHB in the guest. > > We cannot arbitrary have individual devices of separate groups be > represented in the guest as siblings on a single simulated PCI bus. I think the vfio kernel layer we're describing easily supports both. This is just a matter of adding qemu-vfio code to expose different topologies based on group iommu capabilities and mapping mode. Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Wed, Aug 24, 2011 at 05:33:00AM -0400, David Gibson wrote: > On Wed, Aug 24, 2011 at 11:14:26AM +0200, Roedel, Joerg wrote: > > I don't see a reason to make this meta-grouping static. It would harm > > flexibility on x86. I think it makes things easier on power but there > > are options on that platform to get the dynamic solution too. > > I think several people are misreading what Ben means by "static". I > would prefer to say 'persistent', in that the meta-groups lifetime is > not tied to an fd, but they can be freely created, altered and removed > during runtime. Even if it can be altered at runtime, from a usability perspective it is certainly the best to handle these groups directly in qemu. Or are there strong reasons to do it somewhere else? Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Wed, Aug 24, 2011 at 11:14:26AM +0200, Roedel, Joerg wrote: > On Tue, Aug 23, 2011 at 12:54:27PM -0400, aafabbri wrote: > > On 8/23/11 4:04 AM, "Joerg Roedel" wrote: > > > That is makes uiommu basically the same as the meta-groups, right? > > > > Yes, functionality seems the same, thus my suggestion to keep uiommu > > explicit. Is there some need for group-groups besides defining sets of > > groups which share IOMMU resources? > > > > I do all this stuff (bringing up sets of devices which may share IOMMU > > domain) dynamically from C applications. I don't really want some static > > (boot-time or sysfs fiddling) supergroup config unless there is a good > > reason KVM/power needs it. > > > > As you say in your next email, doing it all from ioctls is very easy, > > programmatically. > > I don't see a reason to make this meta-grouping static. It would harm > flexibility on x86. I think it makes things easier on power but there > are options on that platform to get the dynamic solution too. I think several people are misreading what Ben means by "static". I would prefer to say 'persistent', in that the meta-groups lifetime is not tied to an fd, but they can be freely created, altered and removed during runtime. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Tue, Aug 23, 2011 at 12:54:27PM -0400, aafabbri wrote: > On 8/23/11 4:04 AM, "Joerg Roedel" wrote: > > That is makes uiommu basically the same as the meta-groups, right? > > Yes, functionality seems the same, thus my suggestion to keep uiommu > explicit. Is there some need for group-groups besides defining sets of > groups which share IOMMU resources? > > I do all this stuff (bringing up sets of devices which may share IOMMU > domain) dynamically from C applications. I don't really want some static > (boot-time or sysfs fiddling) supergroup config unless there is a good > reason KVM/power needs it. > > As you say in your next email, doing it all from ioctls is very easy, > programmatically. I don't see a reason to make this meta-grouping static. It would harm flexibility on x86. I think it makes things easier on power but there are options on that platform to get the dynamic solution too. Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Tue, Aug 23, 2011 at 01:33:14PM -0400, Aaron Fabbri wrote: > On 8/23/11 10:01 AM, "Alex Williamson" wrote: > > The iommu domain would probably be allocated when the first device is > > bound to vfio. As each device is bound, it gets attached to the group. > > DMAs are done via an ioctl on the group. > > > > I think group + uiommu leads to effectively reliving most of the > > problems with the current code. The only benefit is the group > > assignment to enforce hardware restrictions. We still have the problem > > that uiommu open() = iommu_domain_alloc(), whose properties are > > meaningless without attached devices (groups). Which I think leads to > > the same awkward model of attaching groups to define the domain, then we > > end up doing mappings via the group to enforce ordering. > > Is there a better way to allow groups to share an IOMMU domain? > > Maybe, instead of having an ioctl to allow a group A to inherit the same > iommu domain as group B, we could have an ioctl to fully merge two groups > (could be what Ben was thinking): > > A.ioctl(MERGE_TO_GROUP, B) > > The group A now goes away and its devices join group B. If A ever had an > iommu domain assigned (and buffers mapped?) we fail. > > Groups cannot get smaller (they are defined as minimum granularity of an > IOMMU, initially). They can get bigger if you want to share IOMMU > resources, though. > > Any downsides to this approach? As long as this is a 2-way road its fine. There must be a way to split the groups again after the guest exits. But then we are again at the super-groups (aka meta-groups, aka uiommu) point. Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Tue, Aug 23, 2011 at 07:35:37PM -0400, Benjamin Herrenschmidt wrote: > On Tue, 2011-08-23 at 15:18 +0200, Roedel, Joerg wrote: > > Hmm, good idea. But as far as I know the hotplug-event needs to be in > > the guest _before_ the device is actually unplugged (so that the guest > > can unbind its driver first). That somehow brings back the sleep-idea > > and the timeout in the .release function. > > That's for normal assisted hotplug, but don't we support hard hotplug ? > I mean, things like cardbus, thunderbolt (if we ever support that) > etc... will need it and some platforms do support hard hotplug of PCIe > devices. > > (That's why drivers should never spin on MMIO waiting for a 1 bit to > clear without a timeout :-) Right, thats probably the best semantic for this issue then. The worst thing that happens is that the admin crashed the guest. Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Tue, Aug 23, 2011 at 01:08:29PM -0400, Alex Williamson wrote: > On Tue, 2011-08-23 at 15:14 +0200, Roedel, Joerg wrote: > > Handling it through fds is a good idea. This makes sure that everything > > belongs to one process. I am not really sure yet if we go the way to > > just bind plain groups together or if we create meta-groups. The > > meta-groups thing seems somewhat cleaner, though. > > I'm leaning towards binding because we need to make it dynamic, but I > don't really have a good picture of the lifecycle of a meta-group. In my view the life-cycle of the meta-group is a subrange of the qemu-instance's life-cycle. > > Putting the process to sleep (which would be uninterruptible) seems bad. > > The process would sleep until the guest releases the device-group, which > > can take days or months. > > The best thing (and the most intrusive :-) ) is to change PCI core to > > allow unbindings to fail, I think. But this probably further complicates > > the way to upstream VFIO... > > Yes, it's not ideal but I think it's sufficient for now and if we later > get support for returning an error from release, we can set a timeout > after notifying the user to make use of that. Thanks, Ben had the idea of just forcing to hard-unplug this device from the guest. Thats probably the best way to deal with that, I think. VFIO sends a notification to qemu that the device is gone and qemu informs the guest in some way about it. Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Tue, Aug 23, 2011 at 03:30:06PM -0400, Alex Williamson wrote: > On Tue, 2011-08-23 at 07:01 +1000, Benjamin Herrenschmidt wrote: > > Could be tho in what form ? returning sysfs pathes ? > > I'm at a loss there, please suggest. I think we need an ioctl that > returns some kind of array of devices within the group and another that > maybe takes an index from that array and returns an fd for that device. > A sysfs path string might be a reasonable array element, but it sounds > like a pain to work with. Limiting to PCI we can just pass the BDF as the argument to optain the device-fd. For a more generic solution we need a unique identifier in some way which is unique across all 'struct device' instances in the system. As far as I know we don't have that yet (besides the sysfs-path) so we either add that or stick with bus-specific solutions. > > 1:1 process has the advantage of linking to an -mm which makes the whole > > mmu notifier business doable. How do you want to track down mappings and > > do the second level translation in the case of explicit map/unmap (like > > on power) if you are not tied to an mm_struct ? > > Right, I threw away the mmu notifier code that was originally part of > vfio because we can't do anything useful with it yet on x86. I > definitely don't want to prevent it where it makes sense though. Maybe > we just record current->mm on open and restrict subsequent opens to the > same. Hmm, I think we need io-page-fault support in the iommu-api then. > > Another aspect I don't see discussed is how we represent these things to > > the guest. > > > > On Power for example, I have a requirement that a given iommu domain is > > represented by a single dma window property in the device-tree. What > > that means is that that property needs to be either in the node of the > > device itself if there's only one device in the group or in a parent > > node (ie a bridge or host bridge) if there are multiple devices. > > > > Now I do -not- want to go down the path of simulating P2P bridges, > > besides we'll quickly run out of bus numbers if we go there. > > > > For us the most simple and logical approach (which is also what pHyp > > uses and what Linux handles well) is really to expose a given PCI host > > bridge per group to the guest. Believe it or not, it makes things > > easier :-) > > I'm all for easier. Why does exposing the bridge use less bus numbers > than emulating a bridge? > > On x86, I want to maintain that our default assignment is at the device > level. A user should be able to pick single or multiple devices from > across several groups and have them all show up as individual, > hotpluggable devices on bus 0 in the guest. Not surprisingly, we've > also seen cases where users try to attach a bridge to the guest, > assuming they'll get all the devices below the bridge, so I'd be in > favor of making this "just work" if possible too, though we may have to > prevent hotplug of those. A side-note: Might it be better to expose assigned devices in a guest on a seperate bus? This will make it easier to emulate an IOMMU for the guest inside qemu. Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On 23.08.2011, at 18:51, Benjamin Herrenschmidt wrote: > >>> For us the most simple and logical approach (which is also what pHyp >>> uses and what Linux handles well) is really to expose a given PCI host >>> bridge per group to the guest. Believe it or not, it makes things >>> easier :-) >> >> I'm all for easier. Why does exposing the bridge use less bus numbers >> than emulating a bridge? > > Because a host bridge doesn't look like a PCI to PCI bridge at all for > us. It's an entire separate domain with it's own bus number space > (unlike most x86 setups). > > In fact we have some problems afaik in qemu today with the concept of > PCI domains, for example, I think qemu has assumptions about a single > shared IO space domain which isn't true for us (each PCI host bridge > provides a distinct IO space domain starting at 0). We'll have to fix > that, but it's not a huge deal. > > So for each "group" we'd expose in the guest an entire separate PCI > domain space with its own IO, MMIO etc... spaces, handed off from a > single device-tree "host bridge" which doesn't itself appear in the > config space, doesn't need any emulation of any config space etc... > >> On x86, I want to maintain that our default assignment is at the device >> level. A user should be able to pick single or multiple devices from >> across several groups and have them all show up as individual, >> hotpluggable devices on bus 0 in the guest. Not surprisingly, we've >> also seen cases where users try to attach a bridge to the guest, >> assuming they'll get all the devices below the bridge, so I'd be in >> favor of making this "just work" if possible too, though we may have to >> prevent hotplug of those. >> >> Given the device requirement on x86 and since everything is a PCI device >> on x86, I'd like to keep a qemu command line something like -device >> vfio,host=00:19.0. I assume that some of the iommu properties, such as >> dma window size/address, will be query-able through an architecture >> specific (or general if possible) ioctl on the vfio group fd. I hope >> that will help the specification, but I don't fully understand what all >> remains. Thanks, > > Well, for iommu there's a couple of different issues here but yes, > basically on one side we'll have some kind of ioctl to know what segment > of the device(s) DMA address space is assigned to the group and we'll > need to represent that to the guest via a device-tree property in some > kind of "parent" node of all the devices in that group. > > We -might- be able to implement some kind of hotplug of individual > devices of a group under such a PHB (PCI Host Bridge), I don't know for > sure yet, some of that PAPR stuff is pretty arcane, but basically, for > all intend and purpose, we really want a group to be represented as a > PHB in the guest. > > We cannot arbitrary have individual devices of separate groups be > represented in the guest as siblings on a single simulated PCI bus. So would it make sense for you to go the same route that we need to go on embedded power, with a separate VFIO style interface that simply exports memory ranges and irq bindings, but doesn't know anything about PCI? For e500, we'll be using something like that to pass through a full PCI bus into the system. Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On 23.08.2011, at 18:41, Benjamin Herrenschmidt wrote: > On Tue, 2011-08-23 at 10:23 -0600, Alex Williamson wrote: >> >> Yeah. Joerg's idea of binding groups internally (pass the fd of one >> group to another via ioctl) is one option. The tricky part will be >> implementing it to support hot unplug of any group from the >> supergroup. >> I believe Ben had a suggestion that supergroups could be created in >> sysfs, but I don't know what the mechanism to do that looks like. It >> would also be an extra management step to dynamically bind and unbind >> groups to the supergroup around hotplug. Thanks, > > I don't really care that much what the method for creating them is, to > be honest, I just prefer this concept of "meta groups" or "super groups" > or "synthetic groups" (whatever you want to name them) to having a > separate uiommu file descriptor. > > The one reason I have a slight preference for creating them "statically" > using some kind of separate interface (again, I don't care whether it's > sysfs, netlink, etc...) is that it means things like qemu don't have to > care about them. > > In general, apps that want to use vfio can just get passed the path to > such a group or the /dev/ path or the group number (whatever we chose as > the way to identify a group), and don't need to know anything about > "super groups", how to manipulate them, create them, possible > constraints etc... > > Now, libvirt might want to know about that other API in order to provide > control on the creation of these things, but that's a different issue. > > By "static" I mean they persist, they aren't tied to the lifetime of an > fd. > > Now that's purely a preference on my side because I believe it will make > life easier for actual programs wanting to use vfio to not have to care > about those super-groups, but as I said earlier, I don't actually care > that much :-) Oh I think it's one of the building blocks we need for a sane user space device exposure API. If I want to pass user X a few devices that are all behind a single IOMMU, I just chown that device node to user X and be done with it. The user space tool actually using the VFIO interface wouldn't be in configuration business then - and it really shouldn't. That's what system configuration is there for :). But I'm fairly sure we managed to persuade Alex that this is the right path on the BOF :) Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
> > For us the most simple and logical approach (which is also what pHyp > > uses and what Linux handles well) is really to expose a given PCI host > > bridge per group to the guest. Believe it or not, it makes things > > easier :-) > > I'm all for easier. Why does exposing the bridge use less bus numbers > than emulating a bridge? Because a host bridge doesn't look like a PCI to PCI bridge at all for us. It's an entire separate domain with it's own bus number space (unlike most x86 setups). In fact we have some problems afaik in qemu today with the concept of PCI domains, for example, I think qemu has assumptions about a single shared IO space domain which isn't true for us (each PCI host bridge provides a distinct IO space domain starting at 0). We'll have to fix that, but it's not a huge deal. So for each "group" we'd expose in the guest an entire separate PCI domain space with its own IO, MMIO etc... spaces, handed off from a single device-tree "host bridge" which doesn't itself appear in the config space, doesn't need any emulation of any config space etc... > On x86, I want to maintain that our default assignment is at the device > level. A user should be able to pick single or multiple devices from > across several groups and have them all show up as individual, > hotpluggable devices on bus 0 in the guest. Not surprisingly, we've > also seen cases where users try to attach a bridge to the guest, > assuming they'll get all the devices below the bridge, so I'd be in > favor of making this "just work" if possible too, though we may have to > prevent hotplug of those. > > Given the device requirement on x86 and since everything is a PCI device > on x86, I'd like to keep a qemu command line something like -device > vfio,host=00:19.0. I assume that some of the iommu properties, such as > dma window size/address, will be query-able through an architecture > specific (or general if possible) ioctl on the vfio group fd. I hope > that will help the specification, but I don't fully understand what all > remains. Thanks, Well, for iommu there's a couple of different issues here but yes, basically on one side we'll have some kind of ioctl to know what segment of the device(s) DMA address space is assigned to the group and we'll need to represent that to the guest via a device-tree property in some kind of "parent" node of all the devices in that group. We -might- be able to implement some kind of hotplug of individual devices of a group under such a PHB (PCI Host Bridge), I don't know for sure yet, some of that PAPR stuff is pretty arcane, but basically, for all intend and purpose, we really want a group to be represented as a PHB in the guest. We cannot arbitrary have individual devices of separate groups be represented in the guest as siblings on a single simulated PCI bus. Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Tue, 2011-08-23 at 10:23 -0600, Alex Williamson wrote: > > Yeah. Joerg's idea of binding groups internally (pass the fd of one > group to another via ioctl) is one option. The tricky part will be > implementing it to support hot unplug of any group from the > supergroup. > I believe Ben had a suggestion that supergroups could be created in > sysfs, but I don't know what the mechanism to do that looks like. It > would also be an extra management step to dynamically bind and unbind > groups to the supergroup around hotplug. Thanks, I don't really care that much what the method for creating them is, to be honest, I just prefer this concept of "meta groups" or "super groups" or "synthetic groups" (whatever you want to name them) to having a separate uiommu file descriptor. The one reason I have a slight preference for creating them "statically" using some kind of separate interface (again, I don't care whether it's sysfs, netlink, etc...) is that it means things like qemu don't have to care about them. In general, apps that want to use vfio can just get passed the path to such a group or the /dev/ path or the group number (whatever we chose as the way to identify a group), and don't need to know anything about "super groups", how to manipulate them, create them, possible constraints etc... Now, libvirt might want to know about that other API in order to provide control on the creation of these things, but that's a different issue. By "static" I mean they persist, they aren't tied to the lifetime of an fd. Now that's purely a preference on my side because I believe it will make life easier for actual programs wanting to use vfio to not have to care about those super-groups, but as I said earlier, I don't actually care that much :-) Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Tue, 2011-08-23 at 15:18 +0200, Roedel, Joerg wrote: > On Mon, Aug 22, 2011 at 05:03:53PM -0400, Benjamin Herrenschmidt wrote: > > > > > I am in favour of /dev/vfio/$GROUP. If multiple devices should be > > > assigned to a guest, there can also be an ioctl to bind a group to an > > > address-space of another group (certainly needs some care to not allow > > > that both groups belong to different processes). > > > > > > Btw, a problem we havn't talked about yet entirely is > > > driver-deassignment. User space can decide to de-assign the device from > > > vfio while a fd is open on it. With PCI there is no way to let this fail > > > (the .release function returns void last time i checked). Is this a > > > problem, and yes, how we handle that? > > > > We can treat it as a hard unplug (like a cardbus gone away). > > > > IE. Dispose of the direct mappings (switch to MMIO emulation) and return > > all ff's from reads (& ignore writes). > > > > Then send an unplug event via whatever mechanism the platform provides > > (ACPI hotplug controller on x86 for example, we haven't quite sorted out > > what to do on power for hotplug yet). > > Hmm, good idea. But as far as I know the hotplug-event needs to be in > the guest _before_ the device is actually unplugged (so that the guest > can unbind its driver first). That somehow brings back the sleep-idea > and the timeout in the .release function. That's for normal assisted hotplug, but don't we support hard hotplug ? I mean, things like cardbus, thunderbolt (if we ever support that) etc... will need it and some platforms do support hard hotplug of PCIe devices. (That's why drivers should never spin on MMIO waiting for a 1 bit to clear without a timeout :-) Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Tue, 2011-08-23 at 07:01 +1000, Benjamin Herrenschmidt wrote: > On Mon, 2011-08-22 at 09:45 -0600, Alex Williamson wrote: > > > Yes, that's the idea. An open question I have towards the configuration > > side is whether we might add iommu driver specific options to the > > groups. For instance on x86 where we typically have B:D.F granularity, > > should we have an option not to trust multi-function devices and use a > > B:D granularity for grouping? > > Or even B or range of busses... if you want to enforce strict isolation > you really can't trust anything below a bus level :-) > > > Right, we can also combine models. Binding a device to vfio > > creates /dev/vfio$GROUP, which only allows a subset of ioctls and no > > device access until all the group devices are also bound. I think > > the /dev/vfio/$GROUP might help provide an enumeration interface as well > > though, which could be useful. > > Could be tho in what form ? returning sysfs pathes ? I'm at a loss there, please suggest. I think we need an ioctl that returns some kind of array of devices within the group and another that maybe takes an index from that array and returns an fd for that device. A sysfs path string might be a reasonable array element, but it sounds like a pain to work with. > > 1:1 group<->process is probably too strong. Not allowing concurrent > > open()s on the group file enforces a single userspace entity is > > responsible for that group. Device fds can be passed to other > > processes, but only retrieved via the group fd. I suppose we could even > > branch off the dma interface into a different fd, but it seems like we > > would logically want to serialize dma mappings at each iommu group > > anyway. I'm open to alternatives, this just seemed an easy way to do > > it. Restricting on UID implies that we require isolated qemu instances > > to run as different UIDs. I know that's a goal, but I don't know if we > > want to make it an assumption in the group security model. > > 1:1 process has the advantage of linking to an -mm which makes the whole > mmu notifier business doable. How do you want to track down mappings and > do the second level translation in the case of explicit map/unmap (like > on power) if you are not tied to an mm_struct ? Right, I threw away the mmu notifier code that was originally part of vfio because we can't do anything useful with it yet on x86. I definitely don't want to prevent it where it makes sense though. Maybe we just record current->mm on open and restrict subsequent opens to the same. > > Yes. I'm not sure there's a good ROI to prioritize that model. We have > > to assume >1 device per guest is a typical model and that the iotlb is > > large enough that we might improve thrashing to see both a resource and > > performance benefit from it. I'm open to suggestions for how we could > > include it though. > > Sharing may or may not be possible depending on setups so yes, it's a > bit tricky. > > My preference is to have a static interface (and that's actually where > your pet netlink might make some sense :-) to create "synthetic" groups > made of other groups if the arch allows it. But that might not be the > best approach. In another email I also proposed an option for a group to > "capture" another one... I already made some comments on this in a different thread, so I won't repeat here. > > > If that's > > > not what you're saying, how would the domains - now made up of a > > > user's selection of groups, rather than individual devices - be > > > configured? > > > > > > > Hope that captures it, feel free to jump in with corrections and > > > > suggestions. Thanks, > > > > > Another aspect I don't see discussed is how we represent these things to > the guest. > > On Power for example, I have a requirement that a given iommu domain is > represented by a single dma window property in the device-tree. What > that means is that that property needs to be either in the node of the > device itself if there's only one device in the group or in a parent > node (ie a bridge or host bridge) if there are multiple devices. > > Now I do -not- want to go down the path of simulating P2P bridges, > besides we'll quickly run out of bus numbers if we go there. > > For us the most simple and logical approach (which is also what pHyp > uses and what Linux handles well) is really to expose a given PCI host > bridge per group to the guest. Believe it or not, it makes things > easier :-) I'm all for easier. Why does exposing the bridge use less bus numbers than emulating a bridge? On x86, I want to maintain that our default assignment is at the device level. A user should be able to pick single or multiple devices from across several groups and have them all show up as individual, hotpluggable devices on bus 0 in the guest. Not surprisingly, we've also seen cases where users try to attach a bridge to the guest, assuming they'll get all the devices below the bridge, so I'd b
Re: kvm PCI assignment & VFIO ramblings
On Tue, 2011-08-23 at 10:33 -0700, Aaron Fabbri wrote: > > > On 8/23/11 10:01 AM, "Alex Williamson" wrote: > > > On Tue, 2011-08-23 at 16:54 +1000, Benjamin Herrenschmidt wrote: > >> On Mon, 2011-08-22 at 17:52 -0700, aafabbri wrote: > >> > >>> I'm not following you. > >>> > >>> You have to enforce group/iommu domain assignment whether you have the > >>> existing uiommu API, or if you change it to your proposed > >>> ioctl(inherit_iommu) API. > >>> > >>> The only change needed to VFIO here should be to make uiommu fd assignment > >>> happen on the groups instead of on device fds. That operation fails or > >>> succeeds according to the group semantics (all-or-none assignment/same > >>> uiommu). > >> > >> Ok, so I missed that part where you change uiommu to operate on group > >> fd's rather than device fd's, my apologies if you actually wrote that > >> down :-) It might be obvious ... bare with me I just flew back from the > >> US and I am badly jet lagged ... > > > > I missed it too, the model I'm proposing entirely removes the uiommu > > concept. > > > >> So I see what you mean, however... > >> > >>> I think the question is: do we force 1:1 iommu/group mapping, or do we > >>> allow > >>> arbitrary mapping (satisfying group constraints) as we do today. > >>> > >>> I'm saying I'm an existing user who wants the arbitrary iommu/group > >>> mapping > >>> ability and definitely think the uiommu approach is cleaner than the > >>> ioctl(inherit_iommu) approach. We considered that approach before but it > >>> seemed less clean so we went with the explicit uiommu context. > >> > >> Possibly, the question that interest me the most is what interface will > >> KVM end up using. I'm also not terribly fan with the (perceived) > >> discrepancy between using uiommu to create groups but using the group fd > >> to actually do the mappings, at least if that is still the plan. > > > > Current code: uiommu creates the domain, we bind a vfio device to that > > domain via a SET_UIOMMU_DOMAIN ioctl on the vfio device, then do > > mappings via MAP_DMA on the vfio device (affecting all the vfio devices > > bound to the domain) > > > > My current proposal: "groups" are predefined. groups ~= iommu domain. > > This is my main objection. I'd rather not lose the ability to have multiple > devices (which are all predefined as singleton groups on x86 w/o PCI > bridges) share IOMMU resources. Otherwise, 20 devices sharing buffers would > require 20x the IOMMU/ioTLB resources. KVM doesn't care about this case? We do care, I just wasn't prioritizing it as heavily since I think the typical model is probably closer to 1 device per guest. > > The iommu domain would probably be allocated when the first device is > > bound to vfio. As each device is bound, it gets attached to the group. > > DMAs are done via an ioctl on the group. > > > > I think group + uiommu leads to effectively reliving most of the > > problems with the current code. The only benefit is the group > > assignment to enforce hardware restrictions. We still have the problem > > that uiommu open() = iommu_domain_alloc(), whose properties are > > meaningless without attached devices (groups). Which I think leads to > > the same awkward model of attaching groups to define the domain, then we > > end up doing mappings via the group to enforce ordering. > > Is there a better way to allow groups to share an IOMMU domain? > > Maybe, instead of having an ioctl to allow a group A to inherit the same > iommu domain as group B, we could have an ioctl to fully merge two groups > (could be what Ben was thinking): > > A.ioctl(MERGE_TO_GROUP, B) > > The group A now goes away and its devices join group B. If A ever had an > iommu domain assigned (and buffers mapped?) we fail. > > Groups cannot get smaller (they are defined as minimum granularity of an > IOMMU, initially). They can get bigger if you want to share IOMMU > resources, though. > > Any downsides to this approach? That's sort of the way I'm picturing it. When groups are bound together, they effectively form a pool, where all the groups are peers. When the MERGE/BIND ioctl is called on group A and passed the group B fd, A can check compatibility of the domain associated with B, unbind devices from the B domain and attach them to the A domain. The B domain would then be freed and it would bump the refcnt on the A domain. If we need to remove A from the pool, we call UNMERGE/UNBIND on B with the A fd, it will remove the A devices from the shared object, disassociate A with the shared object, re-alloc a domain for A and rebind A devices to that domain. This is where it seems like it might be helpful to make a GET_IOMMU_FD ioctl so that an iommu object is ubiquitous and persistent across the pool. Operations on any group fd work on the pool as a whole. Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More major
Re: kvm PCI assignment & VFIO ramblings
On 8/23/11 10:01 AM, "Alex Williamson" wrote: > On Tue, 2011-08-23 at 16:54 +1000, Benjamin Herrenschmidt wrote: >> On Mon, 2011-08-22 at 17:52 -0700, aafabbri wrote: >> >>> I'm not following you. >>> >>> You have to enforce group/iommu domain assignment whether you have the >>> existing uiommu API, or if you change it to your proposed >>> ioctl(inherit_iommu) API. >>> >>> The only change needed to VFIO here should be to make uiommu fd assignment >>> happen on the groups instead of on device fds. That operation fails or >>> succeeds according to the group semantics (all-or-none assignment/same >>> uiommu). >> >> Ok, so I missed that part where you change uiommu to operate on group >> fd's rather than device fd's, my apologies if you actually wrote that >> down :-) It might be obvious ... bare with me I just flew back from the >> US and I am badly jet lagged ... > > I missed it too, the model I'm proposing entirely removes the uiommu > concept. > >> So I see what you mean, however... >> >>> I think the question is: do we force 1:1 iommu/group mapping, or do we allow >>> arbitrary mapping (satisfying group constraints) as we do today. >>> >>> I'm saying I'm an existing user who wants the arbitrary iommu/group mapping >>> ability and definitely think the uiommu approach is cleaner than the >>> ioctl(inherit_iommu) approach. We considered that approach before but it >>> seemed less clean so we went with the explicit uiommu context. >> >> Possibly, the question that interest me the most is what interface will >> KVM end up using. I'm also not terribly fan with the (perceived) >> discrepancy between using uiommu to create groups but using the group fd >> to actually do the mappings, at least if that is still the plan. > > Current code: uiommu creates the domain, we bind a vfio device to that > domain via a SET_UIOMMU_DOMAIN ioctl on the vfio device, then do > mappings via MAP_DMA on the vfio device (affecting all the vfio devices > bound to the domain) > > My current proposal: "groups" are predefined. groups ~= iommu domain. This is my main objection. I'd rather not lose the ability to have multiple devices (which are all predefined as singleton groups on x86 w/o PCI bridges) share IOMMU resources. Otherwise, 20 devices sharing buffers would require 20x the IOMMU/ioTLB resources. KVM doesn't care about this case? > The iommu domain would probably be allocated when the first device is > bound to vfio. As each device is bound, it gets attached to the group. > DMAs are done via an ioctl on the group. > > I think group + uiommu leads to effectively reliving most of the > problems with the current code. The only benefit is the group > assignment to enforce hardware restrictions. We still have the problem > that uiommu open() = iommu_domain_alloc(), whose properties are > meaningless without attached devices (groups). Which I think leads to > the same awkward model of attaching groups to define the domain, then we > end up doing mappings via the group to enforce ordering. Is there a better way to allow groups to share an IOMMU domain? Maybe, instead of having an ioctl to allow a group A to inherit the same iommu domain as group B, we could have an ioctl to fully merge two groups (could be what Ben was thinking): A.ioctl(MERGE_TO_GROUP, B) The group A now goes away and its devices join group B. If A ever had an iommu domain assigned (and buffers mapped?) we fail. Groups cannot get smaller (they are defined as minimum granularity of an IOMMU, initially). They can get bigger if you want to share IOMMU resources, though. Any downsides to this approach? -AF > >> If the separate uiommu interface is kept, then anything that wants to be >> able to benefit from the ability to put multiple devices (or existing >> groups) into such a "meta group" would need to be explicitly modified to >> deal with the uiommu APIs. >> >> I tend to prefer such "meta groups" as being something you create >> statically using a configuration interface, either via sysfs, netlink or >> ioctl's to a "control" vfio device driven by a simple command line tool >> (which can have the configuration stored in /etc and re-apply it at >> boot). > > I cringe anytime there's a mention of "static". IMHO, we have to > support hotplug. That means "meta groups" change dynamically. Maybe > this supports the idea that we should be able to retrieve a new fd from > the group to do mappings. Any groups bound together will return the > same fd and the fd will persist so long as any member of the group is > open. > >> That way, any program capable of exploiting VFIO "groups" will >> automatically be able to exploit those "meta groups" (or groups of >> groups) as well as long as they are supported on the system. >> >> If we ever have system specific constraints as to how such groups can be >> created, then it can all be handled at the level of that configuration >> tool without impact on whatever programs know how to exploit t
Re: kvm PCI assignment & VFIO ramblings
On Tue, 2011-08-23 at 15:14 +0200, Roedel, Joerg wrote: > On Mon, Aug 22, 2011 at 03:17:00PM -0400, Alex Williamson wrote: > > On Mon, 2011-08-22 at 19:25 +0200, Joerg Roedel wrote: > > > > I am in favour of /dev/vfio/$GROUP. If multiple devices should be > > > assigned to a guest, there can also be an ioctl to bind a group to an > > > address-space of another group (certainly needs some care to not allow > > > that both groups belong to different processes). > > > > That's an interesting idea. Maybe an interface similar to the current > > uiommu interface, where you open() the 2nd group fd and pass the fd via > > ioctl to the primary group. IOMMUs that don't support this would fail > > the attach device callback, which would fail the ioctl to bind them. It > > will need to be designed so any group can be removed from the super-set > > and the remaining group(s) still works. This feels like something that > > can be added after we get an initial implementation. > > Handling it through fds is a good idea. This makes sure that everything > belongs to one process. I am not really sure yet if we go the way to > just bind plain groups together or if we create meta-groups. The > meta-groups thing seems somewhat cleaner, though. I'm leaning towards binding because we need to make it dynamic, but I don't really have a good picture of the lifecycle of a meta-group. > > > Btw, a problem we havn't talked about yet entirely is > > > driver-deassignment. User space can decide to de-assign the device from > > > vfio while a fd is open on it. With PCI there is no way to let this fail > > > (the .release function returns void last time i checked). Is this a > > > problem, and yes, how we handle that? > > > > The current vfio has the same problem, we can't unbind a device from > > vfio while it's attached to a guest. I think we'd use the same solution > > too; send out a netlink packet for a device removal and have the .remove > > call sleep on a wait_event(, refcnt == 0). We could also set a timeout > > and SIGBUS the PIDs holding the device if they don't return it > > willingly. Thanks, > > Putting the process to sleep (which would be uninterruptible) seems bad. > The process would sleep until the guest releases the device-group, which > can take days or months. > The best thing (and the most intrusive :-) ) is to change PCI core to > allow unbindings to fail, I think. But this probably further complicates > the way to upstream VFIO... Yes, it's not ideal but I think it's sufficient for now and if we later get support for returning an error from release, we can set a timeout after notifying the user to make use of that. Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Tue, 2011-08-23 at 16:54 +1000, Benjamin Herrenschmidt wrote: > On Mon, 2011-08-22 at 17:52 -0700, aafabbri wrote: > > > I'm not following you. > > > > You have to enforce group/iommu domain assignment whether you have the > > existing uiommu API, or if you change it to your proposed > > ioctl(inherit_iommu) API. > > > > The only change needed to VFIO here should be to make uiommu fd assignment > > happen on the groups instead of on device fds. That operation fails or > > succeeds according to the group semantics (all-or-none assignment/same > > uiommu). > > Ok, so I missed that part where you change uiommu to operate on group > fd's rather than device fd's, my apologies if you actually wrote that > down :-) It might be obvious ... bare with me I just flew back from the > US and I am badly jet lagged ... I missed it too, the model I'm proposing entirely removes the uiommu concept. > So I see what you mean, however... > > > I think the question is: do we force 1:1 iommu/group mapping, or do we allow > > arbitrary mapping (satisfying group constraints) as we do today. > > > > I'm saying I'm an existing user who wants the arbitrary iommu/group mapping > > ability and definitely think the uiommu approach is cleaner than the > > ioctl(inherit_iommu) approach. We considered that approach before but it > > seemed less clean so we went with the explicit uiommu context. > > Possibly, the question that interest me the most is what interface will > KVM end up using. I'm also not terribly fan with the (perceived) > discrepancy between using uiommu to create groups but using the group fd > to actually do the mappings, at least if that is still the plan. Current code: uiommu creates the domain, we bind a vfio device to that domain via a SET_UIOMMU_DOMAIN ioctl on the vfio device, then do mappings via MAP_DMA on the vfio device (affecting all the vfio devices bound to the domain) My current proposal: "groups" are predefined. groups ~= iommu domain. The iommu domain would probably be allocated when the first device is bound to vfio. As each device is bound, it gets attached to the group. DMAs are done via an ioctl on the group. I think group + uiommu leads to effectively reliving most of the problems with the current code. The only benefit is the group assignment to enforce hardware restrictions. We still have the problem that uiommu open() = iommu_domain_alloc(), whose properties are meaningless without attached devices (groups). Which I think leads to the same awkward model of attaching groups to define the domain, then we end up doing mappings via the group to enforce ordering. > If the separate uiommu interface is kept, then anything that wants to be > able to benefit from the ability to put multiple devices (or existing > groups) into such a "meta group" would need to be explicitly modified to > deal with the uiommu APIs. > > I tend to prefer such "meta groups" as being something you create > statically using a configuration interface, either via sysfs, netlink or > ioctl's to a "control" vfio device driven by a simple command line tool > (which can have the configuration stored in /etc and re-apply it at > boot). I cringe anytime there's a mention of "static". IMHO, we have to support hotplug. That means "meta groups" change dynamically. Maybe this supports the idea that we should be able to retrieve a new fd from the group to do mappings. Any groups bound together will return the same fd and the fd will persist so long as any member of the group is open. > That way, any program capable of exploiting VFIO "groups" will > automatically be able to exploit those "meta groups" (or groups of > groups) as well as long as they are supported on the system. > > If we ever have system specific constraints as to how such groups can be > created, then it can all be handled at the level of that configuration > tool without impact on whatever programs know how to exploit them via > the VFIO interfaces. I'd prefer to have the constraints be represented in the ioctl to bind groups. It works or not and the platform gets to define what it considers compatible. Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On 8/23/11 4:04 AM, "Joerg Roedel" wrote: > On Mon, Aug 22, 2011 at 08:52:18PM -0400, aafabbri wrote: >> You have to enforce group/iommu domain assignment whether you have the >> existing uiommu API, or if you change it to your proposed >> ioctl(inherit_iommu) API. >> >> The only change needed to VFIO here should be to make uiommu fd assignment >> happen on the groups instead of on device fds. That operation fails or >> succeeds according to the group semantics (all-or-none assignment/same >> uiommu). > > That is makes uiommu basically the same as the meta-groups, right? Yes, functionality seems the same, thus my suggestion to keep uiommu explicit. Is there some need for group-groups besides defining sets of groups which share IOMMU resources? I do all this stuff (bringing up sets of devices which may share IOMMU domain) dynamically from C applications. I don't really want some static (boot-time or sysfs fiddling) supergroup config unless there is a good reason KVM/power needs it. As you say in your next email, doing it all from ioctls is very easy, programmatically. -Aaron Fabbri -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Tue, 2011-08-23 at 12:38 +1000, David Gibson wrote: > On Mon, Aug 22, 2011 at 09:45:48AM -0600, Alex Williamson wrote: > > On Mon, 2011-08-22 at 15:55 +1000, David Gibson wrote: > > > On Sat, Aug 20, 2011 at 09:51:39AM -0700, Alex Williamson wrote: > > > > We had an extremely productive VFIO BoF on Monday. Here's my attempt to > > > > capture the plan that I think we agreed to: > > > > > > > > We need to address both the description and enforcement of device > > > > groups. Groups are formed any time the iommu does not have resolution > > > > between a set of devices. On x86, this typically happens when a > > > > PCI-to-PCI bridge exists between the set of devices and the iommu. For > > > > Power, partitionable endpoints define a group. Grouping information > > > > needs to be exposed for both userspace and kernel internal usage. This > > > > will be a sysfs attribute setup by the iommu drivers. Perhaps: > > > > > > > > # cat /sys/devices/pci:00/:00:19.0/iommu_group > > > > 42 > > > > > > > > (I use a PCI example here, but attribute should not be PCI specific) > > > > > > Ok. Am I correct in thinking these group IDs are representing the > > > minimum granularity, and are therefore always static, defined only by > > > the connected hardware, not by configuration? > > > > Yes, that's the idea. An open question I have towards the configuration > > side is whether we might add iommu driver specific options to the > > groups. For instance on x86 where we typically have B:D.F granularity, > > should we have an option not to trust multi-function devices and use a > > B:D granularity for grouping? > > Right. And likewise I can see a place for configuration parameters > like the present 'allow_unsafe_irqs'. But these would be more-or-less > global options which affected the overall granularity, rather than > detailed configuration such as explicitly binding some devices into a > group, yes? Yes, currently the interrupt remapping support is a global iommu capability. I suppose it's possible that this could be an iommu option, where the iommu driver would not advertise a group if the interrupt remapping constraint isn't met. > > > > >From there we have a few options. In the BoF we discussed a model > > > > >where > > > > binding a device to vfio creates a /dev/vfio$GROUP character device > > > > file. This "group" fd provides provides dma mapping ioctls as well as > > > > ioctls to enumerate and return a "device" fd for each attached member of > > > > the group (similar to KVM_CREATE_VCPU). We enforce grouping by > > > > returning an error on open() of the group fd if there are members of the > > > > group not bound to the vfio driver. Each device fd would then support a > > > > similar set of ioctls and mapping (mmio/pio/config) interface as current > > > > vfio, except for the obvious domain and dma ioctls superseded by the > > > > group fd. > > > > > > It seems a slightly strange distinction that the group device appears > > > when any device in the group is bound to vfio, but only becomes usable > > > when all devices are bound. > > > > > > > Another valid model might be that /dev/vfio/$GROUP is created for all > > > > groups when the vfio module is loaded. The group fd would allow open() > > > > and some set of iommu querying and device enumeration ioctls, but would > > > > error on dma mapping and retrieving device fds until all of the group > > > > devices are bound to the vfio driver. > > > > > > Which is why I marginally prefer this model, although it's not a big > > > deal. > > > > Right, we can also combine models. Binding a device to vfio > > creates /dev/vfio$GROUP, which only allows a subset of ioctls and no > > device access until all the group devices are also bound. I think > > the /dev/vfio/$GROUP might help provide an enumeration interface as well > > though, which could be useful. > > I'm not entirely sure what you mean here. But, that's now several > weak votes in favour of the always-present group devices, and none in > favour of the created-when-first-device-bound model, so I suggest we > take the /dev/vfio/$GROUP as our tentative approach. Yep > > > > In either case, the uiommu interface is removed entirely since dma > > > > mapping is done via the group fd. As necessary in the future, we can > > > > define a more high performance dma mapping interface for streaming dma > > > > via the group fd. I expect we'll also include architecture specific > > > > group ioctls to describe features and capabilities of the iommu. The > > > > group fd will need to prevent concurrent open()s to maintain a 1:1 group > > > > to userspace process ownership model. > > > > > > A 1:1 group<->process correspondance seems wrong to me. But there are > > > many ways you could legitimately write the userspace side of the code, > > > many of them involving some sort of concurrency. Implementing that > > > concurrency as multiple processes (using explicit shared memory
Re: kvm PCI assignment & VFIO ramblings
On Mon, Aug 22, 2011 at 03:17:00PM -0400, Alex Williamson wrote: > On Mon, 2011-08-22 at 19:25 +0200, Joerg Roedel wrote: > > I am in favour of /dev/vfio/$GROUP. If multiple devices should be > > assigned to a guest, there can also be an ioctl to bind a group to an > > address-space of another group (certainly needs some care to not allow > > that both groups belong to different processes). > > That's an interesting idea. Maybe an interface similar to the current > uiommu interface, where you open() the 2nd group fd and pass the fd via > ioctl to the primary group. IOMMUs that don't support this would fail > the attach device callback, which would fail the ioctl to bind them. It > will need to be designed so any group can be removed from the super-set > and the remaining group(s) still works. This feels like something that > can be added after we get an initial implementation. Handling it through fds is a good idea. This makes sure that everything belongs to one process. I am not really sure yet if we go the way to just bind plain groups together or if we create meta-groups. The meta-groups thing seems somewhat cleaner, though. > > Btw, a problem we havn't talked about yet entirely is > > driver-deassignment. User space can decide to de-assign the device from > > vfio while a fd is open on it. With PCI there is no way to let this fail > > (the .release function returns void last time i checked). Is this a > > problem, and yes, how we handle that? > > The current vfio has the same problem, we can't unbind a device from > vfio while it's attached to a guest. I think we'd use the same solution > too; send out a netlink packet for a device removal and have the .remove > call sleep on a wait_event(, refcnt == 0). We could also set a timeout > and SIGBUS the PIDs holding the device if they don't return it > willingly. Thanks, Putting the process to sleep (which would be uninterruptible) seems bad. The process would sleep until the guest releases the device-group, which can take days or months. The best thing (and the most intrusive :-) ) is to change PCI core to allow unbindings to fail, I think. But this probably further complicates the way to upstream VFIO... Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Mon, Aug 22, 2011 at 05:03:53PM -0400, Benjamin Herrenschmidt wrote: > > > I am in favour of /dev/vfio/$GROUP. If multiple devices should be > > assigned to a guest, there can also be an ioctl to bind a group to an > > address-space of another group (certainly needs some care to not allow > > that both groups belong to different processes). > > > > Btw, a problem we havn't talked about yet entirely is > > driver-deassignment. User space can decide to de-assign the device from > > vfio while a fd is open on it. With PCI there is no way to let this fail > > (the .release function returns void last time i checked). Is this a > > problem, and yes, how we handle that? > > We can treat it as a hard unplug (like a cardbus gone away). > > IE. Dispose of the direct mappings (switch to MMIO emulation) and return > all ff's from reads (& ignore writes). > > Then send an unplug event via whatever mechanism the platform provides > (ACPI hotplug controller on x86 for example, we haven't quite sorted out > what to do on power for hotplug yet). Hmm, good idea. But as far as I know the hotplug-event needs to be in the guest _before_ the device is actually unplugged (so that the guest can unbind its driver first). That somehow brings back the sleep-idea and the timeout in the .release function. Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Tue, Aug 23, 2011 at 02:54:43AM -0400, Benjamin Herrenschmidt wrote: > Possibly, the question that interest me the most is what interface will > KVM end up using. I'm also not terribly fan with the (perceived) > discrepancy between using uiommu to create groups but using the group fd > to actually do the mappings, at least if that is still the plan. > > If the separate uiommu interface is kept, then anything that wants to be > able to benefit from the ability to put multiple devices (or existing > groups) into such a "meta group" would need to be explicitly modified to > deal with the uiommu APIs. > > I tend to prefer such "meta groups" as being something you create > statically using a configuration interface, either via sysfs, netlink or > ioctl's to a "control" vfio device driven by a simple command line tool > (which can have the configuration stored in /etc and re-apply it at > boot). Hmm, I don't think that these groups are static for the systems run-time. They only exist for the lifetime of a guest per default, at least on x86. Thats why I prefer to do this grouping using VFIO and not some sysfs interface (which would be the third interface beside the ioctls and netlink a VFIO user needs to be aware of). Doing this in the ioctl interface just makes things easier. Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Mon, Aug 22, 2011 at 08:52:18PM -0400, aafabbri wrote: > You have to enforce group/iommu domain assignment whether you have the > existing uiommu API, or if you change it to your proposed > ioctl(inherit_iommu) API. > > The only change needed to VFIO here should be to make uiommu fd assignment > happen on the groups instead of on device fds. That operation fails or > succeeds according to the group semantics (all-or-none assignment/same > uiommu). That is makes uiommu basically the same as the meta-groups, right? Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Mon, 2011-08-22 at 17:52 -0700, aafabbri wrote: > I'm not following you. > > You have to enforce group/iommu domain assignment whether you have the > existing uiommu API, or if you change it to your proposed > ioctl(inherit_iommu) API. > > The only change needed to VFIO here should be to make uiommu fd assignment > happen on the groups instead of on device fds. That operation fails or > succeeds according to the group semantics (all-or-none assignment/same > uiommu). Ok, so I missed that part where you change uiommu to operate on group fd's rather than device fd's, my apologies if you actually wrote that down :-) It might be obvious ... bare with me I just flew back from the US and I am badly jet lagged ... So I see what you mean, however... > I think the question is: do we force 1:1 iommu/group mapping, or do we allow > arbitrary mapping (satisfying group constraints) as we do today. > > I'm saying I'm an existing user who wants the arbitrary iommu/group mapping > ability and definitely think the uiommu approach is cleaner than the > ioctl(inherit_iommu) approach. We considered that approach before but it > seemed less clean so we went with the explicit uiommu context. Possibly, the question that interest me the most is what interface will KVM end up using. I'm also not terribly fan with the (perceived) discrepancy between using uiommu to create groups but using the group fd to actually do the mappings, at least if that is still the plan. If the separate uiommu interface is kept, then anything that wants to be able to benefit from the ability to put multiple devices (or existing groups) into such a "meta group" would need to be explicitly modified to deal with the uiommu APIs. I tend to prefer such "meta groups" as being something you create statically using a configuration interface, either via sysfs, netlink or ioctl's to a "control" vfio device driven by a simple command line tool (which can have the configuration stored in /etc and re-apply it at boot). That way, any program capable of exploiting VFIO "groups" will automatically be able to exploit those "meta groups" (or groups of groups) as well as long as they are supported on the system. If we ever have system specific constraints as to how such groups can be created, then it can all be handled at the level of that configuration tool without impact on whatever programs know how to exploit them via the VFIO interfaces. > > .../... > > > >> If we in singleton-group land were building our own "groups" which were > >> sets > >> of devices sharing the IOMMU domains we wanted, I suppose we could do away > >> with uiommu fds, but it sounds like the current proposal would create 20 > >> singleton groups (x86 iommu w/o PCI bridges => all devices are > >> partitionable > >> endpoints). Asking me to ioctl(inherit) them together into a blob sounds > >> worse than the current explicit uiommu API. > > > > I'd rather have an API to create super-groups (groups of groups) > > statically and then you can use such groups as normal groups using the > > same interface. That create/management process could be done via a > > simple command line utility or via sysfs banging, whatever... Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Mon, Aug 22, 2011 at 09:45:48AM -0600, Alex Williamson wrote: > On Mon, 2011-08-22 at 15:55 +1000, David Gibson wrote: > > On Sat, Aug 20, 2011 at 09:51:39AM -0700, Alex Williamson wrote: > > > We had an extremely productive VFIO BoF on Monday. Here's my attempt to > > > capture the plan that I think we agreed to: > > > > > > We need to address both the description and enforcement of device > > > groups. Groups are formed any time the iommu does not have resolution > > > between a set of devices. On x86, this typically happens when a > > > PCI-to-PCI bridge exists between the set of devices and the iommu. For > > > Power, partitionable endpoints define a group. Grouping information > > > needs to be exposed for both userspace and kernel internal usage. This > > > will be a sysfs attribute setup by the iommu drivers. Perhaps: > > > > > > # cat /sys/devices/pci:00/:00:19.0/iommu_group > > > 42 > > > > > > (I use a PCI example here, but attribute should not be PCI specific) > > > > Ok. Am I correct in thinking these group IDs are representing the > > minimum granularity, and are therefore always static, defined only by > > the connected hardware, not by configuration? > > Yes, that's the idea. An open question I have towards the configuration > side is whether we might add iommu driver specific options to the > groups. For instance on x86 where we typically have B:D.F granularity, > should we have an option not to trust multi-function devices and use a > B:D granularity for grouping? Right. And likewise I can see a place for configuration parameters like the present 'allow_unsafe_irqs'. But these would be more-or-less global options which affected the overall granularity, rather than detailed configuration such as explicitly binding some devices into a group, yes? > > > >From there we have a few options. In the BoF we discussed a model where > > > binding a device to vfio creates a /dev/vfio$GROUP character device > > > file. This "group" fd provides provides dma mapping ioctls as well as > > > ioctls to enumerate and return a "device" fd for each attached member of > > > the group (similar to KVM_CREATE_VCPU). We enforce grouping by > > > returning an error on open() of the group fd if there are members of the > > > group not bound to the vfio driver. Each device fd would then support a > > > similar set of ioctls and mapping (mmio/pio/config) interface as current > > > vfio, except for the obvious domain and dma ioctls superseded by the > > > group fd. > > > > It seems a slightly strange distinction that the group device appears > > when any device in the group is bound to vfio, but only becomes usable > > when all devices are bound. > > > > > Another valid model might be that /dev/vfio/$GROUP is created for all > > > groups when the vfio module is loaded. The group fd would allow open() > > > and some set of iommu querying and device enumeration ioctls, but would > > > error on dma mapping and retrieving device fds until all of the group > > > devices are bound to the vfio driver. > > > > Which is why I marginally prefer this model, although it's not a big > > deal. > > Right, we can also combine models. Binding a device to vfio > creates /dev/vfio$GROUP, which only allows a subset of ioctls and no > device access until all the group devices are also bound. I think > the /dev/vfio/$GROUP might help provide an enumeration interface as well > though, which could be useful. I'm not entirely sure what you mean here. But, that's now several weak votes in favour of the always-present group devices, and none in favour of the created-when-first-device-bound model, so I suggest we take the /dev/vfio/$GROUP as our tentative approach. > > > In either case, the uiommu interface is removed entirely since dma > > > mapping is done via the group fd. As necessary in the future, we can > > > define a more high performance dma mapping interface for streaming dma > > > via the group fd. I expect we'll also include architecture specific > > > group ioctls to describe features and capabilities of the iommu. The > > > group fd will need to prevent concurrent open()s to maintain a 1:1 group > > > to userspace process ownership model. > > > > A 1:1 group<->process correspondance seems wrong to me. But there are > > many ways you could legitimately write the userspace side of the code, > > many of them involving some sort of concurrency. Implementing that > > concurrency as multiple processes (using explicit shared memory and/or > > other IPC mechanisms to co-ordinate) seems a valid choice that we > > shouldn't arbitrarily prohibit. > > > > Obviously, only one UID may be permitted to have the group open at a > > time, and I think that's enough to prevent them doing any worse than > > shooting themselves in the foot. > > 1:1 group<->process is probably too strong. Not allowing concurrent > open()s on the group file enforces a single userspace entity is > responsible for that group
Re: kvm PCI assignment & VFIO ramblings
On 8/22/11 2:49 PM, "Benjamin Herrenschmidt" wrote: > >>> I wouldn't use uiommu for that. >> >> Any particular reason besides saving a file descriptor? >> >> We use it today, and it seems like a cleaner API than what you propose >> changing it to. > > Well for one, we are back to square one vs. grouping constraints. I'm not following you. You have to enforce group/iommu domain assignment whether you have the existing uiommu API, or if you change it to your proposed ioctl(inherit_iommu) API. The only change needed to VFIO here should be to make uiommu fd assignment happen on the groups instead of on device fds. That operation fails or succeeds according to the group semantics (all-or-none assignment/same uiommu). I think the question is: do we force 1:1 iommu/group mapping, or do we allow arbitrary mapping (satisfying group constraints) as we do today. I'm saying I'm an existing user who wants the arbitrary iommu/group mapping ability and definitely think the uiommu approach is cleaner than the ioctl(inherit_iommu) approach. We considered that approach before but it seemed less clean so we went with the explicit uiommu context. > .../... > >> If we in singleton-group land were building our own "groups" which were sets >> of devices sharing the IOMMU domains we wanted, I suppose we could do away >> with uiommu fds, but it sounds like the current proposal would create 20 >> singleton groups (x86 iommu w/o PCI bridges => all devices are partitionable >> endpoints). Asking me to ioctl(inherit) them together into a blob sounds >> worse than the current explicit uiommu API. > > I'd rather have an API to create super-groups (groups of groups) > statically and then you can use such groups as normal groups using the > same interface. That create/management process could be done via a > simple command line utility or via sysfs banging, whatever... -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
> > I wouldn't use uiommu for that. > > Any particular reason besides saving a file descriptor? > > We use it today, and it seems like a cleaner API than what you propose > changing it to. Well for one, we are back to square one vs. grouping constraints. .../... > If we in singleton-group land were building our own "groups" which were sets > of devices sharing the IOMMU domains we wanted, I suppose we could do away > with uiommu fds, but it sounds like the current proposal would create 20 > singleton groups (x86 iommu w/o PCI bridges => all devices are partitionable > endpoints). Asking me to ioctl(inherit) them together into a blob sounds > worse than the current explicit uiommu API. I'd rather have an API to create super-groups (groups of groups) statically and then you can use such groups as normal groups using the same interface. That create/management process could be done via a simple command line utility or via sysfs banging, whatever... Cheers, Ben. > Thanks, > Aaron > > > > > Another option is to make that static configuration APIs via special > > ioctls (or even netlink if you really like it), to change the grouping > > on architectures that allow it. > > > > Cheers. > > Ben. > > > >> > >> -Aaron > >> > >>> As necessary in the future, we can > >>> define a more high performance dma mapping interface for streaming dma > >>> via the group fd. I expect we'll also include architecture specific > >>> group ioctls to describe features and capabilities of the iommu. The > >>> group fd will need to prevent concurrent open()s to maintain a 1:1 group > >>> to userspace process ownership model. > >>> > >>> Also on the table is supporting non-PCI devices with vfio. To do this, > >>> we need to generalize the read/write/mmap and irq eventfd interfaces. > >>> We could keep the same model of segmenting the device fd address space, > >>> perhaps adding ioctls to define the segment offset bit position or we > >>> could split each region into it's own fd (VFIO_GET_PCI_BAR_FD(0), > >>> VFIO_GET_PCI_CONFIG_FD(), VFIO_GET_MMIO_FD(3)), though we're already > >>> suffering some degree of fd bloat (group fd, device fd(s), interrupt > >>> event fd(s), per resource fd, etc). For interrupts we can overload > >>> VFIO_SET_IRQ_EVENTFD to be either PCI INTx or non-PCI irq (do non-PCI > >>> devices support MSI?). > >>> > >>> For qemu, these changes imply we'd only support a model where we have a > >>> 1:1 group to iommu domain. The current vfio driver could probably > >>> become vfio-pci as we might end up with more target specific vfio > >>> drivers for non-pci. PCI should be able to maintain a simple -device > >>> vfio-pci,host=bb:dd.f to enable hotplug of individual devices. We'll > >>> need to come up with extra options when we need to expose groups to > >>> guest for pvdma. > >>> > >>> Hope that captures it, feel free to jump in with corrections and > >>> suggestions. Thanks, > >>> > >>> Alex > >>> > > > > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On 8/22/11 1:49 PM, "Benjamin Herrenschmidt" wrote: > On Mon, 2011-08-22 at 13:29 -0700, aafabbri wrote: > >>> Each device fd would then support a >>> similar set of ioctls and mapping (mmio/pio/config) interface as current >>> vfio, except for the obvious domain and dma ioctls superseded by the >>> group fd. >>> >>> Another valid model might be that /dev/vfio/$GROUP is created for all >>> groups when the vfio module is loaded. The group fd would allow open() >>> and some set of iommu querying and device enumeration ioctls, but would >>> error on dma mapping and retrieving device fds until all of the group >>> devices are bound to the vfio driver. >>> >>> In either case, the uiommu interface is removed entirely since dma >>> mapping is done via the group fd. >> >> The loss in generality is unfortunate. I'd like to be able to support >> arbitrary iommu domain <-> device assignment. One way to do this would be >> to keep uiommu, but to return an error if someone tries to assign more than >> one uiommu context to devices in the same group. > > I wouldn't use uiommu for that. Any particular reason besides saving a file descriptor? We use it today, and it seems like a cleaner API than what you propose changing it to. > If the HW or underlying kernel drivers > support it, what I'd suggest is that you have an (optional) ioctl to > bind two groups (you have to have both opened already) or for one group > to "capture" another one. You'll need other rules there too.. "both opened already, but zero mappings performed yet as they would have instantiated a default IOMMU domain". Keep in mind the only case I'm using is singleton groups, a.k.a. devices. Since what I want is to specify which devices can do things like share network buffers (in a way that conserves IOMMU hw resources), it seems cleanest to expose this explicitly, versus some "inherit iommu domain from another device" ioctl. What happens if I do something like this: dev1_fd = open ("/dev/vfio0") dev2_fd = open ("/dev/vfio1") dev2_fd.inherit_iommu(dev1_fd) error = close(dev1_fd) There are other gross cases as well. > > The binding means under the hood the iommus get shared, with the > lifetime being that of the "owning" group. So what happens in the close() above? EINUSE? Reset all children? Still seems less clean than having an explicit iommu fd. Without some benefit I'm not sure why we'd want to change this API. If we in singleton-group land were building our own "groups" which were sets of devices sharing the IOMMU domains we wanted, I suppose we could do away with uiommu fds, but it sounds like the current proposal would create 20 singleton groups (x86 iommu w/o PCI bridges => all devices are partitionable endpoints). Asking me to ioctl(inherit) them together into a blob sounds worse than the current explicit uiommu API. Thanks, Aaron > > Another option is to make that static configuration APIs via special > ioctls (or even netlink if you really like it), to change the grouping > on architectures that allow it. > > Cheers. > Ben. > >> >> -Aaron >> >>> As necessary in the future, we can >>> define a more high performance dma mapping interface for streaming dma >>> via the group fd. I expect we'll also include architecture specific >>> group ioctls to describe features and capabilities of the iommu. The >>> group fd will need to prevent concurrent open()s to maintain a 1:1 group >>> to userspace process ownership model. >>> >>> Also on the table is supporting non-PCI devices with vfio. To do this, >>> we need to generalize the read/write/mmap and irq eventfd interfaces. >>> We could keep the same model of segmenting the device fd address space, >>> perhaps adding ioctls to define the segment offset bit position or we >>> could split each region into it's own fd (VFIO_GET_PCI_BAR_FD(0), >>> VFIO_GET_PCI_CONFIG_FD(), VFIO_GET_MMIO_FD(3)), though we're already >>> suffering some degree of fd bloat (group fd, device fd(s), interrupt >>> event fd(s), per resource fd, etc). For interrupts we can overload >>> VFIO_SET_IRQ_EVENTFD to be either PCI INTx or non-PCI irq (do non-PCI >>> devices support MSI?). >>> >>> For qemu, these changes imply we'd only support a model where we have a >>> 1:1 group to iommu domain. The current vfio driver could probably >>> become vfio-pci as we might end up with more target specific vfio >>> drivers for non-pci. PCI should be able to maintain a simple -device >>> vfio-pci,host=bb:dd.f to enable hotplug of individual devices. We'll >>> need to come up with extra options when we need to expose groups to >>> guest for pvdma. >>> >>> Hope that captures it, feel free to jump in with corrections and >>> suggestions. Thanks, >>> >>> Alex >>> > > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
> I am in favour of /dev/vfio/$GROUP. If multiple devices should be > assigned to a guest, there can also be an ioctl to bind a group to an > address-space of another group (certainly needs some care to not allow > that both groups belong to different processes). > > Btw, a problem we havn't talked about yet entirely is > driver-deassignment. User space can decide to de-assign the device from > vfio while a fd is open on it. With PCI there is no way to let this fail > (the .release function returns void last time i checked). Is this a > problem, and yes, how we handle that? We can treat it as a hard unplug (like a cardbus gone away). IE. Dispose of the direct mappings (switch to MMIO emulation) and return all ff's from reads (& ignore writes). Then send an unplug event via whatever mechanism the platform provides (ACPI hotplug controller on x86 for example, we haven't quite sorted out what to do on power for hotplug yet). Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Mon, 2011-08-22 at 09:45 -0600, Alex Williamson wrote: > Yes, that's the idea. An open question I have towards the configuration > side is whether we might add iommu driver specific options to the > groups. For instance on x86 where we typically have B:D.F granularity, > should we have an option not to trust multi-function devices and use a > B:D granularity for grouping? Or even B or range of busses... if you want to enforce strict isolation you really can't trust anything below a bus level :-) > Right, we can also combine models. Binding a device to vfio > creates /dev/vfio$GROUP, which only allows a subset of ioctls and no > device access until all the group devices are also bound. I think > the /dev/vfio/$GROUP might help provide an enumeration interface as well > though, which could be useful. Could be tho in what form ? returning sysfs pathes ? > 1:1 group<->process is probably too strong. Not allowing concurrent > open()s on the group file enforces a single userspace entity is > responsible for that group. Device fds can be passed to other > processes, but only retrieved via the group fd. I suppose we could even > branch off the dma interface into a different fd, but it seems like we > would logically want to serialize dma mappings at each iommu group > anyway. I'm open to alternatives, this just seemed an easy way to do > it. Restricting on UID implies that we require isolated qemu instances > to run as different UIDs. I know that's a goal, but I don't know if we > want to make it an assumption in the group security model. 1:1 process has the advantage of linking to an -mm which makes the whole mmu notifier business doable. How do you want to track down mappings and do the second level translation in the case of explicit map/unmap (like on power) if you are not tied to an mm_struct ? > Yes. I'm not sure there's a good ROI to prioritize that model. We have > to assume >1 device per guest is a typical model and that the iotlb is > large enough that we might improve thrashing to see both a resource and > performance benefit from it. I'm open to suggestions for how we could > include it though. Sharing may or may not be possible depending on setups so yes, it's a bit tricky. My preference is to have a static interface (and that's actually where your pet netlink might make some sense :-) to create "synthetic" groups made of other groups if the arch allows it. But that might not be the best approach. In another email I also proposed an option for a group to "capture" another one... > > If that's > > not what you're saying, how would the domains - now made up of a > > user's selection of groups, rather than individual devices - be > > configured? > > > > > Hope that captures it, feel free to jump in with corrections and > > > suggestions. Thanks, > > Another aspect I don't see discussed is how we represent these things to the guest. On Power for example, I have a requirement that a given iommu domain is represented by a single dma window property in the device-tree. What that means is that that property needs to be either in the node of the device itself if there's only one device in the group or in a parent node (ie a bridge or host bridge) if there are multiple devices. Now I do -not- want to go down the path of simulating P2P bridges, besides we'll quickly run out of bus numbers if we go there. For us the most simple and logical approach (which is also what pHyp uses and what Linux handles well) is really to expose a given PCI host bridge per group to the guest. Believe it or not, it makes things easier :-) Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Mon, 2011-08-22 at 09:30 +0300, Avi Kivity wrote: > On 08/20/2011 07:51 PM, Alex Williamson wrote: > > We need to address both the description and enforcement of device > > groups. Groups are formed any time the iommu does not have resolution > > between a set of devices. On x86, this typically happens when a > > PCI-to-PCI bridge exists between the set of devices and the iommu. For > > Power, partitionable endpoints define a group. Grouping information > > needs to be exposed for both userspace and kernel internal usage. This > > will be a sysfs attribute setup by the iommu drivers. Perhaps: > > > > # cat /sys/devices/pci:00/:00:19.0/iommu_group > > 42 > > > > $ readlink /sys/devices/pci:00/:00:19.0/iommu_group > ../../../path/to/device/which/represents/the/resource/constraint > > (the pci-to-pci bridge on x86, or whatever node represents partitionable > endpoints on power) The constraint might not necessarily be a device. The PCI bridge is just an example. There are other possible constraints. On POWER for example, it could be a limit in how far I can segment the DMA address space, forcing me to arbitrarily put devices together, or it could be a similar constraint related to how the MMIO space is broken up. So either that remains a path in which case we do have a separate set of sysfs nodes representing the groups themselves which may or may not itself contain a pointer to the "constraining" device, or we just make that an arbitrary number (in my case the PE#) Cheers, Ben -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Mon, 2011-08-22 at 13:29 -0700, aafabbri wrote: > > Each device fd would then support a > > similar set of ioctls and mapping (mmio/pio/config) interface as current > > vfio, except for the obvious domain and dma ioctls superseded by the > > group fd. > > > > Another valid model might be that /dev/vfio/$GROUP is created for all > > groups when the vfio module is loaded. The group fd would allow open() > > and some set of iommu querying and device enumeration ioctls, but would > > error on dma mapping and retrieving device fds until all of the group > > devices are bound to the vfio driver. > > > > In either case, the uiommu interface is removed entirely since dma > > mapping is done via the group fd. > > The loss in generality is unfortunate. I'd like to be able to support > arbitrary iommu domain <-> device assignment. One way to do this would be > to keep uiommu, but to return an error if someone tries to assign more than > one uiommu context to devices in the same group. I wouldn't use uiommu for that. If the HW or underlying kernel drivers support it, what I'd suggest is that you have an (optional) ioctl to bind two groups (you have to have both opened already) or for one group to "capture" another one. The binding means under the hood the iommus get shared, with the lifetime being that of the "owning" group. Another option is to make that static configuration APIs via special ioctls (or even netlink if you really like it), to change the grouping on architectures that allow it. Cheers. Ben. > > -Aaron > > > As necessary in the future, we can > > define a more high performance dma mapping interface for streaming dma > > via the group fd. I expect we'll also include architecture specific > > group ioctls to describe features and capabilities of the iommu. The > > group fd will need to prevent concurrent open()s to maintain a 1:1 group > > to userspace process ownership model. > > > > Also on the table is supporting non-PCI devices with vfio. To do this, > > we need to generalize the read/write/mmap and irq eventfd interfaces. > > We could keep the same model of segmenting the device fd address space, > > perhaps adding ioctls to define the segment offset bit position or we > > could split each region into it's own fd (VFIO_GET_PCI_BAR_FD(0), > > VFIO_GET_PCI_CONFIG_FD(), VFIO_GET_MMIO_FD(3)), though we're already > > suffering some degree of fd bloat (group fd, device fd(s), interrupt > > event fd(s), per resource fd, etc). For interrupts we can overload > > VFIO_SET_IRQ_EVENTFD to be either PCI INTx or non-PCI irq (do non-PCI > > devices support MSI?). > > > > For qemu, these changes imply we'd only support a model where we have a > > 1:1 group to iommu domain. The current vfio driver could probably > > become vfio-pci as we might end up with more target specific vfio > > drivers for non-pci. PCI should be able to maintain a simple -device > > vfio-pci,host=bb:dd.f to enable hotplug of individual devices. We'll > > need to come up with extra options when we need to expose groups to > > guest for pvdma. > > > > Hope that captures it, feel free to jump in with corrections and > > suggestions. Thanks, > > > > Alex > > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On 8/20/11 9:51 AM, "Alex Williamson" wrote: > We had an extremely productive VFIO BoF on Monday. Here's my attempt to > capture the plan that I think we agreed to: > > We need to address both the description and enforcement of device > groups. Groups are formed any time the iommu does not have resolution > between a set of devices. On x86, this typically happens when a > PCI-to-PCI bridge exists between the set of devices and the iommu. For > Power, partitionable endpoints define a group. Grouping information > needs to be exposed for both userspace and kernel internal usage. This > will be a sysfs attribute setup by the iommu drivers. Perhaps: > > # cat /sys/devices/pci:00/:00:19.0/iommu_group > 42 > > (I use a PCI example here, but attribute should not be PCI specific) > > From there we have a few options. In the BoF we discussed a model where > binding a device to vfio creates a /dev/vfio$GROUP character device > file. This "group" fd provides provides dma mapping ioctls as well as > ioctls to enumerate and return a "device" fd for each attached member of > the group (similar to KVM_CREATE_VCPU). We enforce grouping by > returning an error on open() of the group fd if there are members of the > group not bound to the vfio driver. Sounds reasonable. > Each device fd would then support a > similar set of ioctls and mapping (mmio/pio/config) interface as current > vfio, except for the obvious domain and dma ioctls superseded by the > group fd. > > Another valid model might be that /dev/vfio/$GROUP is created for all > groups when the vfio module is loaded. The group fd would allow open() > and some set of iommu querying and device enumeration ioctls, but would > error on dma mapping and retrieving device fds until all of the group > devices are bound to the vfio driver. > > In either case, the uiommu interface is removed entirely since dma > mapping is done via the group fd. The loss in generality is unfortunate. I'd like to be able to support arbitrary iommu domain <-> device assignment. One way to do this would be to keep uiommu, but to return an error if someone tries to assign more than one uiommu context to devices in the same group. -Aaron > As necessary in the future, we can > define a more high performance dma mapping interface for streaming dma > via the group fd. I expect we'll also include architecture specific > group ioctls to describe features and capabilities of the iommu. The > group fd will need to prevent concurrent open()s to maintain a 1:1 group > to userspace process ownership model. > > Also on the table is supporting non-PCI devices with vfio. To do this, > we need to generalize the read/write/mmap and irq eventfd interfaces. > We could keep the same model of segmenting the device fd address space, > perhaps adding ioctls to define the segment offset bit position or we > could split each region into it's own fd (VFIO_GET_PCI_BAR_FD(0), > VFIO_GET_PCI_CONFIG_FD(), VFIO_GET_MMIO_FD(3)), though we're already > suffering some degree of fd bloat (group fd, device fd(s), interrupt > event fd(s), per resource fd, etc). For interrupts we can overload > VFIO_SET_IRQ_EVENTFD to be either PCI INTx or non-PCI irq (do non-PCI > devices support MSI?). > > For qemu, these changes imply we'd only support a model where we have a > 1:1 group to iommu domain. The current vfio driver could probably > become vfio-pci as we might end up with more target specific vfio > drivers for non-pci. PCI should be able to maintain a simple -device > vfio-pci,host=bb:dd.f to enable hotplug of individual devices. We'll > need to come up with extra options when we need to expose groups to > guest for pvdma. > > Hope that captures it, feel free to jump in with corrections and > suggestions. Thanks, > > Alex > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Mon, 2011-08-22 at 19:25 +0200, Joerg Roedel wrote: > On Sat, Aug 20, 2011 at 12:51:39PM -0400, Alex Williamson wrote: > > We had an extremely productive VFIO BoF on Monday. Here's my attempt to > > capture the plan that I think we agreed to: > > > > We need to address both the description and enforcement of device > > groups. Groups are formed any time the iommu does not have resolution > > between a set of devices. On x86, this typically happens when a > > PCI-to-PCI bridge exists between the set of devices and the iommu. For > > Power, partitionable endpoints define a group. Grouping information > > needs to be exposed for both userspace and kernel internal usage. This > > will be a sysfs attribute setup by the iommu drivers. Perhaps: > > > > # cat /sys/devices/pci:00/:00:19.0/iommu_group > > 42 > > Right, that is mainly for libvirt to provide that information to the > user in a meaningful way. So userspace is aware that other devices might > not work anymore when it assigns one to a guest. > > > > > (I use a PCI example here, but attribute should not be PCI specific) > > > > From there we have a few options. In the BoF we discussed a model where > > binding a device to vfio creates a /dev/vfio$GROUP character device > > file. This "group" fd provides provides dma mapping ioctls as well as > > ioctls to enumerate and return a "device" fd for each attached member of > > the group (similar to KVM_CREATE_VCPU). We enforce grouping by > > returning an error on open() of the group fd if there are members of the > > group not bound to the vfio driver. Each device fd would then support a > > similar set of ioctls and mapping (mmio/pio/config) interface as current > > vfio, except for the obvious domain and dma ioctls superseded by the > > group fd. > > > > Another valid model might be that /dev/vfio/$GROUP is created for all > > groups when the vfio module is loaded. The group fd would allow open() > > and some set of iommu querying and device enumeration ioctls, but would > > error on dma mapping and retrieving device fds until all of the group > > devices are bound to the vfio driver. > > I am in favour of /dev/vfio/$GROUP. If multiple devices should be > assigned to a guest, there can also be an ioctl to bind a group to an > address-space of another group (certainly needs some care to not allow > that both groups belong to different processes). That's an interesting idea. Maybe an interface similar to the current uiommu interface, where you open() the 2nd group fd and pass the fd via ioctl to the primary group. IOMMUs that don't support this would fail the attach device callback, which would fail the ioctl to bind them. It will need to be designed so any group can be removed from the super-set and the remaining group(s) still works. This feels like something that can be added after we get an initial implementation. > Btw, a problem we havn't talked about yet entirely is > driver-deassignment. User space can decide to de-assign the device from > vfio while a fd is open on it. With PCI there is no way to let this fail > (the .release function returns void last time i checked). Is this a > problem, and yes, how we handle that? The current vfio has the same problem, we can't unbind a device from vfio while it's attached to a guest. I think we'd use the same solution too; send out a netlink packet for a device removal and have the .remove call sleep on a wait_event(, refcnt == 0). We could also set a timeout and SIGBUS the PIDs holding the device if they don't return it willingly. Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Sat, Aug 20, 2011 at 12:51:39PM -0400, Alex Williamson wrote: > We had an extremely productive VFIO BoF on Monday. Here's my attempt to > capture the plan that I think we agreed to: > > We need to address both the description and enforcement of device > groups. Groups are formed any time the iommu does not have resolution > between a set of devices. On x86, this typically happens when a > PCI-to-PCI bridge exists between the set of devices and the iommu. For > Power, partitionable endpoints define a group. Grouping information > needs to be exposed for both userspace and kernel internal usage. This > will be a sysfs attribute setup by the iommu drivers. Perhaps: > > # cat /sys/devices/pci:00/:00:19.0/iommu_group > 42 Right, that is mainly for libvirt to provide that information to the user in a meaningful way. So userspace is aware that other devices might not work anymore when it assigns one to a guest. > > (I use a PCI example here, but attribute should not be PCI specific) > > From there we have a few options. In the BoF we discussed a model where > binding a device to vfio creates a /dev/vfio$GROUP character device > file. This "group" fd provides provides dma mapping ioctls as well as > ioctls to enumerate and return a "device" fd for each attached member of > the group (similar to KVM_CREATE_VCPU). We enforce grouping by > returning an error on open() of the group fd if there are members of the > group not bound to the vfio driver. Each device fd would then support a > similar set of ioctls and mapping (mmio/pio/config) interface as current > vfio, except for the obvious domain and dma ioctls superseded by the > group fd. > > Another valid model might be that /dev/vfio/$GROUP is created for all > groups when the vfio module is loaded. The group fd would allow open() > and some set of iommu querying and device enumeration ioctls, but would > error on dma mapping and retrieving device fds until all of the group > devices are bound to the vfio driver. I am in favour of /dev/vfio/$GROUP. If multiple devices should be assigned to a guest, there can also be an ioctl to bind a group to an address-space of another group (certainly needs some care to not allow that both groups belong to different processes). Btw, a problem we havn't talked about yet entirely is driver-deassignment. User space can decide to de-assign the device from vfio while a fd is open on it. With PCI there is no way to let this fail (the .release function returns void last time i checked). Is this a problem, and yes, how we handle that? Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Mon, 2011-08-22 at 15:55 +1000, David Gibson wrote: > On Sat, Aug 20, 2011 at 09:51:39AM -0700, Alex Williamson wrote: > > We had an extremely productive VFIO BoF on Monday. Here's my attempt to > > capture the plan that I think we agreed to: > > > > We need to address both the description and enforcement of device > > groups. Groups are formed any time the iommu does not have resolution > > between a set of devices. On x86, this typically happens when a > > PCI-to-PCI bridge exists between the set of devices and the iommu. For > > Power, partitionable endpoints define a group. Grouping information > > needs to be exposed for both userspace and kernel internal usage. This > > will be a sysfs attribute setup by the iommu drivers. Perhaps: > > > > # cat /sys/devices/pci:00/:00:19.0/iommu_group > > 42 > > > > (I use a PCI example here, but attribute should not be PCI specific) > > Ok. Am I correct in thinking these group IDs are representing the > minimum granularity, and are therefore always static, defined only by > the connected hardware, not by configuration? Yes, that's the idea. An open question I have towards the configuration side is whether we might add iommu driver specific options to the groups. For instance on x86 where we typically have B:D.F granularity, should we have an option not to trust multi-function devices and use a B:D granularity for grouping? > > >From there we have a few options. In the BoF we discussed a model where > > binding a device to vfio creates a /dev/vfio$GROUP character device > > file. This "group" fd provides provides dma mapping ioctls as well as > > ioctls to enumerate and return a "device" fd for each attached member of > > the group (similar to KVM_CREATE_VCPU). We enforce grouping by > > returning an error on open() of the group fd if there are members of the > > group not bound to the vfio driver. Each device fd would then support a > > similar set of ioctls and mapping (mmio/pio/config) interface as current > > vfio, except for the obvious domain and dma ioctls superseded by the > > group fd. > > It seems a slightly strange distinction that the group device appears > when any device in the group is bound to vfio, but only becomes usable > when all devices are bound. > > > Another valid model might be that /dev/vfio/$GROUP is created for all > > groups when the vfio module is loaded. The group fd would allow open() > > and some set of iommu querying and device enumeration ioctls, but would > > error on dma mapping and retrieving device fds until all of the group > > devices are bound to the vfio driver. > > Which is why I marginally prefer this model, although it's not a big > deal. Right, we can also combine models. Binding a device to vfio creates /dev/vfio$GROUP, which only allows a subset of ioctls and no device access until all the group devices are also bound. I think the /dev/vfio/$GROUP might help provide an enumeration interface as well though, which could be useful. > > In either case, the uiommu interface is removed entirely since dma > > mapping is done via the group fd. As necessary in the future, we can > > define a more high performance dma mapping interface for streaming dma > > via the group fd. I expect we'll also include architecture specific > > group ioctls to describe features and capabilities of the iommu. The > > group fd will need to prevent concurrent open()s to maintain a 1:1 group > > to userspace process ownership model. > > A 1:1 group<->process correspondance seems wrong to me. But there are > many ways you could legitimately write the userspace side of the code, > many of them involving some sort of concurrency. Implementing that > concurrency as multiple processes (using explicit shared memory and/or > other IPC mechanisms to co-ordinate) seems a valid choice that we > shouldn't arbitrarily prohibit. > > Obviously, only one UID may be permitted to have the group open at a > time, and I think that's enough to prevent them doing any worse than > shooting themselves in the foot. 1:1 group<->process is probably too strong. Not allowing concurrent open()s on the group file enforces a single userspace entity is responsible for that group. Device fds can be passed to other processes, but only retrieved via the group fd. I suppose we could even branch off the dma interface into a different fd, but it seems like we would logically want to serialize dma mappings at each iommu group anyway. I'm open to alternatives, this just seemed an easy way to do it. Restricting on UID implies that we require isolated qemu instances to run as different UIDs. I know that's a goal, but I don't know if we want to make it an assumption in the group security model. > > Also on the table is supporting non-PCI devices with vfio. To do this, > > we need to generalize the read/write/mmap and irq eventfd interfaces. > > We could keep the same model of segmenting the device fd address space, > > perhaps adding ioc
Re: kvm PCI assignment & VFIO ramblings
On Mon, Aug 22, 2011 at 09:17:41AM -0400, Avi Kivity wrote: > On 08/22/2011 04:15 PM, Roedel, Joerg wrote: > > On Mon, Aug 22, 2011 at 09:06:07AM -0400, Avi Kivity wrote: > > > On 08/22/2011 03:55 PM, Roedel, Joerg wrote: > > > > > > Well, I don't think its really meaningless, but we need some way to > > > > communicate the information about device groups to userspace. > > > > > > I mean the contents of the group descriptor. There are enough 42s in > > > the kernel, it's better if we can replace a synthetic number with > > > something meaningful. > > > > If we only look at PCI than a Segment:Bus:Dev.Fn Number would be > > sufficient, of course. But the idea was to make it generic enough so > > that it works with !PCI too. > > > > We could make it an arch defined string instead of a symlink. So it > doesn't return 42, rather something that can be used by the admin to > figure out what the problem was. Well, ok, it would certainly differ from the in-kernel representation then and introduce new architecture dependencies into libvirt. But if the 'group-string' is more meaningful to users then its certainly good. Suggestions? Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On 08/22/2011 04:15 PM, Roedel, Joerg wrote: On Mon, Aug 22, 2011 at 09:06:07AM -0400, Avi Kivity wrote: > On 08/22/2011 03:55 PM, Roedel, Joerg wrote: > > Well, I don't think its really meaningless, but we need some way to > > communicate the information about device groups to userspace. > > I mean the contents of the group descriptor. There are enough 42s in > the kernel, it's better if we can replace a synthetic number with > something meaningful. If we only look at PCI than a Segment:Bus:Dev.Fn Number would be sufficient, of course. But the idea was to make it generic enough so that it works with !PCI too. We could make it an arch defined string instead of a symlink. So it doesn't return 42, rather something that can be used by the admin to figure out what the problem was. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Mon, Aug 22, 2011 at 09:06:07AM -0400, Avi Kivity wrote: > On 08/22/2011 03:55 PM, Roedel, Joerg wrote: > > Well, I don't think its really meaningless, but we need some way to > > communicate the information about device groups to userspace. > > I mean the contents of the group descriptor. There are enough 42s in > the kernel, it's better if we can replace a synthetic number with > something meaningful. If we only look at PCI than a Segment:Bus:Dev.Fn Number would be sufficient, of course. But the idea was to make it generic enough so that it works with !PCI too. Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On 08/22/2011 03:55 PM, Roedel, Joerg wrote: On Mon, Aug 22, 2011 at 08:42:35AM -0400, Avi Kivity wrote: > On 08/22/2011 03:36 PM, Roedel, Joerg wrote: > > On the AMD IOMMU side this information is stored in the IVRS ACPI table. > > Not sure about the VT-d side, though. > > I see. There is no sysfs node representing it? No. It also doesn't exist as a 'struct pci_dev'. This caused problems in the AMD IOMMU driver in the past and I needed to fix that. There I know that from :) Well, too bad. > I'd rather not add another meaningless identifier. Well, I don't think its really meaningless, but we need some way to communicate the information about device groups to userspace. I mean the contents of the group descriptor. There are enough 42s in the kernel, it's better if we can replace a synthetic number with something meaningful. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Mon, Aug 22, 2011 at 08:42:35AM -0400, Avi Kivity wrote: > On 08/22/2011 03:36 PM, Roedel, Joerg wrote: > > On the AMD IOMMU side this information is stored in the IVRS ACPI table. > > Not sure about the VT-d side, though. > > I see. There is no sysfs node representing it? No. It also doesn't exist as a 'struct pci_dev'. This caused problems in the AMD IOMMU driver in the past and I needed to fix that. There I know that from :) > I'd rather not add another meaningless identifier. Well, I don't think its really meaningless, but we need some way to communicate the information about device groups to userspace. Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On 08/22/2011 03:36 PM, Roedel, Joerg wrote: On Mon, Aug 22, 2011 at 06:51:35AM -0400, Avi Kivity wrote: > On 08/22/2011 01:46 PM, Joerg Roedel wrote: > > That does not work. The bridge in question may not even be visible as a > > PCI device, so you can't link to it. This is the case on a few PCIe > > cards which only have a PCIx chip and a PCIe-2-PCIx bridge to implement > > the PCIe interface (yes, I have seen those cards). > > How does the kernel detect that devices behind the invisible bridge must > be assigned as a unit? On the AMD IOMMU side this information is stored in the IVRS ACPI table. Not sure about the VT-d side, though. I see. There is no sysfs node representing it? I'd rather not add another meaningless identifier. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Mon, Aug 22, 2011 at 06:51:35AM -0400, Avi Kivity wrote: > On 08/22/2011 01:46 PM, Joerg Roedel wrote: > > That does not work. The bridge in question may not even be visible as a > > PCI device, so you can't link to it. This is the case on a few PCIe > > cards which only have a PCIx chip and a PCIe-2-PCIx bridge to implement > > the PCIe interface (yes, I have seen those cards). > > How does the kernel detect that devices behind the invisible bridge must > be assigned as a unit? On the AMD IOMMU side this information is stored in the IVRS ACPI table. Not sure about the VT-d side, though. Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On 08/22/2011 01:46 PM, Joerg Roedel wrote: > $ readlink /sys/devices/pci:00/:00:19.0/iommu_group > ../../../path/to/device/which/represents/the/resource/constraint > > (the pci-to-pci bridge on x86, or whatever node represents partitionable > endpoints on power) That does not work. The bridge in question may not even be visible as a PCI device, so you can't link to it. This is the case on a few PCIe cards which only have a PCIx chip and a PCIe-2-PCIx bridge to implement the PCIe interface (yes, I have seen those cards). How does the kernel detect that devices behind the invisible bridge must be assigned as a unit? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Mon, Aug 22, 2011 at 02:30:26AM -0400, Avi Kivity wrote: > On 08/20/2011 07:51 PM, Alex Williamson wrote: > > We need to address both the description and enforcement of device > > groups. Groups are formed any time the iommu does not have resolution > > between a set of devices. On x86, this typically happens when a > > PCI-to-PCI bridge exists between the set of devices and the iommu. For > > Power, partitionable endpoints define a group. Grouping information > > needs to be exposed for both userspace and kernel internal usage. This > > will be a sysfs attribute setup by the iommu drivers. Perhaps: > > > > # cat /sys/devices/pci:00/:00:19.0/iommu_group > > 42 > > > > $ readlink /sys/devices/pci:00/:00:19.0/iommu_group > ../../../path/to/device/which/represents/the/resource/constraint > > (the pci-to-pci bridge on x86, or whatever node represents partitionable > endpoints on power) That does not work. The bridge in question may not even be visible as a PCI device, so you can't link to it. This is the case on a few PCIe cards which only have a PCIx chip and a PCIe-2-PCIx bridge to implement the PCIe interface (yes, I have seen those cards). Regards, Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo, Andrew Bowd Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On 08/20/2011 07:51 PM, Alex Williamson wrote: We need to address both the description and enforcement of device groups. Groups are formed any time the iommu does not have resolution between a set of devices. On x86, this typically happens when a PCI-to-PCI bridge exists between the set of devices and the iommu. For Power, partitionable endpoints define a group. Grouping information needs to be exposed for both userspace and kernel internal usage. This will be a sysfs attribute setup by the iommu drivers. Perhaps: # cat /sys/devices/pci:00/:00:19.0/iommu_group 42 $ readlink /sys/devices/pci:00/:00:19.0/iommu_group ../../../path/to/device/which/represents/the/resource/constraint (the pci-to-pci bridge on x86, or whatever node represents partitionable endpoints on power) -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Sat, Aug 20, 2011 at 09:51:39AM -0700, Alex Williamson wrote: > We had an extremely productive VFIO BoF on Monday. Here's my attempt to > capture the plan that I think we agreed to: > > We need to address both the description and enforcement of device > groups. Groups are formed any time the iommu does not have resolution > between a set of devices. On x86, this typically happens when a > PCI-to-PCI bridge exists between the set of devices and the iommu. For > Power, partitionable endpoints define a group. Grouping information > needs to be exposed for both userspace and kernel internal usage. This > will be a sysfs attribute setup by the iommu drivers. Perhaps: > > # cat /sys/devices/pci:00/:00:19.0/iommu_group > 42 > > (I use a PCI example here, but attribute should not be PCI specific) Ok. Am I correct in thinking these group IDs are representing the minimum granularity, and are therefore always static, defined only by the connected hardware, not by configuration? > >From there we have a few options. In the BoF we discussed a model where > binding a device to vfio creates a /dev/vfio$GROUP character device > file. This "group" fd provides provides dma mapping ioctls as well as > ioctls to enumerate and return a "device" fd for each attached member of > the group (similar to KVM_CREATE_VCPU). We enforce grouping by > returning an error on open() of the group fd if there are members of the > group not bound to the vfio driver. Each device fd would then support a > similar set of ioctls and mapping (mmio/pio/config) interface as current > vfio, except for the obvious domain and dma ioctls superseded by the > group fd. It seems a slightly strange distinction that the group device appears when any device in the group is bound to vfio, but only becomes usable when all devices are bound. > Another valid model might be that /dev/vfio/$GROUP is created for all > groups when the vfio module is loaded. The group fd would allow open() > and some set of iommu querying and device enumeration ioctls, but would > error on dma mapping and retrieving device fds until all of the group > devices are bound to the vfio driver. Which is why I marginally prefer this model, although it's not a big deal. > In either case, the uiommu interface is removed entirely since dma > mapping is done via the group fd. As necessary in the future, we can > define a more high performance dma mapping interface for streaming dma > via the group fd. I expect we'll also include architecture specific > group ioctls to describe features and capabilities of the iommu. The > group fd will need to prevent concurrent open()s to maintain a 1:1 group > to userspace process ownership model. A 1:1 group<->process correspondance seems wrong to me. But there are many ways you could legitimately write the userspace side of the code, many of them involving some sort of concurrency. Implementing that concurrency as multiple processes (using explicit shared memory and/or other IPC mechanisms to co-ordinate) seems a valid choice that we shouldn't arbitrarily prohibit. Obviously, only one UID may be permitted to have the group open at a time, and I think that's enough to prevent them doing any worse than shooting themselves in the foot. > Also on the table is supporting non-PCI devices with vfio. To do this, > we need to generalize the read/write/mmap and irq eventfd interfaces. > We could keep the same model of segmenting the device fd address space, > perhaps adding ioctls to define the segment offset bit position or we > could split each region into it's own fd (VFIO_GET_PCI_BAR_FD(0), > VFIO_GET_PCI_CONFIG_FD(), VFIO_GET_MMIO_FD(3)), though we're already > suffering some degree of fd bloat (group fd, device fd(s), interrupt > event fd(s), per resource fd, etc). For interrupts we can overload > VFIO_SET_IRQ_EVENTFD to be either PCI INTx or non-PCI irq Sounds reasonable. > (do non-PCI > devices support MSI?). They can. Obviously they might not have exactly the same semantics as PCI MSIs, but I know we have SoC systems with (non-PCI) on-die devices whose interrupts are treated by the (also on-die) root interrupt controller in the same way as PCI MSIs. > For qemu, these changes imply we'd only support a model where we have a > 1:1 group to iommu domain. The current vfio driver could probably > become vfio-pci as we might end up with more target specific vfio > drivers for non-pci. PCI should be able to maintain a simple -device > vfio-pci,host=bb:dd.f to enable hotplug of individual devices. We'll > need to come up with extra options when we need to expose groups to > guest for pvdma. Are you saying that you'd no longer support the current x86 usage of putting all of one guest's devices into a single domain? If that's not what you're saying, how would the domains - now made up of a user's selection of groups, rather than individual devices - be configured? > Hope that captures it, feel free to jump in with corr
Re: kvm PCI assignment & VFIO ramblings
We had an extremely productive VFIO BoF on Monday. Here's my attempt to capture the plan that I think we agreed to: We need to address both the description and enforcement of device groups. Groups are formed any time the iommu does not have resolution between a set of devices. On x86, this typically happens when a PCI-to-PCI bridge exists between the set of devices and the iommu. For Power, partitionable endpoints define a group. Grouping information needs to be exposed for both userspace and kernel internal usage. This will be a sysfs attribute setup by the iommu drivers. Perhaps: # cat /sys/devices/pci:00/:00:19.0/iommu_group 42 (I use a PCI example here, but attribute should not be PCI specific) >From there we have a few options. In the BoF we discussed a model where binding a device to vfio creates a /dev/vfio$GROUP character device file. This "group" fd provides provides dma mapping ioctls as well as ioctls to enumerate and return a "device" fd for each attached member of the group (similar to KVM_CREATE_VCPU). We enforce grouping by returning an error on open() of the group fd if there are members of the group not bound to the vfio driver. Each device fd would then support a similar set of ioctls and mapping (mmio/pio/config) interface as current vfio, except for the obvious domain and dma ioctls superseded by the group fd. Another valid model might be that /dev/vfio/$GROUP is created for all groups when the vfio module is loaded. The group fd would allow open() and some set of iommu querying and device enumeration ioctls, but would error on dma mapping and retrieving device fds until all of the group devices are bound to the vfio driver. In either case, the uiommu interface is removed entirely since dma mapping is done via the group fd. As necessary in the future, we can define a more high performance dma mapping interface for streaming dma via the group fd. I expect we'll also include architecture specific group ioctls to describe features and capabilities of the iommu. The group fd will need to prevent concurrent open()s to maintain a 1:1 group to userspace process ownership model. Also on the table is supporting non-PCI devices with vfio. To do this, we need to generalize the read/write/mmap and irq eventfd interfaces. We could keep the same model of segmenting the device fd address space, perhaps adding ioctls to define the segment offset bit position or we could split each region into it's own fd (VFIO_GET_PCI_BAR_FD(0), VFIO_GET_PCI_CONFIG_FD(), VFIO_GET_MMIO_FD(3)), though we're already suffering some degree of fd bloat (group fd, device fd(s), interrupt event fd(s), per resource fd, etc). For interrupts we can overload VFIO_SET_IRQ_EVENTFD to be either PCI INTx or non-PCI irq (do non-PCI devices support MSI?). For qemu, these changes imply we'd only support a model where we have a 1:1 group to iommu domain. The current vfio driver could probably become vfio-pci as we might end up with more target specific vfio drivers for non-pci. PCI should be able to maintain a simple -device vfio-pci,host=bb:dd.f to enable hotplug of individual devices. We'll need to come up with extra options when we need to expose groups to guest for pvdma. Hope that captures it, feel free to jump in with corrections and suggestions. Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
> Mostly correct, yes. x86 isn't immune to the group problem, it shows up > for us any time there's a PCIe-to-PCI bridge in the device hierarchy. > We lose resolution of devices behind the bridge. As you state though, I > think of this as only a constraint on what we're able to do with those > devices. > > Perhaps part of the differences is that on x86 the constraints don't > really effect how we expose devices to the guest. We need to hold > unused devices in the group hostage and use the same iommu domain for > any devices assigned, but that's not visible to the guest. AIUI, POWER > probably needs to expose the bridge (or at least an emulated bridge) to > the guest, any devices in the group need to show up behind that bridge, Yes, pretty much, essentially because a group must have as shared iommu domain and so due to the way our PV representation works, that means the iommu DMA window is to be exposed by a bridge that covers all the devices of that group. > some kind of pvDMA needs to be associated with that group, there might > be MMIO segments and IOVA windows, etc. The MMIO segments are mostly transparent to the guest, we just tell it where the BARs are and it leaves them alone, at least that's how it works under pHyp. Currently on our qemu/vfio expriments, we do let the guest do the BAR assignment via the emulated stuff using a hack to work around the guest expectation that the BARs have been already setup (I can fill you on the details if you really care but it's not very interesting). It works because we only ever used that on setups where we had a device == a group, but it's nasty. But in any case, because they are going to be always in separate pages, it's not too hard for KVM to remap them wherewver we want so MMIO is basically a non-issue. > Effectively you want to > transplant the entire group into the guest. Is that right? Thanks, Well, at least we want to have a bridge for the group (it could and probably should be a host bridge, ie, an entire PCI domain, that's a lot easier than trying to mess around with virtual P2P bridges). >From there, I don't care if we need to expose explicitly each device of that group one by one. IE. It would be a nice "optimziation" to have the ability to just specify the group and have qemu pick them all up but it doesn't really matter in the grand scheme of things. Currently, we do expose individual devices, but again, it's hacks and it won't work on many setups etc... with horrid consequences :-) We need to sort that before we can even think of merging that code on our side. Cheers, Ben. > Alex > > -- > To unsubscribe from this list: send the line "unsubscribe kvm" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Mon, 2011-08-08 at 11:28 +0300, Avi Kivity wrote: > On 08/03/2011 05:04 AM, David Gibson wrote: > > I still don't understand the distinction you're making. We're saying > > the group is "owned" by a given user or guest in the sense that no-one > > else may use anything in the group (including host drivers). At that > > point none, some or all of the devices in the group may actually be > > used by the guest. > > > > You seem to be making a distinction between "owned by" and "assigned > > to" and "used by" and I really don't see what it is. > > > > Alex (and I) think that we should work with device/function granularity, > as is common with other archs, and that the group thing is just a > constraint on which functions may be assigned where, while you think > that we should work at group granularity, with 1-function groups for > archs which don't have constraints. > > Is this an accurate way of putting it? Mostly correct, yes. x86 isn't immune to the group problem, it shows up for us any time there's a PCIe-to-PCI bridge in the device hierarchy. We lose resolution of devices behind the bridge. As you state though, I think of this as only a constraint on what we're able to do with those devices. Perhaps part of the differences is that on x86 the constraints don't really effect how we expose devices to the guest. We need to hold unused devices in the group hostage and use the same iommu domain for any devices assigned, but that's not visible to the guest. AIUI, POWER probably needs to expose the bridge (or at least an emulated bridge) to the guest, any devices in the group need to show up behind that bridge, some kind of pvDMA needs to be associated with that group, there might be MMIO segments and IOVA windows, etc. Effectively you want to transplant the entire group into the guest. Is that right? Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On 08/03/2011 05:04 AM, David Gibson wrote: I still don't understand the distinction you're making. We're saying the group is "owned" by a given user or guest in the sense that no-one else may use anything in the group (including host drivers). At that point none, some or all of the devices in the group may actually be used by the guest. You seem to be making a distinction between "owned by" and "assigned to" and "used by" and I really don't see what it is. Alex (and I) think that we should work with device/function granularity, as is common with other archs, and that the group thing is just a constraint on which functions may be assigned where, while you think that we should work at group granularity, with 1-function groups for archs which don't have constraints. Is this an accurate way of putting it? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Fri, Aug 05, 2011 at 09:10:09AM -0600, Alex Williamson wrote: > On Fri, 2011-08-05 at 20:42 +1000, Benjamin Herrenschmidt wrote: > > Right. In fact to try to clarify the problem for everybody, I think we > > can distinguish two different classes of "constraints" that can > > influence the grouping of devices: > > > > 1- Hard constraints. These are typically devices using the same RID or > > where the RID cannot be reliably guaranteed (the later is the case with > > some PCIe-PCIX bridges which will take ownership of "some" transactions > > such as split but not all). Devices like that must be in the same > > domain. This is where PowerPC adds to what x86 does today the concept > > that the domains are pre-existing, since we use the RID for error > > isolation & MMIO segmenting as well. so we need to create those domains > > at boot time. > > > > 2- Softer constraints. Those constraints derive from the fact that not > > applying them risks enabling the guest to create side effects outside of > > its "sandbox". To some extent, there can be "degrees" of badness between > > the various things that can cause such constraints. Examples are shared > > LSIs (since trusting DisINTx can be chancy, see earlier discussions), > > potentially any set of functions in the same device can be problematic > > due to the possibility to get backdoor access to the BARs etc... > > This is what I've been trying to get to, hardware constraints vs system > policy constraints. > > > Now, what I derive from the discussion we've had so far, is that we need > > to find a proper fix for #1, but Alex and Avi seem to prefer that #2 > > remains a matter of libvirt/user doing the right thing (basically > > keeping a loaded gun aimed at the user's foot with a very very very > > sweet trigger but heh, let's not start a flamewar here :-) > > Doesn't your own uncertainty of whether or not to allow this lead to the > same conclusion, that it belongs in userspace policy? I don't think we > want to make white lists of which devices we trust to do DisINTx > correctly part of the kernel interface, do we? Thanks, Yes, but the overall point is that both the hard and soft constraints are much easier to handle if a group or iommu domain or whatever is a persistent entity that can be set up once-per-boot by the admin with whatever degree of safety they want, rather than a transient entity tied to an fd's lifetime, which must be set up correctly, every time, by the thing establishing it. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Fri, 2011-08-05 at 15:44 +0200, Joerg Roedel wrote: > On Fri, Aug 05, 2011 at 08:42:38PM +1000, Benjamin Herrenschmidt wrote: > > > Right. In fact to try to clarify the problem for everybody, I think we > > can distinguish two different classes of "constraints" that can > > influence the grouping of devices: > > > > 1- Hard constraints. These are typically devices using the same RID or > > where the RID cannot be reliably guaranteed (the later is the case with > > some PCIe-PCIX bridges which will take ownership of "some" transactions > > such as split but not all). Devices like that must be in the same > > domain. This is where PowerPC adds to what x86 does today the concept > > that the domains are pre-existing, since we use the RID for error > > isolation & MMIO segmenting as well. so we need to create those domains > > at boot time. > > Domains (in the iommu-sense) are created at boot time on x86 today. > Every device needs at least a domain to provide dma-mapping > functionality to the drivers. So all the grouping is done too at > boot-time. This is specific to the iommu-drivers today but can be > generalized I think. Ok, let's go there then. > > 2- Softer constraints. Those constraints derive from the fact that not > > applying them risks enabling the guest to create side effects outside of > > its "sandbox". To some extent, there can be "degrees" of badness between > > the various things that can cause such constraints. Examples are shared > > LSIs (since trusting DisINTx can be chancy, see earlier discussions), > > potentially any set of functions in the same device can be problematic > > due to the possibility to get backdoor access to the BARs etc... > > Hmm, there is no sane way to handle such constraints in a safe way, > right? We can either blacklist devices which are know to have such > backdoors or we just ignore the problem. Arguably they probably all do have such backdoors. A debug register, JTAG register, ... My point is you don't really know unless you get manufacturer guarantee that there is no undocumented register somewhere or way to change the microcode so that it does it etc The more complex the devices, the less likely to have a guarantee. The "safe" way is what pHyp does and basically boils down to only allowing pass-through of entire 'slots', ie, things that are behind a P2P bridge (virtual one typically, ie, a PCIe switch) and disallowing pass-through with shared interrupts. That way, even if the guest can move the BARs around, it cannot make them overlap somebody else device because the parent bridge restricts the portion of MMIO space that is forwarded down to that device anyway. > > Now, what I derive from the discussion we've had so far, is that we need > > to find a proper fix for #1, but Alex and Avi seem to prefer that #2 > > remains a matter of libvirt/user doing the right thing (basically > > keeping a loaded gun aimed at the user's foot with a very very very > > sweet trigger but heh, let's not start a flamewar here :-) > > > > So let's try to find a proper solution for #1 now, and leave #2 alone > > for the time being. > > Yes, and the solution for #1 should be entirely in the kernel. The > question is how to do that. Probably the most sane way is to introduce a > concept of device ownership. The ownership can either be a kernel driver > or a userspace process. Giving ownership of a device to userspace is > only possible if all devices in the same group are unbound from its > respective drivers. This is a very intrusive concept, no idea if it > has a chance of acceptance :-) > But the advantage is clearly that this allows better semantics in the > IOMMU drivers and a more stable handover of devices from host drivers to > kvm guests. I tend to think around those lines too, but the ownership concept doesn't necessarily have to be core-kernel enforced itself, it can be in VFIO. If we have a common API to expose the "domain number", it can perfectly be a matter of VFIO itself not allowing to do pass-through until it has attached its stub driver to all the devices with that domain number, and it can handle exclusion of iommu domains from there. > > Maybe the right option is for x86 to move toward pre-existing domains > > like powerpc does, or maybe we can just expose some kind of ID. > > As I said, the domains are created a iommu driver initialization time > (usually boot time). But the groups are internal to the iommu drivers > and not visible somewhere else. That's what we need to fix :-) > > Ah you started answering to my above questions :-) > > > > We could do what you propose. It depends what we want to do with > > domains. Practically speaking, we could make domains pre-existing (with > > the ability to group several PEs into larger domains) or we could keep > > the concepts different, possibly with the limitation that on powerpc, a > > domain == a PE. > > > > I suppose we -could- make arbitrary domains on ppc as well by making the > > v
Re: kvm PCI assignment & VFIO ramblings
On Fri, 2011-08-05 at 20:42 +1000, Benjamin Herrenschmidt wrote: > Right. In fact to try to clarify the problem for everybody, I think we > can distinguish two different classes of "constraints" that can > influence the grouping of devices: > > 1- Hard constraints. These are typically devices using the same RID or > where the RID cannot be reliably guaranteed (the later is the case with > some PCIe-PCIX bridges which will take ownership of "some" transactions > such as split but not all). Devices like that must be in the same > domain. This is where PowerPC adds to what x86 does today the concept > that the domains are pre-existing, since we use the RID for error > isolation & MMIO segmenting as well. so we need to create those domains > at boot time. > > 2- Softer constraints. Those constraints derive from the fact that not > applying them risks enabling the guest to create side effects outside of > its "sandbox". To some extent, there can be "degrees" of badness between > the various things that can cause such constraints. Examples are shared > LSIs (since trusting DisINTx can be chancy, see earlier discussions), > potentially any set of functions in the same device can be problematic > due to the possibility to get backdoor access to the BARs etc... This is what I've been trying to get to, hardware constraints vs system policy constraints. > Now, what I derive from the discussion we've had so far, is that we need > to find a proper fix for #1, but Alex and Avi seem to prefer that #2 > remains a matter of libvirt/user doing the right thing (basically > keeping a loaded gun aimed at the user's foot with a very very very > sweet trigger but heh, let's not start a flamewar here :-) Doesn't your own uncertainty of whether or not to allow this lead to the same conclusion, that it belongs in userspace policy? I don't think we want to make white lists of which devices we trust to do DisINTx correctly part of the kernel interface, do we? Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Fri, Aug 05, 2011 at 08:42:38PM +1000, Benjamin Herrenschmidt wrote: > Right. In fact to try to clarify the problem for everybody, I think we > can distinguish two different classes of "constraints" that can > influence the grouping of devices: > > 1- Hard constraints. These are typically devices using the same RID or > where the RID cannot be reliably guaranteed (the later is the case with > some PCIe-PCIX bridges which will take ownership of "some" transactions > such as split but not all). Devices like that must be in the same > domain. This is where PowerPC adds to what x86 does today the concept > that the domains are pre-existing, since we use the RID for error > isolation & MMIO segmenting as well. so we need to create those domains > at boot time. Domains (in the iommu-sense) are created at boot time on x86 today. Every device needs at least a domain to provide dma-mapping functionality to the drivers. So all the grouping is done too at boot-time. This is specific to the iommu-drivers today but can be generalized I think. > 2- Softer constraints. Those constraints derive from the fact that not > applying them risks enabling the guest to create side effects outside of > its "sandbox". To some extent, there can be "degrees" of badness between > the various things that can cause such constraints. Examples are shared > LSIs (since trusting DisINTx can be chancy, see earlier discussions), > potentially any set of functions in the same device can be problematic > due to the possibility to get backdoor access to the BARs etc... Hmm, there is no sane way to handle such constraints in a safe way, right? We can either blacklist devices which are know to have such backdoors or we just ignore the problem. > Now, what I derive from the discussion we've had so far, is that we need > to find a proper fix for #1, but Alex and Avi seem to prefer that #2 > remains a matter of libvirt/user doing the right thing (basically > keeping a loaded gun aimed at the user's foot with a very very very > sweet trigger but heh, let's not start a flamewar here :-) > > So let's try to find a proper solution for #1 now, and leave #2 alone > for the time being. Yes, and the solution for #1 should be entirely in the kernel. The question is how to do that. Probably the most sane way is to introduce a concept of device ownership. The ownership can either be a kernel driver or a userspace process. Giving ownership of a device to userspace is only possible if all devices in the same group are unbound from its respective drivers. This is a very intrusive concept, no idea if it has a chance of acceptance :-) But the advantage is clearly that this allows better semantics in the IOMMU drivers and a more stable handover of devices from host drivers to kvm guests. > Maybe the right option is for x86 to move toward pre-existing domains > like powerpc does, or maybe we can just expose some kind of ID. As I said, the domains are created a iommu driver initialization time (usually boot time). But the groups are internal to the iommu drivers and not visible somewhere else. > Ah you started answering to my above questions :-) > > We could do what you propose. It depends what we want to do with > domains. Practically speaking, we could make domains pre-existing (with > the ability to group several PEs into larger domains) or we could keep > the concepts different, possibly with the limitation that on powerpc, a > domain == a PE. > > I suppose we -could- make arbitrary domains on ppc as well by making the > various PE's iommu's in HW point to the same in-memory table, but that's > a bit nasty in practice due to the way we manage those, and it would to > some extent increase the risk of a failing device/driver stomping on > another one and thus taking it down with itself. IE. isolation of errors > is an important feature for us. These arbitrary domains exist in the iommu-api. It would be good to emulate them on Power too. Can't you put a PE into an isolated error-domain when something goes wrong with it? This should provide the same isolation as before. What you derive the group number from is your business :-) On x86 it is certainly the best to use the RID these devices share together with the PCI segment number. Regards, Joerg -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Fri, Aug 05, 2011 at 08:26:11PM +1000, Benjamin Herrenschmidt wrote: > On Thu, 2011-08-04 at 12:41 +0200, Joerg Roedel wrote: > > On Mon, Aug 01, 2011 at 02:27:36PM -0600, Alex Williamson wrote: > > > It's not clear to me how we could skip it. With VT-d, we'd have to > > > implement an emulated interrupt remapper and hope that the guest picks > > > unused indexes in the host interrupt remapping table before it could do > > > anything useful with direct access to the MSI-X table. Maybe AMD IOMMU > > > makes this easier? > > > > AMD IOMMU provides remapping tables per-device, and not a global one. > > But that does not make direct guest-access to the MSI-X table safe. The > > table contains the table contains the interrupt-type and the vector > > which is used as an index into the remapping table by the IOMMU. So when > > the guest writes into its MSI-X table the remapping-table in the host > > needs to be updated too. > > Right, you need paravirt to avoid filtering :-) Or a shadow MSI-X table like done on x86. How to handle this seems to be platform specific. As you indicate there is a standardized paravirt interface for that on Power. > IE the problem is two fold: > > - Getting the right value in the table / remapper so things work > (paravirt) > > - Protecting against the guest somewhat managing to change the value in > the table (either directly or via a backdoor access to its own config > space). > > The later for us comes from the HW PE filtering of the MSI transactions. Right. The second part of the problem can be avoided with interrupt-remapping/filtering hardware in the IOMMUs. Joerg -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Thu, 2011-08-04 at 12:27 +0200, Joerg Roedel wrote: > Hi Ben, > > thanks for your detailed introduction to the requirements for POWER. Its > good to know that the granularity problem is not x86-only. I'm happy to see your reply :-) I had the feeling I was a bit alone here... > On Sat, Jul 30, 2011 at 09:58:53AM +1000, Benjamin Herrenschmidt wrote: > > In IBM POWER land, we call this a "partitionable endpoint" (the term > > "endpoint" here is historic, such a PE can be made of several PCIe > > "endpoints"). I think "partitionable" is a pretty good name tho to > > represent the constraints, so I'll call this a "partitionable group" > > from now on. > > On x86 this is mostly an issue of the IOMMU and which set of devices use > the same request-id. I used to call that an alias-group because the > devices have a request-id alias to the pci-bridge. Right. In fact to try to clarify the problem for everybody, I think we can distinguish two different classes of "constraints" that can influence the grouping of devices: 1- Hard constraints. These are typically devices using the same RID or where the RID cannot be reliably guaranteed (the later is the case with some PCIe-PCIX bridges which will take ownership of "some" transactions such as split but not all). Devices like that must be in the same domain. This is where PowerPC adds to what x86 does today the concept that the domains are pre-existing, since we use the RID for error isolation & MMIO segmenting as well. so we need to create those domains at boot time. 2- Softer constraints. Those constraints derive from the fact that not applying them risks enabling the guest to create side effects outside of its "sandbox". To some extent, there can be "degrees" of badness between the various things that can cause such constraints. Examples are shared LSIs (since trusting DisINTx can be chancy, see earlier discussions), potentially any set of functions in the same device can be problematic due to the possibility to get backdoor access to the BARs etc... Now, what I derive from the discussion we've had so far, is that we need to find a proper fix for #1, but Alex and Avi seem to prefer that #2 remains a matter of libvirt/user doing the right thing (basically keeping a loaded gun aimed at the user's foot with a very very very sweet trigger but heh, let's not start a flamewar here :-) So let's try to find a proper solution for #1 now, and leave #2 alone for the time being. Maybe the right option is for x86 to move toward pre-existing domains like powerpc does, or maybe we can just expose some kind of ID. Because #1 is a mix of generic constraints (nasty bridges) and very platform specific ones (whatever capacity limits in our MMIO segmenting forced us to put two devices in the same hard domain on power), I believe it's really something the kernel must solve, not libvirt nor qemu user or anything else. I am open to suggestions here. I can easily expose my PE# (it's just a number) somewhere in sysfs, in fact I'm considering doing it in the PCI devices sysfs directory, simply because it can/will be useful for other things such as error reporting, so we could maybe build on that. The crux for me is really the need for pre-existence of the iommu domains as my PE's imply a shared iommu space. > > - The -minimum- granularity of pass-through is not always a single > > device and not always under SW control > > Correct. > > > - Having a magic heuristic in libvirt to figure out those constraints is > > WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel > > knowledge of PCI resource management and getting it wrong in many many > > cases, something that took years to fix essentially by ripping it all > > out. This is kernel knowledge and thus we need the kernel to expose in a > > way or another what those constraints are, what those "partitionable > > groups" are. > > I agree. Managing the ownership of a group should be done in the kernel. > Doing this in userspace is just too dangerous. > > The problem to be solved here is how to present these PEs inside the > kernel and to userspace. I thought a bit about making this visbible > through the iommu-api for in-kernel users. That is probably the most > logical place. Ah you started answering to my above questions :-) We could do what you propose. It depends what we want to do with domains. Practically speaking, we could make domains pre-existing (with the ability to group several PEs into larger domains) or we could keep the concepts different, possibly with the limitation that on powerpc, a domain == a PE. I suppose we -could- make arbitrary domains on ppc as well by making the various PE's iommu's in HW point to the same in-memory table, but that's a bit nasty in practice due to the way we manage those, and it would to some extent increase the risk of a failing device/driver stomping on another one and thus taking it down with itself. IE. isolation of errors is an important feature for us. S
Re: kvm PCI assignment & VFIO ramblings
On Thu, 2011-08-04 at 12:41 +0200, Joerg Roedel wrote: > On Mon, Aug 01, 2011 at 02:27:36PM -0600, Alex Williamson wrote: > > It's not clear to me how we could skip it. With VT-d, we'd have to > > implement an emulated interrupt remapper and hope that the guest picks > > unused indexes in the host interrupt remapping table before it could do > > anything useful with direct access to the MSI-X table. Maybe AMD IOMMU > > makes this easier? > > AMD IOMMU provides remapping tables per-device, and not a global one. > But that does not make direct guest-access to the MSI-X table safe. The > table contains the table contains the interrupt-type and the vector > which is used as an index into the remapping table by the IOMMU. So when > the guest writes into its MSI-X table the remapping-table in the host > needs to be updated too. Right, you need paravirt to avoid filtering :-) IE the problem is two fold: - Getting the right value in the table / remapper so things work (paravirt) - Protecting against the guest somewhat managing to change the value in the table (either directly or via a backdoor access to its own config space). The later for us comes from the HW PE filtering of the MSI transactions. Cheers, Ben. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Mon, Aug 01, 2011 at 02:27:36PM -0600, Alex Williamson wrote: > It's not clear to me how we could skip it. With VT-d, we'd have to > implement an emulated interrupt remapper and hope that the guest picks > unused indexes in the host interrupt remapping table before it could do > anything useful with direct access to the MSI-X table. Maybe AMD IOMMU > makes this easier? AMD IOMMU provides remapping tables per-device, and not a global one. But that does not make direct guest-access to the MSI-X table safe. The table contains the table contains the interrupt-type and the vector which is used as an index into the remapping table by the IOMMU. So when the guest writes into its MSI-X table the remapping-table in the host needs to be updated too. Joerg -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote: > On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote: > > - The -minimum- granularity of pass-through is not always a single > > device and not always under SW control > > But IMHO, we need to preserve the granularity of exposing a device to a > guest as a single device. That might mean some devices are held hostage > by an agent on the host. Thats true. There is a difference between unassign a group from the host and make single devices in that PE visible to the guest. But we need to make sure that no device in a PE is used by the host while at least one device is assigned to a guest. Unlike the other proposals to handle this in libvirt, I think this belongs into the kernel. Doing this in userspace may break the entire system if done wrong. For example, if one device from e PE is assigned to a guest while another one is not unbound from its host driver, the driver may get very confused when DMA just stops working. This may crash the entire system or lead to silent data corruption in the guest. The behavior is basically undefined then. The kernel must not not allow that. Joerg -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
Hi Ben, thanks for your detailed introduction to the requirements for POWER. Its good to know that the granularity problem is not x86-only. On Sat, Jul 30, 2011 at 09:58:53AM +1000, Benjamin Herrenschmidt wrote: > In IBM POWER land, we call this a "partitionable endpoint" (the term > "endpoint" here is historic, such a PE can be made of several PCIe > "endpoints"). I think "partitionable" is a pretty good name tho to > represent the constraints, so I'll call this a "partitionable group" > from now on. On x86 this is mostly an issue of the IOMMU and which set of devices use the same request-id. I used to call that an alias-group because the devices have a request-id alias to the pci-bridge. > - The -minimum- granularity of pass-through is not always a single > device and not always under SW control Correct. > - Having a magic heuristic in libvirt to figure out those constraints is > WRONG. This reeks of XFree 4 PCI layer trying to duplicate the kernel > knowledge of PCI resource management and getting it wrong in many many > cases, something that took years to fix essentially by ripping it all > out. This is kernel knowledge and thus we need the kernel to expose in a > way or another what those constraints are, what those "partitionable > groups" are. I agree. Managing the ownership of a group should be done in the kernel. Doing this in userspace is just too dangerous. The problem to be solved here is how to present these PEs inside the kernel and to userspace. I thought a bit about making this visbible through the iommu-api for in-kernel users. That is probably the most logical place. For userspace I would like to propose a new device attribute in sysfs. This attribute contains the group number. All devices with the same group number belong to the same PE. Libvirt needs to scan the whole device tree to build the groups but that is probalbly not a big deal. Joerg > > - That does -not- mean that we cannot specify for each individual device > within such a group where we want to put it in qemu (what devfn etc...). > As long as there is a clear understanding that the "ownership" of the > device goes with the group, this is somewhat orthogonal to how they are > represented in qemu. (Not completely... if the iommu is exposed to the > guest ,via paravirt for example, some of these constraints must be > exposed but I'll talk about that more later). > > The interface currently proposed for VFIO (and associated uiommu) > doesn't handle that problem at all. Instead, it is entirely centered > around a specific "feature" of the VTd iommu's for creating arbitrary > domains with arbitrary devices (tho those devices -do- have the same > constraints exposed above, don't try to put 2 legacy PCI devices behind > the same bridge into 2 different domains !), but the API totally ignores > the problem, leaves it to libvirt "magic foo" and focuses on something > that is both quite secondary in the grand scheme of things, and quite > x86 VTd specific in the implementation and API definition. > > Now, I'm not saying these programmable iommu domains aren't a nice > feature and that we shouldn't exploit them when available, but as it is, > it is too much a central part of the API. > > I'll talk a little bit more about recent POWER iommu's here to > illustrate where I'm coming from with my idea of groups: > > On p7ioc (the IO chip used on recent P7 machines), there -is- a concept > of domain and a per-RID filtering. However it differs from VTd in a few > ways: > > The "domains" (aka PEs) encompass more than just an iommu filtering > scheme. The MMIO space and PIO space are also segmented, and those > segments assigned to domains. Interrupts (well, MSI ports at least) are > assigned to domains. Inbound PCIe error messages are targeted to > domains, etc... > > Basically, the PEs provide a very strong isolation feature which > includes errors, and has the ability to immediately "isolate" a PE on > the first occurence of an error. For example, if an inbound PCIe error > is signaled by a device on a PE or such a device does a DMA to a > non-authorized address, the whole PE gets into error state. All > subsequent stores (both DMA and MMIO) are swallowed and reads return all > 1's, interrupts are blocked. This is designed to prevent any propagation > of bad data, which is a very important feature in large high reliability > systems. > > Software then has the ability to selectively turn back on MMIO and/or > DMA, perform diagnostics, reset devices etc... > > Because the domains encompass more than just DMA, but also segment the > MMIO space, it is not practical at all to dynamically reconfigure them > at runtime to "move" devices into domains. The firmware or early kernel > code (it depends) will assign devices BARs using an algorithm that keeps > them within PE segment boundaries, etc > > Additionally (and this is indeed a "restriction" compared to VTd, though > I expect our future IO chips to lift it to so
Re: kvm PCI assignment & VFIO ramblings
On Tue, Aug 02, 2011 at 09:44:49PM -0600, Alex Williamson wrote: > On Wed, 2011-08-03 at 12:04 +1000, David Gibson wrote: > > On Tue, Aug 02, 2011 at 12:35:19PM -0600, Alex Williamson wrote: > > > On Tue, 2011-08-02 at 12:14 -0600, Alex Williamson wrote: > > > > On Tue, 2011-08-02 at 18:28 +1000, David Gibson wrote: > > > > > On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote: > > > > > > On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote: > > > > > [snip] > > > > > > On x86, the USB controllers don't typically live behind a > > > > > > PCIe-to-PCI > > > > > > bridge, so don't suffer the source identifier problem, but they do > > > > > > often > > > > > > share an interrupt. But even then, we can count on most modern > > > > > > devices > > > > > > supporting PCI2.3, and thus the DisINTx feature, which allows us to > > > > > > share interrupts. In any case, yes, it's more rare but we need to > > > > > > know > > > > > > how to handle devices behind PCI bridges. However I disagree that > > > > > > we > > > > > > need to assign all the devices behind such a bridge to the guest. > > > > > > There's a difference between removing the device from the host and > > > > > > exposing the device to the guest. > > > > > > > > > > I think you're arguing only over details of what words to use for > > > > > what, rather than anything of substance here. The point is that an > > > > > entire partitionable group must be assigned to "host" (in which case > > > > > kernel drivers may bind to it) or to a particular guest partition (or > > > > > at least to a single UID on the host). Which of the assigned devices > > > > > the partition actually uses is another matter of course, as is at > > > > > exactly which level they become "de-exposed" if you don't want to use > > > > > all of then. > > > > > > > > Well first we need to define what a partitionable group is, whether it's > > > > based on hardware requirements or user policy. And while I agree that > > > > we need unique ownership of a partition, I disagree that qemu is > > > > necessarily the owner of the entire partition vs individual devices. > > > > > > Sorry, I didn't intend to have such circular logic. "... I disagree > > > that qemu is necessarily the owner of the entire partition vs granted > > > access to devices within the partition". Thanks, > > > > I still don't understand the distinction you're making. We're saying > > the group is "owned" by a given user or guest in the sense that no-one > > else may use anything in the group (including host drivers). At that > > point none, some or all of the devices in the group may actually be > > used by the guest. > > > > You seem to be making a distinction between "owned by" and "assigned > > to" and "used by" and I really don't see what it is. > > How does a qemu instance that uses none of the devices in a group still > own that group? ?? In the same way that you still own a file you don't have open..? > Aren't we at that point free to move the group to a > different qemu instance or return ownership to the host? Of course. But until you actually do that, the group is still notionally owned by the guest. > Who does that? The admin. Possily by poking sysfs, or possibly by frobbing some character device, or maybe something else. Naturally libvirt or whatever could also do this. > In my mental model, there's an intermediary that "owns" the group and > just as kernel drivers bind to devices when the host owns the group, > qemu is a userspace device driver that binds to sets of devices when the > intermediary owns it. Obviously I'm thinking libvirt, but it doesn't > have to be. Thanks, Well sure, but I really don't see how such an intermediary fits into the kernel's model of ownership. So, first, take a step back and look at what sort of entities can "own" a group (or device or whatever). I notice that when I've said "owned by the guest" you seem to have read this as "owned by qemu" which is not necessarily the same thing. What I had in mind is that each group is either owned by "host", in which case host kernel drivers can bind to it, or it's in "guest mode" in which case it has a user, group and mode and can be bound by user drivers (and therefore guests) with the right permission. From the kernel's perspective there is therefore no distinction between "owned by qemu" and "owned by libvirt". -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Wed, 2011-08-03 at 12:04 +1000, David Gibson wrote: > On Tue, Aug 02, 2011 at 12:35:19PM -0600, Alex Williamson wrote: > > On Tue, 2011-08-02 at 12:14 -0600, Alex Williamson wrote: > > > On Tue, 2011-08-02 at 18:28 +1000, David Gibson wrote: > > > > On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote: > > > > > On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote: > > > > [snip] > > > > > On x86, the USB controllers don't typically live behind a PCIe-to-PCI > > > > > bridge, so don't suffer the source identifier problem, but they do > > > > > often > > > > > share an interrupt. But even then, we can count on most modern > > > > > devices > > > > > supporting PCI2.3, and thus the DisINTx feature, which allows us to > > > > > share interrupts. In any case, yes, it's more rare but we need to > > > > > know > > > > > how to handle devices behind PCI bridges. However I disagree that we > > > > > need to assign all the devices behind such a bridge to the guest. > > > > > There's a difference between removing the device from the host and > > > > > exposing the device to the guest. > > > > > > > > I think you're arguing only over details of what words to use for > > > > what, rather than anything of substance here. The point is that an > > > > entire partitionable group must be assigned to "host" (in which case > > > > kernel drivers may bind to it) or to a particular guest partition (or > > > > at least to a single UID on the host). Which of the assigned devices > > > > the partition actually uses is another matter of course, as is at > > > > exactly which level they become "de-exposed" if you don't want to use > > > > all of then. > > > > > > Well first we need to define what a partitionable group is, whether it's > > > based on hardware requirements or user policy. And while I agree that > > > we need unique ownership of a partition, I disagree that qemu is > > > necessarily the owner of the entire partition vs individual devices. > > > > Sorry, I didn't intend to have such circular logic. "... I disagree > > that qemu is necessarily the owner of the entire partition vs granted > > access to devices within the partition". Thanks, > > I still don't understand the distinction you're making. We're saying > the group is "owned" by a given user or guest in the sense that no-one > else may use anything in the group (including host drivers). At that > point none, some or all of the devices in the group may actually be > used by the guest. > > You seem to be making a distinction between "owned by" and "assigned > to" and "used by" and I really don't see what it is. How does a qemu instance that uses none of the devices in a group still own that group? Aren't we at that point free to move the group to a different qemu instance or return ownership to the host? Who does that? In my mental model, there's an intermediary that "owns" the group and just as kernel drivers bind to devices when the host owns the group, qemu is a userspace device driver that binds to sets of devices when the intermediary owns it. Obviously I'm thinking libvirt, but it doesn't have to be. Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Tue, Aug 02, 2011 at 12:35:19PM -0600, Alex Williamson wrote: > On Tue, 2011-08-02 at 12:14 -0600, Alex Williamson wrote: > > On Tue, 2011-08-02 at 18:28 +1000, David Gibson wrote: > > > On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote: > > > > On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote: > > > [snip] > > > > On x86, the USB controllers don't typically live behind a PCIe-to-PCI > > > > bridge, so don't suffer the source identifier problem, but they do often > > > > share an interrupt. But even then, we can count on most modern devices > > > > supporting PCI2.3, and thus the DisINTx feature, which allows us to > > > > share interrupts. In any case, yes, it's more rare but we need to know > > > > how to handle devices behind PCI bridges. However I disagree that we > > > > need to assign all the devices behind such a bridge to the guest. > > > > There's a difference between removing the device from the host and > > > > exposing the device to the guest. > > > > > > I think you're arguing only over details of what words to use for > > > what, rather than anything of substance here. The point is that an > > > entire partitionable group must be assigned to "host" (in which case > > > kernel drivers may bind to it) or to a particular guest partition (or > > > at least to a single UID on the host). Which of the assigned devices > > > the partition actually uses is another matter of course, as is at > > > exactly which level they become "de-exposed" if you don't want to use > > > all of then. > > > > Well first we need to define what a partitionable group is, whether it's > > based on hardware requirements or user policy. And while I agree that > > we need unique ownership of a partition, I disagree that qemu is > > necessarily the owner of the entire partition vs individual devices. > > Sorry, I didn't intend to have such circular logic. "... I disagree > that qemu is necessarily the owner of the entire partition vs granted > access to devices within the partition". Thanks, I still don't understand the distinction you're making. We're saying the group is "owned" by a given user or guest in the sense that no-one else may use anything in the group (including host drivers). At that point none, some or all of the devices in the group may actually be used by the guest. You seem to be making a distinction between "owned by" and "assigned to" and "used by" and I really don't see what it is. -- David Gibson| I'll have my music baroque, and my code david AT gibson.dropbear.id.au | minimalist, thank you. NOT _the_ _other_ | _way_ _around_! http://www.ozlabs.org/~dgibson -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Tue, 2011-08-02 at 17:29 -0400, Konrad Rzeszutek Wilk wrote: > On Tue, Aug 02, 2011 at 09:34:58AM -0600, Alex Williamson wrote: > > On Tue, 2011-08-02 at 22:58 +1000, Benjamin Herrenschmidt wrote: > > > > > > Don't worry, it took me a while to get my head around the HW :-) SR-IOV > > > VFs will generally not have limitations like that no, but on the other > > > hand, they -will- still require 1 VF = 1 group, ie, you won't be able to > > > take a bunch of VFs and put them in the same 'domain'. > > > > > > I think the main deal is that VFIO/qemu sees "domains" as "guests" and > > > tries to put all devices for a given guest into a "domain". > > > > Actually, that's only a recent optimization, before that each device got > > it's own iommu domain. It's actually completely configurable on the > > qemu command line which devices get their own iommu and which share. > > The default optimizes the number of domains (one) and thus the number of > > mapping callbacks since we pin the entire guest. > > > > > On POWER, we have a different view of things were domains/groups are > > > defined to be the smallest granularity we can (down to a single VF) and > > > we give several groups to a guest (ie we avoid sharing the iommu in most > > > cases) > > > > > > This is driven by the HW design but that design is itself driven by the > > > idea that the domains/group are also error isolation groups and we don't > > > want to take all of the IOs of a guest down if one adapter in that guest > > > is having an error. > > > > > > The x86 domains are conceptually different as they are about sharing the > > > iommu page tables with the clear long term intent of then sharing those > > > page tables with the guest CPU own. We aren't going in that direction > > > (at this point at least) on POWER.. > > > > Yes and no. The x86 domains are pretty flexible and used a few > > different ways. On the host we do dynamic DMA with a domain per device, > > mapping only the inflight DMA ranges. In order to achieve the > > transparent device assignment model, we have to flip that around and map > > the entire guest. As noted, we can continue to use separate domains for > > this, but since each maps the entire guest, it doesn't add a lot of > > value and uses more resources and requires more mapping callbacks (and > > x86 doesn't have the best error containment anyway). If we had a well > > supported IOMMU model that we could adapt for pvDMA, then it would make > > sense to keep each device in it's own domain again. Thanks, > > Could you have an PV IOMMU (in the guest) that would set up those > maps? Yep, definitely. That's effectively what power wants to do. We could do it on x86, but as others have noted, the map/unmap interface isn't tuned to do this at that granularity and our target guest OS audience is effectively reduced to Linux. Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Tue, Aug 02, 2011 at 09:34:58AM -0600, Alex Williamson wrote: > On Tue, 2011-08-02 at 22:58 +1000, Benjamin Herrenschmidt wrote: > > > > Don't worry, it took me a while to get my head around the HW :-) SR-IOV > > VFs will generally not have limitations like that no, but on the other > > hand, they -will- still require 1 VF = 1 group, ie, you won't be able to > > take a bunch of VFs and put them in the same 'domain'. > > > > I think the main deal is that VFIO/qemu sees "domains" as "guests" and > > tries to put all devices for a given guest into a "domain". > > Actually, that's only a recent optimization, before that each device got > it's own iommu domain. It's actually completely configurable on the > qemu command line which devices get their own iommu and which share. > The default optimizes the number of domains (one) and thus the number of > mapping callbacks since we pin the entire guest. > > > On POWER, we have a different view of things were domains/groups are > > defined to be the smallest granularity we can (down to a single VF) and > > we give several groups to a guest (ie we avoid sharing the iommu in most > > cases) > > > > This is driven by the HW design but that design is itself driven by the > > idea that the domains/group are also error isolation groups and we don't > > want to take all of the IOs of a guest down if one adapter in that guest > > is having an error. > > > > The x86 domains are conceptually different as they are about sharing the > > iommu page tables with the clear long term intent of then sharing those > > page tables with the guest CPU own. We aren't going in that direction > > (at this point at least) on POWER.. > > Yes and no. The x86 domains are pretty flexible and used a few > different ways. On the host we do dynamic DMA with a domain per device, > mapping only the inflight DMA ranges. In order to achieve the > transparent device assignment model, we have to flip that around and map > the entire guest. As noted, we can continue to use separate domains for > this, but since each maps the entire guest, it doesn't add a lot of > value and uses more resources and requires more mapping callbacks (and > x86 doesn't have the best error containment anyway). If we had a well > supported IOMMU model that we could adapt for pvDMA, then it would make > sense to keep each device in it's own domain again. Thanks, Could you have an PV IOMMU (in the guest) that would set up those maps? -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Tue, 2011-08-02 at 12:14 -0600, Alex Williamson wrote: > On Tue, 2011-08-02 at 18:28 +1000, David Gibson wrote: > > On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote: > > > On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote: > > [snip] > > > On x86, the USB controllers don't typically live behind a PCIe-to-PCI > > > bridge, so don't suffer the source identifier problem, but they do often > > > share an interrupt. But even then, we can count on most modern devices > > > supporting PCI2.3, and thus the DisINTx feature, which allows us to > > > share interrupts. In any case, yes, it's more rare but we need to know > > > how to handle devices behind PCI bridges. However I disagree that we > > > need to assign all the devices behind such a bridge to the guest. > > > There's a difference between removing the device from the host and > > > exposing the device to the guest. > > > > I think you're arguing only over details of what words to use for > > what, rather than anything of substance here. The point is that an > > entire partitionable group must be assigned to "host" (in which case > > kernel drivers may bind to it) or to a particular guest partition (or > > at least to a single UID on the host). Which of the assigned devices > > the partition actually uses is another matter of course, as is at > > exactly which level they become "de-exposed" if you don't want to use > > all of then. > > Well first we need to define what a partitionable group is, whether it's > based on hardware requirements or user policy. And while I agree that > we need unique ownership of a partition, I disagree that qemu is > necessarily the owner of the entire partition vs individual devices. Sorry, I didn't intend to have such circular logic. "... I disagree that qemu is necessarily the owner of the entire partition vs granted access to devices within the partition". Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Tue, 2011-08-02 at 18:28 +1000, David Gibson wrote: > On Sat, Jul 30, 2011 at 12:20:08PM -0600, Alex Williamson wrote: > > On Sat, 2011-07-30 at 09:58 +1000, Benjamin Herrenschmidt wrote: > [snip] > > On x86, the USB controllers don't typically live behind a PCIe-to-PCI > > bridge, so don't suffer the source identifier problem, but they do often > > share an interrupt. But even then, we can count on most modern devices > > supporting PCI2.3, and thus the DisINTx feature, which allows us to > > share interrupts. In any case, yes, it's more rare but we need to know > > how to handle devices behind PCI bridges. However I disagree that we > > need to assign all the devices behind such a bridge to the guest. > > There's a difference between removing the device from the host and > > exposing the device to the guest. > > I think you're arguing only over details of what words to use for > what, rather than anything of substance here. The point is that an > entire partitionable group must be assigned to "host" (in which case > kernel drivers may bind to it) or to a particular guest partition (or > at least to a single UID on the host). Which of the assigned devices > the partition actually uses is another matter of course, as is at > exactly which level they become "de-exposed" if you don't want to use > all of then. Well first we need to define what a partitionable group is, whether it's based on hardware requirements or user policy. And while I agree that we need unique ownership of a partition, I disagree that qemu is necessarily the owner of the entire partition vs individual devices. But feel free to dismiss it as unsubstantial. > [snip] > > > Maybe something like /sys/devgroups ? This probably warrants involving > > > more kernel people into the discussion. > > > > I don't yet buy into passing groups to qemu since I don't buy into the > > idea of always exposing all of those devices to qemu. Would it be > > sufficient to expose iommu nodes in sysfs that link to the devices > > behind them and describe properties and capabilities of the iommu > > itself? More on this at the end. > > Again, I don't think you're making a distinction of any substance. > Ben is saying the group as a whole must be set to allow partition > access, whether or not you call that "assigning". There's no reason > that passing a sysfs descriptor to qemu couldn't be the qemu > developer's quick-and-dirty method of putting the devices in, while > also allowing full assignment of the devices within the groups by > libvirt. Well, there is a reason for not passing a sysfs descriptor to qemu if qemu isn't the one defining the policy about how the members of that group are exposed. I tend to envision a userspace entity defining policy and granting devices to qemu. Do we really want separate developer vs production interfaces? > [snip] > > > Now some of this can be fixed with tweaks, and we've started doing it > > > (we have a working pass-through using VFIO, forgot to mention that, it's > > > just that we don't like what we had to do to get there). > > > > This is a result of wanting to support *unmodified* x86 guests. We > > don't have the luxury of having a predefined pvDMA spec that all x86 > > OSes adhere to. The 32bit problem is unfortunate, but the priority use > > case for assigning devices to guests is high performance I/O, which > > usually entails modern, 64bit hardware. I'd like to see us get to the > > point of having emulated IOMMU hardware on x86, which could then be > > backed by VFIO, but for now guest pinning is the most practical and > > useful. > > No-one's suggesting that this isn't a valid mode of operation. It's > just that right now conditionally disabling it for us is fairly ugly > because of the way the qemu code is structured. It really shouldn't be any more than skipping the cpu_register_phys_memory_client() and calling the map/unmap routines elsewhere. > [snip] > > > - I don't like too much the fact that VFIO provides yet another > > > different API to do what we already have at least 2 kernel APIs for, ie, > > > BAR mapping and config space access. At least it should be better at > > > using the backend infrastructure of the 2 others (sysfs & procfs). I > > > understand it wants to filter in some case (config space) and -maybe- > > > yet another API is the right way to go but allow me to have my doubts. > > > > The use of PCI sysfs is actually one of my complaints about current > > device assignment. To do assignment with an unprivileged guest we need > > to open the PCI sysfs config file for it, then change ownership on a > > handful of other PCI sysfs files, then there's this other pci-stub thing > > to maintain ownership, but the kvm ioctls don't actually require it and > > can grab onto any free device... We are duplicating some of that in > > VFIO, but we also put the ownership of the device behind a single device > > file. We do have the uiommu problem that we can't give an unp
Re: kvm PCI assignment & VFIO ramblings
On Tue, 2011-08-02 at 22:58 +1000, Benjamin Herrenschmidt wrote: > > Don't worry, it took me a while to get my head around the HW :-) SR-IOV > VFs will generally not have limitations like that no, but on the other > hand, they -will- still require 1 VF = 1 group, ie, you won't be able to > take a bunch of VFs and put them in the same 'domain'. > > I think the main deal is that VFIO/qemu sees "domains" as "guests" and > tries to put all devices for a given guest into a "domain". Actually, that's only a recent optimization, before that each device got it's own iommu domain. It's actually completely configurable on the qemu command line which devices get their own iommu and which share. The default optimizes the number of domains (one) and thus the number of mapping callbacks since we pin the entire guest. > On POWER, we have a different view of things were domains/groups are > defined to be the smallest granularity we can (down to a single VF) and > we give several groups to a guest (ie we avoid sharing the iommu in most > cases) > > This is driven by the HW design but that design is itself driven by the > idea that the domains/group are also error isolation groups and we don't > want to take all of the IOs of a guest down if one adapter in that guest > is having an error. > > The x86 domains are conceptually different as they are about sharing the > iommu page tables with the clear long term intent of then sharing those > page tables with the guest CPU own. We aren't going in that direction > (at this point at least) on POWER.. Yes and no. The x86 domains are pretty flexible and used a few different ways. On the host we do dynamic DMA with a domain per device, mapping only the inflight DMA ranges. In order to achieve the transparent device assignment model, we have to flip that around and map the entire guest. As noted, we can continue to use separate domains for this, but since each maps the entire guest, it doesn't add a lot of value and uses more resources and requires more mapping callbacks (and x86 doesn't have the best error containment anyway). If we had a well supported IOMMU model that we could adapt for pvDMA, then it would make sense to keep each device in it's own domain again. Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On Tue, 2011-08-02 at 11:27 +1000, Benjamin Herrenschmidt wrote: > It's a shared address space. With a basic configuration on p7ioc for > example we have MMIO going from 3G to 4G (PCI side addresses). BARs > contain the normal PCI address there. But that 1G is divided in 128 > segments of equal size which can separately be assigned to PE#'s. > > So BARs are allocated by firmware or the kernel PCI code so that devices > in different PEs don't share segments. > > Of course there's always the risk that a device can be hacked via a > sideband access to BARs to move out of it's allocated segment. That > means that the guest owning that device won't be able to access it > anymore and can potentially disturb a guest or host owning whatever is > in that other segment. Wait, what? I thought the MMIO segments were specifically so that if the device BARs moved out of the segment the guest only hurts itself and not the new segments overlapped. > The only way to enforce isolation here is to ensure that PE# are > entirely behind P2P bridges, since those would then ensure that even if > you put crap into your BARs you won't be able to walk over a neighbour. Ok, so the MMIO segments are really just a configuration nuance of the platform and being behind a P2P bridge is what allows you to hand off BARs to a guest (which needs to know the bridge window to do anything useful with them). Is that right? > I believe pHyp enforces that, for example, if you have a slot, all > devices & functions behind that slot pertain to the same PE# under pHyp. > > That means you cannot put individual functions of a device into > different PE# with pHyp. > > We plan to be a bit less restrictive here for KVM, assuming that if you > use a device that allows such a back-channel to the BARs, then it's your > problem to not trust such a device for virtualization. And most of the > time, you -will- have a P2P to protect you anyways. > > The problem doesn't exist (or is assumed as non-existing) for SR-IOV > since in that case, the VFs are meant to be virtualized, so pHyp assumes > there is no such back-channel and it can trust them to be in different > PE#. But you still need the P2P bridge to protect MMIO segments? Or do SR-IOV BARs need to be virtualized? I'm having trouble with the mental model of how you can do both. Thanks, Alex -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm PCI assignment & VFIO ramblings
On 08/02/2011 03:58 PM, Benjamin Herrenschmidt wrote: > > > > What you mean 2-level is two passes through two trees (ie 6 or 8 levels > > right ?). > > (16 or 25) 25 levels ? You mean 25 loads to get to a translation ? And you get any kind of performance out of that ? :-) Aggressive partial translation caching. Even then, performance does suffer on memory intensive workloads. The fix was transparent hugepages; that makes the page table walks much faster since they're fully cached, the partial translation caches become more effective, and the tlb itself becomes more effective. On some workloads, THP on both guest and host was faster than no-THP on bare metal. > > > > Not sure what you mean... the guest calls h-calls for every iommu page > > mapping/unmapping, yes. So the performance of these is critical. So yes, > > we'll eventually do it in kernel. We just haven't yet. > > I see. x86 traditionally doesn't do it for every request. We had some > proposals to do a pviommu that does map every request, but none reached > maturity. It's quite performance critical, you don't want to go anywhere near a full exit. On POWER we plan to handle that in "real mode" (ie MMU off) straight off the interrupt handlers, with the CPU still basically operating in guest context with HV permission. That is basically do the permission check, translation and whack the HW iommu immediately. If for some reason one step fails (!present PTE or something like that), we'd then fallback to an exit to Linux to handle it in a more "common" environment where we can handle page faults etc... I guess we can hack some kind of private interface, though I'd hoped to avoid it (and so far we succeeded - we can even get vfio to inject interrupts into kvm from the kernel without either knowing anything about the other). > > > Does the BAR value contain the segment base address? Or is that added > > > later? > > > > It's a shared address space. With a basic configuration on p7ioc for > > example we have MMIO going from 3G to 4G (PCI side addresses). BARs > > contain the normal PCI address there. But that 1G is divided in 128 > > segments of equal size which can separately be assigned to PE#'s. > > > > So BARs are allocated by firmware or the kernel PCI code so that devices > > in different PEs don't share segments. > > Okay, and config space virtualization ensures that the guest can't remap? Well, so it depends :-) With KVM we currently use whatever config space virtualization you do and so we somewhat rely on this but it's not very fool proof. I believe pHyp doesn't even bother filtering config space. As I said in another note, you can't trust adapters anyway. Plenty of them (video cards come to mind) have ways to get to their own config space via MMIO registers for example. Yes, we've seen that. So what pHyp does is that it always create PE's (aka groups) that are below a bridge. With PCIe, everything mostly is below a bridge so that's easy, but that does mean that you always have all functions of a device in the same PE (and thus in the same partition). SR-IOV is an exception to this rule since in that case the HW is designed to be trusted. That way, being behind a bridge, the bridge windows are going to define what can be forwarded to the device, and thus the system is immune to the guest putting crap into the BARs. It can't be remapped to overlap a neighbouring device. Note that the bridge itself isn't visible to the guest, so yes, config space is -somewhat- virtualized, typically pHyp make every pass-through PE look like a separate PCI host bridge with the devices below it. I think I see, yes. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html