Re: Re: KVM PCI device assignment issues
Matthew Wilcox wrote: On Fri, Feb 13, 2009 at 04:32:47PM +, Mark McLoughlin wrote: - Secondary Bus Reset (SBR) allows software to trigger a reset on all devices (and functions) behind a PCI bridge. - A PCI Power Management D-state transition (D3hot to D0) can be used to reset a device (all functions). That's not guaranteed according to PCI PM 1.2: 5.4.1. Software Accessible D3 (D3hot) When programmed to D0, the function may return to the D0 Initialized or D0 Uninitialized state without PCI RST# being asserted. This option is determined at design time and allows designs the option of either performing an internal reset or not performing an internal reset. The No_Soft_Reset bit in the PMCSR indicates which option is chosen at design time: Section 3.2.4. says: Value at Reset: Device specific Read/Write: Read Only When set (“1”), this bit indicates that devices transitioning from D3hot to D0 because of PowerState commands do not perform an internal reset. Configuration Context is preserved. Upon transition from the D3hot to the D0 Initialized state, no additional operating system intervention is required to preserve Configuration Context beyond writing the PowerState bits. When clear (“0”), devices do perform an internal reset upon transitioning from D3hot to D0 via software control of the PowerState bits. Configuration Context is lost when performing the soft reset. Upon transition from the D3hot to the D0 state, full reinitialization sequence is needed to return the device to D0 Initialized. Regardless of this bit, devices that transition from D3hot to D0 by a system or bus segment reset will return to the device state D0 Uninitialized with only PME context preserved if PME is supported and enabled. So the reset is guaranteed if the bit is 0. And I checked the devices on my machine, all of them who have PM perform internal reset when transiting from D3hot to D0 (GeForce 7300, Myri-10G, E1000 82567, ICH10 SATA, EHCI, etc.) -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PCI device assignment issues
* Chris Wright (chr...@redhat.com) wrote: > * Matthew Wilcox (matt...@wil.cx) wrote: > > I might suggest a second approach which would be to have an explicit > > echo to the bind file ignore the list of ids. Then you wouldn't need to > > 'echo -n "8086 10de"' to begin with. > > I tried that first, and it dips into the driver logic, where it wants > to filter via ->match. Untested patch below _should_ be enough to avoid > adding the id to begin with. OK, after making it actually compile. Still gets trapped into generic logic, this time in pci core. I'm starting to remember why dynid looked like the better option. pci_device_probe __pci_device_probe pci_match_device() <-- fails -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PCI device assignment issues
* Matthew Wilcox (matt...@wil.cx) wrote: > On Fri, Feb 13, 2009 at 04:32:47PM +, Mark McLoughlin wrote: > > - Conventional PCI devices (i.e. PCI/PCI-X, not PCIe) behind the same > > bridge must be assigned to the same VT-d domain - i.e given device > > A (:0f:1.0) and device B (and :0f:2.0), if you assign > > device A to guest, you cannot then use device B in the host or > > another guest. > > Is that a limitation of the VT-d / IOMMU setup? Yes. The source id will essentially show up as the bridge. > > - Some newer PCIe devices (and newer conventional PCI devices too via > > PCI Advanced Features) support Function Level Reset (FLR). This > > allows a PCI function to be reset without affecting any other > > functions on that device, or any other devices. This feature is not > > widespread yet AFAIK - e.g. I've seen it on an audio controller, > > and it must also be supported by SR-IOV devices. > > Yes, that's definitely not very widespread yet. OTOH, we don't need to > worry about disturbing other functions if all devices behind the same > bridge have to be mapped to the same guest. FLR (when it exists) would work fine for devices not behind a conventional pci bridge. > > Driver Unbinding > > > > > > Before a device is assigned to a guest, we should make sure that no host > > device driver is currently bound to the device. > > > > We can do that with e.g. > > > > $> echo -n "8086 10de" > /sys/bus/pci/drivers/pci-stub/new_id > > $> echo -n :00:19.0 > /sys/bus/pci/drivers/e1000e/unbind > > $> echo -n :00:19.0 > /sys/bus/pci/drivers/pci-stub/bind > > > > One minor problem with this scheme is that at this point you can't > > unbind from pci-stub and trigger a re-probe and have e1000e bind to it. > > In order to support that, we need a "remove_id" interface to remove the > > dynamic ID. > > It sounds like you'd be OK with a 'remove_id' interface that only > removes subsequently-added interfaces. > > I might suggest a second approach which would be to have an explicit > echo to the bind file ignore the list of ids. Then you wouldn't need to > 'echo -n "8086 10de"' to begin with. I tried that first, and it dips into the driver logic, where it wants to filter via ->match. Untested patch below _should_ be enough to avoid adding the id to begin with. > > Furthermore, it should be possible to do this without actually affecting > > any of the devices - i.e. a "try to unbind and see if we oops" approach > > clearly isn't great. > > Well, yes. I'd even be upset if my network or storage flickered away > briefly while another using was starting to run KVM. > > > This last constraint is the most difficult and points to the logic > > needing to be in userland management libraries. Possibly the only sane > > kernel space support would be "try to unbind and reset; if it works then > > the device is assignable". > > If we expose a 'reset' file in the /sys/bus/pci/devices/*/ directories > for devices that are resettable, that should be enough, I would think. Yes, currently it's only internal and it's not robust given the reset constraints. thanks, -chris -- drivers/base/base.h |1 + drivers/base/bus.c |2 +- drivers/base/dd.c | 15 +-- 3 files changed, 15 insertions(+), 3 deletions(-) diff --git a/drivers/base/base.h b/drivers/base/base.h index 0a5f055..60dc346 100644 --- a/drivers/base/base.h +++ b/drivers/base/base.h @@ -86,6 +86,7 @@ extern void bus_remove_driver(struct device_driver *drv); extern void driver_detach(struct device_driver *drv); extern int driver_probe_device(struct device_driver *drv, struct device *dev); +extern int driver_bind_probe_device(struct device_driver *drv, struct device *dev); extern void sysdev_shutdown(void); extern int sysdev_suspend(pm_message_t state); diff --git a/drivers/base/bus.c b/drivers/base/bus.c index 83f32b8..ad28338 100644 --- a/drivers/base/bus.c +++ b/drivers/base/bus.c @@ -202,7 +202,7 @@ static ssize_t driver_bind(struct device_driver *drv, if (dev->parent)/* Needed for USB */ down(&dev->parent->sem); down(&dev->sem); - err = driver_probe_device(drv, dev); + err = driver_bind_probe_device(drv, dev); up(&dev->sem); if (dev->parent) up(&dev->parent->sem); diff --git a/drivers/base/dd.c b/drivers/base/dd.c index 315bed8..fba6463 100644 --- a/drivers/base/dd.c +++ b/drivers/base/dd.c @@ -184,13 +184,14 @@ int driver_probe_done(void) * This function must be called with @dev->sem held. When called for a * USB interface, @dev->parent->sem must be held as well. */ -int driver_probe_device(struct device_driver *drv, struct device *dev) +static int __driver_probe_device(struct device_driver *drv, struct device *dev, +bool force) { int ret = 0;
Re: KVM PCI device assignment issues
On Fri, Feb 13, 2009 at 04:32:47PM +, Mark McLoughlin wrote: > Hi, You raise some interesting points. Thanks for doing that rather than going off and creating a big pile of patches and demanding they be applied ;-) > This gets confusing, so some background constraints first: > > - Conventional PCI devices (i.e. PCI/PCI-X, not PCIe) behind the same > bridge must be assigned to the same VT-d domain - i.e given device > A (:0f:1.0) and device B (and :0f:2.0), if you assign > device A to guest, you cannot then use device B in the host or > another guest. Is that a limitation of the VT-d / IOMMU setup? > - Some newer PCIe devices (and newer conventional PCI devices too via > PCI Advanced Features) support Function Level Reset (FLR). This > allows a PCI function to be reset without affecting any other > functions on that device, or any other devices. This feature is not > widespread yet AFAIK - e.g. I've seen it on an audio controller, > and it must also be supported by SR-IOV devices. Yes, that's definitely not very widespread yet. OTOH, we don't need to worry about disturbing other functions if all devices behind the same bridge have to be mapped to the same guest. > - Secondary Bus Reset (SBR) allows software to trigger a reset on all > devices (and functions) behind a PCI bridge. > > - A PCI Power Management D-state transition (D3hot to D0) can be used > to reset a device (all functions). That's not guaranteed according to PCI PM 1.2: 5.4.1. Software Accessible D3 (D3hot) When programmed to D0, the function may return to the D0 Initialized or D0 Uninitialized state without PCI RST# being asserted. This option is determined at design time and allows designs the option of either performing an internal reset or not performing an internal reset. - There's also the option that devices in a hotplug PCI slot can have their power cycled, forcing them into D3cold and then transitioning into D0 Uninitialised. > - Some PCI devices don't have page aligned MMIO BARs. These devices > (all functions) cannot be safely assigned to guests. We've seen patches to force page alignment on this list ... they haven't been sufficiently beautiful to be applied yet. > Driver Unbinding > > > Before a device is assigned to a guest, we should make sure that no host > device driver is currently bound to the device. > > We can do that with e.g. > > $> echo -n "8086 10de" > /sys/bus/pci/drivers/pci-stub/new_id > $> echo -n :00:19.0 > /sys/bus/pci/drivers/e1000e/unbind > $> echo -n :00:19.0 > /sys/bus/pci/drivers/pci-stub/bind > > One minor problem with this scheme is that at this point you can't > unbind from pci-stub and trigger a re-probe and have e1000e bind to it. > In order to support that, we need a "remove_id" interface to remove the > dynamic ID. It sounds like you'd be OK with a 'remove_id' interface that only removes subsequently-added interfaces. I might suggest a second approach which would be to have an explicit echo to the bind file ignore the list of ids. Then you wouldn't need to 'echo -n "8086 10de"' to begin with. > Device Reset > > > Before assigning a device to a guest, it should be reset. The host or a > previous guest may have left the device in an unknown state. Not > resetting can be seen in testing to lead to e.g. "TX Unit Hang" errors > with e1000e devices. Really, this is the same problem that kexec has. Either the driver is doing insufficient initialisation, or it's not doing tis shutdown properly. The former is definitely better than the latter as kexec may be used from a position of having the driver locked solid and unable to reset the device. > If we're assigning devices from behind a PCI/PCI-x bridge (remember all > devices must be assigned together), then we can use SBR to reset them > all together. Clearly, though, one should make sure that all devices > behind that bridge are not in use before doing the reset. We could > implement this with a "reset" sysfs interface for pci-stub - it would > only reset a device using SBR if all devices behind that bridge were > bound to pci-stub. I don't think this should be part of pci-stub, but rather part of the PCI core. I can imagine other uses for being able to reset all devices behind a bridge that don't involve anything to do with v12n. So I'd like to see a /sys/class/pci_bus/*/reset (where * would not include root busses). > Where a conventional PCI device is on the root bus, or where a PCIe > device is on the root bus or another bus with multiple devices, we could > use the D-state transition reset. Since this resets all functions on a > device, we would need a similar approach where all functions must be > bound to pci-stub before being reset. Even with the caveat above about D0 -> D3hot -> D0 doesn't necessarily do a full reset, it does seem to be per-function. For example, this passage fro
Re: KVM PCI device assignment issues
On Fri, 2009-02-13 at 08:56 -0800, Greg KH wrote: > On Fri, Feb 13, 2009 at 04:32:47PM +, Mark McLoughlin wrote: > > Driver Unbinding > > > > > > Before a device is assigned to a guest, we should make sure that no host > > device driver is currently bound to the device. > > > > We can do that with e.g. > > > > $> echo -n "8086 10de" > /sys/bus/pci/drivers/pci-stub/new_id > > $> echo -n :00:19.0 > /sys/bus/pci/drivers/e1000e/unbind > > $> echo -n :00:19.0 > /sys/bus/pci/drivers/pci-stub/bind > > > > One minor problem with this scheme is that at this point you can't > > unbind from pci-stub and trigger a re-probe and have e1000e bind to it. > > Are you sure? It should work if you manually tell the e1000e driver to > bind to it, after unbinding it from the pci-stub driver. Yes, that works - I meant using /sys/bus/pci/drivers_probe. The problem is that it would suck for management tools to have to remember which device driver it was originally bound to. > > In order to support that, we need a "remove_id" interface to remove the > > dynamic ID. > > Why? Before assignment: $> echo -n "8086 10de" > /sys/bus/pci/drivers/pci-stub/new_id $> echo -n :00:19.0 > /sys/bus/pci/drivers/e1000e/unbind $> echo -n :00:19.0 > /sys/bus/pci/drivers/pci-stub/bind $> echo -n "8086 10de" > /sys/bus/pci/drivers/pci-stub/remove_id After assignment: $> echo -n :00:19.0 > /sys/bus/pci/drivers_probe > > What we don't support is a way to unbind permanently. Xen has a > > pciback.hide module param which tries to achieve this, but you end up > > with the inevitable issues around making sure pciback is loaded before > > the device driver etc. > > What do you mean, unbind "permanently"? For every reboot? Or just > within the same boot time? Across reboots, yeah. Cheers, Mark. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM PCI device assignment issues
On Fri, Feb 13, 2009 at 04:32:47PM +, Mark McLoughlin wrote: > Driver Unbinding > > > Before a device is assigned to a guest, we should make sure that no host > device driver is currently bound to the device. > > We can do that with e.g. > > $> echo -n "8086 10de" > /sys/bus/pci/drivers/pci-stub/new_id > $> echo -n :00:19.0 > /sys/bus/pci/drivers/e1000e/unbind > $> echo -n :00:19.0 > /sys/bus/pci/drivers/pci-stub/bind > > One minor problem with this scheme is that at this point you can't > unbind from pci-stub and trigger a re-probe and have e1000e bind to it. Are you sure? It should work if you manually tell the e1000e driver to bind to it, after unbinding it from the pci-stub driver. > In order to support that, we need a "remove_id" interface to remove the > dynamic ID. Why? > What we don't support is a way to unbind permanently. Xen has a > pciback.hide module param which tries to achieve this, but you end up > with the inevitable issues around making sure pciback is loaded before > the device driver etc. What do you mean, unbind "permanently"? For every reboot? Or just within the same boot time? thanks, greg k-h -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
KVM PCI device assignment issues
Hi, KVM has support for PCI device assignment using VT-d and AMD IOMMU, but there are a number of inter-related issues that need some further discussion: - Unbinding devices from any existing device driver before assignment - Resetting devices before and after assignment - Helping users figure out which devices can actually be assigned This gets confusing, so some background constraints first: - Conventional PCI devices (i.e. PCI/PCI-X, not PCIe) behind the same bridge must be assigned to the same VT-d domain - i.e given device A (:0f:1.0) and device B (and :0f:2.0), if you assign device A to guest, you cannot then use device B in the host or another guest. - Some newer PCIe devices (and newer conventional PCI devices too via PCI Advanced Features) support Function Level Reset (FLR). This allows a PCI function to be reset without affecting any other functions on that device, or any other devices. This feature is not widespread yet AFAIK - e.g. I've seen it on an audio controller, and it must also be supported by SR-IOV devices. - Secondary Bus Reset (SBR) allows software to trigger a reset on all devices (and functions) behind a PCI bridge. - A PCI Power Management D-state transition (D3hot to D0) can be used to reset a device (all functions). - Some PCI devices don't have page aligned MMIO BARs. These devices (all functions) cannot be safely assigned to guests. Driver Unbinding Before a device is assigned to a guest, we should make sure that no host device driver is currently bound to the device. We can do that with e.g. $> echo -n "8086 10de" > /sys/bus/pci/drivers/pci-stub/new_id $> echo -n :00:19.0 > /sys/bus/pci/drivers/e1000e/unbind $> echo -n :00:19.0 > /sys/bus/pci/drivers/pci-stub/bind One minor problem with this scheme is that at this point you can't unbind from pci-stub and trigger a re-probe and have e1000e bind to it. In order to support that, we need a "remove_id" interface to remove the dynamic ID. What we don't support is a way to unbind permanently. Xen has a pciback.hide module param which tries to achieve this, but you end up with the inevitable issues around making sure pciback is loaded before the device driver etc. Permanent unbinding isn't necessarily needed, but it might help provide a solution to some of the nastier issues below. Device Reset Before assigning a device to a guest, it should be reset. The host or a previous guest may have left the device in an unknown state. Not resetting can be seen in testing to lead to e.g. "TX Unit Hang" errors with e1000e devices. FLR is without doubt the preferable solution here. KVM already implements this. However, the range of devices which support FLR is currently quite limited. If we're assigning devices from behind a PCI/PCI-x bridge (remember all devices must be assigned together), then we can use SBR to reset them all together. Clearly, though, one should make sure that all devices behind that bridge are not in use before doing the reset. We could implement this with a "reset" sysfs interface for pci-stub - it would only reset a device using SBR if all devices behind that bridge were bound to pci-stub. Where a conventional PCI device is on the root bus, or where a PCIe device is on the root bus or another bus with multiple devices, we could use the D-state transition reset. Since this resets all functions on a device, we would need a similar approach where all functions must be bound to pci-stub before being reset. Furthermore, we would need to prevent pci-stub from resetting a device it is bound to where the device is already assigned to a guest. To achieve this, we would want KVM to explicitly call in to pci-stub to mark a device as in use. The alternatives to such an approach are: a) Only support FLR capable devices b) Cross our fingers and hope that work without a device reset c) Allow a driver to be permanently unbound from a device and require the user to reboot after unbinding before assigning Filtering = In order to support a sane user interface in management tools, it should be possible to list all PCI devices on available on a host and filter out those which cannot be assigned to a guest. Furthermore, it should be possible to do this without actually affecting any of the devices - i.e. a "try to unbind and see if we oops" approach clearly isn't great. Finally, some management tools would like to be able to do this filtering given the constraint of a device being reserved for a currently inactive guest. This last constraint is the most difficult and points to the logic needing to be in userland management libraries. Possibly the only sane kernel space support would be "try to unbind and reset; if it works then the device is assignable". Conclusions === Only supporting devices with FLR restricts our user pool far too severely. Permanent unbind