Re: Re: KVM PCI device assignment issues

2009-02-24 Thread Zhao, Yu

Matthew Wilcox wrote:

On Fri, Feb 13, 2009 at 04:32:47PM +, Mark McLoughlin wrote:

  - Secondary Bus Reset (SBR) allows software to trigger a reset on all
devices (and functions) behind a PCI bridge.

  - A PCI Power Management D-state transition (D3hot to D0) can be used
to reset a device (all functions).


That's not guaranteed according to PCI PM 1.2:

5.4.1. Software Accessible D3 (D3hot)

  When programmed to D0, the function may return to the D0 Initialized
  or D0 Uninitialized state without PCI RST# being asserted. This option
  is determined at design time and allows designs the option of either
  performing an internal reset or not performing an internal reset.


The No_Soft_Reset bit in the PMCSR indicates which option is chosen at 
design time:


Section 3.2.4. says:
Value at Reset: Device specific
Read/Write: Read Only
When set (“1”), this bit indicates that devices transitioning from D3hot 
to D0 because of PowerState commands do not perform an internal reset. 
Configuration Context is preserved. Upon transition from the D3hot to 
the D0 Initialized state, no additional operating system intervention is 
required to preserve Configuration Context beyond writing the PowerState 
bits. When clear (“0”), devices do perform an internal reset upon 
transitioning from D3hot to D0 via software control of the PowerState 
bits. Configuration Context is lost when performing the soft reset. Upon 
 transition from the D3hot to the D0 state, full reinitialization 
sequence is needed to return the device to D0 Initialized. Regardless of 
this bit, devices that transition from D3hot to D0 by a system or bus 
segment reset will return to the device state D0 Uninitialized with only 
PME context preserved if PME is supported and enabled.


So the reset is guaranteed if the bit is 0.

And I checked the devices on my machine, all of them who have PM perform 
internal reset when transiting from D3hot to D0 (GeForce 7300, Myri-10G, 
E1000 82567, ICH10 SATA, EHCI, etc.)

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PCI device assignment issues

2009-02-13 Thread Chris Wright
* Chris Wright (chr...@redhat.com) wrote:
> * Matthew Wilcox (matt...@wil.cx) wrote:
> > I might suggest a second approach which would be to have an explicit
> > echo to the bind file ignore the list of ids.  Then you wouldn't need to
> > 'echo -n "8086 10de"' to begin with.
> 
> I tried that first, and it dips into the driver logic, where it wants
> to filter via ->match.  Untested patch below _should_ be enough to avoid
> adding the id to begin with.

OK, after making it actually compile.  Still gets trapped into generic
logic, this time in pci core.  I'm starting to remember why dynid looked
like the better option.

pci_device_probe
  __pci_device_probe
pci_match_device()  <-- fails
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PCI device assignment issues

2009-02-13 Thread Chris Wright
* Matthew Wilcox (matt...@wil.cx) wrote:
> On Fri, Feb 13, 2009 at 04:32:47PM +, Mark McLoughlin wrote:
> >   - Conventional PCI devices (i.e. PCI/PCI-X, not PCIe) behind the same 
> > bridge must be assigned to the same VT-d domain - i.e given device 
> > A (:0f:1.0) and device B (and :0f:2.0), if you assign 
> > device A to guest, you cannot then use device B in the host or 
> > another guest.
> 
> Is that a limitation of the VT-d / IOMMU setup?

Yes.  The source id will essentially show up as the bridge.

> >   - Some newer PCIe devices (and newer conventional PCI devices too via 
> > PCI Advanced Features) support Function Level Reset (FLR). This 
> > allows a PCI function to be reset without affecting any other 
> > functions on that device, or any other devices. This feature is not 
> > widespread yet AFAIK - e.g. I've seen it on an audio controller, 
> > and it must also be supported by SR-IOV devices.
> 
> Yes, that's definitely not very widespread yet.  OTOH, we don't need to
> worry about disturbing other functions if all devices behind the same
> bridge have to be mapped to the same guest.

FLR (when it exists) would work fine for devices not behind a conventional
pci bridge.

> > Driver Unbinding
> > 
> > 
> > Before a device is assigned to a guest, we should make sure that no host
> > device driver is currently bound to the device.
> > 
> > We can do that with e.g.
> > 
> >  $> echo -n "8086 10de"  > /sys/bus/pci/drivers/pci-stub/new_id
> >  $> echo -n :00:19.0 > /sys/bus/pci/drivers/e1000e/unbind
> >  $> echo -n :00:19.0 > /sys/bus/pci/drivers/pci-stub/bind
> > 
> > One minor problem with this scheme is that at this point you can't
> > unbind from pci-stub and trigger a re-probe and have e1000e bind to it.
> > In order to support that, we need a "remove_id" interface to remove the
> > dynamic ID.
> 
> It sounds like you'd be OK with a 'remove_id' interface that only
> removes subsequently-added interfaces.
> 
> I might suggest a second approach which would be to have an explicit
> echo to the bind file ignore the list of ids.  Then you wouldn't need to
> 'echo -n "8086 10de"' to begin with.

I tried that first, and it dips into the driver logic, where it wants
to filter via ->match.  Untested patch below _should_ be enough to avoid
adding the id to begin with.

> > Furthermore, it should be possible to do this without actually affecting
> > any of the devices - i.e. a "try to unbind and see if we oops" approach
> > clearly isn't great.
> 
> Well, yes.  I'd even be upset if my network or storage flickered away
> briefly while another using was starting to run KVM.
> 
> > This last constraint is the most difficult and points to the logic
> > needing to be in userland management libraries. Possibly the only sane
> > kernel space support would be "try to unbind and reset; if it works then
> > the device is assignable".
> 
> If we expose a 'reset' file in the /sys/bus/pci/devices/*/ directories
> for devices that are resettable, that should be enough, I would think.

Yes, currently it's only internal and it's not robust given the reset
constraints.

thanks,
-chris
--

 drivers/base/base.h |1 +
 drivers/base/bus.c  |2 +-
 drivers/base/dd.c   |   15 +--
 3 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/drivers/base/base.h b/drivers/base/base.h
index 0a5f055..60dc346 100644
--- a/drivers/base/base.h
+++ b/drivers/base/base.h
@@ -86,6 +86,7 @@ extern void bus_remove_driver(struct device_driver *drv);
 
 extern void driver_detach(struct device_driver *drv);
 extern int driver_probe_device(struct device_driver *drv, struct device *dev);
+extern int driver_bind_probe_device(struct device_driver *drv, struct device 
*dev);
 
 extern void sysdev_shutdown(void);
 extern int sysdev_suspend(pm_message_t state);
diff --git a/drivers/base/bus.c b/drivers/base/bus.c
index 83f32b8..ad28338 100644
--- a/drivers/base/bus.c
+++ b/drivers/base/bus.c
@@ -202,7 +202,7 @@ static ssize_t driver_bind(struct device_driver *drv,
if (dev->parent)/* Needed for USB */
down(&dev->parent->sem);
down(&dev->sem);
-   err = driver_probe_device(drv, dev);
+   err = driver_bind_probe_device(drv, dev);
up(&dev->sem);
if (dev->parent)
up(&dev->parent->sem);
diff --git a/drivers/base/dd.c b/drivers/base/dd.c
index 315bed8..fba6463 100644
--- a/drivers/base/dd.c
+++ b/drivers/base/dd.c
@@ -184,13 +184,14 @@ int driver_probe_done(void)
  * This function must be called with @dev->sem held.  When called for a
  * USB interface, @dev->parent->sem must be held as well.
  */
-int driver_probe_device(struct device_driver *drv, struct device *dev)
+static int __driver_probe_device(struct device_driver *drv, struct device *dev,
+bool force)
 {
int ret = 0;
 
  

Re: KVM PCI device assignment issues

2009-02-13 Thread Matthew Wilcox
On Fri, Feb 13, 2009 at 04:32:47PM +, Mark McLoughlin wrote:
> Hi,

You raise some interesting points.  Thanks for doing that rather than
going off and creating a big pile of patches and demanding they be
applied ;-)

> This gets confusing, so some background constraints first:
> 
>   - Conventional PCI devices (i.e. PCI/PCI-X, not PCIe) behind the same 
> bridge must be assigned to the same VT-d domain - i.e given device 
> A (:0f:1.0) and device B (and :0f:2.0), if you assign 
> device A to guest, you cannot then use device B in the host or 
> another guest.

Is that a limitation of the VT-d / IOMMU setup?

>   - Some newer PCIe devices (and newer conventional PCI devices too via 
> PCI Advanced Features) support Function Level Reset (FLR). This 
> allows a PCI function to be reset without affecting any other 
> functions on that device, or any other devices. This feature is not 
> widespread yet AFAIK - e.g. I've seen it on an audio controller, 
> and it must also be supported by SR-IOV devices.

Yes, that's definitely not very widespread yet.  OTOH, we don't need to
worry about disturbing other functions if all devices behind the same
bridge have to be mapped to the same guest.

>   - Secondary Bus Reset (SBR) allows software to trigger a reset on all 
> devices (and functions) behind a PCI bridge.
> 
>   - A PCI Power Management D-state transition (D3hot to D0) can be used 
> to reset a device (all functions).

That's not guaranteed according to PCI PM 1.2:

5.4.1. Software Accessible D3 (D3hot)

  When programmed to D0, the function may return to the D0 Initialized
  or D0 Uninitialized state without PCI RST# being asserted. This option
  is determined at design time and allows designs the option of either
  performing an internal reset or not performing an internal reset.

-

There's also the option that devices in a hotplug PCI slot can have
their power cycled, forcing them into D3cold and then transitioning into
D0 Uninitialised.

>   - Some PCI devices don't have page aligned MMIO BARs. These devices 
> (all functions) cannot be safely assigned to guests.

We've seen patches to force page alignment on this list ... they haven't
been sufficiently beautiful to be applied yet.

> Driver Unbinding
> 
> 
> Before a device is assigned to a guest, we should make sure that no host
> device driver is currently bound to the device.
> 
> We can do that with e.g.
> 
>  $> echo -n "8086 10de"  > /sys/bus/pci/drivers/pci-stub/new_id
>  $> echo -n :00:19.0 > /sys/bus/pci/drivers/e1000e/unbind
>  $> echo -n :00:19.0 > /sys/bus/pci/drivers/pci-stub/bind
> 
> One minor problem with this scheme is that at this point you can't
> unbind from pci-stub and trigger a re-probe and have e1000e bind to it.
> In order to support that, we need a "remove_id" interface to remove the
> dynamic ID.

It sounds like you'd be OK with a 'remove_id' interface that only
removes subsequently-added interfaces.

I might suggest a second approach which would be to have an explicit
echo to the bind file ignore the list of ids.  Then you wouldn't need to
'echo -n "8086 10de"' to begin with.

> Device Reset
> 
> 
> Before assigning a device to a guest, it should be reset. The host or a
> previous guest may have left the device in an unknown state. Not
> resetting can be seen in testing to lead to e.g. "TX Unit Hang" errors
> with e1000e devices.

Really, this is the same problem that kexec has.  Either the driver is
doing insufficient initialisation, or it's not doing tis shutdown
properly.  The former is definitely better than the latter as kexec may
be used from a position of having the driver locked solid and unable to
reset the device.

> If we're assigning devices from behind a PCI/PCI-x bridge (remember all
> devices must be assigned together), then we can use SBR to reset them
> all together. Clearly, though, one should make sure that all devices
> behind that bridge are not in use before doing the reset. We could
> implement this with a "reset" sysfs interface for pci-stub - it would
> only reset a device using SBR if all devices behind that bridge were
> bound to pci-stub.

I don't think this should be part of pci-stub, but rather part of the
PCI core.  I can imagine other uses for being able to reset all devices
behind a bridge that don't involve anything to do with v12n.  So I'd
like to see a /sys/class/pci_bus/*/reset (where * would not include root
busses).

> Where a conventional PCI device is on the root bus, or where a PCIe
> device is on the root bus or another bus with multiple devices, we could
> use the D-state transition reset. Since this resets all functions on a
> device, we would need a similar approach where all functions must be
> bound to pci-stub before being reset.

Even with the caveat above about D0 -> D3hot -> D0 doesn't necessarily
do a full reset, it does seem to be per-function.  For example, this
passage fro

Re: KVM PCI device assignment issues

2009-02-13 Thread Mark McLoughlin
On Fri, 2009-02-13 at 08:56 -0800, Greg KH wrote:
> On Fri, Feb 13, 2009 at 04:32:47PM +, Mark McLoughlin wrote:
> > Driver Unbinding
> > 
> > 
> > Before a device is assigned to a guest, we should make sure that no host
> > device driver is currently bound to the device.
> > 
> > We can do that with e.g.
> > 
> >  $> echo -n "8086 10de"  > /sys/bus/pci/drivers/pci-stub/new_id
> >  $> echo -n :00:19.0 > /sys/bus/pci/drivers/e1000e/unbind
> >  $> echo -n :00:19.0 > /sys/bus/pci/drivers/pci-stub/bind
> > 
> > One minor problem with this scheme is that at this point you can't
> > unbind from pci-stub and trigger a re-probe and have e1000e bind to it.
> 
> Are you sure?  It should work if you manually tell the e1000e driver to
> bind to it, after unbinding it from the pci-stub driver.

Yes, that works - I meant using /sys/bus/pci/drivers_probe. The problem
is that it would suck for management tools to have to remember which
device driver it was originally bound to.

> > In order to support that, we need a "remove_id" interface to remove the
> > dynamic ID.
> 
> Why?

Before assignment:

 $> echo -n "8086 10de"  > /sys/bus/pci/drivers/pci-stub/new_id
 $> echo -n :00:19.0 > /sys/bus/pci/drivers/e1000e/unbind
 $> echo -n :00:19.0 > /sys/bus/pci/drivers/pci-stub/bind
 $> echo -n "8086 10de"  > /sys/bus/pci/drivers/pci-stub/remove_id

After assignment:

 $> echo -n :00:19.0 > /sys/bus/pci/drivers_probe

> > What we don't support is a way to unbind permanently. Xen has a
> > pciback.hide module param which tries to achieve this, but you end up
> > with the inevitable issues around making sure pciback is loaded before
> > the device driver etc.
> 
> What do you mean, unbind "permanently"?  For every reboot?  Or just
> within the same boot time?

Across reboots, yeah.

Cheers,
Mark.

--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: KVM PCI device assignment issues

2009-02-13 Thread Greg KH
On Fri, Feb 13, 2009 at 04:32:47PM +, Mark McLoughlin wrote:
> Driver Unbinding
> 
> 
> Before a device is assigned to a guest, we should make sure that no host
> device driver is currently bound to the device.
> 
> We can do that with e.g.
> 
>  $> echo -n "8086 10de"  > /sys/bus/pci/drivers/pci-stub/new_id
>  $> echo -n :00:19.0 > /sys/bus/pci/drivers/e1000e/unbind
>  $> echo -n :00:19.0 > /sys/bus/pci/drivers/pci-stub/bind
> 
> One minor problem with this scheme is that at this point you can't
> unbind from pci-stub and trigger a re-probe and have e1000e bind to it.

Are you sure?  It should work if you manually tell the e1000e driver to
bind to it, after unbinding it from the pci-stub driver.

> In order to support that, we need a "remove_id" interface to remove the
> dynamic ID.

Why?

> What we don't support is a way to unbind permanently. Xen has a
> pciback.hide module param which tries to achieve this, but you end up
> with the inevitable issues around making sure pciback is loaded before
> the device driver etc.

What do you mean, unbind "permanently"?  For every reboot?  Or just
within the same boot time?

thanks,

greg k-h
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


KVM PCI device assignment issues

2009-02-13 Thread Mark McLoughlin
Hi,

KVM has support for PCI device assignment using VT-d and AMD IOMMU, but
there are a number of inter-related issues that need some further
discussion:

  - Unbinding devices from any existing device driver before assignment

  - Resetting devices before and after assignment

  - Helping users figure out which devices can actually be assigned

This gets confusing, so some background constraints first:

  - Conventional PCI devices (i.e. PCI/PCI-X, not PCIe) behind the same 
bridge must be assigned to the same VT-d domain - i.e given device 
A (:0f:1.0) and device B (and :0f:2.0), if you assign 
device A to guest, you cannot then use device B in the host or 
another guest.

  - Some newer PCIe devices (and newer conventional PCI devices too via 
PCI Advanced Features) support Function Level Reset (FLR). This 
allows a PCI function to be reset without affecting any other 
functions on that device, or any other devices. This feature is not 
widespread yet AFAIK - e.g. I've seen it on an audio controller, 
and it must also be supported by SR-IOV devices.

  - Secondary Bus Reset (SBR) allows software to trigger a reset on all 
devices (and functions) behind a PCI bridge.

  - A PCI Power Management D-state transition (D3hot to D0) can be used 
to reset a device (all functions).

  - Some PCI devices don't have page aligned MMIO BARs. These devices 
(all functions) cannot be safely assigned to guests.

Driver Unbinding


Before a device is assigned to a guest, we should make sure that no host
device driver is currently bound to the device.

We can do that with e.g.

 $> echo -n "8086 10de"  > /sys/bus/pci/drivers/pci-stub/new_id
 $> echo -n :00:19.0 > /sys/bus/pci/drivers/e1000e/unbind
 $> echo -n :00:19.0 > /sys/bus/pci/drivers/pci-stub/bind

One minor problem with this scheme is that at this point you can't
unbind from pci-stub and trigger a re-probe and have e1000e bind to it.
In order to support that, we need a "remove_id" interface to remove the
dynamic ID.

What we don't support is a way to unbind permanently. Xen has a
pciback.hide module param which tries to achieve this, but you end up
with the inevitable issues around making sure pciback is loaded before
the device driver etc.

Permanent unbinding isn't necessarily needed, but it might help provide
a solution to some of the nastier issues below.

Device Reset


Before assigning a device to a guest, it should be reset. The host or a
previous guest may have left the device in an unknown state. Not
resetting can be seen in testing to lead to e.g. "TX Unit Hang" errors
with e1000e devices.

FLR is without doubt the preferable solution here. KVM already
implements this. However, the range of devices which support FLR is
currently quite limited.

If we're assigning devices from behind a PCI/PCI-x bridge (remember all
devices must be assigned together), then we can use SBR to reset them
all together. Clearly, though, one should make sure that all devices
behind that bridge are not in use before doing the reset. We could
implement this with a "reset" sysfs interface for pci-stub - it would
only reset a device using SBR if all devices behind that bridge were
bound to pci-stub.

Where a conventional PCI device is on the root bus, or where a PCIe
device is on the root bus or another bus with multiple devices, we could
use the D-state transition reset. Since this resets all functions on a
device, we would need a similar approach where all functions must be
bound to pci-stub before being reset.

Furthermore, we would need to prevent pci-stub from resetting a device
it is bound to where the device is already assigned to a guest. To
achieve this, we would want KVM to explicitly call in to pci-stub to
mark a device as in use.

The alternatives to such an approach are:

  a) Only support FLR capable devices

  b) Cross our fingers and hope that work without a device reset

  c) Allow a driver to be permanently unbound from a device and require 
 the user to reboot after unbinding before assigning

Filtering
=

In order to support a sane user interface in management tools, it should
be possible to list all PCI devices on available on a host and filter
out those which cannot be assigned to a guest.

Furthermore, it should be possible to do this without actually affecting
any of the devices - i.e. a "try to unbind and see if we oops" approach
clearly isn't great.

Finally, some management tools would like to be able to do this
filtering given the constraint of a device being reserved for a
currently inactive guest.

This last constraint is the most difficult and points to the logic
needing to be in userland management libraries. Possibly the only sane
kernel space support would be "try to unbind and reset; if it works then
the device is assignable".

Conclusions
===

Only supporting devices with FLR restricts our user pool far too
severely.

Permanent unbind