On Sun, Jan 06, 2013 at 08:57:50AM -0700, Alex Williamson wrote: > On Sun, 2013-01-06 at 15:23 +0200, Michael S. Tsirkin wrote: > > On Wed, Jan 02, 2013 at 08:49:42AM -0700, Alex Williamson wrote: > > > On Fri, 2012-12-21 at 08:46 -0700, Alex Williamson wrote: > > > > On Fri, 2012-12-21 at 14:17 +0200, Michael S. Tsirkin wrote: > > > > > On Thu, Dec 20, 2012 at 03:15:38PM -0700, Alex Williamson wrote: > > > > > > On Thu, 2012-12-20 at 18:38 +0200, Michael S. Tsirkin wrote: > > > > > > > On Thu, Dec 20, 2012 at 09:05:50AM -0700, Alex Williamson wrote: > > > > > > > > When a guest enables MSIX on a device we evaluate the MSIX > > > > > > > > vector > > > > > > > > table, typically find no unmasked vectors and don't switch the > > > > > > > > device > > > > > > > > to MSIX mode. This generally works fine and the device will be > > > > > > > > switched once the guest enables and therefore unmasks a vector. > > > > > > > > Unfortunately some drivers enable MSIX, then use interfaces to > > > > > > > > send > > > > > > > > commands between VF & PF or PF & firmware that act based on the > > > > > > > > host > > > > > > > > state of the device. These therefore break when MSIX is managed > > > > > > > > lazily. This change re-enables the previous test used to > > > > > > > > enable MSIX > > > > > > > > (see qemu-kvm a6b402c9), which basically guesses whether a > > > > > > > > vector > > > > > > > > will be used based on the data field of the vector table. > > > > > > > > > > > > > > > > Cc: qemu-sta...@nongnu.org > > > > > > > > Signed-off-by: Alex Williamson <alex.william...@redhat.com> > > > > > > > > > > > > > > Same question: can't we enable and mask MSIX through config sysfs? > > > > > > > In this case it can be done in userspace ... > > > > > > > > > > > > In this case userspace could do this, but I think it's still > > > > > > incredibly > > > > > > dangerous. Kernel space drivers can also directly enable MSI-X on a > > > > > > device, but you might get shot for writing one that did. > > > > > > > > > > What would be the reason for the kernel driver to do this? > > > > > > > > Maybe they don't know how many vectors to use until they enable MSI-X > > > > and query some firmware interface. It's a hypothetical situation, I'm > > > > just trying to illustrate that if a kernel driver did want to do this, > > > > they'd have to develop interfaces to allow it, not just manually poke > > > > their MSI-X enable bit. > > > > > > > > > > We should > > > > > > follow the rules, play be the existing kernel interfaces, and work > > > > > > to > > > > > > eventually improve those interfaces. Thanks, > > > > > > > > > > > > Alex > > > > > > > > > > I'm not against adding an interface for this long term but we have > > > > > existing kernels to support too. IMHO it would be nicer than > > > > > the data hack which relies on non-documented guest behaviour > > > > > that might change without warning in the future. > > > > > > > > We've unwittingly used the data hack for years and only ripped it out > > > > because it was undocumented. The patch below adds documentation for it, > > > > so at least we have a more clear understanding of why it was there if we > > > > want to try to rip it out again. This fully supports existing kernels > > > > and as I mention below, we might be able to do better with limiting how > > > > many vectors we enabled, but I think this is the right initial fix and > > > > right fix for stable and we can continue to experiment from here. > > > > > > Happy new year. I'd like to close on this as we do currently have a > > > regression for devices that cannot handle MSI-X being lazily enabled. > > > The option here is to document and revert to the old style > > > initialization behavior where we look at the data field of the vector to > > > get a hint whether the guest intends to make use of the vector. This > > > gives us the same behavior as we had previously, but still allows > > > vectors to be added, so we maintain the current FreeBSD support. This > > > much needs to go to stable. > > > > By the way, could you clarify what exactly happens with FreeBSD? > > FreeBSD enables MSIX in config space without setting anything in the > MSIX table address/data fields. Thus previously, when we only looked at > the data field and did not add vectors dynamically, FreeBSD wouldn't get > any vectors setup. This change therefore does not effect FreeBSD since > it will still lazily enable vectors. This would be addressed by a > follow-on patch since we don't know of any drivers that care about the > lazy setup on FreeBSD.
Hopefully this can be addressed by the long term solution, would be nice to have behaviour match spec instead of relying on guest-specific behaviour. > > > field is not 100% reliable in giving us the number of vectors the guest > > > actually intends to use. Instead we'd like to enable MSI-X with no > > > vectors and add vectors as the guest unmasks them. The host Linux MSI > > > API currently doesn't allow this, so I think the next best thing is to > > > enable MSI-X with a single vector in the case where MSI-X is enabled but > > > no vectors are unmasked. This conserves vectors on the host though we > > > do potentially allow spurious interrupts through the enabled vector > > > (though we previously enabled multiple vectors using the above data > > > method without problems). > > > > Hmm I think this can lose interrupts if they are sent before > > there is a handler. The data hack only sends interrupts when > > there's a handler so it's safe. > > The data hack is looking at the result of pci_enable_msix, in the case > of a Linux guest. The data fields are setup, but interrupt handlers are > not setup until the guest calls request_irq, so those would also be > spurious interrupts. But are the vectors unmasked? I need to look at the code - if not they get queued in the hardware. > > > The alternative that you're proposing to this longer term solution is to > > > manually mask all vectors in the physical MSI-X vector table from > > > userspace > > > > This masking is not necessary I think: all vectors are masked > > after reset IIRC. > > Assuming the device supports a reset and the host doesn't restore > anything in the vector table. We more or less require function level reset for assignment, don't we? > > > then manually enable MSI-X on the physical device (through > > > pci-sysfs resource and config access respectively). This puts the > > > physical device is a state that better matches the guest view of the > > > devices, but I'm doubtful that the risk is worth the reward. This adds > > > a new state to the qemu MSI-X model where we have entirely host kernel > > > managed physical IRQ state, except for this. It also creates a > > > synchronization problem that the physical device moves to a new > > > interrupt state outside of the control of the host kernel, possibly > > > bypassing any quirks for the host platform. > > > > Hmm as long as all vectors are masked, MSIX is a per device state so I > > don't see why any quirks would be needed. > > > > > Another option is to modify the host MSI API to allow the interface we > > > want, splitting enabling MSI-X from vector allocation. That of course > > > has a much longer lead time. > > > > It's a requirement to make it really robust though, isn't it? > > It's required to get the kind of control we'd like, including the > ability to add vectors without tearing down MSIX and starting over. > > > > We can certainly continue the discussion on this, but we need a fix for > > > stable and I don't think either of these longer term methods are known > > > to have the reliability or simplicity of reverting to the previous > > > initialization criteria. Please ack if you agree. Thanks, > > > > > > Alex > > > > I agree with your approach to stable, so ACK. > > > > For long term, I think we should start with fixing up > > things in the kernel. Then for old kernels, qemu can do > > the userspace hack (or some other hack if we find a better one). > > Ok, thanks. Logically it seems easy to split pci_enable_msix into > enable + add vectors, but I expect it gets a lot more complicated across > numerous architectures and things like interrupt remappers. I'll try to > start looking into this. Thanks, > > Alex > > > > > > > > > --- > > > > > > > > hw/kvm/pci-assign.c | 17 +++++++++++++++-- > > > > > > > > 1 file changed, 15 insertions(+), 2 deletions(-) > > > > > > > > > > > > > > > > I think we might be able to do a little better than this, but I > > > > > > > > think > > > > > > > > this is the right fix for stable and we can build on it to > > > > > > > > perhaps only > > > > > > > > enable a single vector. > > > > > > > > > > > > > > > > diff --git a/hw/kvm/pci-assign.c b/hw/kvm/pci-assign.c > > > > > > > > index e80dad0..12a219b 100644 > > > > > > > > --- a/hw/kvm/pci-assign.c > > > > > > > > +++ b/hw/kvm/pci-assign.c > > > > > > > > @@ -1025,6 +1025,19 @@ static bool > > > > > > > > assigned_dev_msix_masked(MSIXTableEntry *entry) > > > > > > > > return (entry->ctrl & cpu_to_le32(0x1)) != 0; > > > > > > > > } > > > > > > > > > > > > > > > > +/* > > > > > > > > + * When MSI-X is first enabled the vector table typically has > > > > > > > > all the > > > > > > > > + * vectors masked, so we can't use that as the obvious test to > > > > > > > > figure out > > > > > > > > + * how many vectors to initially enable. Instead we look at > > > > > > > > the data field > > > > > > > > + * because this is what worked for pci-assign for a long time. > > > > > > > > This makes > > > > > > > > + * sure the physical MSI-X state tracks the guest's view, > > > > > > > > which is important > > > > > > > > + * for some VF/PF and PF/fw communication channels. > > > > > > > > + */ > > > > > > > > +static bool assigned_dev_msix_skipped(MSIXTableEntry *entry) > > > > > > > > +{ > > > > > > > > + return !entry->data; > > > > > > > > +} > > > > > > > > + > > > > > > > > static int assigned_dev_update_msix_mmio(PCIDevice *pci_dev) > > > > > > > > { > > > > > > > > AssignedDevice *adev = DO_UPCAST(AssignedDevice, dev, > > > > > > > > pci_dev); > > > > > > > > @@ -1035,7 +1048,7 @@ static int > > > > > > > > assigned_dev_update_msix_mmio(PCIDevice *pci_dev) > > > > > > > > > > > > > > > > /* Get the usable entry number for allocating */ > > > > > > > > for (i = 0; i < adev->msix_max; i++, entry++) { > > > > > > > > - if (assigned_dev_msix_masked(entry)) { > > > > > > > > + if (assigned_dev_msix_skipped(entry)) { > > > > > > > > continue; > > > > > > > > } > > > > > > > > entries_nr++; > > > > > > > > @@ -1064,7 +1077,7 @@ static int > > > > > > > > assigned_dev_update_msix_mmio(PCIDevice *pci_dev) > > > > > > > > for (i = 0; i < adev->msix_max; i++, entry++) { > > > > > > > > adev->msi_virq[i] = -1; > > > > > > > > > > > > > > > > - if (assigned_dev_msix_masked(entry)) { > > > > > > > > + if (assigned_dev_msix_skipped(entry)) { > > > > > > > > continue; > > > > > > > > } > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > >