On Mon, Jan 09, 2017 at 10:42:41AM -0500, Dan Streetman wrote: > On Mon, Jan 9, 2017 at 9:59 AM, Boris Ostrovsky > <boris.ostrov...@oracle.com> wrote: > > On 01/06/2017 08:06 PM, Konrad Rzeszutek Wilk wrote: > >> On Thu, Jan 05, 2017 at 02:28:56PM -0500, Dan Streetman wrote: > >>> Do not read a pci device's msi message data to see if a pirq was > >>> previously configured for the device's msi/msix, as the old pirq was > >>> unmapped and may now be in use by another pci device. The previous > >>> pirq should never be re-used; instead a new pirq should always be > >>> allocated from the hypervisor. > >> Won't this cause a starvation problem? That is we will run out of PIRQs > >> as we are not reusing them? > > > > Don't we free the pirq when we unmap it? > > I think this is actually a bit worse than I initially thought. After > looking a bit closer, and I think there's an asymmetry with pirq > allocation:
Lets include Stefano, Thank you for digging in this! This has quite the deja-vu feeling as I believe I hit this at some point in the past and posted some possible ways of fixing this. But sadly I can't find the thread. > > tl;dr: > > pci_enable_msix_range() -> each MSIX (or MSI) now has a pirq > allocated, and reserved in the hypervisor > > request_irq() -> an event channel is opened for the specific pirq, and > maps the pirq with the hypervisor > > free_irq() -> the event channel is closed, and the pirq is unmapped, > but that unmap function also frees the pirq! The hypervisor can/will > give it away to the next call to get_free_pirq. However, the pci > msi/msix data area still contains the pirq number, and the next call > to request_irq() will re-use the pirq. > > pci_disable_msix() -> this has no effect on the pirq in the hypervisor > (it's already been freed), and it also doesn't clear anything from the > msi/msix data area, so the pirq is still cached there. > > > It seems like the hypervisor needs to be fixed to *not* unmap the pirq > when the event channel is closed - or at least, not to change it to > IRQ_UNBOUND state? And, the pci_disable_msix call should eventually > call into something in the Xen guest kernel that actually does the > pirq unmapping, and clear it from the msi data area (i.e. > pci_write_msi_msg) The problem is that Xen changes have sailed a long long time ago. > > Alternately, if the hypervisor doesn't change, then the call into the > hypervisor to actually allocate a pirq needs to move from the > pci_enable_msix_range() call to the request_irq() call? So that when > the msi/msix range is enabled, it doesn't actually reserve any pirq's > for any of the vectors; each request_irq/free_irq pair do the pirq > allocate-and-map/unmap... Or a third one: We keep an pirq->device lookup and inhibit free_irq() from actually calling evtchn_close() until the pci_disable_msix() is done? > > > longer details: > > The chain of function calls starts in the initial call to configure > the msi vectors, which eventually calls __pci_enable_msix_range (or > msi_) which then eventually reaches xen_hvm_setup_msi_irqs(), which > either tries to re-use any cached pirq in the MSI data area, or (for > the first time setup) allocates a new pirq from the hypervisor via > PHYSDEVOP_get_free_pirq. That pirq is then reserved from the > hypervisor's perspective, but it's not mapped to anything in the guest > kernel. > > Then, the driver calls request_irq to actually start using the irq, > which calls __setup_irq to irq_startup to startup_pirq. The > startup_pirq call actually creates the evtchn and binds it to the > previously allocated pirq via EVTCHNOP_bind_pirq. > > At this point, the pirq is bound to a guest kernel evtchn (and irq) > and is in use. But then, when the driver doesn't want it anymore, it > calls free_irq, and that calls irq_shutdown to shutdown_pirq; and that > function closes the evtchn via EVTCHNOP_close. > > Inside the hypervisor, in xen/common/event_channel.c in > evtchn_close(), if the channel is type ECS_PIRQ (which our pirq > channel is) then it unmaps the pirq mapping via > unmap_domain_pirq_emuirq. This unmaps the pirq, but also puts it back > to state IRQ_UNBOUND, which makes it available for the hypervisor to > give away to anyone requesting a new pirq! > > > > > > > > > -boris > > > >>> The xen_hvm_setup_msi_irqs() function currently checks the pci device's > >>> msi descriptor message data for each msi/msix vector it sets up, and if > >>> it finds the vector was previously configured with a pirq, and that pirq > >>> is mapped to an irq, it re-uses the pirq instead of requesting a new pirq > >>> from the hypervisor. However, that pirq was unmapped when the pci device > >>> disabled its msi/msix, and it cannot be re-used; it may have been given > >>> to a different pci device. > >> Hm, but this implies that we do keep track of it. > >> > >> > >> while (true) > >> do > >> rmmod nvme > >> modprobe nvme > >> done > >> > >> Things go boom without this patch. But with this patch does this > >> still work? As in we don't run out of PIRQs? > >> > >> Thanks. > >>> This exact situation is happening in a Xen guest where multiple NVMe > >>> controllers (pci devices) are present. The NVMe driver configures each > >>> pci device's msi/msix twice; first to configure a single vector (to > >>> talk to the controller for its configuration info), and then it disables > >>> that msi/msix and re-configures with all the msi/msix it needs. When > >>> multiple NVMe controllers are present, this happens concurrently on all > >>> of them, and in the time between controller A calling pci_disable_msix() > >>> and then calling pci_enable_msix_range(), controller B enables its msix > >>> and gets controller A's pirq allocated from the hypervisor. Then when > >>> controller A re-configures its msix, its first vector tries to re-use > >>> the same pirq that it had before; but that pirq was allocated to > >>> controller B, and thus the Xen event channel for controller A's re-used > >>> pirq fails to map its irq to that pirq; the hypervisor already has the > >>> pirq mapped elsewhere. > >>> > >>> Signed-off-by: Dan Streetman <dan.street...@canonical.com> > >>> --- > >>> arch/x86/pci/xen.c | 23 +++++++---------------- > >>> 1 file changed, 7 insertions(+), 16 deletions(-) > >>> > >>> diff --git a/arch/x86/pci/xen.c b/arch/x86/pci/xen.c > >>> index bedfab9..a00a6c0 100644 > >>> --- a/arch/x86/pci/xen.c > >>> +++ b/arch/x86/pci/xen.c > >>> @@ -234,23 +234,14 @@ static int xen_hvm_setup_msi_irqs(struct pci_dev > >>> *dev, int nvec, int type) > >>> return 1; > >>> > >>> for_each_pci_msi_entry(msidesc, dev) { > >>> - __pci_read_msi_msg(msidesc, &msg); > >>> - pirq = MSI_ADDR_EXT_DEST_ID(msg.address_hi) | > >>> - ((msg.address_lo >> MSI_ADDR_DEST_ID_SHIFT) & 0xff); > >>> - if (msg.data != XEN_PIRQ_MSI_DATA || > >>> - xen_irq_from_pirq(pirq) < 0) { > >>> - pirq = xen_allocate_pirq_msi(dev, msidesc); > >>> - if (pirq < 0) { > >>> - irq = -ENODEV; > >>> - goto error; > >>> - } > >>> - xen_msi_compose_msg(dev, pirq, &msg); > >>> - __pci_write_msi_msg(msidesc, &msg); > >>> - dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", > >>> pirq); > >>> - } else { > >>> - dev_dbg(&dev->dev, > >>> - "xen: msi already bound to pirq=%d\n", pirq); > >>> + pirq = xen_allocate_pirq_msi(dev, msidesc); > >>> + if (pirq < 0) { > >>> + irq = -ENODEV; > >>> + goto error; > >>> } > >>> + xen_msi_compose_msg(dev, pirq, &msg); > >>> + __pci_write_msi_msg(msidesc, &msg); > >>> + dev_dbg(&dev->dev, "xen: msi bound to pirq=%d\n", pirq); > >>> irq = xen_bind_pirq_msi_to_irq(dev, msidesc, pirq, > >>> (type == PCI_CAP_ID_MSI) ? > >>> nvec : 1, > >>> (type == PCI_CAP_ID_MSIX) ? > >>> -- > >>> 2.9.3 > >>> > > > > > > _______________________________________________ > > Xen-devel mailing list > > Xen-devel@lists.xen.org > > https://lists.xen.org/xen-devel _______________________________________________ Xen-devel mailing list Xen-devel@lists.xen.org https://lists.xen.org/xen-devel