Re: [PATCH] intel-iommu: Synchronize gcmd value with global command register
(2013/04/03 17:24), David Woodhouse wrote: > On Wed, 2013-04-03 at 16:11 +0900, Takao Indoh wrote: >> (2013/04/02 23:05), Joerg Roedel wrote: >>> On Mon, Apr 01, 2013 at 02:45:18PM +0900, Takao Indoh wrote: enable_IR intel_enable_irq_remapping iommu_disable_irq_remapping <== IRES/QIES/TES disabled here dmar_disable_qi <== do nothing dmar_enable_qi <== QIES enabled intel_setup_irq_remapping<== IRES enabled >>> >>> But what we want to do here in the kdumo case is to disable translation >>> too, right? Because the former kernel might have translation and >>> irq-remapping enabled and the kdump kernel might be compiled without >>> support for dma-remapping. So if we don't disable translation here too >>> the kdump kernel is unable to do DMA. >> >> Yeah, you are right. I forgot such a case. > > If you disable translation and there's some device still doing DMA, it's > going to scribble over random areas of memory. You really want to have > translation enabled and all the page tables *cleared*, during kexec. I > think it's fair to insist that the secondary kernel should use the IOMMU > if the first one did. > >> To be honest, I also expected the side effect of this patch. As I wrote >> in the previous mail, I'm working on kdump problem with iommu, that is, >> ongoing DMA causes DMAR fault in 2nd kernel and sometimes kdump fails >> due to this fault. > > Here you've lost me. The DMAR fault is caught and reported, and how does > this lead to a kdump failure? Are you using dodgy hardware that just > keeps *trying* after an abort, and floods the system with a storm of > DMAR faults? We've occasionally spoken about working around such a > problem by setting a bit to make subsequent faults *silent*. Would that > work? There are several cases. - DMAR fault messages floods and second kernel does not boot. Recently I saw similar report. https://lkml.org/lkml/2013/3/8/120 - igb driver detectes error on linkup and kdump via network fails. - On a certain platform, though kdump itself works, PCIe error like Unexpected Completion is detected and it gets hardware degraded. Thanks, Takao Indoh > >> What we have to do is stopping DMA transaction >> before DMA-remapping is disabled in 2nd kernel. > > The IOMMU is there to stop DMA transactions. That is its *job*. :) > ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2] irq: add quirk for broken interrupt remapping on 55XX chipsets
[+cc David and iommu list, Yinghai, Jiang] On Mon, Mar 4, 2013 at 12:04 PM, Neil Horman wrote: > A few years back intel published a spec update: > http://www.intel.com/content/dam/doc/specification-update/5520-and-5500-chipset-ioh-specification-update.pdf > > For the 5520 and 5500 chipsets which contained an errata (specificially errata > 53), which noted that these chipsets can't properly do interrupt remapping, > and > as a result the recommend that interrupt remapping be disabled in bios. While > many vendors have a bios update to do exactly that, not all do, and of course > not all users update their bios to a level that corrects the problem. As a > result, occasionally interrupts can arrive at a cpu even after affinity for > that > interrupt has be moved, leading to lost or spurrious interrupts (usually > characterized by the message: > kernel: do_IRQ: 7.71 No irq handler for vector (irq -1) > > There have been several incidents recently of people seeing this error, and > investigation has shown that they have system for which their BIOS level is > such > that this feature was not properly turned off. As such, it would be good to > give them a reminder that their systems are vulnurable to this problem. > > Signed-off-by: Neil Horman > CC: Prarit Bhargava > CC: Don Zickus > CC: Don Dutile > CC: Bjorn Helgaas > CC: Asit Mallick > CC: linux-...@vger.kernel.org > > --- > > Change notes: > > v2) > > * Moved the quirk to the x86 arch, since consensus seems to be that the 55XX > chipset series is x86 only. I decided however to keep the quirk as a regular > quirk, not an early_quirk. Early quirks have no way currently to determine if > BIOS has properly disabled the feature in the iommu, at least not without > significant hacking, and since its quite possible this will be a short lived > quirk, should Don Z's workaround code prove successful (and it looks like it > may > well), I don't think that necessecary. > > * Removed the WARNING banner from the quirk, and added the HW_ERR token to the > string, I opted to leave the newlines in place however, as I really couldnt > find a way to keep the text on a single line is still legible from a code > perspective. I think theres enough language in there that using cscope on > just > about any substring however will turn it up, and again, this may be a short > lived quirk. > --- > arch/x86/kernel/quirks.c | 18 ++ > include/linux/pci_ids.h | 2 ++ > 2 files changed, 20 insertions(+) > > diff --git a/arch/x86/kernel/quirks.c b/arch/x86/kernel/quirks.c > index 26ee48a..a718ea2 100644 > --- a/arch/x86/kernel/quirks.c > +++ b/arch/x86/kernel/quirks.c > @@ -5,6 +5,7 @@ > #include > > #include > +#include "../../../drivers/iommu/irq_remapping.h" > > #if defined(CONFIG_X86_IO_APIC) && defined(CONFIG_SMP) && defined(CONFIG_PCI) > > @@ -567,3 +568,20 @@ DECLARE_PCI_FIXUP_FINAL(PCI_VENDOR_ID_AMD, > PCI_DEVICE_ID_AMD_15H_NB_F5, > quirk_amd_nb_node); > > #endif > + > +static void intel_remapping_check(struct pci_dev *dev) > +{ > + u8 revision; > + > + pci_read_config_byte(dev, PCI_REVISION_ID, &revision); > + > + if ((revision == 0x13) && irq_remapping_enabled) { > +pr_warn(HW_ERR "This system BIOS has enabled interrupt > remapping\n" > +"on a chipset that contains an errata making that\n" > +"feature unstable. Please reboot with nointremap\n" > +"added to the kernel command line and contact\n" > +"your BIOS vendor for an update"); > + } > +} > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_5520_IOHUB, > intel_remapping_check); > +DECLARE_PCI_FIXUP_EARLY(PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_5500_IOHUB, > intel_remapping_check); This started as an IOMMU change, and I'm not an expert in that area, so I added David and the IOMMU list. I'd rather have him deal with this than me. Is this something we can just *fix* in the kernel, e.g., by turning off interrupt remapping ourselves, or does it have to be done before the OS boots? > diff --git a/include/linux/pci_ids.h b/include/linux/pci_ids.h > index 31717bd..54027a6 100644 > --- a/include/linux/pci_ids.h > +++ b/include/linux/pci_ids.h > @@ -2732,6 +2732,8 @@ > #define PCI_DEVICE_ID_INTEL_LYNNFIELD_MC_CH2_RANK_REV2 0x2db2 > #define PCI_DEVICE_ID_INTEL_LYNNFIELD_MC_CH2_TC_REV20x2db3 > #define PCI_DEVICE_ID_INTEL_82855PM_HB 0x3340 > +#define PCI_DEVICE_ID_INTEL_5500_IOHUB 0x3403 > +#define PCI_DEVICE_ID_INTEL_5520_IOHUB 0x3406 For constants only used in one place, we just use the bare constant (0x3403) in the quirk rather than editing pci_ids.h (see comment at the top of that file). > #define PCI_DEVICE_ID_INTEL_IOAT_TBG4 0x3429 > #define PCI_DEVICE_ID_INTEL_IOAT_TBG5 0x342a > #define PCI_DEVICE_ID_INTEL_IOAT_TBG6 0x342b > -- > 1.7.11.7 > > -- > To unsubscribe from this list: send the line
[PATCH] iommu/amd: Add workaround to propery clearing IOMMU status register
From: Suravee Suthikulpanit The IOMMU interrupt handling in bottom half must clear the PPR log interrupt and event log interrupt bits to re-enable the interrupt. This is done by writing 1 to the memory mapped register to clear the bit. Due to hardware bug, if the driver tries to clear this bit while the IOMMU hardware also setting this bit, the conflict will result with the bit being set. If the interrupt handling code does not make sure to clear this bit, subsequent changes in the event/PPR logs will no longer generating interrupts, and would result if buffer overflow. After clearing the bits, the driver must read back the register to verify. In the current interrupt handling scheme, there are as many threads as the number of IOMMUs. Each thread is created and assigned to an IOMMU at the time of registering interrupt handlers (request_threaded_irq). When an IOMMU HW generates an interrupt, the irq handler (top half) wakes up the corresponding thread to process event and PPR logs of all IOMMUs starting from the 1st IOMMU. In the system with multiple IOMMU,this handling scheme complicates the synchronization of the IOMMU data structures and status registers as there could be multiple threads competing for the same IOMMU while the other IOMMU could be left unhandled. In order to simplify the implementation of the workaround, this patch is proposing a different interrupt handling scheme by having each thread only managing interrupts of the corresponding IOMMU. This can be achieved by passing the struct amd_iommu when registering the interrupt handlers. This structure is unique for each IOMMU and can be used by the bottom half thread to identify the IOMMU to be handled instead of calling for_each_iommu. Besides this also eliminate the needs to lock the IOMMU for processing event and PPR logs. Signed-off-by: Suravee Suthikulpanit --- drivers/iommu/amd_iommu.c | 59 drivers/iommu/amd_iommu_init.c |2 +- 2 files changed, 31 insertions(+), 30 deletions(-) diff --git a/drivers/iommu/amd_iommu.c b/drivers/iommu/amd_iommu.c index e5db3e5..3548d63 100644 --- a/drivers/iommu/amd_iommu.c +++ b/drivers/iommu/amd_iommu.c @@ -803,12 +803,6 @@ retry: static void iommu_poll_events(struct amd_iommu *iommu) { u32 head, tail; - unsigned long flags; - - /* enable event interrupts again */ - writel(MMIO_STATUS_EVT_INT_MASK, iommu->mmio_base + MMIO_STATUS_OFFSET); - - spin_lock_irqsave(&iommu->lock, flags); head = readl(iommu->mmio_base + MMIO_EVT_HEAD_OFFSET); tail = readl(iommu->mmio_base + MMIO_EVT_TAIL_OFFSET); @@ -819,8 +813,6 @@ static void iommu_poll_events(struct amd_iommu *iommu) } writel(head, iommu->mmio_base + MMIO_EVT_HEAD_OFFSET); - - spin_unlock_irqrestore(&iommu->lock, flags); } static void iommu_handle_ppr_entry(struct amd_iommu *iommu, u64 *raw) @@ -845,17 +837,11 @@ static void iommu_handle_ppr_entry(struct amd_iommu *iommu, u64 *raw) static void iommu_poll_ppr_log(struct amd_iommu *iommu) { - unsigned long flags; u32 head, tail; if (iommu->ppr_log == NULL) return; - /* enable ppr interrupts again */ - writel(MMIO_STATUS_PPR_INT_MASK, iommu->mmio_base + MMIO_STATUS_OFFSET); - - spin_lock_irqsave(&iommu->lock, flags); - head = readl(iommu->mmio_base + MMIO_PPR_HEAD_OFFSET); tail = readl(iommu->mmio_base + MMIO_PPR_TAIL_OFFSET); @@ -891,34 +877,49 @@ static void iommu_poll_ppr_log(struct amd_iommu *iommu) head = (head + PPR_ENTRY_SIZE) % PPR_LOG_SIZE; writel(head, iommu->mmio_base + MMIO_PPR_HEAD_OFFSET); - /* -* Release iommu->lock because ppr-handling might need to -* re-acquire it -*/ - spin_unlock_irqrestore(&iommu->lock, flags); - /* Handle PPR entry */ iommu_handle_ppr_entry(iommu, entry); - spin_lock_irqsave(&iommu->lock, flags); - /* Refresh ring-buffer information */ head = readl(iommu->mmio_base + MMIO_PPR_HEAD_OFFSET); tail = readl(iommu->mmio_base + MMIO_PPR_TAIL_OFFSET); } - - spin_unlock_irqrestore(&iommu->lock, flags); } irqreturn_t amd_iommu_int_thread(int irq, void *data) { - struct amd_iommu *iommu; + struct amd_iommu *iommu = (struct amd_iommu *) data; + u32 status = readl(iommu->mmio_base + MMIO_STATUS_OFFSET); - for_each_iommu(iommu) { - iommu_poll_events(iommu); - iommu_poll_ppr_log(iommu); - } + while (status & (MMIO_STATUS_EVT_INT_MASK | MMIO_STATUS_PPR_INT_MASK)) { + if (status & MMIO_STATUS_EVT_INT_MASK) { + pr_devel("AMD-Vi: Processing IOMMU Event Log\n"); + iommu_poll_events(iommu); + } + if (s
Re: RFC: vfio API changes needed for powerpc
On 04/02/2013 10:12:31 PM, Alex Williamson wrote: On Tue, 2013-04-02 at 17:44 -0500, Scott Wood wrote: > On 04/02/2013 04:32:04 PM, Alex Williamson wrote: > > On Tue, 2013-04-02 at 15:57 -0500, Scott Wood wrote: > > > On 04/02/2013 03:32:17 PM, Alex Williamson wrote: > > > > On x86 the interrupt remapper handles this transparently when MSI > > > > is enabled and userspace never gets direct access to the device > > MSI > > > > address/data registers. > > > > > > x86 has a totally different mechanism here, as far as I understand > > -- > > > even before you get into restrictions on mappings. > > > > So what control will userspace have over programming the actually MSI > > vectors on PAMU? > > Not sure what you mean -- PAMU doesn't get explicitly involved in > MSIs. It's just another 4K page mapping (per relevant MSI bank). If > you want isolation, you need to make sure that an MSI group is only > used by one VFIO group, and that you're on a chip that has alias pages > with just one MSI bank register each (newer chips do, but the first > chip to have a PAMU didn't). How does a user figure this out? The user's involvement could be limited to setting a policy knob of whether that degree of isolation is required (if required and unavailable, all devices using an MSI bank would be forced into the same group). We'd need to do something with MSI allocation so that we avoid using an MSI bank with more than one IOMMU group where possible. I'm not sure about the details yet, or how practical this is. There might need to be some MSI bank assignment done as part of the VFIO device binding process, if there are going to be more VFIO groups than there are MSI banks (reserving one bank for host use). -Scott ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: RFC: vfio API changes needed for powerpc
On 04/02/2013 10:37:20 PM, Alex Williamson wrote: On Tue, 2013-04-02 at 17:50 -0500, Scott Wood wrote: > On 04/02/2013 04:38:45 PM, Alex Williamson wrote: > > On Tue, 2013-04-02 at 16:08 -0500, Stuart Yoder wrote: > > > On Tue, Apr 2, 2013 at 3:57 PM, Scott Wood > > wrote: > > > >> >C. Explicit mapping using normal DMA map. The last idea > > is that > > > >> >we would introduce a new ioctl to give user-space an fd > > to > > > >> >the MSI bank, which could be mmapped. The flow would be > > > >> >something like this: > > > >> > -for each group user space calls new ioctl > > > >> > VFIO_GROUP_GET_MSI_FD > > > >> > -user space mmaps the fd, getting a vaddr > > > >> > -user space does a normal DMA map for desired iova > > > >> >This approach makes everything explicit, but adds a new > > ioctl > > > >> >applicable most likely only to the PAMU (type2 iommu). > > > >> > > > >> And the DMA_MAP of that mmap then allows userspace to select the > > window > > > >> used? This one seems like a lot of overhead, adding a new > > ioctl, new > > > >> fd, mmap, special mapping path, etc. > > > > > > > > > > > > There's going to be special stuff no matter what. This would > > keep it > > > > separated from the IOMMU map code. > > > > > > > > I'm not sure what you mean by "overhead" here... the runtime > > overhead of > > > > setting things up is not particularly relevant as long as it's > > reasonable. > > > > If you mean development and maintenance effort, keeping things > > well > > > > separated should help. > > > > > > We don't need to change DMA_MAP. If we can simply add a new "type > > 2" > > > ioctl that allows user space to set which windows are MSIs, it > > seems vastly > > > less complex than an ioctl to supply a new fd, mmap of it, etc. > > > > > > So maybe 2 ioctls: > > > VFIO_IOMMU_GET_MSI_COUNT > > Do you mean a count of actual MSIs or a count of MSI banks used by the > whole VFIO group? I hope the latter, which would clarify how this is distinct from DEVICE_GET_IRQ_INFO. Is hotplug even on the table? Presumably dynamically adding a device could bring along additional MSI banks? I'm not sure -- maybe we could say that hotplug can add banks, but not remove them or change the order, so userspace would just need to check if the number of banks changed, and map the extras. The current VFIO MSI support has the host handling everything about MSI. The user never programs an MSI vector to the physical device, they set up everything through ioctl. On interrupt, we simply trigger an eventfd and leave it to things like KVM irqfd or QEMU to do the right thing in a virtual machine. Here the MSI vector has to go through a PAMU window to hit the correct MSI bank. So that means it has some component of the iova involved, which we're proposing here is controlled by userspace (whether that vector uses an offset from 0x1000 or 0x depending on which window slot is used to make the MSI bank). I assume we're still working in a model where the physical interrupt fires into the host and a host-based interrupt handler triggers an eventfd, right? Yes (subject to possible future optimizations). So that means the vector also has host components so we trigger the correct ISR. How is that coordinated? Everything but the iova component needs to come from the host MSI allocator. Would is be possible for userspace to simply leave room for MSI bank mapping (how much room could be determined by something like VFIO_IOMMU_GET_MSI_BANK_COUNT) then document the API that userspace can DMA_MAP starting at the 0x0 address of the aperture, growing up, and VFIO will map banks on demand at the top of the aperture, growing down? Wouldn't that avoid a lot of issues with userspace needing to know anything about MSI banks (other than count) and coordinating irq numbers and enabling handlers? This would restrict a (possibly unlikely) use case where the user wants to map something near the top of the aperture but has another place MSIs can go (or is willing to live without MSIs). Otherwise it could be workable, as long as we can require an explicit MSI enabling on a device to happen after the aperture and subwindow count are set up. I'm not sure it would really buy anything over having userspace iterate over the MSI bank count, though -- it would probably be a bit more complicated. > > On x86 MSI count is very > > device specific, which means it wold be a VFIO_DEVICE_* ioctl > > (actually > > VFIO_DEVICE_GET_IRQ_INFO does this for us on x86). The trouble with > > it > > being a device ioctl is that you need to get the device FD, but the > > IOMMU protection needs to be established before you can get that... so > > there's an ordering problem if you need it from the device before > > configuring the IOMMU. Thanks, > > What do you mean by "IOMMU protection needs
Re: RFC: vfio API changes needed for powerpc
On 04/03/2013 02:43:06 PM, Stuart Yoder wrote: On Wed, Apr 3, 2013 at 2:18 PM, Scott Wood wrote: > On 04/03/2013 02:09:45 PM, Stuart Yoder wrote: >> >> > Would is be possible for userspace to simply leave room for MSI bank >> > mapping (how much room could be determined by something like >> > VFIO_IOMMU_GET_MSI_BANK_COUNT) then document the API that userspace can >> > DMA_MAP starting at the 0x0 address of the aperture, growing up, and >> > VFIO will map banks on demand at the top of the aperture, growing down? >> > Wouldn't that avoid a lot of issues with userspace needing to know >> > anything about MSI banks (other than count) and coordinating irq numbers >> > and enabling handlers? >> >> This is basically option #A in the original proposals sent. I like >> this approach, in that it >> is simpler and keeps user space mostly out of this...which is >> consistent with how things are done >> on x86. User space just needs to know how many windows to leave at >> the top of the aperture. >> The kernel then has the flexibility to use those windows how it wants. >> >> But one question, is when should the kernel actually map (and unmap) >> the MSI banks. > > > I think userspace should explicitly request it. Userspace still wouldn't > need to know anything but the count: > > count = VFIO_IOMMU_GET_MSI_BANK_COUNT > VFIO_IOMMU_SET_ATTR(ATTR_GEOMETRY) > VFIO_IOMMU_SET_ATTR(ATTR_WINDOWS) > // do other DMA maps now, or later, or not at all, doesn't matter > for (i = 0; i < count; i++) > VFIO_IOMMU_MAP_MSI_BANK(iova, i); > // The kernel now knows where each bank has been mapped, and can update PCI > config space appropriately. And the overall aperture enable/disable would occur on the first dma/msi map() and last dma/msi unmap()? Yes. We may want the optional ability to do an overall enable/disable for reasons we discussed a while ago, but in the absence of an explicit disable the domain would be enabled on first map. -Scott ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: RFC: vfio API changes needed for powerpc
On 04/03/2013 02:09:45 PM, Stuart Yoder wrote: > Would is be possible for userspace to simply leave room for MSI bank > mapping (how much room could be determined by something like > VFIO_IOMMU_GET_MSI_BANK_COUNT) then document the API that userspace can > DMA_MAP starting at the 0x0 address of the aperture, growing up, and > VFIO will map banks on demand at the top of the aperture, growing down? > Wouldn't that avoid a lot of issues with userspace needing to know > anything about MSI banks (other than count) and coordinating irq numbers > and enabling handlers? This is basically option #A in the original proposals sent. I like this approach, in that it is simpler and keeps user space mostly out of this...which is consistent with how things are done on x86. User space just needs to know how many windows to leave at the top of the aperture. The kernel then has the flexibility to use those windows how it wants. But one question, is when should the kernel actually map (and unmap) the MSI banks. One thing we need to do is enable the aperture...and current thinking is that is done on the first DMA_MAP. Similarly when the last mapping is unmapped we could also umap the MSI banks. Sequence would be something like: VFIO_GROUP_SET_CONTAINER // add groups to the container VFIO_SET_IOMMU(VFIO_FSL_PAMU)// set iommu model cnt = VFIO_IOMMU_GET_MSI_BANK_COUNT// returns max # of MSI banks VFIO_IOMMU_SET_ATTR(ATTR_GEOMETRY) // set overall aperture VFIO_IOMMU_SET_ATTR(ATTR_WINDOWS) // set # of windows, including MSI banks VFIO_IOMMU_MAP_DMA// map the guest's memory ---> kernel enables aperture and maps needed MSI banks here Maps them where? What if there is more than one explicit DMA mapping? What if DMA mappings are changed during operation? -Scott ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: RFC: vfio API changes needed for powerpc
On Wed, Apr 3, 2013 at 2:18 PM, Scott Wood wrote: > On 04/03/2013 02:09:45 PM, Stuart Yoder wrote: >> >> > Would is be possible for userspace to simply leave room for MSI bank >> > mapping (how much room could be determined by something like >> > VFIO_IOMMU_GET_MSI_BANK_COUNT) then document the API that userspace can >> > DMA_MAP starting at the 0x0 address of the aperture, growing up, and >> > VFIO will map banks on demand at the top of the aperture, growing down? >> > Wouldn't that avoid a lot of issues with userspace needing to know >> > anything about MSI banks (other than count) and coordinating irq numbers >> > and enabling handlers? >> >> This is basically option #A in the original proposals sent. I like >> this approach, in that it >> is simpler and keeps user space mostly out of this...which is >> consistent with how things are done >> on x86. User space just needs to know how many windows to leave at >> the top of the aperture. >> The kernel then has the flexibility to use those windows how it wants. >> >> But one question, is when should the kernel actually map (and unmap) >> the MSI banks. > > > I think userspace should explicitly request it. Userspace still wouldn't > need to know anything but the count: > > count = VFIO_IOMMU_GET_MSI_BANK_COUNT > VFIO_IOMMU_SET_ATTR(ATTR_GEOMETRY) > VFIO_IOMMU_SET_ATTR(ATTR_WINDOWS) > // do other DMA maps now, or later, or not at all, doesn't matter > for (i = 0; i < count; i++) > VFIO_IOMMU_MAP_MSI_BANK(iova, i); > // The kernel now knows where each bank has been mapped, and can update PCI > config space appropriately. And the overall aperture enable/disable would occur on the first dma/msi map() and last dma/msi unmap()? Stuart ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: RFC: vfio API changes needed for powerpc
On Wed, 2013-04-03 at 14:09 -0500, Stuart Yoder wrote: > > Would is be possible for userspace to simply leave room for MSI bank > > mapping (how much room could be determined by something like > > VFIO_IOMMU_GET_MSI_BANK_COUNT) then document the API that userspace can > > DMA_MAP starting at the 0x0 address of the aperture, growing up, and > > VFIO will map banks on demand at the top of the aperture, growing down? > > Wouldn't that avoid a lot of issues with userspace needing to know > > anything about MSI banks (other than count) and coordinating irq numbers > > and enabling handlers? > > This is basically option #A in the original proposals sent. I like > this approach, in that it > is simpler and keeps user space mostly out of this...which is > consistent with how things are done > on x86. User space just needs to know how many windows to leave at > the top of the aperture. > The kernel then has the flexibility to use those windows how it wants. > > But one question, is when should the kernel actually map (and unmap) > the MSI banks. One thing we need to do is enable the aperture...and current > thinking is that is done on the first DMA_MAP. Similarly when the last > mapping > is unmapped we could also umap the MSI banks. > > Sequence would be something like: > > VFIO_GROUP_SET_CONTAINER // add groups to the container > > VFIO_SET_IOMMU(VFIO_FSL_PAMU)// set iommu model > > cnt = VFIO_IOMMU_GET_MSI_BANK_COUNT// returns max # of MSI banks > > VFIO_IOMMU_SET_ATTR(ATTR_GEOMETRY) // set overall aperture > > VFIO_IOMMU_SET_ATTR(ATTR_WINDOWS) // set # of windows, > including MSI banks > > VFIO_IOMMU_MAP_DMA// map the guest's memory >---> kernel enables aperture and maps needed MSI banks here > > VFIO_DEVICE_SET_IRQS >---> kernel sets actual MSI addr/data in physical > device here (I think) You could also make use of the IOMMU_ENABLE/DISABLE entry points that Alexey plans to use. Ideally I'd think that you'd want to enable the required MSI banks for a device on DEVICE_SET_IRQs. That's effectively what happens on x86. Perhaps some information stored in the domain structure would let architecture hooks in MSI setup enable those mappings for you? Thanks, Alex ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: RFC: vfio API changes needed for powerpc
On 04/03/2013 02:09:45 PM, Stuart Yoder wrote: > Would is be possible for userspace to simply leave room for MSI bank > mapping (how much room could be determined by something like > VFIO_IOMMU_GET_MSI_BANK_COUNT) then document the API that userspace can > DMA_MAP starting at the 0x0 address of the aperture, growing up, and > VFIO will map banks on demand at the top of the aperture, growing down? > Wouldn't that avoid a lot of issues with userspace needing to know > anything about MSI banks (other than count) and coordinating irq numbers > and enabling handlers? This is basically option #A in the original proposals sent. I like this approach, in that it is simpler and keeps user space mostly out of this...which is consistent with how things are done on x86. User space just needs to know how many windows to leave at the top of the aperture. The kernel then has the flexibility to use those windows how it wants. But one question, is when should the kernel actually map (and unmap) the MSI banks. I think userspace should explicitly request it. Userspace still wouldn't need to know anything but the count: count = VFIO_IOMMU_GET_MSI_BANK_COUNT VFIO_IOMMU_SET_ATTR(ATTR_GEOMETRY) VFIO_IOMMU_SET_ATTR(ATTR_WINDOWS) // do other DMA maps now, or later, or not at all, doesn't matter for (i = 0; i < count; i++) VFIO_IOMMU_MAP_MSI_BANK(iova, i); // The kernel now knows where each bank has been mapped, and can update PCI config space appropriately. One thing we need to do is enable the aperture...and current thinking is that is done on the first DMA_MAP. What if there are no other mappings required? -Scott ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: RFC: vfio API changes needed for powerpc
> Would is be possible for userspace to simply leave room for MSI bank > mapping (how much room could be determined by something like > VFIO_IOMMU_GET_MSI_BANK_COUNT) then document the API that userspace can > DMA_MAP starting at the 0x0 address of the aperture, growing up, and > VFIO will map banks on demand at the top of the aperture, growing down? > Wouldn't that avoid a lot of issues with userspace needing to know > anything about MSI banks (other than count) and coordinating irq numbers > and enabling handlers? This is basically option #A in the original proposals sent. I like this approach, in that it is simpler and keeps user space mostly out of this...which is consistent with how things are done on x86. User space just needs to know how many windows to leave at the top of the aperture. The kernel then has the flexibility to use those windows how it wants. But one question, is when should the kernel actually map (and unmap) the MSI banks. One thing we need to do is enable the aperture...and current thinking is that is done on the first DMA_MAP. Similarly when the last mapping is unmapped we could also umap the MSI banks. Sequence would be something like: VFIO_GROUP_SET_CONTAINER // add groups to the container VFIO_SET_IOMMU(VFIO_FSL_PAMU)// set iommu model cnt = VFIO_IOMMU_GET_MSI_BANK_COUNT// returns max # of MSI banks VFIO_IOMMU_SET_ATTR(ATTR_GEOMETRY) // set overall aperture VFIO_IOMMU_SET_ATTR(ATTR_WINDOWS) // set # of windows, including MSI banks VFIO_IOMMU_MAP_DMA// map the guest's memory ---> kernel enables aperture and maps needed MSI banks here VFIO_DEVICE_SET_IRQS ---> kernel sets actual MSI addr/data in physical device here (I think) Stuart ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: RFC: vfio API changes needed for powerpc
On 04/03/2013 01:32:26 PM, Stuart Yoder wrote: On Tue, Apr 2, 2013 at 5:50 PM, Scott Wood wrote: > On 04/02/2013 04:38:45 PM, Alex Williamson wrote: >> >> On Tue, 2013-04-02 at 16:08 -0500, Stuart Yoder wrote: >> > VFIO_IOMMU_MAP_MSI(iova, size) > > > Not sure how you mean "size" to be used -- for MPIC it would be 4K per bank, > and you can only map one bank at a time (which bank you're mapping should be > a parameter, if only so that the kernel doesn't have to keep iteration state > for you). The intent was for user space to tell the kernel which windows to use for MSI. So I envisioned a total size of window-size * msi-bank-count. Size doesn't tell the kernel *which* banks to use, only how many. If it already knows which banks are used by the group, then it also knows how many are used. And size is misleading because the mapping is not generally going to be contiguous. -Scott ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: RFC: vfio API changes needed for powerpc
On Tue, Apr 2, 2013 at 5:50 PM, Scott Wood wrote: > On 04/02/2013 04:38:45 PM, Alex Williamson wrote: >> >> On Tue, 2013-04-02 at 16:08 -0500, Stuart Yoder wrote: >> > On Tue, Apr 2, 2013 at 3:57 PM, Scott Wood >> > wrote: >> > >> >C. Explicit mapping using normal DMA map. The last idea is >> > >> > that >> > >> >we would introduce a new ioctl to give user-space an fd to >> > >> >the MSI bank, which could be mmapped. The flow would be >> > >> >something like this: >> > >> > -for each group user space calls new ioctl >> > >> > VFIO_GROUP_GET_MSI_FD >> > >> > -user space mmaps the fd, getting a vaddr >> > >> > -user space does a normal DMA map for desired iova >> > >> >This approach makes everything explicit, but adds a new >> > >> > ioctl >> > >> >applicable most likely only to the PAMU (type2 iommu). >> > >> >> > >> And the DMA_MAP of that mmap then allows userspace to select the >> > >> window >> > >> used? This one seems like a lot of overhead, adding a new ioctl, new >> > >> fd, mmap, special mapping path, etc. >> > > >> > > >> > > There's going to be special stuff no matter what. This would keep it >> > > separated from the IOMMU map code. >> > > >> > > I'm not sure what you mean by "overhead" here... the runtime overhead >> > > of >> > > setting things up is not particularly relevant as long as it's >> > > reasonable. >> > > If you mean development and maintenance effort, keeping things well >> > > separated should help. >> > >> > We don't need to change DMA_MAP. If we can simply add a new "type 2" >> > ioctl that allows user space to set which windows are MSIs, it seems >> > vastly >> > less complex than an ioctl to supply a new fd, mmap of it, etc. >> > >> > So maybe 2 ioctls: >> > VFIO_IOMMU_GET_MSI_COUNT > > > Do you mean a count of actual MSIs or a count of MSI banks used by the whole > VFIO group? I meant # of MSI banks, so VFIO_IOMMU_GET_MSI_BANK_COUNT would be better. >> > VFIO_IOMMU_MAP_MSI(iova, size) > > > Not sure how you mean "size" to be used -- for MPIC it would be 4K per bank, > and you can only map one bank at a time (which bank you're mapping should be > a parameter, if only so that the kernel doesn't have to keep iteration state > for you). The intent was for user space to tell the kernel which windows to use for MSI. So I envisioned a total size of window-size * msi-bank-count. Stuart ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: RFC: vfio API changes needed for powerpc
>> > Type1 is arbitrary. It might as well be named "brown" and this one >> > can be >> > "blue". >> >> The difference is that "type1" seems to refer to hardware that can do >> arbitrary 4K page mappings, possibly constrained by an aperture but >> nothing else. More than one IOMMU can reasonably fit that. The odds >> that another IOMMU would have exactly the same restrictions as PAMU >> seem smaller in comparison. >> >> In any case, if you had to deal with some Intel-only quirk, would it >> make sense to call it a "type1 attribute"? I'm not advocating one way >> or the other on whether an abstraction is viable here (though Stuart >> seems to think it's "highly unlikely anything but a PAMU will comply"), >> just that if it is to be abstracted rather than a hardware-specific >> interface, we need to document what is and is not part of the >> abstraction. Otherwise a non-PAMU-specific user won't know what they >> can rely on, and someone adding support for a new windowed IOMMU won't >> know if theirs is close enough, or they need to introduce a "type3". > > So Alexey named the SPAPR IOMMU something related to spapr... > surprisingly enough. I'm fine with that. If you think it's unique > enough, name it something appropriately. I haven't seen the code and > don't know the architecture sufficiently to have an opinion. The only reason I suggested "type 2" is that I thought that was the convention...we would enumerate different iommus. I think that calling it "pamu" is better and more clear. Stuart ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 5/5 v11] iommu/fsl: Freescale PAMU driver and iommu implementation.
On Tue, 2013-04-02 at 18:18 +0200, Joerg Roedel wrote: > Cc'ing Alex Williamson > > Alex, can you please review the iommu-group part of this patch? Sure, it looks pretty reasonable. AIUI, all PCI devices are below some kind of host bridge that is either new and supports partitioning or old and doesn't. I don't know if that's a visibility or isolation requirement, perhaps PCI ACS-ish. In the new host bridge case, each device gets a group. This seems not to have any quirks for multifunction devices though. On AMD and Intel IOMMUs we test multifunction device ACS support to determine whether all the functions should be in the same group. Is there any reason to trust multifunction devices on PAMU? I also find it curious what happens to the iommu group of the host bridge. In the partitionable case the host bridge group is removed, in the non-partitionable case the host bridge group becomes the group for the children, removing the host bridge. It's unique to PAMU so far that these host bridges are even in an iommu group (x86 only adds pci devices), but I don't see it as necessarily wrong leaving it in either scenario. Does it solve some problem to remove them from the groups? Thanks, Alex ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH 5/5 v11] iommu/fsl: Freescale PAMU driver and iommu implementation.
> -Original Message- > From: Sethi Varun-B16395 > Sent: Wednesday, April 03, 2013 12:12 AM > To: Wood Scott-B07421; Timur Tabi > Cc: Joerg Roedel; lkml; Kumar Gala; Yoder Stuart-B08248; > iommu@lists.linux-foundation.org; Benjamin > Herrenschmidt; linuxppc-...@lists.ozlabs.org > Subject: RE: [PATCH 5/5 v11] iommu/fsl: Freescale PAMU driver and iommu > implementation. > > > > > -Original Message- > > From: Wood Scott-B07421 > > Sent: Wednesday, April 03, 2013 7:23 AM > > To: Timur Tabi > > Cc: Joerg Roedel; Sethi Varun-B16395; lkml; Kumar Gala; Yoder Stuart- > > B08248; iommu@lists.linux-foundation.org; Benjamin Herrenschmidt; > > linuxppc-...@lists.ozlabs.org > > Subject: Re: [PATCH 5/5 v11] iommu/fsl: Freescale PAMU driver and iommu > > implementation. > > > > On 04/02/2013 08:35:54 PM, Timur Tabi wrote: > > > On Tue, Apr 2, 2013 at 11:18 AM, Joerg Roedel wrote: > > > > > > > > + panic("\n"); > > > > > > > > A kernel panic seems like an over-reaction to an access violation. > > > > > > We have no way to determining what code caused the violation, so we > > > can't just kill the process. I agree it seems like overkill, but what > > > else should we do? Does the IOMMU layer have a way for the IOMMU > > > driver to stop the device that caused the problem? > > > > At a minimum, log a message and continue. Probably turn off the LIODN, > > at least if it continues to be noisy (otherwise we could get stuck in an > > interrupt storm as you note). Possibly let the user know somehow, > > especially if it's a VFIO domain. > [Sethi Varun-B16395] Can definitely log the message and disable the LIODN (to > avoid an interrupt storm), > but > we definitely need a mechanism to inform vfio subsystem about the error. > Also, disabling LIODN may not > be a viable > option with the new LIODN allocation scheme (where LIODN would be associated > with a domain). I think for phase 1 of this, just log the error, shut down DMA as you described. We can implement more full featured error management, like notifying vfio or the VM somehow in the future. Stuart ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] intel-iommu: Synchronize gcmd value with global command register
On Wed, 2013-04-03 at 16:11 +0900, Takao Indoh wrote: > (2013/04/02 23:05), Joerg Roedel wrote: > > On Mon, Apr 01, 2013 at 02:45:18PM +0900, Takao Indoh wrote: > >> > >> enable_IR > >>intel_enable_irq_remapping > >> iommu_disable_irq_remapping <== IRES/QIES/TES disabled here > >> dmar_disable_qi <== do nothing > >> dmar_enable_qi <== QIES enabled > >> intel_setup_irq_remapping<== IRES enabled > > > > But what we want to do here in the kdumo case is to disable translation > > too, right? Because the former kernel might have translation and > > irq-remapping enabled and the kdump kernel might be compiled without > > support for dma-remapping. So if we don't disable translation here too > > the kdump kernel is unable to do DMA. > > Yeah, you are right. I forgot such a case. If you disable translation and there's some device still doing DMA, it's going to scribble over random areas of memory. You really want to have translation enabled and all the page tables *cleared*, during kexec. I think it's fair to insist that the secondary kernel should use the IOMMU if the first one did. > To be honest, I also expected the side effect of this patch. As I wrote > in the previous mail, I'm working on kdump problem with iommu, that is, > ongoing DMA causes DMAR fault in 2nd kernel and sometimes kdump fails > due to this fault. Here you've lost me. The DMAR fault is caught and reported, and how does this lead to a kdump failure? Are you using dodgy hardware that just keeps *trying* after an abort, and floods the system with a storm of DMAR faults? We've occasionally spoken about working around such a problem by setting a bit to make subsequent faults *silent*. Would that work? > What we have to do is stopping DMA transaction > before DMA-remapping is disabled in 2nd kernel. The IOMMU is there to stop DMA transactions. That is its *job*. :) -- David WoodhouseOpen Source Technology Centre david.woodho...@intel.com Intel Corporation smime.p7s Description: S/MIME cryptographic signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 4/5 v11] iommu/fsl: Add additional iommu attributes required by the PAMU driver.
On Wed, Apr 03, 2013 at 05:21:16AM +, Sethi Varun-B16395 wrote: > > I would prefer these PAMU specific enum and struct to be in a pamu- > > specific iommu-header. > > > > [Sethi Varun-B16395] But, these would be used by the IOMMU API users > (e.g. VFIO), they shouldn't depend on PAMU specific headers. Drivers that use PAMU specifics (like the PAMU specific VFIO parts) can also include the PAMU specific header. Joerg ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] intel-iommu: Synchronize gcmd value with global command register
(2013/04/02 23:05), Joerg Roedel wrote: > On Mon, Apr 01, 2013 at 02:45:18PM +0900, Takao Indoh wrote: >> >> enable_IR >>intel_enable_irq_remapping >> iommu_disable_irq_remapping <== IRES/QIES/TES disabled here >> dmar_disable_qi <== do nothing >> dmar_enable_qi <== QIES enabled >> intel_setup_irq_remapping<== IRES enabled > > But what we want to do here in the kdumo case is to disable translation > too, right? Because the former kernel might have translation and > irq-remapping enabled and the kdump kernel might be compiled without > support for dma-remapping. So if we don't disable translation here too > the kdump kernel is unable to do DMA. Yeah, you are right. I forgot such a case. To be honest, I also expected the side effect of this patch. As I wrote in the previous mail, I'm working on kdump problem with iommu, that is, ongoing DMA causes DMAR fault in 2nd kernel and sometimes kdump fails due to this fault. What we have to do is stopping DMA transaction before DMA-remapping is disabled in 2nd kernel. To do this, we need to reset devices before intel_enable_irq_remapping() is called, but it is very difficult because struct pci_dev is not prepared in this stage. After applying this patch, DMA-remapping is disabled in init_dmars where struct pci_dev is ready, so it's easier to handle. For example we can add pci quirk to reset devices. So, disabling translation in 2nd kernel is necessary, and it's better if we can do it after struct pci_dev is prepared. (after subsys_initcall?) Any idea? Thanks, Takao Indoh > > I agree that disabling translation should be a bit more explicit instead > of the current code. > > > Joerg > ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH 5/5 v11] iommu/fsl: Freescale PAMU driver and iommu implementation.
> -Original Message- > From: Joerg Roedel [mailto:j...@8bytes.org] > Sent: Tuesday, April 02, 2013 9:48 PM > To: Sethi Varun-B16395 > Cc: Yoder Stuart-B08248; Wood Scott-B07421; iommu@lists.linux- > foundation.org; linuxppc-...@lists.ozlabs.org; linux- > ker...@vger.kernel.org; ga...@kernel.crashing.org; > b...@kernel.crashing.org; Alex Williamson > Subject: Re: [PATCH 5/5 v11] iommu/fsl: Freescale PAMU driver and iommu > implementation. > > Cc'ing Alex Williamson > > Alex, can you please review the iommu-group part of this patch? > > My comments so far are below: > > On Fri, Mar 29, 2013 at 01:24:02AM +0530, Varun Sethi wrote: > > +config FSL_PAMU > > + bool "Freescale IOMMU support" > > + depends on PPC_E500MC > > + select IOMMU_API > > + select GENERIC_ALLOCATOR > > + help > > + Freescale PAMU support. > > A bit lame for a help text. Can you elaborate more what PAMU is and when > it should be enabled? > [Sethi Varun-B16395] Will update the description. > > +int pamu_enable_liodn(int liodn) > > +{ > > + struct paace *ppaace; > > + > > + ppaace = pamu_get_ppaace(liodn); > > + if (!ppaace) { > > + pr_err("Invalid primary paace entry\n"); > > + return -ENOENT; > > + } > > + > > + if (!get_bf(ppaace->addr_bitfields, PPAACE_AF_WSE)) { > > + pr_err("liodn %d not configured\n", liodn); > > + return -EINVAL; > > + } > > + > > + /* Ensure that all other stores to the ppaace complete first */ > > + mb(); > > + > > + ppaace->addr_bitfields |= PAACE_V_VALID; > > + mb(); > > Why is it sufficient to set the bit in a variable when enabling liodn but > when disabling it set_bf needs to be called? This looks a bit assymetric. > [Sethi Varun-B16395] Will make it symetric :) > > +/* Derive the window size encoding for a particular PAACE entry */ > > +static unsigned int map_addrspace_size_to_wse(phys_addr_t > > +addrspace_size) { > > + /* Bug if not a power of 2 */ > > + BUG_ON((addrspace_size & (addrspace_size - 1))); > > Please use is_power_of_2 here. > > > + > > + /* window size is 2^(WSE+1) bytes */ > > + return __ffs(addrspace_size >> PAMU_PAGE_SHIFT) + PAMU_PAGE_SHIFT - > > +1; > > The PAMU_PAGE_SHIFT shifting and adding looks redundant. > > > + if ((win_size & (win_size - 1)) || win_size < PAMU_PAGE_SIZE) { > > + pr_err("window size too small or not a power of two %llx\n", > win_size); > > + return -EINVAL; > > + } > > + > > + if (win_addr & (win_size - 1)) { > > + pr_err("window address is not aligned with window size\n"); > > + return -EINVAL; > > + } > > Again, use is_power_of_2 instead of hand-coding. > [Sethi Varun-B16395] ok > > + if (~stashid != 0) > > + set_bf(paace->impl_attr, PAACE_IA_CID, stashid); > > + > > + smp_wmb(); > > + > > + if (enable) > > + paace->addr_bitfields |= PAACE_V_VALID; > > Havn't you written a helper funtion to set this bit? > [Sethi Varun-B16395] We already have a PAACE entry with us here so we can directly manipulate it here. > > +irqreturn_t pamu_av_isr(int irq, void *arg) { > > + struct pamu_isr_data *data = arg; > > + phys_addr_t phys; > > + unsigned int i, j; > > + > > + pr_emerg("fsl-pamu: access violation interrupt\n"); > > + > > + for (i = 0; i < data->count; i++) { > > + void __iomem *p = data->pamu_reg_base + i * PAMU_OFFSET; > > + u32 pics = in_be32(p + PAMU_PICS); > > + > > + if (pics & PAMU_ACCESS_VIOLATION_STAT) { > > + pr_emerg("POES1=%08x\n", in_be32(p + PAMU_POES1)); > > + pr_emerg("POES2=%08x\n", in_be32(p + PAMU_POES2)); > > + pr_emerg("AVS1=%08x\n", in_be32(p + PAMU_AVS1)); > > + pr_emerg("AVS2=%08x\n", in_be32(p + PAMU_AVS2)); > > + pr_emerg("AVA=%016llx\n", make64(in_be32(p + > PAMU_AVAH), > > + in_be32(p + PAMU_AVAL))); > > + pr_emerg("UDAD=%08x\n", in_be32(p + PAMU_UDAD)); > > + pr_emerg("POEA=%016llx\n", make64(in_be32(p + > PAMU_POEAH), > > + in_be32(p + PAMU_POEAL))); > > + > > + phys = make64(in_be32(p + PAMU_POEAH), > > + in_be32(p + PAMU_POEAL)); > > + > > + /* Assume that POEA points to a PAACE */ > > + if (phys) { > > + u32 *paace = phys_to_virt(phys); > > + > > + /* Only the first four words are relevant */ > > + for (j = 0; j < 4; j++) > > + pr_emerg("PAACE[%u]=%08x\n", j, > in_be32(paace + j)); > > + } > > + } > > + } > > + > > + panic("\n"); > > A kernel panic seems like an over-reaction to an access violation. > Besides the device that caused the violation the system should still > work, no? > [Sethi Varun-B16395] Well, if devic