Re: [PATCH] iommu/vt-d: Increase DMAR_UNITS_SUPPORTED
On 2022/5/6 03:46, Steve Wahl wrote: Increase DMAR_UNITS_SUPPORTED to support 64 sockets with 10 DMAR units each, for a total of 640. If the available hardware exceeds DMAR_UNITS_SUPPORTED (previously set to MAX_IO_APICS, or 128), it causes these messages: "DMAR: Failed to allocate seq_id", "DMAR: Parse DMAR table failure.", and "x2apic: IRQ remapping doesn't support X2APIC mode x2apic disabled"; and the system fails to boot. Signed-off-by: Steve Wahl Reviewed-by: Mike Travis --- Note that we could not find a reason for connecting DMAR_UNITS_SUPPORTED to MAX_IO_APICS as was done previously. Perhaps it seemed like the two would continue to match on earlier processors. There doesn't appear to be kernel code that assumes that the value of one is related to the other. +Kevin This maximum value was introduced by below commit. And I don't see any hardware/software restrictions that we can't enlarge it after ten years. commit 1b198bb04ad72669d4bd6575fc9945ed595bfee0 Author: Mike Travis Date: Mon Mar 5 15:05:16 2012 -0800 x86/iommu/intel: Increase the number of iommus supported to MAX_IO_APICS The number of IOMMUs supported should be the same as the number of IO APICS. This limit comes into play when the IOMMUs are identity mapped, thus the number of possible IOMMUs in the "static identity" (si) domain should be this same number. [...] include/linux/dmar.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/dmar.h b/include/linux/dmar.h index 45e903d84733..9d4867b8f42e 100644 --- a/include/linux/dmar.h +++ b/include/linux/dmar.h @@ -19,7 +19,7 @@ struct acpi_dmar_header; #ifdef CONFIG_X86 -# define DMAR_UNITS_SUPPORTEDMAX_IO_APICS +# define DMAR_UNITS_SUPPORTED640 #else # define DMAR_UNITS_SUPPORTED64 #endif Best regards, baolu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v5 10/12] iommu: Prepare IOMMU domain for IOPF
On 2022/5/5 21:38, Jean-Philippe Brucker wrote: Hi Baolu, On Thu, May 05, 2022 at 04:31:38PM +0800, Baolu Lu wrote: On 2022/5/4 02:20, Jean-Philippe Brucker wrote: diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 7cae631c1baa..33449523afbe 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -3174,3 +3174,24 @@ void iommu_detach_device_pasid(struct iommu_domain *domain, iommu_group_put(group); } + +struct iommu_domain *iommu_get_domain_for_dev_pasid(struct device *dev, + ioasid_t pasid) +{ + struct iommu_domain *domain; + struct iommu_group *group; + + if (!pasid_valid(pasid)) + return NULL; + + group = iommu_group_get(dev); + if (!group) + return NULL; + + mutex_lock(>mutex); Unfortunately this still causes the deadlock when unbind() flushes the IOPF queue while holding the group mutex. Sorry, I didn't get your point here. Do you mean unbind() could hold group mutex before calling this helper? The group mutex is only available in iommu.c. The unbind() has no means to hold this lock. Or, I missed anything? I wasn't clear, it's iommu_detach_device_pasid() that holds the group->mutex: iommu_sva_unbind_device() | iommu_detach_device_pasid() | mutex_lock(>mutex)| domain->ops->detach_dev_pasid() | iopf_handle_group() iopf_queue_flush_dev() | iommu_get_domain_for_dev_pasid() ... wait for IOPF work | mutex_lock(>mutex) |... deadlock Ah! Yes. Thank you for the clarification. Thanks, Jean Best regards, baolu If we make this function private to IOPF, then we can get rid of this mutex_lock(). It's OK because: * xarray protects its internal state with RCU, so we can call xa_load() outside the lock. * The domain obtained from xa_load is finalized. Its content is valid because xarray stores the domain using rcu_assign_pointer(), which has a release memory barrier, which pairs with data dependencies in IOPF (domain->sva_ioas etc). We'll need to be careful about this when allowing other users to install a fault handler. Should be fine as long as the handler and data are installed before the domain is added to pasid_array. * We know the domain is valid the whole time IOPF is using it, because unbind() waits for pending faults. We just need a comment explaining the last point, something like: /* * Safe to fetch outside the group mutex because: * - xarray protects its internal state with RCU * - the domain obtained is either NULL or fully formed * - the IOPF work is the only caller and is flushed before the * domain is freed. */ Agreed. The mutex is needed only when domain could possibly be freed before unbind(). In that case, we need this mutex and get a reference from the domain. As we have dropped the domain user reference, this lock is unnecessary. Thanks, Jean + domain = xa_load(>pasid_array, pasid); + mutex_unlock(>mutex); + iommu_group_put(group); + + return domain; +} Best regards, baolu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3] iommu/mediatek: Fix NULL pointer dereference when printing dev_name
On Thu, 2022-05-05 at 21:27 +0800, Miles Chen wrote: > When larbdev is NULL (in the case I hit, the node is incorrectly set > iommus = < NUM>), it will cause device_link_add() fail and > kernel crashes when we try to print dev_name(larbdev). > > Let's fail the probe if a larbdev is NULL to avoid invalid inputs > from > dts. > > It should work for normal correct setting and avoid the crash caused > by my incorrect setting. > > Error log: > [ 18.189042][ T301] Unable to handle kernel NULL pointer > dereference at virtual address 0050 > ... > [ 18.344519][ T301] pstate: a045 (NzCv daif +PAN -UAO) > [ 18.345213][ T301] pc : mtk_iommu_probe_device+0xf8/0x118 > [mtk_iommu] > [ 18.346050][ T301] lr : mtk_iommu_probe_device+0xd0/0x118 > [mtk_iommu] > [ 18.346884][ T301] sp : ffc00a5635e0 > [ 18.347392][ T301] x29: ffc00a5635e0 x28: ffd44a46c1d8 > [ 18.348156][ T301] x27: ff80c39a8000 x26: ffd44a80cc38 > [ 18.348917][ T301] x25: x24: ffd44a80cc38 > [ 18.349677][ T301] x23: ffd44e4da4c6 x22: ffd44a80cc38 > [ 18.350438][ T301] x21: ff80cecd1880 x20: > [ 18.351198][ T301] x19: ff80c439f010 x18: ffc00a50d0c0 > [ 18.351959][ T301] x17: x16: 0004 > [ 18.352719][ T301] x15: 0004 x14: ffd44eb5d420 > [ 18.353480][ T301] x13: 0ad2 x12: 0003 > [ 18.354241][ T301] x11: fad2 x10: c000fad2 > [ 18.355003][ T301] x9 : a0d288d8d7142d00 x8 : a0d288d8d7142d00 > [ 18.355763][ T301] x7 : ffd44c2bc640 x6 : > [ 18.356524][ T301] x5 : 0080 x4 : 0001 > [ 18.357284][ T301] x3 : x2 : 0005 > [ 18.358045][ T301] x1 : x0 : > [ 18.360208][ T301] Hardware name: MT6873 (DT) > [ 18.360771][ T301] Call trace: > [ 18.361168][ T301] dump_backtrace+0xf8/0x1f0 > [ 18.361737][ T301] dump_stack_lvl+0xa8/0x11c > [ 18.362305][ T301] dump_stack+0x1c/0x2c > [ 18.362816][ T301] mrdump_common_die+0x184/0x40c [mrdump] > [ 18.363575][ T301] ipanic_die+0x24/0x38 [mrdump] > [ 18.364230][ T301] atomic_notifier_call_chain+0x128/0x2b8 > [ 18.364937][ T301] die+0x16c/0x568 > [ 18.365394][ T301] __do_kernel_fault+0x1e8/0x214 > [ 18.365402][ T301] do_page_fault+0xb8/0x678 > [ 18.366934][ T301] do_translation_fault+0x48/0x64 > [ 18.368645][ T301] do_mem_abort+0x68/0x148 > [ 18.368652][ T301] el1_abort+0x40/0x64 > [ 18.368660][ T301] el1h_64_sync_handler+0x54/0x88 > [ 18.368668][ T301] el1h_64_sync+0x68/0x6c > [ 18.368673][ T301] mtk_iommu_probe_device+0xf8/0x118 [mtk_iommu] > ... > > Cc: Robin Murphy > Cc: Yong Wu > Reported-by: kernel test robot > Fixes: 635319a4a744 ("media: iommu/mediatek: Add device_link between > the consumer and the larb devices") > Signed-off-by: Miles Chen Reviewed-by: Yong Wu And this doesn't conflict with the MT8195 patchset. Thanks. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v4 05/11] iommu/sva: Assign a PASID to mm on PASID allocation and free it on mm exit
On 2022/5/3 15:49, Jean-Philippe Brucker wrote: On Sat, Apr 30, 2022 at 03:33:17PM +0800, Baolu Lu wrote: Jean, another quick question about the iommu_sva_bind_device() /** * iommu_sva_bind_device() - Bind a process address space to a device * @dev: the device * @mm: the mm to bind, caller must hold a reference to it * @drvdata: opaque data pointer to pass to bind callback This interface requires the caller to take a reference to mm. Which reference should it take, mm->mm_count or mm->mm_users? It's better to make it explicit in this comment. Agreed, it's mm_users as required by mmu_notifier_register() Thanks! I will add this in my refactoring patch. Best regards, baolu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v3 4/4] iommu/vt-d: Remove hard coding PGSNP bit in PASID entries
As enforce_cache_coherency has been introduced into the iommu_domain_ops, the kernel component which owns the iommu domain is able to opt-in its requirement for force snooping support. The iommu driver has no need to hard code the page snoop control bit in the PASID table entries anymore. Signed-off-by: Lu Baolu Reviewed-by: Kevin Tian --- drivers/iommu/intel/pasid.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c index 41a0e3b02c79..0abfa7fc7fb0 100644 --- a/drivers/iommu/intel/pasid.c +++ b/drivers/iommu/intel/pasid.c @@ -710,9 +710,6 @@ int intel_pasid_setup_second_level(struct intel_iommu *iommu, pasid_set_fault_enable(pte); pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap)); - if (domain->domain.type == IOMMU_DOMAIN_UNMANAGED) - pasid_set_pgsnp(pte); - /* * Since it is a second level only translation setup, we should * set SRE bit as well (addresses are expected to be GPAs). -- 2.25.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v3 3/4] iommu/vt-d: Remove domain_update_iommu_snooping()
The IOMMU force snooping capability is not required to be consistent among all the IOMMUs anymore. Remove force snooping capability check in the IOMMU hot-add path and domain_update_iommu_snooping() becomes a dead code now. Signed-off-by: Lu Baolu Reviewed-by: Jason Gunthorpe Reviewed-by: Kevin Tian --- drivers/iommu/intel/iommu.c | 34 +- 1 file changed, 1 insertion(+), 33 deletions(-) diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index 048ebfbd5fcb..444d51a18c93 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -533,33 +533,6 @@ static void domain_update_iommu_coherency(struct dmar_domain *domain) rcu_read_unlock(); } -static bool domain_update_iommu_snooping(struct intel_iommu *skip) -{ - struct dmar_drhd_unit *drhd; - struct intel_iommu *iommu; - bool ret = true; - - rcu_read_lock(); - for_each_active_iommu(iommu, drhd) { - if (iommu != skip) { - /* -* If the hardware is operating in the scalable mode, -* the snooping control is always supported since we -* always set PASID-table-entry.PGSNP bit if the domain -* is managed outside (UNMANAGED). -*/ - if (!sm_supported(iommu) && - !ecap_sc_support(iommu->ecap)) { - ret = false; - break; - } - } - } - rcu_read_unlock(); - - return ret; -} - static int domain_update_iommu_superpage(struct dmar_domain *domain, struct intel_iommu *skip) { @@ -3606,12 +3579,7 @@ static int intel_iommu_add(struct dmar_drhd_unit *dmaru) iommu->name); return -ENXIO; } - if (!ecap_sc_support(iommu->ecap) && - domain_update_iommu_snooping(iommu)) { - pr_warn("%s: Doesn't support snooping.\n", - iommu->name); - return -ENXIO; - } + sp = domain_update_iommu_superpage(NULL, iommu) - 1; if (sp >= 0 && !(cap_super_page_val(iommu->cap) & (1 << sp))) { pr_warn("%s: Doesn't support large page.\n", -- 2.25.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v3 2/4] iommu/vt-d: Check domain force_snooping against attached devices
As domain->force_snooping only impacts the devices attached with the domain, there's no need to check against all IOMMU units. On the other hand, force_snooping could be set on a domain no matter whether it has been attached or not, and once set it is an immutable flag. If no device attached, the operation always succeeds. Then this empty domain can be only attached to a device of which the IOMMU supports snoop control. Signed-off-by: Lu Baolu --- include/linux/intel-iommu.h | 1 + drivers/iommu/intel/pasid.h | 2 ++ drivers/iommu/intel/iommu.c | 53 ++--- drivers/iommu/intel/pasid.c | 23 4 files changed, 76 insertions(+), 3 deletions(-) diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h index 72e5d7900e71..4f29139bbfc3 100644 --- a/include/linux/intel-iommu.h +++ b/include/linux/intel-iommu.h @@ -540,6 +540,7 @@ struct dmar_domain { u8 has_iotlb_device: 1; u8 iommu_coherency: 1; /* indicate coherency of iommu access */ u8 force_snooping : 1; /* Create IOPTEs with snoop control */ + u8 set_pte_snp:1; struct list_head devices; /* all devices' list */ struct iova_domain iovad; /* iova's that belong to this domain */ diff --git a/drivers/iommu/intel/pasid.h b/drivers/iommu/intel/pasid.h index ab4408c824a5..583ea67fc783 100644 --- a/drivers/iommu/intel/pasid.h +++ b/drivers/iommu/intel/pasid.h @@ -123,4 +123,6 @@ void intel_pasid_tear_down_entry(struct intel_iommu *iommu, bool fault_ignore); int vcmd_alloc_pasid(struct intel_iommu *iommu, u32 *pasid); void vcmd_free_pasid(struct intel_iommu *iommu, u32 pasid); +void intel_pasid_setup_page_snoop_control(struct intel_iommu *iommu, + struct device *dev, u32 pasid); #endif /* __INTEL_PASID_H */ diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index b4802f4055a0..048ebfbd5fcb 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -2459,7 +2459,7 @@ static int domain_setup_first_level(struct intel_iommu *iommu, if (level == 5) flags |= PASID_FLAG_FL5LP; - if (domain->domain.type == IOMMU_DOMAIN_UNMANAGED) + if (domain->force_snooping) flags |= PASID_FLAG_PAGE_SNOOP; return intel_pasid_setup_first_level(iommu, dev, (pgd_t *)pgd, pasid, @@ -,7 +,7 @@ static int intel_iommu_map(struct iommu_domain *domain, prot |= DMA_PTE_READ; if (iommu_prot & IOMMU_WRITE) prot |= DMA_PTE_WRITE; - if (dmar_domain->force_snooping) + if (dmar_domain->set_pte_snp) prot |= DMA_PTE_SNP; max_addr = iova + size; @@ -4567,13 +4567,60 @@ static phys_addr_t intel_iommu_iova_to_phys(struct iommu_domain *domain, return phys; } +static bool domain_support_force_snooping(struct dmar_domain *domain) +{ + struct device_domain_info *info; + bool support = true; + + assert_spin_locked(_domain_lock); + list_for_each_entry(info, >devices, link) { + if (!ecap_sc_support(info->iommu->ecap)) { + support = false; + break; + } + } + + return support; +} + +static void domain_set_force_snooping(struct dmar_domain *domain) +{ + struct device_domain_info *info; + + assert_spin_locked(_domain_lock); + + /* +* Second level page table supports per-PTE snoop control. The +* iommu_map() interface will handle this by setting SNP bit. +*/ + if (!domain_use_first_level(domain)) { + domain->set_pte_snp = true; + return; + } + + list_for_each_entry(info, >devices, link) + intel_pasid_setup_page_snoop_control(info->iommu, info->dev, +PASID_RID2PASID); +} + static bool intel_iommu_enforce_cache_coherency(struct iommu_domain *domain) { struct dmar_domain *dmar_domain = to_dmar_domain(domain); + unsigned long flags; - if (!domain_update_iommu_snooping(NULL)) + if (dmar_domain->force_snooping) + return true; + + spin_lock_irqsave(_domain_lock, flags); + if (!domain_support_force_snooping(dmar_domain)) { + spin_unlock_irqrestore(_domain_lock, flags); return false; + } + + domain_set_force_snooping(dmar_domain); dmar_domain->force_snooping = true; + spin_unlock_irqrestore(_domain_lock, flags); + return true; } diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c index f8d215d85695..41a0e3b02c79 100644 --- a/drivers/iommu/intel/pasid.c +++ b/drivers/iommu/intel/pasid.c @@ -762,3 +762,26 @@ int intel_pasid_setup_pass_through(struct intel_iommu *iommu, return 0; } + +/* + * Set the page snoop
[PATCH v3 1/4] iommu/vt-d: Block force-snoop domain attaching if no SC support
In the attach_dev callback of the default domain ops, if the domain has been set force_snooping, but the iommu hardware of the device does not support SC(Snoop Control) capability, the callback should block it and return a corresponding error code. Signed-off-by: Lu Baolu Reviewed-by: Jason Gunthorpe Reviewed-by: Kevin Tian --- drivers/iommu/intel/iommu.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index 2990f80c5e08..b4802f4055a0 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -4367,6 +4367,9 @@ static int prepare_domain_attach_device(struct iommu_domain *domain, if (!iommu) return -ENODEV; + if (dmar_domain->force_snooping && !ecap_sc_support(iommu->ecap)) + return -EOPNOTSUPP; + /* check if this iommu agaw is sufficient for max mapped address */ addr_width = agaw_to_width(iommu->agaw); if (addr_width > cap_mgaw(iommu->cap)) -- 2.25.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v3 0/4] iommu/vt-d: Force snooping improvement
Hi folks, Previously, the IOMMU capability of enforcing cache coherency was queried through iommu_capable(IOMMU_CAP_CACHE_COHERENCY). This is a global capability, hence the IOMMU driver reports support for this capability only when all IOMMUs in the system has this support. Commit 6043257b1de06 ("iommu: Introduce the domain op enforce_cache_coherency()") converts this into a per-domain test-and-set option, and the previous iommu_capable(IOMMU_CAP_CACHE_COHERENCY) is deprecated. This is a follow-up series which improves the Intel IOMMU driver to support the per-domain scheme better. Best regards, baolu Change log: v3: - Hold the device_domain_lock when check and set force snooping. - Refind the commit messages. v2: - https://lore.kernel.org/linux-iommu/20220505010710.1477739-1-baolu...@linux.intel.com/ - Check whether force_snooping has already been set in intel_iommu_enforce_cache_coherency(). - Set PGSNP pasid bit field during domain attaching if forcing_snooping is set. - Remove redundant list_empty() checks. - Add dmar_domain->set_pte_snp and set it if force snooping is enforced on a domain with 2nd-level translation. v1: - https://lore.kernel.org/linux-iommu/20220501112434.874236-1-baolu...@linux.intel.com - Initial post. Lu Baolu (4): iommu/vt-d: Block force-snoop domain attaching if no SC support iommu/vt-d: Check domain force_snooping against attached devices iommu/vt-d: Remove domain_update_iommu_snooping() iommu/vt-d: Remove hard coding PGSNP bit in PASID entries include/linux/intel-iommu.h | 1 + drivers/iommu/intel/pasid.h | 2 + drivers/iommu/intel/iommu.c | 90 ++--- drivers/iommu/intel/pasid.c | 26 +-- 4 files changed, 80 insertions(+), 39 deletions(-) -- 2.25.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
On Thu, May 05, 2022 at 04:07:28PM -0300, Jason Gunthorpe wrote: > On Mon, May 02, 2022 at 05:30:05PM +1000, David Gibson wrote: > > > > It is a bit more CPU work since maps in the lower range would have to > > > be copied over, but conceptually the model matches the HW nesting. > > > > Ah.. ok. IIUC what you're saying is that the kernel-side IOASes have > > fixed windows, but we fake dynamic windows in the userspace > > implementation by flipping the devices over to a new IOAS with the new > > windows. Is that right? > > Yes > > > Where exactly would the windows be specified? My understanding was > > that when creating a back-end specific IOAS, that would typically be > > for the case where you're using a user / guest managed IO pagetable, > > with the backend specifying the format for that. In the ppc case we'd > > need to specify the windows, but we'd still need the IOAS_MAP/UNMAP > > operations to manage the mappings. The PAPR vIOMMU is > > paravirtualized, so all updates come via hypercalls, so there's no > > user/guest managed data structure. > > When the iommu_domain is created I want to have a > iommu-driver-specific struct, so PPC can customize its iommu_domain > however it likes. This requires that the client be aware of the host side IOMMU model. That's true in VFIO now, and it's nasty; I was really hoping we could *stop* doing that. Note that I'm talking here *purely* about the non-optimized case where all updates to the host side IO pagetables are handled by IOAS_MAP / IOAS_COPY, with no direct hardware access to user or guest managed IO pagetables. The optimized case obviously requires end-to-end agreement on the pagetable format amongst other domain properties. What I'm hoping is that qemu (or whatever) can use this non-optimized as a fallback case where it does't need to know the properties of whatever host side IOMMU models there are. It just requests what it needs based on the vIOMMU properties it needs to replicate and the host kernel either can supply it or can't. In many cases it should be perfectly possible to emulate a PPC style vIOMMU on an x86 host, because the x86 IOMMU has such a colossal aperture that it will encompass wherever the ppc apertures end up. Similarly we could simulate an x86-style no-vIOMMU guest on a ppc host (currently somewhere between awkward and impossible) by placing the host apertures to cover guest memory. Admittedly those are pretty niche cases, but allowing for them gives us flexibility for the future. Emulating an ARM SMMUv3 guest on an ARM SMMU v4 or v5 or v.whatever host is likely to be a real case in the future, and AFAICT, ARM are much less conservative that x86 about maintaining similar hw interfaces over time. That's why I think considering these ppc cases will give a more robust interface for other future possibilities as well. > > That should work from the point of view of the userspace and guest > > side interfaces. It might be fiddly from the point of view of the > > back end. The ppc iommu doesn't really have the notion of > > configurable domains - instead the address spaces are the hardware or > > firmware fixed PEs, so they have a fixed set of devices. At the bare > > metal level it's possible to sort of do domains by making the actual > > pagetable pointers for several PEs point to a common place. > > I'm not sure I understand this - a domain is just a storage container > for an IO page table, if the HW has IOPTEs then it should be able to > have a domain? > > Making page table pointers point to a common IOPTE tree is exactly > what iommu_domains are for - why is that "sort of" for ppc? Ok, fair enough, it's only "sort of" in the sense that the hw specs / docs don't present any equivalent concept. > > However, in the future, nested KVM under PowerVM is likely to be the > > norm. In that situation the L1 as well as the L2 only has the > > paravirtualized interfaces, which don't have any notion of domains, > > only PEs. All updates take place via hypercalls which explicitly > > specify a PE (strictly speaking they take a "Logical IO Bus Number" > > (LIOBN), but those generally map one to one with PEs), so it can't use > > shared pointer tricks either. > > How does the paravirtualized interfaces deal with the page table? Does > it call a map/unmap hypercall instead of providing guest IOPTEs? Sort of. The main interface is H_PUT_TCE ("TCE" - Translation Control Entry - being IBMese for an IOPTE). This takes an LIOBN (which selects which PE and aperture), an IOVA and a TCE value - which is a guest physical address plus some permission bits. There are some variants for performance that can set a batch of IOPTEs from a buffer, or clear a range of IOPTEs, but they're just faster ways of doing the same thing as a bunch of H_PUT_TCE calls. H_PUT_TCE calls. You can consider that a map/unmap hypercall, but the size of the mapping is fixed (the IO pagesize which was previously set for the aperture). > Assuming yes,
RE: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
> From: Jason Gunthorpe > Sent: Thursday, May 5, 2022 10:08 PM > > On Thu, May 05, 2022 at 07:40:37AM +, Tian, Kevin wrote: > > > In concept this is an iommu property instead of a domain property. > > Not really, domains shouldn't be changing behaviors once they are > created. If a domain supports dirty tracking and I attach a new device > then it still must support dirty tracking. That sort of suggests that userspace should specify whether a domain supports dirty tracking when it's created. But how does userspace know that it should create the domain in this way in the first place? live migration is triggered on demand and it may not happen in the lifetime of a VM. and if the user always creates domain to allow dirty tracking by default, how does it know a failed attach is due to missing dirty tracking support by the IOMMU and then creates another domain which disables dirty tracking and retry-attach again? In any case IMHO having a device capability still sounds applausive even in above model so userspace can create domain with right property based on a potential list of devices to be attached. Once the domain is created, then further attached devices must be compatible to the domain property. > > I suppose we may need something here because we need to control when > domains are re-used if they don't have the right properties in case > the system iommu's are discontiguous somehow. > > ie iommufd should be able to assert that dirty tracking is desired and > an existing non-dirty tracking capable domain will not be > automatically re-used. > > We don't really have the right infrastructure to do this currently. > > > From this angle IMHO it's more reasonable to report this IOMMU > > property to userspace via a device capability. If all devices attached > > to a hwpt claim IOMMU dirty tracking capability, the user can call > > set_dirty_tracking() on the hwpt object. > > Inherent domain properties need to be immutable or, at least one-way, > like enforced coherent, or it just all stops making any kind of sense. > > > Once dirty tracking is enabled on a hwpt, further attaching a device > > which doesn't claim this capability is simply rejected. > > It would be OK to do as enforced coherent does as flip a domain > permanently into dirty-tracking enabled, or specify a flag at domain > creation time. > Either way I think a device capability is useful for the user to decide the necessity of flipping one-way or specifying a flag at domain creation. Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
> From: Jason Gunthorpe > Sent: Thursday, May 5, 2022 9:55 PM > > On Thu, May 05, 2022 at 11:03:18AM +, Tian, Kevin wrote: > > > iiuc the purpose of 'write-protection' here is to capture in-fly dirty pages > > in the said race window until unmap and iotlb is invalidated is completed. > > No, the purpose is to perform "unmap" without destroying the dirty bit > in the process. > > If an IOMMU architecture has a way to render the page unmaped and > flush back the dirty bit/not destroy then it doesn't require a write > protect pass. > Yes, I see the point now. As you said let's consider it only when there is a real use case requiring such atomicity. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
> From: Joao Martins > Sent: Thursday, May 5, 2022 7:51 PM > > On 5/5/22 12:03, Tian, Kevin wrote: > >> From: Joao Martins > >> Sent: Thursday, May 5, 2022 6:07 PM > >> > >> On 5/5/22 08:42, Tian, Kevin wrote: > From: Jason Gunthorpe > Sent: Tuesday, May 3, 2022 2:53 AM > > On Mon, May 02, 2022 at 12:11:07PM -0600, Alex Williamson wrote: > > On Fri, 29 Apr 2022 05:45:20 + > > "Tian, Kevin" wrote: > >>> From: Joao Martins > >>> 3) Unmapping an IOVA range while returning its dirty bit prior to > >>> unmap. This case is specific for non-nested vIOMMU case where an > >>> erronous guest (or device) DMAing to an address being unmapped > at > the > >>> same time. > >> > >> an erroneous attempt like above cannot anticipate which DMAs can > >> succeed in that window thus the end behavior is undefined. For an > >> undefined behavior nothing will be broken by losing some bits dirtied > >> in the window between reading back dirty bits of the range and > >> actually calling unmap. From guest p.o.v. all those are black-box > >> hardware logic to serve a virtual iotlb invalidation request which just > >> cannot be completed in one cycle. > >> > >> Hence in reality probably this is not required except to meet vfio > >> compat requirement. Just in concept returning dirty bits at unmap > >> is more accurate. > >> > >> I'm slightly inclined to abandon it in iommufd uAPI. > > > > Sorry, I'm not following why an unmap with returned dirty bitmap > > operation is specific to a vIOMMU case, or in fact indicative of some > > sort of erroneous, racy behavior of guest or device. > > It is being compared against the alternative which is to explicitly > query dirty then do a normal unmap as two system calls and permit a > race. > > The only case with any difference is if the guest is racing DMA with > the unmap - in which case it is already indeterminate for the guest if > the DMA will be completed or not. > > eg on the vIOMMU case if the guest races DMA with unmap then we > are > already fine with throwing away that DMA because that is how the race > resolves during non-migration situations, so resovling it as throwing > away the DMA during migration is OK too. > > > We need the flexibility to support memory hot-unplug operations > > during migration, > > I would have thought that hotplug during migration would simply > discard all the data - how does it use the dirty bitmap? > > > This was implemented as a single operation specifically to avoid > > races where ongoing access may be available after retrieving a > > snapshot of the bitmap. Thanks, > > The issue is the cost. > > On a real iommu elminating the race is expensive as we have to write > protect the pages before query dirty, which seems to be an extra IOTLB > flush. > > It is not clear if paying this cost to become atomic is actually > something any use case needs. > > So, I suggest we think about a 3rd op 'write protect and clear > dirties' that will be followed by a normal unmap - the extra op will > have the extra oveheard and userspace can decide if it wants to pay or > not vs the non-atomic read dirties operation. And lets have a use case > where this must be atomic before we implement it.. > >>> > >>> and write-protection also relies on the support of I/O page fault... > >>> > >> /I think/ all IOMMUs in this series already support > permission/unrecoverable > >> I/O page faults for a long time IIUC. > >> > >> The earlier suggestion was just to discard the I/O page fault after > >> write-protection happens. fwiw, some IOMMUs also support suppressing > >> the event notification (like AMD). > > > > iiuc the purpose of 'write-protection' here is to capture in-fly dirty pages > > in the said race window until unmap and iotlb is invalidated is completed. > > > But then we depend on PRS being there on the device, because without it, > DMA is > aborted on the target on a read-only IOVA prior to the page fault, thus the > page > is not going to be dirty anyways. > > > *unrecoverable* faults are not expected to be used in a feature path > > as occurrence of such faults may lead to severe reaction in iommu > > drivers e.g. completely block DMA from the device causing such faults. > > Unless I totally misunderstood ... the later is actually what we were > suggesting > here /in the context of unmaping an GIOVA/(*). > > The wrprotect() was there to ensure we get an atomic dirty state of the IOVA > range > afterwards, by blocking DMA (as opposed to sort of mediating DMA). The I/O > page fault is > not supposed to happen unless there's rogue DMA AIUI. You are right. It's me misunderstanding the proposal here. > > TBH, the same could be said for normal DMA
Re: [PATCH v2] iommu/sva: Fix PASID use-after-free issue
On 2022/4/29 上午9:39, Zhangfei Gao wrote: On 2022/4/29 上午2:00, Fenghua Yu wrote: The PASID is being freed too early. It needs to stay around until after device drivers that might be using it have had a chance to clear it out of the hardware. As a reminder: mmget() /mmput() refcount the mm's address space mmgrab()/mmdrop() refcount the mm itself The PASID is currently tied to the life of the mm's address space and freed in __mmput(). This makes logical sense because the PASID can't be used once the address space is gone. But, this misses an important point: even after the address space is gone, the PASID will still be programmed into a device. Device drivers might, for instance, still need to flush operations that are outstanding and need to use that PASID. They do this at file->release() time. Device drivers call the IOMMU driver to hold a reference on the mm itself and drop it at file->release() time. But, the IOMMU driver holds a reference on the mm itself, not the address space. The address space (and the PASID) is long gone by the time the driver tries to clean up. This is effectively a use-after-free bug on the PASID. To fix this, move the PASID free operation from __mmput() to __mmdrop(). This ensures that the IOMMU driver's existing mmgrab() keeps the PASID allocated until it drops its mm reference. Fixes: 701fac40384f ("iommu/sva: Assign a PASID to mm on PASID allocation and free it on mm exit") Reported-by: Zhangfei Gao Tested-by: Zhangfei Gao Tested-by: Zhangfei Gao Use the formal email, thanks Suggested-by: Jean-Philippe Brucker Suggested-by: Jacob Pan Reviewed-by: Jean-Philippe Brucker Signed-off-by: Fenghua Yu Hi, Will this be taken for 5.18? Thanks --- v2: - Dave Hansen rewrites the change log. - Add Tested-by: Zhangfei Gao - Add Reviewed-by: Jean-Philippe Brucker The original patch was posted and discussed in: https://lore.kernel.org/lkml/ymdzffx7fn586...@fyu1.sc.intel.com/ kernel/fork.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/fork.c b/kernel/fork.c index 9796897560ab..35a3beff140b 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -792,6 +792,7 @@ void __mmdrop(struct mm_struct *mm) mmu_notifier_subscriptions_destroy(mm); check_mm(mm); put_user_ns(mm->user_ns); + mm_pasid_drop(mm); free_mm(mm); } EXPORT_SYMBOL_GPL(__mmdrop); @@ -1190,7 +1191,6 @@ static inline void __mmput(struct mm_struct *mm) } if (mm->binfmt) module_put(mm->binfmt->module); - mm_pasid_drop(mm); mmdrop(mm); } ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v6 29/29] x86/tsc: Switch to perf-based hardlockup detector if TSC become unstable
The HPET-based hardlockup detector relies on the TSC to determine if an observed NMI interrupt was originated by HPET timer. Hence, this detector can no longer be used with an unstable TSC. In such case, permanently stop the HPET-based hardlockup detector and start the perf-based detector. Cc: Andi Kleen Cc: Stephane Eranian Cc: "Ravi V. Shankar" Cc: iommu@lists.linux-foundation.org Cc: linuxppc-...@lists.ozlabs.org Cc: x...@kernel.org Suggested-by: Thomas Gleixner Reviewed-by: Tony Luck Signed-off-by: Ricardo Neri --- Changes since v5: * Relocated the delcaration of hardlockup_detector_switch_to_perf() to x86/nmi.h It does not depend on HPET. * Removed function stub. The shim hardlockup detector is always for x86. Changes since v4: * Added a stub version of hardlockup_detector_switch_to_perf() for !CONFIG_HPET_TIMER. (lkp) * Reconfigure the whole lockup detector instead of unconditionally starting the perf-based hardlockup detector. Changes since v3: * None Changes since v2: * Introduced this patch. Changes since v1: * N/A --- arch/x86/include/asm/nmi.h | 6 ++ arch/x86/kernel/tsc.c | 2 ++ arch/x86/kernel/watchdog_hld.c | 6 ++ 3 files changed, 14 insertions(+) diff --git a/arch/x86/include/asm/nmi.h b/arch/x86/include/asm/nmi.h index 4a0d5b562c91..47752ff67d8b 100644 --- a/arch/x86/include/asm/nmi.h +++ b/arch/x86/include/asm/nmi.h @@ -63,4 +63,10 @@ void stop_nmi(void); void restart_nmi(void); void local_touch_nmi(void); +#ifdef CONFIG_X86_HARDLOCKUP_DETECTOR +void hardlockup_detector_switch_to_perf(void); +#else +static inline void hardlockup_detector_switch_to_perf(void) { } +#endif + #endif /* _ASM_X86_NMI_H */ diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c index cc1843044d88..74772ffc79d1 100644 --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -1176,6 +1176,8 @@ void mark_tsc_unstable(char *reason) clocksource_mark_unstable(_tsc_early); clocksource_mark_unstable(_tsc); + + hardlockup_detector_switch_to_perf(); } EXPORT_SYMBOL_GPL(mark_tsc_unstable); diff --git a/arch/x86/kernel/watchdog_hld.c b/arch/x86/kernel/watchdog_hld.c index ef11f0af4ef5..7940977c6312 100644 --- a/arch/x86/kernel/watchdog_hld.c +++ b/arch/x86/kernel/watchdog_hld.c @@ -83,3 +83,9 @@ void watchdog_nmi_start(void) if (detector_type == X86_HARDLOCKUP_DETECTOR_HPET) hardlockup_detector_hpet_start(); } + +void hardlockup_detector_switch_to_perf(void) +{ + detector_type = X86_HARDLOCKUP_DETECTOR_PERF; + lockup_detector_reconfigure(); +} -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v6 26/29] x86/watchdog: Add a shim hardlockup detector
The generic hardlockup detector is based on perf. It also provides a set of weak functions that CPU architectures can override. Add a shim hardlockup detector for x86 that overrides such functions and can select between perf and HPET implementations of the detector. For clarity, add the intermediate Kconfig symbol X86_HARDLOCKUP_DETECTOR that is selected whenever the core of the hardlockup detector is selected. Cc: Andi Kleen Cc: Stephane Eranian Cc: "Ravi V. Shankar" Cc: iommu@lists.linux-foundation.org Cc: linuxppc-...@lists.ozlabs.org Cc: x...@kernel.org Suggested-by: Nicholas Piggin Reviewed-by: Tony Luck Signed-off-by: Ricardo Neri --- Changes since v5: * Added watchdog_nmi_start() to be used when tsc_khz is recalibrated. * Always build the x86-specific hardlockup detector shim; not only when the HPET-based detector is selected. * Corrected a typo in comment in watchdog_nmi_probe() (Ani) * Removed useless local ret variable in watchdog_nmi_enable(). (Ani) Changes since v4: * Use a switch to enable and disable the various available detectors. (Andi) Changes since v3: * Fixed style in multi-line comment. (Randy Dunlap) Changes since v2: * Pass cpu number as argument to hardlockup_detector_[enable|disable]. (Thomas Gleixner) Changes since v1: * Introduced this patch: Added an x86-specific shim hardlockup detector. (Nicholas Piggin) --- arch/x86/Kconfig.debug | 3 ++ arch/x86/kernel/Makefile | 2 + arch/x86/kernel/watchdog_hld.c | 85 ++ 3 files changed, 90 insertions(+) create mode 100644 arch/x86/kernel/watchdog_hld.c diff --git a/arch/x86/Kconfig.debug b/arch/x86/Kconfig.debug index bc34239589db..599001157847 100644 --- a/arch/x86/Kconfig.debug +++ b/arch/x86/Kconfig.debug @@ -6,6 +6,9 @@ config TRACE_IRQFLAGS_NMI_SUPPORT config EARLY_PRINTK_USB bool +config X86_HARDLOCKUP_DETECTOR + def_bool y if HARDLOCKUP_DETECTOR_CORE + config X86_VERBOSE_BOOTUP bool "Enable verbose x86 bootup info messages" default y diff --git a/arch/x86/kernel/Makefile b/arch/x86/kernel/Makefile index c700b00a2d86..af3d54e4c836 100644 --- a/arch/x86/kernel/Makefile +++ b/arch/x86/kernel/Makefile @@ -114,6 +114,8 @@ obj-$(CONFIG_KGDB) += kgdb.o obj-$(CONFIG_VM86) += vm86_32.o obj-$(CONFIG_EARLY_PRINTK) += early_printk.o +obj-$(CONFIG_X86_HARDLOCKUP_DETECTOR) += watchdog_hld.o + obj-$(CONFIG_HPET_TIMER) += hpet.o obj-$(CONFIG_X86_HARDLOCKUP_DETECTOR_HPET) += watchdog_hld_hpet.o diff --git a/arch/x86/kernel/watchdog_hld.c b/arch/x86/kernel/watchdog_hld.c new file mode 100644 index ..ef11f0af4ef5 --- /dev/null +++ b/arch/x86/kernel/watchdog_hld.c @@ -0,0 +1,85 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * A shim hardlockup detector. It overrides the weak stubs of the generic + * implementation to select between the perf- or the hpet-based implementation. + * + * Copyright (C) Intel Corporation 2022 + */ + +#include +#include + +enum x86_hardlockup_detector { + X86_HARDLOCKUP_DETECTOR_PERF, + X86_HARDLOCKUP_DETECTOR_HPET, +}; + +static enum __read_mostly x86_hardlockup_detector detector_type; + +int watchdog_nmi_enable(unsigned int cpu) +{ + switch (detector_type) { + case X86_HARDLOCKUP_DETECTOR_PERF: + hardlockup_detector_perf_enable(); + break; + case X86_HARDLOCKUP_DETECTOR_HPET: + hardlockup_detector_hpet_enable(cpu); + break; + default: + return -ENODEV; + } + + return 0; +} + +void watchdog_nmi_disable(unsigned int cpu) +{ + switch (detector_type) { + case X86_HARDLOCKUP_DETECTOR_PERF: + hardlockup_detector_perf_disable(); + break; + case X86_HARDLOCKUP_DETECTOR_HPET: + hardlockup_detector_hpet_disable(cpu); + break; + } +} + +int __init watchdog_nmi_probe(void) +{ + int ret; + + /* +* Try first with the HPET hardlockup detector. It will only +* succeed if selected at build time and requested in the +* nmi_watchdog command-line parameter. This ensures that the +* perf-based detector is used by default, if selected at +* build time. +*/ + ret = hardlockup_detector_hpet_init(); + if (!ret) { + detector_type = X86_HARDLOCKUP_DETECTOR_HPET; + return ret; + } + + ret = hardlockup_detector_perf_init(); + if (!ret) { + detector_type = X86_HARDLOCKUP_DETECTOR_PERF; + return ret; + } + + return 0; +} + +void watchdog_nmi_stop(void) +{ + /* Only the HPET lockup detector defines a stop function. */ + if (detector_type == X86_HARDLOCKUP_DETECTOR_HPET) + hardlockup_detector_hpet_stop(); +} + +void watchdog_nmi_start(void) +{ + /* Only the HPET lockup detector defines a start function. */ +
[PATCH v6 23/29] x86/watchdog/hardlockup/hpet: Determine if HPET timer caused NMI
It is not possible to determine the source of a non-maskable interrupt (NMI) in x86. When dealing with an HPET channel, the only direct method to determine whether it caused an NMI would be to read the Interrupt Status register. However, reading HPET registers is slow and, therefore, not to be done while in NMI context. Furthermore, status is not available if the HPET channel is programmed to deliver an MSI interrupt. An indirect manner to infer if an incoming NMI was caused by the HPET channel of the detector is to use the time-stamp counter (TSC). Compute the value that the TSC is expected to have at the next interrupt of the HPET channel and compare it with the value it has when the interrupt does happen. If the actual value falls within a small error window, assume that the HPET channel of the detector is the source of the NMI. Let tsc_delta be the difference between the value the TSC has now and the value it will have when the next HPET channel interrupt happens. Define the error window as a percentage of tsc_delta. Below is a table that characterizes the error in the error in the expected TSC value when the HPET channel fires on a variety of systems. It presents the error as a percentage of tsc_delta and in microseconds. The table summarizes the error of 4096 interrupts of the HPET channel collected after the system has been up for 5 minutes as well as since boot. The maximum observed error on any system is 0.045%. When the error since boot is considered, the maximum observed error is 0.198%. To find the most common error value, the collected data is grouped into buckets of 0.01 percentage points of the error and 10ns, respectively. The most common error on any system is of 0.01317% Allow a maximum error that is twice as big the maximum error observed in these experiments: 0.4% watchdog_thresh 1s10s 60s Error wrt expected TSC value %us %us%us AMD EPYC 7742 64-Core Processor Abs max since boot 0.04517 451.740.00171 171.04 0.00034 201.89 Abs max0.04517 451.740.00171 171.04 0.00034 201.89 Mode 0.2 0.180.2 2.07 -0.3 -19.20 Intel(R) Xeon(R) CPU E7-8890 - INTEL_FAM6_HASWELL_X abs max since boot 0.0081181.150.00462 462.40 0.0001481.65 Abs max0.0081181.150.0008484.31 0.0001481.65 Mode -0.00422 -42.16 -0.00043 -42.50 -0.7 -40.40 Intel(R) Xeon(R) Platinum 8170M - INTEL_FAM6_SKYLAKE_X Abs max since boot 0.10530 1053.040.01324 1324.27 0.00407 2443.25 Abs max0.01166 116.590.00114 114.11 0.00024 143.47 Mode -0.01023 -102.32 -0.00103 -102.44 -0.00022 -132.38 Intel(R) Xeon(R) CPU E5-2699A v4 - INTEL_FAM6_BROADSWELL_X Abs max since boot 0.0001099.340.0009998.83 0.0001697.50 Abs max0.0001099.340.0009998.83 0.0001697.50 Mode -0.7 -74.29 -0.00074 -73.99 -0.00012 -73.12 Intel(R) Xeon(R) Gold 5318H - INTEL_FAM6_COOPERLAKE_X Abs max since boot 0.11262 1126.170.01109 1109.17 0.00409 2455.73 Abs max0.01073 107.310.00109 109.02 0.00019 115.34 Mode -0.00953 -95.26 -0.00094 -93.63 -0.00015 -90.42 Intel(R) Xeon(R) Platinum 8360Y - INTEL_FAM6_ICELAKE_X Abs max since boot 0.19853 1985.300.00784 783.53 -0.00017 -104.77 Abs max0.01550 155.020.00158 157.56 0.00020 117.74 Mode -0.01317 -131.65 -0.00136 -136.42 -0.00018 -105.06 Cc: Andi Kleen Cc: Stephane Eranian Cc: "Ravi V. Shankar" Cc: iommu@lists.linux-foundation.org Cc: linuxppc-...@lists.ozlabs.org Cc: x...@kernel.org Suggested-by: Andi Kleen Reviewed-by: Tony Luck Signed-off-by: Ricardo Neri --- NOTE: The error characterization data is repetead here from the cover letter. --- Changes since v5: * Reworked is_hpet_hld_interrupt() to reduce indentation. * Use time_in_range64() to compare the actual TSC value vs the expected value. This makes it more readable. (Tony) * Reduced the error window of the expected TSC value at the time of the HPET channel expiration. * Described better the heuristics used to determine if the HPET channel caused the NMI. (Tony) * Added a table to characterize the error in the expected TSC value when the HPET channel fires. * Removed references to groups of monitored CPUs. Instead, use tsc_khz directly. Changes since v4: * Compute the TSC expected value at the next HPET interrupt based on the number of monitored packages and not the number of monitored CPUs. Changes since v3: * None Changes since v2: * Reworked condition to check if the expected TSC value is within the error margin to avoid an unnecessary conditional. (Peter Zijlstra) * Removed TSC error margin from struct hld_data; use a global variable instead. (Peter Zijlstra) Changes since v1: * Introduced this patch. --- arch/x86/include/asm/hpet.h | 3 ++
[PATCH v6 27/29] watchdog: Expose lockup_detector_reconfigure()
When there are multiple implementations of the NMI watchdog, there may be situations in which switching from one to another is needed. If the time- stamp counter becomes unstable, the HPET-based NMI watchdog can no longer be used. Similarly, the HPET-based NMI watchdog relies on tsc_khz and needs to be informed when it is refined. Reloading the NMI watchdog or switching to another hardlockup detector can be done cleanly by updating the arch-specific stub and then reconfiguring the whole lockup detector. Expose lockup_detector_reconfigure() to achieve this goal. Cc: Andi Kleen Cc: Nicholas Piggin Cc: Stephane Eranian Cc: "Ravi V. Shankar" Cc: iommu@lists.linux-foundation.org Cc: linuxppc-...@lists.ozlabs.org Cc: x...@kernel.org Reviewed-by: Tony Luck Signed-off-by: Ricardo Neri --- Changes since v5: * None Changes since v4: * Switching to the perf-based lockup detector under the hood is hacky. Instead, reconfigure the whole lockup detector. Changes since v3: * None Changes since v2: * Introduced this patch. Changes since v1: * N/A --- include/linux/nmi.h | 2 ++ kernel/watchdog.c | 4 ++-- 2 files changed, 4 insertions(+), 2 deletions(-) diff --git a/include/linux/nmi.h b/include/linux/nmi.h index cf12380e51b3..73827a477288 100644 --- a/include/linux/nmi.h +++ b/include/linux/nmi.h @@ -16,6 +16,7 @@ void lockup_detector_init(void); void lockup_detector_soft_poweroff(void); void lockup_detector_cleanup(void); bool is_hardlockup(void); +void lockup_detector_reconfigure(void); extern int watchdog_user_enabled; extern int nmi_watchdog_user_enabled; @@ -37,6 +38,7 @@ extern int sysctl_hardlockup_all_cpu_backtrace; static inline void lockup_detector_init(void) { } static inline void lockup_detector_soft_poweroff(void) { } static inline void lockup_detector_cleanup(void) { } +static inline void lockup_detector_reconfigure(void) { } #endif /* !CONFIG_LOCKUP_DETECTOR */ #ifdef CONFIG_SOFTLOCKUP_DETECTOR diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 6443841a755f..e5b67544f8c8 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -537,7 +537,7 @@ int lockup_detector_offline_cpu(unsigned int cpu) return 0; } -static void lockup_detector_reconfigure(void) +void lockup_detector_reconfigure(void) { cpus_read_lock(); watchdog_nmi_stop(); @@ -579,7 +579,7 @@ static __init void lockup_detector_setup(void) } #else /* CONFIG_SOFTLOCKUP_DETECTOR */ -static void lockup_detector_reconfigure(void) +void lockup_detector_reconfigure(void) { cpus_read_lock(); watchdog_nmi_stop(); -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v6 28/29] x86/tsc: Restart NMI watchdog after refining tsc_khz
The HPET hardlockup detector relies on tsc_khz to estimate the value of that the TSC will have when its HPET channel fires. A refined tsc_khz helps to estimate better the expected TSC value. Using the early value of tsc_khz may lead to a large error in the expected TSC value. Restarting the NMI watchdog detector has the effect of kicking its HPET channel and make use of the refined tsc_khz. When the HPET hardlockup is not in use, restarting the NMI watchdog is a noop. Cc: Andi Kleen Cc: Stephane Eranian Cc: "Ravi V. Shankar" Cc: iommu@lists.linux-foundation.org Cc: linuxppc-...@lists.ozlabs.org Cc: x...@kernel.org Signed-off-by: Ricardo Neri --- Changes since v5: * Introduced this patch Changes since v4 * N/A Changes since v3 * N/A Changes since v2: * N/A Changes since v1: * N/A --- arch/x86/kernel/tsc.c | 6 ++ 1 file changed, 6 insertions(+) diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c index cafacb2e58cc..cc1843044d88 100644 --- a/arch/x86/kernel/tsc.c +++ b/arch/x86/kernel/tsc.c @@ -1386,6 +1386,12 @@ static void tsc_refine_calibration_work(struct work_struct *work) /* Inform the TSC deadline clockevent devices about the recalibration */ lapic_update_tsc_freq(); + /* +* If in use, the HPET hardlockup detector relies on tsc_khz. +* Reconfigure it to make use of the refined tsc_khz. +*/ + lockup_detector_reconfigure(); + /* Update the sched_clock() rate to match the clocksource one */ for_each_possible_cpu(cpu) set_cyc2ns_scale(tsc_khz, cpu, tsc_stop); -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v6 25/29] watchdog/hardlockup/hpet: Only enable the HPET watchdog via a boot parameter
Keep the HPET-based hardlockup detector disabled unless explicitly enabled via a command-line argument. If such parameter is not given, the initialization of the HPET-based hardlockup detector fails and the NMI watchdog will fall back to use the perf-based implementation. Implement the command-line parsing using an early_param, as __setup("nmi_watchdog=") only parses generic options. Cc: Andi Kleen Cc: Stephane Eranian Cc: "Ravi V. Shankar" Cc: iommu@lists.linux-foundation.org Cc: linuxppc-...@lists.ozlabs.org Cc: x...@kernel.org Reviewed-by: Tony Luck Signed-off-by: Ricardo Neri -- Changes since v5: * None Changes since v4: * None Changes since v3: * None Changes since v2: * Do not imply that using nmi_watchdog=hpet means the detector is enabled. Instead, print a warning in such case. Changes since v1: * Added documentation to the function handing the nmi_watchdog kernel command-line argument. --- .../admin-guide/kernel-parameters.txt | 8 ++- arch/x86/kernel/watchdog_hld_hpet.c | 22 +++ 2 files changed, 29 insertions(+), 1 deletion(-) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 269be339d738..89eae950fdb8 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -3370,7 +3370,7 @@ Format: [state][,regs][,debounce][,die] nmi_watchdog= [KNL,BUGS=X86] Debugging features for SMP kernels - Format: [panic,][nopanic,][num] + Format: [panic,][nopanic,][num,][hpet] Valid num: 0 or 1 0 - turn hardlockup detector in nmi_watchdog off 1 - turn hardlockup detector in nmi_watchdog on @@ -3381,6 +3381,12 @@ please see 'nowatchdog'. This is useful when you use a panic=... timeout and need the box quickly up again. + When hpet is specified, the NMI watchdog will be driven + by an HPET timer, if available in the system. Otherwise, + it falls back to the default implementation (perf or + architecture-specific). Specifying hpet has no effect + if the NMI watchdog is not enabled (either at build time + or via the command line). These settings can be accessed at runtime via the nmi_watchdog and hardlockup_panic sysctls. diff --git a/arch/x86/kernel/watchdog_hld_hpet.c b/arch/x86/kernel/watchdog_hld_hpet.c index 3effdbf29095..4413d5fb94f4 100644 --- a/arch/x86/kernel/watchdog_hld_hpet.c +++ b/arch/x86/kernel/watchdog_hld_hpet.c @@ -379,6 +379,28 @@ void hardlockup_detector_hpet_start(void) enable_timer(hld_data); } +/** + * hardlockup_detector_hpet_setup() - Parse command-line parameters + * @str: A string containing the kernel command line + * + * Parse the nmi_watchdog parameter from the kernel command line. If + * selected by the user, use this implementation to detect hardlockups. + */ +static int __init hardlockup_detector_hpet_setup(char *str) +{ + if (!str) + return -EINVAL; + + if (parse_option_str(str, "hpet")) + hardlockup_use_hpet = true; + + if (!nmi_watchdog_user_enabled && hardlockup_use_hpet) + pr_err("Selecting HPET NMI watchdog has no effect with NMI watchdog disabled\n"); + + return 0; +} +early_param("nmi_watchdog", hardlockup_detector_hpet_setup); + /** * hardlockup_detector_hpet_init() - Initialize the hardlockup detector * -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v6 22/29] x86/watchdog/hardlockup: Add an HPET-based hardlockup detector
Implement a hardlockup detector that uses an HPET channel as the source of the non-maskable interrupt. Implement the basic functionality to start, stop, and configure the timer. Designate as the handling CPU one of the CPUs that the detector monitors. Use it to service the NMI from the HPET channel. When servicing the HPET NMI, issue an inter-processor interrupt to the rest of the monitored CPUs. Only enable the detector if IPI shorthands are enabled in the system. During operation, the HPET registers are only accessed to kick the timer. This operation can be avoided if a periodic HPET channel is added to the detector. To configure the HPET channel interrupt, the detector relies on the interrupt subsystem to configure the deliver mode as NMI (as requested in hpet_hld_get_timer()) throughout the IRQ hierarchy. This covers systems with and without interrupt remapping enabled. The detector is not functional at this stage. A subsequent changeset will invoke the interfaces implemented in this changeset go start, stop, and reconfigure the detector. Another subsequent changeset implements logic to determine if the HPET timer caused the NMI. For now, implement a stub function. Cc: Andi Kleen Cc: Stephane Eranian Cc: iommu@lists.linux-foundation.org Cc: linuxppc-...@lists.ozlabs.org Cc: x...@kernel.org Reviewed-by: Tony Luck Signed-off-by: Ricardo Neri --- Changes since v5: * Squashed a previously separate patch to support interrupt remapping into this patch. There is no need to handle interrupt remapping separately. All the necessary plumbing is done in the interrupt subsytem. Now it uses request_irq(). * Use IPI shorthands to send an NMI to the CPUs being monitored. (Thomas) * Added extra check to only use the HPET hardlockup detector if the IPI shorthands are enabled. (Thomas) * Relocated flushing of outstanding interrupts from enable_timer() to disable_timer(). On some systems, making any change in the configuration of the HPET channel causes it to issue an interrupt. * Added a new cpumask to function as a per-cpu test bit to determine if a CPU should check for hardlockups. * Dropped pointless X86_64 || X86_32 check in Kconfig. (Tony) * Dropped pointless dependency on CONFIG_HPET. * Added dependency on CONFIG_GENERIC_MSI_IRQ, needed to build the [|IR]- HPET-MSI irq_chip. * Added hardlockup_detector_hpet_start() to be used when tsc_khz is recalibrated. * Reworked the periodic setting the HPET channel. Rather than changing it every time the channel is disabled or enabled, do it only once. While at here, wrap the code in an initial setup function. * Implemented hardlockup_detector_hpet_start() to be called when tsc_khz is refined. * Enhanced inline comments for clarity. * Added missing #include files. * Relocated function declarations to not depend on CONFIG_HPET_TIMER. Changes since v4: * Dropped hpet_hld_data.enabled_cpus and instead use cpumask_weight(). * Renamed hpet_hld_data.cpu_monitored_mask to hld_data_data.cpu_monitored_mask and converted it to cpumask_var_t. * Flushed out any outstanding interrupt before enabling the HPET channel. * Removed unnecessary MSI_DATA_LEVEL_ASSERT from the MSI message. * Added comments in hardlockup_detector_nmi_handler() to explain how CPUs are targeted for an IPI. * Updated code to only issue an IPI when needed (i.e., there are monitored CPUs to be inspected via an IPI). * Reworked hardlockup_detector_hpet_init() for readability. * Now reserve the cpumasks in the hardlockup detector code and not in the generic HPET code. * Handled the case of watchdog_thresh = 0 when disabling the detector. * Made this detector available to i386. * Reworked logic to kick the timer to remove a local variable. (Andi) * Added a comment on what type of timer channel will be assigned to the detector. (Andi) * Reworded prompt comment in Kconfig. (Andi) * Removed unneeded switch to level interrupt mode when disabling the timer. (Andi) * Disabled the HPET timer to avoid a race between an incoming interrupt and an update of the MSI destination ID. (Ashok) * Corrected a typo in an inline comment. (Tony) * Made the HPET hardlockup detector depend on HARDLOCKUP_DETECTOR instead of selecting it. Changes since v3: * Fixed typo in Kconfig.debug. (Randy Dunlap) * Added missing slab.h to include the definition of kfree to fix a build break. Changes since v2: * Removed use of struct cpumask in favor of a variable length array in conjunction with kzalloc. (Peter Zijlstra) * Removed redundant documentation of functions. (Thomas Gleixner) * Added CPU as argument hardlockup_detector_hpet_enable()/disable(). (Thomas Gleixner). Changes since v1: * Do not target CPUs in a round-robin manner. Instead, the HPET timer always targets the same CPU; other CPUs are monitored via an interprocessor interrupt. * Dropped support for IO APIC interrupts and instead use only MSI interrupts. * Removed
[PATCH v6 24/29] watchdog/hardlockup: Use parse_option_str() to handle "nmi_watchdog"
Prepare hardlockup_panic_setup() to handle a comma-separated list of options. Thus, it can continue parsing its own command-line options while ignoring parameters that are relevant only to specific implementations of the hardlockup detector. Such implementations may use an early_param to parse their own options. Cc: Andi Kleen Cc: Nicholas Piggin Cc: Stephane Eranian Cc: "Ravi V. Shankar" Cc: iommu@lists.linux-foundation.org Cc: linuxppc-...@lists.ozlabs.org Cc: x...@kernel.org Reviewed-by: Tony Luck Signed-off-by: Ricardo Neri --- Changes since v5: * Corrected typo in commit message. (Tony) Changes since v4: * None Changes since v3: * None Changes since v2: * Introduced this patch. Changes since v1: * None --- kernel/watchdog.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/kernel/watchdog.c b/kernel/watchdog.c index 9166220457bc..6443841a755f 100644 --- a/kernel/watchdog.c +++ b/kernel/watchdog.c @@ -73,13 +73,13 @@ void __init hardlockup_detector_disable(void) static int __init hardlockup_panic_setup(char *str) { - if (!strncmp(str, "panic", 5)) + if (parse_option_str(str, "panic")) hardlockup_panic = 1; - else if (!strncmp(str, "nopanic", 7)) + else if (parse_option_str(str, "nopanic")) hardlockup_panic = 0; - else if (!strncmp(str, "0", 1)) + else if (parse_option_str(str, "0")) nmi_watchdog_user_enabled = 0; - else if (!strncmp(str, "1", 1)) + else if (parse_option_str(str, "1")) nmi_watchdog_user_enabled = 1; return 1; } -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v6 20/29] init/main: Delay initialization of the lockup detector after smp_init()
Certain implementations of the hardlockup detector require support for Inter-Processor Interrupt shorthands. On x86, support for these can only be determined after all the possible CPUs have booted once (in smp_init()). Other architectures may not need such check. lockup_detector_init() only performs the initializations of data structures of the lockup detector. Hence, there are no dependencies on smp_init(). Cc: Andi Kleen Cc: Nicholas Piggin Cc: Andrew Morton Cc: Stephane Eranian Cc: "Ravi V. Shankar" Cc: iommu@lists.linux-foundation.org Cc: linuxppc-...@lists.ozlabs.org Cc: x...@kernel.org Reviewed-by: Tony Luck Signed-off-by: Ricardo Neri --- Changes since v5: * Introduced this patch Changes since v4: * N/A Changes since v3: * N/A Changes since v2: * N/A Changes since v1: * N/A --- init/main.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/init/main.c b/init/main.c index 98182c3c2c4b..62c52c9e4c2b 100644 --- a/init/main.c +++ b/init/main.c @@ -1600,9 +1600,11 @@ static noinline void __init kernel_init_freeable(void) rcu_init_tasks_generic(); do_pre_smp_initcalls(); - lockup_detector_init(); smp_init(); + + lockup_detector_init(); + sched_init_smp(); padata_init(); -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v6 21/29] x86/nmi: Add an NMI_WATCHDOG NMI handler category
Add a NMI_WATCHDOG as a new category of NMI handler. This new category is to be used with the HPET-based hardlockup detector. This detector does not have a direct way of checking if the HPET timer is the source of the NMI. Instead, it indirectly estimates it using the time-stamp counter. Therefore, we may have false-positives in case another NMI occurs within the estimated time window. For this reason, we want the handler of the detector to be called after all the NMI_LOCAL handlers. A simple way of achieving this with a new NMI handler category. Cc: Andi Kleen Cc: Andrew Morton Cc: "Ravi V. Shankar" Cc: Stephane Eranian Cc: iommu@lists.linux-foundation.org Cc: linuxppc-...@lists.ozlabs.org Cc: x...@kernel.org Reviewed-by: Tony Luck Signed-off-by: Ricardo Neri --- Changes since v5: * Updated to call instrumentation_end() as per f051f6979550 ("x86/nmi: Protect NMI entry against instrumentation") Changes since v4: * None Changes since v3: * None Changes since v2: * Introduced this patch. Changes since v1: * N/A --- arch/x86/include/asm/nmi.h | 1 + arch/x86/kernel/nmi.c | 10 ++ 2 files changed, 11 insertions(+) diff --git a/arch/x86/include/asm/nmi.h b/arch/x86/include/asm/nmi.h index 1cb9c17a4cb4..4a0d5b562c91 100644 --- a/arch/x86/include/asm/nmi.h +++ b/arch/x86/include/asm/nmi.h @@ -28,6 +28,7 @@ enum { NMI_UNKNOWN, NMI_SERR, NMI_IO_CHECK, + NMI_WATCHDOG, NMI_MAX }; diff --git a/arch/x86/kernel/nmi.c b/arch/x86/kernel/nmi.c index e73f7df362f5..fde387e0812a 100644 --- a/arch/x86/kernel/nmi.c +++ b/arch/x86/kernel/nmi.c @@ -61,6 +61,10 @@ static struct nmi_desc nmi_desc[NMI_MAX] = .lock = __RAW_SPIN_LOCK_UNLOCKED(_desc[3].lock), .head = LIST_HEAD_INIT(nmi_desc[3].head), }, + { + .lock = __RAW_SPIN_LOCK_UNLOCKED(_desc[4].lock), + .head = LIST_HEAD_INIT(nmi_desc[4].head), + }, }; @@ -168,6 +172,8 @@ int __register_nmi_handler(unsigned int type, struct nmiaction *action) */ WARN_ON_ONCE(type == NMI_SERR && !list_empty(>head)); WARN_ON_ONCE(type == NMI_IO_CHECK && !list_empty(>head)); + WARN_ON_ONCE(type == NMI_WATCHDOG && !list_empty(>head)); + /* * some handlers need to be executed first otherwise a fake @@ -379,6 +385,10 @@ static noinstr void default_do_nmi(struct pt_regs *regs) } raw_spin_unlock(_reason_lock); + handled = nmi_handle(NMI_WATCHDOG, regs); + if (handled == NMI_HANDLED) + goto out; + /* * Only one NMI can be latched at a time. To handle * this we may process multiple nmi handlers at once to -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v6 18/29] watchdog/hardlockup: Define a generic function to detect hardlockups
The procedure to detect hardlockups is independent of the underlying mechanism that generates the non-maskable interrupt used to drive the detector. Thus, it can be put in a separate, generic function. In this manner, it can be invoked by various implementations of the NMI watchdog. For this purpose, move the bulk of watchdog_overflow_callback() to the new function inspect_for_hardlockups(). This function can then be called from the applicable NMI handlers. No functional changes. Cc: Andi Kleen Cc: Nicholas Piggin Cc: Andrew Morton Cc: Stephane Eranian Cc: "Ravi V. Shankar" Cc: iommu@lists.linux-foundation.org Cc: linuxppc-...@lists.ozlabs.org Cc: x...@kernel.org Reviewed-by: Tony Luck Signed-off-by: Ricardo Neri --- Changes since v5: * None Changes since v4: * None Changes since v3: * None Changes since v2: * None Changes since v1: * None --- include/linux/nmi.h | 1 + kernel/watchdog_hld.c | 18 +++--- 2 files changed, 12 insertions(+), 7 deletions(-) diff --git a/include/linux/nmi.h b/include/linux/nmi.h index 750c7f395ca9..1b68f48ad440 100644 --- a/include/linux/nmi.h +++ b/include/linux/nmi.h @@ -207,6 +207,7 @@ int proc_nmi_watchdog(struct ctl_table *, int , void *, size_t *, loff_t *); int proc_soft_watchdog(struct ctl_table *, int , void *, size_t *, loff_t *); int proc_watchdog_thresh(struct ctl_table *, int , void *, size_t *, loff_t *); int proc_watchdog_cpumask(struct ctl_table *, int, void *, size_t *, loff_t *); +void inspect_for_hardlockups(struct pt_regs *regs); #ifdef CONFIG_HAVE_ACPI_APEI_NMI #include diff --git a/kernel/watchdog_hld.c b/kernel/watchdog_hld.c index 247bf0b1582c..b352e507b17f 100644 --- a/kernel/watchdog_hld.c +++ b/kernel/watchdog_hld.c @@ -106,14 +106,8 @@ static struct perf_event_attr wd_hw_attr = { .disabled = 1, }; -/* Callback function for perf event subsystem */ -static void watchdog_overflow_callback(struct perf_event *event, - struct perf_sample_data *data, - struct pt_regs *regs) +void inspect_for_hardlockups(struct pt_regs *regs) { - /* Ensure the watchdog never gets throttled */ - event->hw.interrupts = 0; - if (__this_cpu_read(watchdog_nmi_touch) == true) { __this_cpu_write(watchdog_nmi_touch, false); return; @@ -163,6 +157,16 @@ static void watchdog_overflow_callback(struct perf_event *event, return; } +/* Callback function for perf event subsystem */ +static void watchdog_overflow_callback(struct perf_event *event, + struct perf_sample_data *data, + struct pt_regs *regs) +{ + /* Ensure the watchdog never gets throttled */ + event->hw.interrupts = 0; + inspect_for_hardlockups(regs); +} + static int hardlockup_detector_event_create(void) { unsigned int cpu = smp_processor_id(); -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v6 19/29] watchdog/hardlockup: Decouple the hardlockup detector from perf
The current default implementation of the hardlockup detector assumes that it is implemented using perf events. However, the hardlockup detector can be driven by other sources of non-maskable interrupts (e.g., a properly configured timer). Group and wrap in #ifdef CONFIG_HARDLOCKUP_DETECTOR_PERF all the code specific to perf: create and manage perf events, stop and start the perf- based detector. The generic portion of the detector (monitor the timers' thresholds, check timestamps and detect hardlockups as well as the implementation of arch_touch_nmi_watchdog()) is now selected with the new intermediate config symbol CONFIG_HARDLOCKUP_DETECTOR_CORE. The perf-based implementation of the detector selects the new intermediate symbol. Other implementations should do the same. Cc: Andi Kleen Cc: Nicholas Piggin Cc: Andrew Morton Cc: Stephane Eranian Cc: "Ravi V. Shankar" Cc: iommu@lists.linux-foundation.org Cc: linuxppc-...@lists.ozlabs.org Cc: x...@kernel.org Reviewed-by: Tony Luck Signed-off-by: Ricardo Neri --- Changes since v5: * None Changes since v4: * None Changes since v3: * Squashed into this patch a previous patch to make arch_touch_nmi_watchdog() part of the core detector code. Changes since v2: * Undid split of the generic hardlockup detector into a separate file. (Thomas Gleixner) * Added a new intermediate symbol CONFIG_HARDLOCKUP_DETECTOR_CORE to select generic parts of the detector (Paul E. McKenney, Thomas Gleixner). Changes since v1: * Make the generic detector code with CONFIG_HARDLOCKUP_DETECTOR. --- include/linux/nmi.h | 5 - kernel/Makefile | 2 +- kernel/watchdog_hld.c | 32 lib/Kconfig.debug | 4 4 files changed, 29 insertions(+), 14 deletions(-) diff --git a/include/linux/nmi.h b/include/linux/nmi.h index 1b68f48ad440..cf12380e51b3 100644 --- a/include/linux/nmi.h +++ b/include/linux/nmi.h @@ -94,8 +94,11 @@ static inline void hardlockup_detector_disable(void) {} # define NMI_WATCHDOG_SYSCTL_PERM 0444 #endif -#if defined(CONFIG_HARDLOCKUP_DETECTOR_PERF) +#if defined(CONFIG_HARDLOCKUP_DETECTOR_CORE) extern void arch_touch_nmi_watchdog(void); +#endif + +#if defined(CONFIG_HARDLOCKUP_DETECTOR_PERF) extern void hardlockup_detector_perf_stop(void); extern void hardlockup_detector_perf_restart(void); extern void hardlockup_detector_perf_disable(void); diff --git a/kernel/Makefile b/kernel/Makefile index 847a82bfe0e3..27e75b735ef7 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -95,7 +95,7 @@ obj-$(CONFIG_FAIL_FUNCTION) += fail_function.o obj-$(CONFIG_KGDB) += debug/ obj-$(CONFIG_DETECT_HUNG_TASK) += hung_task.o obj-$(CONFIG_LOCKUP_DETECTOR) += watchdog.o -obj-$(CONFIG_HARDLOCKUP_DETECTOR_PERF) += watchdog_hld.o +obj-$(CONFIG_HARDLOCKUP_DETECTOR_CORE) += watchdog_hld.o obj-$(CONFIG_SECCOMP) += seccomp.o obj-$(CONFIG_RELAY) += relay.o obj-$(CONFIG_SYSCTL) += utsname_sysctl.o diff --git a/kernel/watchdog_hld.c b/kernel/watchdog_hld.c index b352e507b17f..bb6435978c46 100644 --- a/kernel/watchdog_hld.c +++ b/kernel/watchdog_hld.c @@ -22,12 +22,8 @@ static DEFINE_PER_CPU(bool, hard_watchdog_warn); static DEFINE_PER_CPU(bool, watchdog_nmi_touch); -static DEFINE_PER_CPU(struct perf_event *, watchdog_ev); -static DEFINE_PER_CPU(struct perf_event *, dead_event); -static struct cpumask dead_events_mask; static unsigned long hardlockup_allcpu_dumped; -static atomic_t watchdog_cpus = ATOMIC_INIT(0); notrace void arch_touch_nmi_watchdog(void) { @@ -98,14 +94,6 @@ static inline bool watchdog_check_timestamp(void) } #endif -static struct perf_event_attr wd_hw_attr = { - .type = PERF_TYPE_HARDWARE, - .config = PERF_COUNT_HW_CPU_CYCLES, - .size = sizeof(struct perf_event_attr), - .pinned = 1, - .disabled = 1, -}; - void inspect_for_hardlockups(struct pt_regs *regs) { if (__this_cpu_read(watchdog_nmi_touch) == true) { @@ -157,6 +145,24 @@ void inspect_for_hardlockups(struct pt_regs *regs) return; } +#ifdef CONFIG_HARDLOCKUP_DETECTOR_PERF +#undef pr_fmt +#define pr_fmt(fmt) "NMI perf watchdog: " fmt + +static DEFINE_PER_CPU(struct perf_event *, watchdog_ev); +static DEFINE_PER_CPU(struct perf_event *, dead_event); +static struct cpumask dead_events_mask; + +static atomic_t watchdog_cpus = ATOMIC_INIT(0); + +static struct perf_event_attr wd_hw_attr = { + .type = PERF_TYPE_HARDWARE, + .config = PERF_COUNT_HW_CPU_CYCLES, + .size = sizeof(struct perf_event_attr), + .pinned = 1, + .disabled = 1, +}; + /* Callback function for perf event subsystem */ static void watchdog_overflow_callback(struct perf_event *event, struct perf_sample_data *data, @@ -298,3 +304,5 @@ int __init hardlockup_detector_perf_init(void) } return ret; } + +#endif /*
[PATCH v6 17/29] x86/hpet: Reserve an HPET channel for the hardlockup detector
The HPET hardlockup detector needs a dedicated HPET channel. Hence, create a new HPET_MODE_NMI_WATCHDOG mode category to indicate that it cannot be used for other purposes. Using MSI interrupts greatly simplifies the implementation of the detector. Specifically, it helps to avoid the complexities of routing the interrupt via the IO-APIC (e.g., potential race conditions that arise from re-programming the IO-APIC while also servicing an NMI). Therefore, only reserve the timer if it supports Front Side Bus interrupt delivery. HPET channels are reserved at various stages. First, from x86_late_time_init(), hpet_time_init() checks if the HPET timer supports Legacy Replacement Routing. If this is the case, channels 0 and 1 are reserved as HPET_MODE_LEGACY. At a later stage, from lockup_detector_init(), reserve the HPET channel for the hardlockup detector. Then, the HPET clocksource reserves the channels it needs and then the remaining channels are given to the HPET char driver via hpet_alloc(). Hence, the channel assigned to the HPET hardlockup detector depends on whether the first two channels are reserved for legacy mode. Lastly, only reserve the channel for the hardlockup detector if enabled in the kernel command line. Cc: Andi Kleen Cc: Stephane Eranian Cc: "Ravi V. Shankar" Cc: iommu@lists.linux-foundation.org Cc: linuxppc-...@lists.ozlabs.org Cc: x...@kernel.org Reviewed-by: Tony Luck Signed-off-by: Ricardo Neri --- Changes since v5: * Added a check for the allowed maximum frequency of the HPET. * Added hpet_hld_free_timer() to properly free the reserved HPET channel if the initialization is not completed. * Call hpet_assign_irq() with as_nmi = true. * Relocated declarations of functions and data structures of the detector to not depend on CONFIG_HPET_TIMER. Changes since v4: * Reworked timer reservation to use Thomas' rework on HPET channel management. * Removed hard-coded channel number for the hardlockup detector. * Provided more details on the sequence of HPET channel reservations. (Thomas Gleixner) * Only reserve a channel for the hardlockup detector if enabled via kernel command line. The function reserving the channel is called from hardlockup detector. (Thomas Gleixner) * Shorten the name of hpet_hardlockup_detector_get_timer() to hpet_hld_get_timer(). (Andi) * Simplify error handling when a channel is not found. (Tony) Changes since v3: * None Changes since v2: * None Changes since v1: * None --- arch/x86/include/asm/hpet.h | 22 arch/x86/kernel/hpet.c | 105 2 files changed, 127 insertions(+) diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h index 486e001413c7..5762bd0169a1 100644 --- a/arch/x86/include/asm/hpet.h +++ b/arch/x86/include/asm/hpet.h @@ -103,4 +103,26 @@ static inline int is_hpet_enabled(void) { return 0; } #define default_setup_hpet_msi NULL #endif + +#ifdef CONFIG_X86_HARDLOCKUP_DETECTOR_HPET +/** + * struct hpet_hld_data - Data needed to operate the detector + * @has_periodic: The HPET channel supports periodic mode + * @channel: HPET channel assigned to the detector + * @channe_priv: Private data of the assigned channel + * @ticks_per_second: Frequency of the HPET timer + * @irq: IRQ number assigned to the HPET channel + */ +struct hpet_hld_data { + boolhas_periodic; + u32 channel; + struct hpet_channel *channel_priv; + u64 ticks_per_second; + int irq; +}; + +extern struct hpet_hld_data *hpet_hld_get_timer(void); +extern void hpet_hld_free_timer(struct hpet_hld_data *hdata); +#endif /* CONFIG_X86_HARDLOCKUP_DETECTOR_HPET */ + #endif /* _ASM_X86_HPET_H */ diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c index 02d25e00e93f..ee9275c013f5 100644 --- a/arch/x86/kernel/hpet.c +++ b/arch/x86/kernel/hpet.c @@ -20,6 +20,7 @@ enum hpet_mode { HPET_MODE_LEGACY, HPET_MODE_CLOCKEVT, HPET_MODE_DEVICE, + HPET_MODE_NMI_WATCHDOG, }; struct hpet_channel { @@ -216,6 +217,7 @@ static void __init hpet_reserve_platform_timers(void) break; case HPET_MODE_CLOCKEVT: case HPET_MODE_LEGACY: + case HPET_MODE_NMI_WATCHDOG: hpet_reserve_timer(, hc->num); break; } @@ -1496,3 +1498,106 @@ irqreturn_t hpet_rtc_interrupt(int irq, void *dev_id) } EXPORT_SYMBOL_GPL(hpet_rtc_interrupt); #endif + +#ifdef CONFIG_X86_HARDLOCKUP_DETECTOR_HPET + +/* + * We program the timer in 32-bit mode to reduce the number of register + * accesses. The maximum value of watch_thresh is 60 seconds. The HPET counter + * should not wrap around more frequently than that. Thus, the frequency of the + * HPET timer must be less than 71.582788 MHz.
[PATCH v6 08/29] iommu/vt-d: Rework prepare_irte() to support per-IRQ delivery mode
struct irq_cfg::delivery_mode specifies the delivery mode of each IRQ separately. Configuring the delivery mode of an IRTE would require adding a third argument to prepare_irte(). Instead, simply take a pointer to the irq_cfg for which an IRTE is being configured. This change does not cause functional changes. Cc: Andi Kleen Cc: David Woodhouse Cc: "Ravi V. Shankar" Cc: Lu Baolu Cc: Stephane Eranian Cc: iommu@lists.linux-foundation.org Cc: linuxppc-...@lists.ozlabs.org Cc: x...@kernel.org Reviewed-by: Ashok Raj Reviewed-by: Tony Luck Reviewed-by: Lu Baolu Signed-off-by: Ricardo Neri --- Changes since v5: * Only change the signature of prepare_irte(). A separate patch changes the setting of the delivery_mode. Changes since v4: * None Changes since v3: * None Changes since v2: * None Changes since v1: * Introduced this patch. --- drivers/iommu/intel/irq_remapping.c | 9 - 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/drivers/iommu/intel/irq_remapping.c b/drivers/iommu/intel/irq_remapping.c index d2764a71f91a..66d37186ec28 100644 --- a/drivers/iommu/intel/irq_remapping.c +++ b/drivers/iommu/intel/irq_remapping.c @@ -,7 +,7 @@ void intel_irq_remap_add_device(struct dmar_pci_notify_info *info) dev_set_msi_domain(>dev->dev, map_dev_to_ir(info->dev)); } -static void prepare_irte(struct irte *irte, int vector, unsigned int dest) +static void prepare_irte(struct irte *irte, struct irq_cfg *irq_cfg) { memset(irte, 0, sizeof(*irte)); @@ -1126,8 +1126,8 @@ static void prepare_irte(struct irte *irte, int vector, unsigned int dest) */ irte->trigger_mode = 0; irte->dlvry_mode = apic->delivery_mode; - irte->vector = vector; - irte->dest_id = IRTE_DEST(dest); + irte->vector = irq_cfg->vector; + irte->dest_id = IRTE_DEST(irq_cfg->dest_apicid); /* * When using the destination mode of physical APICID, only the @@ -1278,8 +1278,7 @@ static void intel_irq_remapping_prepare_irte(struct intel_ir_data *data, { struct irte *irte = >irte_entry; - prepare_irte(irte, irq_cfg->vector, irq_cfg->dest_apicid); - + prepare_irte(irte, irq_cfg); switch (info->type) { case X86_IRQ_ALLOC_TYPE_IOAPIC: /* Set source-id of interrupt request */ -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v6 01/29] irq/matrix: Expose functions to allocate the best CPU for new vectors
Certain types of interrupts, such as NMI, do not have an associated vector. They, however, target specific CPUs. Thus, when assigning the destination CPU, it is beneficial to select the one with the lowest number of vectors. Prepend the functions matrix_find_best_cpu_managed() and matrix_find_best_cpu_managed() with the irq_ prefix and expose them for IRQ controllers to use when allocating and activating vector-less IRQs. Cc: Andi Kleen Cc: "Ravi V. Shankar" Cc: Stephane Eranian Cc: iommu@lists.linux-foundation.org Cc: linuxppc-...@lists.ozlabs.org Cc: x...@kernel.org Reviewed-by: Tony Luck Signed-off-by: Ricardo Neri --- Changes since v5: * Introduced this patch. Changes since v4: * N/A Changes since v3: * N/A Changes since v2: * N/A Changes since v1: * N/A --- include/linux/irq.h | 4 kernel/irq/matrix.c | 32 +++- 2 files changed, 27 insertions(+), 9 deletions(-) diff --git a/include/linux/irq.h b/include/linux/irq.h index f92788ccdba2..9e674e73d295 100644 --- a/include/linux/irq.h +++ b/include/linux/irq.h @@ -1223,6 +1223,10 @@ struct irq_matrix *irq_alloc_matrix(unsigned int matrix_bits, void irq_matrix_online(struct irq_matrix *m); void irq_matrix_offline(struct irq_matrix *m); void irq_matrix_assign_system(struct irq_matrix *m, unsigned int bit, bool replace); +unsigned int irq_matrix_find_best_cpu(struct irq_matrix *m, + const struct cpumask *msk); +unsigned int irq_matrix_find_best_cpu_managed(struct irq_matrix *m, + const struct cpumask *msk); int irq_matrix_reserve_managed(struct irq_matrix *m, const struct cpumask *msk); void irq_matrix_remove_managed(struct irq_matrix *m, const struct cpumask *msk); int irq_matrix_alloc_managed(struct irq_matrix *m, const struct cpumask *msk, diff --git a/kernel/irq/matrix.c b/kernel/irq/matrix.c index 1698e77645ac..810479f608f4 100644 --- a/kernel/irq/matrix.c +++ b/kernel/irq/matrix.c @@ -125,9 +125,16 @@ static unsigned int matrix_alloc_area(struct irq_matrix *m, struct cpumap *cm, return area; } -/* Find the best CPU which has the lowest vector allocation count */ -static unsigned int matrix_find_best_cpu(struct irq_matrix *m, - const struct cpumask *msk) +/** + * irq_matrix_find_best_cpu() - Find the best CPU for an IRQ + * @m: Matrix pointer + * @msk: On which CPUs the search will be performed + * + * Find the best CPU which has the lowest vector allocation count + * Returns: The best CPU to use + */ +unsigned int irq_matrix_find_best_cpu(struct irq_matrix *m, + const struct cpumask *msk) { unsigned int cpu, best_cpu, maxavl = 0; struct cpumap *cm; @@ -146,9 +153,16 @@ static unsigned int matrix_find_best_cpu(struct irq_matrix *m, return best_cpu; } -/* Find the best CPU which has the lowest number of managed IRQs allocated */ -static unsigned int matrix_find_best_cpu_managed(struct irq_matrix *m, - const struct cpumask *msk) +/** + * irq_matrix_find_best_cpu_managed() - Find the best CPU for a managed IRQ + * @m: Matrix pointer + * @msk: On which CPUs the search will be performed + * + * Find the best CPU which has the lowest number of managed IRQs allocated + * Returns: The best CPU to use + */ +unsigned int irq_matrix_find_best_cpu_managed(struct irq_matrix *m, + const struct cpumask *msk) { unsigned int cpu, best_cpu, allocated = UINT_MAX; struct cpumap *cm; @@ -292,7 +306,7 @@ int irq_matrix_alloc_managed(struct irq_matrix *m, const struct cpumask *msk, if (cpumask_empty(msk)) return -EINVAL; - cpu = matrix_find_best_cpu_managed(m, msk); + cpu = irq_matrix_find_best_cpu_managed(m, msk); if (cpu == UINT_MAX) return -ENOSPC; @@ -381,13 +395,13 @@ int irq_matrix_alloc(struct irq_matrix *m, const struct cpumask *msk, struct cpumap *cm; /* -* Not required in theory, but matrix_find_best_cpu() uses +* Not required in theory, but irq_matrix_find_best_cpu() uses * for_each_cpu() which ignores the cpumask on UP . */ if (cpumask_empty(msk)) return -EINVAL; - cpu = matrix_find_best_cpu(m, msk); + cpu = irq_matrix_find_best_cpu(m, msk); if (cpu == UINT_MAX) return -ENOSPC; -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v6 16/29] x86/hpet: Prepare IRQ assignments to use the X86_ALLOC_AS_NMI flag
The flag X86_ALLOC_AS_NMI indicates that the IRQs to be allocated in an IRQ domain need to be configured as NMIs. Add an as_nmi argument to hpet_assign_irq(). Even though the HPET clock events do not need NMI IRQs, the HPET hardlockup detector does. A subsequent changeset will implement the reservation of a channel for it. Cc: Andi Kleen Cc: "Ravi V. Shankar" Cc: Stephane Eranian Cc: iommu@lists.linux-foundation.org Cc: linuxppc-...@lists.ozlabs.org Cc: x...@kernel.org Suggested-by: Thomas Gleixner Reviewed-by: Tony Luck Signed-off-by: Ricardo Neri --- Changes since v5: * Introduced this patch. Changes since v4: * N/A Changes since v3: * N/A Changes since v2: * N/A Changes since v1: * N/A --- arch/x86/kernel/hpet.c | 6 -- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c index 2c6713b40921..02d25e00e93f 100644 --- a/arch/x86/kernel/hpet.c +++ b/arch/x86/kernel/hpet.c @@ -618,7 +618,7 @@ static inline int hpet_dev_id(struct irq_domain *domain) } static int hpet_assign_irq(struct irq_domain *domain, struct hpet_channel *hc, - int dev_num) + int dev_num, bool as_nmi) { struct irq_alloc_info info; @@ -627,6 +627,8 @@ static int hpet_assign_irq(struct irq_domain *domain, struct hpet_channel *hc, info.data = hc; info.devid = hpet_dev_id(domain); info.hwirq = dev_num; + if (as_nmi) + info.flags |= X86_IRQ_ALLOC_AS_NMI; return irq_domain_alloc_irqs(domain, 1, NUMA_NO_NODE, ); } @@ -755,7 +757,7 @@ static void __init hpet_select_clockevents(void) sprintf(hc->name, "hpet%d", i); - irq = hpet_assign_irq(hpet_domain, hc, hc->num); + irq = hpet_assign_irq(hpet_domain, hc, hc->num, false); if (irq <= 0) continue; -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v6 15/29] x86/hpet: Add helper function hpet_set_comparator_periodic()
Programming an HPET channel as periodic requires setting the HPET_TN_SETVAL bit in the channel configuration. Plus, the comparator register must be written twice (once for the comparator value and once for the periodic value). Since this programming might be needed in several places (e.g., the HPET clocksource and the HPET-based hardlockup detector), add a helper function for this purpose. A helper function hpet_set_comparator_oneshot() could also be implemented. However, such function would only program the comparator register and the function would be quite small. Hence, it is better to not bloat the code with such an obvious function. Cc: Andi Kleen Cc: Tony Luck Cc: Stephane Eranian Cc: "Ravi V. Shankar" Cc: iommu@lists.linux-foundation.org Cc: linuxppc-...@lists.ozlabs.org Cc: x...@kernel.org Originally-by: Suravee Suthikulpanit Reviewed-by: Tony Luck Signed-off-by: Ricardo Neri --- When programming the HPET channel in periodic mode, a udelay(1) between the two successive writes to HPET_Tn_CMP was introduced in commit e9e2cdb41241 ("[PATCH] clockevents: i386 drivers"). The commit message does not give any reason for such delay. The hardware specification does not seem to require it. The refactoring in this patch simply carries such delay. --- Changes since v5: * None Changes since v4: * Implement function only for periodic mode. This removed extra logic to to use a non-zero period value as a proxy for periodic mode programming. (Thomas) * Added a comment on the history of the udelay() when programming the channel in periodic mode. (Ashok) Changes since v3: * Added back a missing hpet_writel() for time configuration. Changes since v2: * Introduced this patch. Changes since v1: * N/A --- arch/x86/include/asm/hpet.h | 2 ++ arch/x86/kernel/hpet.c | 49 - 2 files changed, 39 insertions(+), 12 deletions(-) diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h index be9848f0883f..486e001413c7 100644 --- a/arch/x86/include/asm/hpet.h +++ b/arch/x86/include/asm/hpet.h @@ -74,6 +74,8 @@ extern void hpet_disable(void); extern unsigned int hpet_readl(unsigned int a); extern void hpet_writel(unsigned int d, unsigned int a); extern void force_hpet_resume(void); +extern void hpet_set_comparator_periodic(int channel, unsigned int cmp, +unsigned int period); #ifdef CONFIG_HPET_EMULATE_RTC diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c index 47678e7927ff..2c6713b40921 100644 --- a/arch/x86/kernel/hpet.c +++ b/arch/x86/kernel/hpet.c @@ -294,6 +294,39 @@ static void hpet_enable_legacy_int(void) hpet_legacy_int_enabled = true; } +/** + * hpet_set_comparator_periodic() - Helper function to set periodic channel + * @channel: The HPET channel + * @cmp: The value to be written to the comparator/accumulator + * @period:Number of ticks per period + * + * Helper function for updating comparator, accumulator and period values. + * + * In periodic mode, HPET needs HPET_TN_SETVAL to be set before writing + * to the Tn_CMP to update the accumulator. Then, HPET needs a second + * write (with HPET_TN_SETVAL cleared) to Tn_CMP to set the period. + * The HPET_TN_SETVAL bit is automatically cleared after the first write. + * + * This function takes a 1 microsecond delay. However, this function is supposed + * to be called only once (or when reprogramming the timer) as it deals with a + * periodic timer channel. + * + * See the following documents: + * - Intel IA-PC HPET (High Precision Event Timers) Specification + * - AMD-8111 HyperTransport I/O Hub Data Sheet, Publication # 24674 + */ +void hpet_set_comparator_periodic(int channel, unsigned int cmp, unsigned int period) +{ + unsigned int v = hpet_readl(HPET_Tn_CFG(channel)); + + hpet_writel(v | HPET_TN_SETVAL, HPET_Tn_CFG(channel)); + + hpet_writel(cmp, HPET_Tn_CMP(channel)); + + udelay(1); + hpet_writel(period, HPET_Tn_CMP(channel)); +} + static int hpet_clkevt_set_state_periodic(struct clock_event_device *evt) { unsigned int channel = clockevent_to_channel(evt)->num; @@ -306,19 +339,11 @@ static int hpet_clkevt_set_state_periodic(struct clock_event_device *evt) now = hpet_readl(HPET_COUNTER); cmp = now + (unsigned int)delta; cfg = hpet_readl(HPET_Tn_CFG(channel)); - cfg |= HPET_TN_ENABLE | HPET_TN_PERIODIC | HPET_TN_SETVAL | - HPET_TN_32BIT; + cfg |= HPET_TN_ENABLE | HPET_TN_PERIODIC | HPET_TN_32BIT; hpet_writel(cfg, HPET_Tn_CFG(channel)); - hpet_writel(cmp, HPET_Tn_CMP(channel)); - udelay(1); - /* -* HPET on AMD 81xx needs a second write (with HPET_TN_SETVAL -* cleared) to T0_CMP to set the period. The HPET_TN_SETVAL -* bit is automatically cleared after the first write. -* (See AMD-8111 HyperTransport I/O Hub Data Sheet, -* Publication # 24674) -
[PATCH v6 03/29] x86/apic/msi: Set the delivery mode individually for each IRQ
There are no restrictions in hardware to set MSI messages with its own delivery mode. Use the mode specified in the provided IRQ hardware configuration data. Since most of the IRQs are configured to use the delivery mode of the APIC driver in use (set in all of them to APIC_DELIVERY_MODE_FIXED), the only functional changes are where IRQs are configured to use a specific delivery mode. Changing the utility function __irq_msi_compose_msg() takes care of implementing the change in the in the local APIC, PCI-MSI, and DMAR-MSI irq_chips. The IO-APIC irq_chip configures the entries in the interrupt redirection table using the delivery mode specified in the corresponding MSI message. Since the MSI message is composed by a higher irq_chip in the hierarchy, it does not need to be updated. Cc: Andi Kleen Cc: "Ravi V. Shankar" Cc: Stephane Eranian Cc: iommu@lists.linux-foundation.org Cc: linuxppc-...@lists.ozlabs.org Cc: x...@kernel.org Reviewed-by: Tony Luck Signed-off-by: Ricardo Neri --- Changes since v5: * Introduced this patch Changes since v4: * N/A Changes since v3: * N/A Changes since v2: * N/A Changes since v1: * N/A --- arch/x86/kernel/apic/apic.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/x86/kernel/apic/apic.c b/arch/x86/kernel/apic/apic.c index 189d3a5e471a..d1e12da1e9af 100644 --- a/arch/x86/kernel/apic/apic.c +++ b/arch/x86/kernel/apic/apic.c @@ -2528,7 +2528,7 @@ void __irq_msi_compose_msg(struct irq_cfg *cfg, struct msi_msg *msg, msg->arch_addr_lo.dest_mode_logical = apic->dest_mode_logical; msg->arch_addr_lo.destid_0_7 = cfg->dest_apicid & 0xFF; - msg->arch_data.delivery_mode = APIC_DELIVERY_MODE_FIXED; + msg->arch_data.delivery_mode = cfg->delivery_mode; msg->arch_data.vector = cfg->vector; msg->address_hi = X86_MSI_BASE_ADDRESS_HIGH; -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v6 11/29] iommu/amd: Expose [set|get]_dev_entry_bit()
These functions are used to check and set specific bits in a Device Table Entry. For instance, they can be used to modify the setting of the NMIPass field. Currently, these functions are used only for ACPI-specified devices. However, an interrupt is to be allocated with NMI as delivery mode, the Device Table Entry needs modified accordingly in irq_remapping_alloc(). As a first step expose these two functions. No functional changes. Cc: Andi Kleen Cc: "Ravi V. Shankar" Cc: Joerg Roedel Cc: Suravee Suthikulpanit Cc: Stephane Eranian Cc: iommu@lists.linux-foundation.org Cc: linuxppc-...@lists.ozlabs.org Cc: x...@kernel.org Signed-off-by: Ricardo Neri --- Changes since v5: * Introduced this patch Changes since v4: * N/A Changes since v3: * N/A Changes since v2: * N/A Changes since v1: * N/A --- drivers/iommu/amd/amd_iommu.h | 3 +++ drivers/iommu/amd/init.c | 4 ++-- 2 files changed, 5 insertions(+), 2 deletions(-) diff --git a/drivers/iommu/amd/amd_iommu.h b/drivers/iommu/amd/amd_iommu.h index 1ab31074f5b3..9f3d1564c84e 100644 --- a/drivers/iommu/amd/amd_iommu.h +++ b/drivers/iommu/amd/amd_iommu.h @@ -128,4 +128,7 @@ static inline void amd_iommu_apply_ivrs_quirks(void) { } extern void amd_iommu_domain_set_pgtable(struct protection_domain *domain, u64 *root, int mode); + +extern void set_dev_entry_bit(u16 devid, u8 bit); +extern int get_dev_entry_bit(u16 devid, u8 bit); #endif diff --git a/drivers/iommu/amd/init.c b/drivers/iommu/amd/init.c index b4a798c7b347..823e76b284f1 100644 --- a/drivers/iommu/amd/init.c +++ b/drivers/iommu/amd/init.c @@ -914,7 +914,7 @@ static void iommu_enable_gt(struct amd_iommu *iommu) } /* sets a specific bit in the device table entry. */ -static void set_dev_entry_bit(u16 devid, u8 bit) +void set_dev_entry_bit(u16 devid, u8 bit) { int i = (bit >> 6) & 0x03; int _bit = bit & 0x3f; @@ -922,7 +922,7 @@ static void set_dev_entry_bit(u16 devid, u8 bit) amd_iommu_dev_table[devid].data[i] |= (1UL << _bit); } -static int get_dev_entry_bit(u16 devid, u8 bit) +int get_dev_entry_bit(u16 devid, u8 bit) { int i = (bit >> 6) & 0x03; int _bit = bit & 0x3f; -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v6 12/29] iommu/amd: Enable NMIPass when allocating an NMI irq
As per the AMD I/O Virtualization Technology (IOMMU) Specification, the AMD IOMMU only remaps fixed and arbitrated MSIs. NMIs are controlled by the NMIPass bit of a Device Table Entry. When set, the IOMMU passes through NMI interrupt messages unmapped. Otherwise, they are aborted. Furthermore, Section 2.2.5 Table 19 states that the IOMMU will also abort NMIs when the destination mode is logical. Update the NMIPass setting of a device's DTE when an NMI irq is being allocated. Only do so when the destination mode of the APIC is not logical. Cc: Andi Kleen Cc: "Ravi V. Shankar" Cc: Joerg Roedel Cc: Suravee Suthikulpanit Cc: Stephane Eranian Cc: iommu@lists.linux-foundation.org Cc: linuxppc-...@lists.ozlabs.org Cc: x...@kernel.org Signed-off-by: Ricardo Neri --- Changes since v5: * Introduced this patch Changes since v4: * N/A Changes since v3: * N/A Changes since v2: * N/A Changes since v1: * N/A --- drivers/iommu/amd/iommu.c | 18 ++ 1 file changed, 18 insertions(+) diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c index a1ada7bff44e..4d7421b6858d 100644 --- a/drivers/iommu/amd/iommu.c +++ b/drivers/iommu/amd/iommu.c @@ -3156,6 +3156,15 @@ static int irq_remapping_alloc(struct irq_domain *domain, unsigned int virq, info->type != X86_IRQ_ALLOC_TYPE_PCI_MSIX) return -EINVAL; + if (info->flags & X86_IRQ_ALLOC_AS_NMI) { + /* Only one IRQ per NMI */ + if (nr_irqs != 1) + return -EINVAL; + + /* NMIs are aborted when the destination mode is logical. */ + if (apic->dest_mode_logical) + return -EPERM; + } /* * With IRQ remapping enabled, don't need contiguous CPU vectors * to support multiple MSI interrupts. @@ -3208,6 +3217,15 @@ static int irq_remapping_alloc(struct irq_domain *domain, unsigned int virq, goto out_free_parent; } + if (info->flags & X86_IRQ_ALLOC_AS_NMI) { + struct amd_iommu *iommu = amd_iommu_rlookup_table[devid]; + + if (!get_dev_entry_bit(devid, DEV_ENTRY_NMI_PASS)) { + set_dev_entry_bit(devid, DEV_ENTRY_NMI_PASS); + iommu_flush_dte(iommu, devid); + } + } + for (i = 0; i < nr_irqs; i++) { irq_data = irq_domain_get_irq_data(domain, virq + i); cfg = irq_data ? irqd_cfg(irq_data) : NULL; -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v6 00/29] x86: Implement an HPET-based hardlockup detector
Hi, This is the sixth version of this patchset. It again took me a while to post this version as I have been working on other projects and implemented major reworks in the series. This work has gone through several versions. I have incorporated feedback from Thomas Gleixner and others. Many of the patches have review tags (the patches for arch/x86 have review tags from the trusted x86 reviewers). I believe it is ready for review by the x86 maintainers. Thus, I have dropped the RFC qualifier. I would especially appreciate review from the IOMMU experts on the patches for the AMD IOMMU (patches 11-13) and from the powerpc experts on the refactoring of the generic hardlockup detector (patches 18-20, 24, 28). Also, I observed the error in the expected TSC value has its largest value when the system just booted. Afterwards, the error becomes very small. In this version allows a larger than necessary error. I also would appreciate feedback on this particular. Perhaps the system can switch to this detector after boot and free up the PMU counters. Please see the Testing section for details. Previous version of the patches can be found in [1], [2], [3], [4], and [5]. Also, thanks to Thomas' feedback, this version handles the changes to the interrupt subsystem more cleanly and therefore it is not necessary to handle interrupt remapping separately [6] as in the last version. For easier review, please refer to the changelog below for a list of unchanged, updated, and new patches. == Problem statement In CPU architectures that do not have an non-maskable (NMI) watchdog, one can be constructed using any resource capable of issuing NMIs. In x86, this is done using a counter (one per CPU) of the Performance Monitoring Unit (PMU). PMU counters, however, are scarce and used mainly to profile workloads. Using them for the NMI watchdog is an overkill. Moreover, since the counter that the NMI watchdog uses cannot be shared with any other perf users, the current perf-event subsystem may return a false positive when validating certain metric groups. Certain metric groups may never get a chance to be scheduled [7], [8]. == Description of the detector Unlike the perf-based hardlockup detector, this implementation is driven by a single source of NMIs: an HPET channel. One of the CPUs that the hardlockup detector monitors handles the HPET NMI. Upon reception, it issues a shorthand IPI to the rest of the CPUs. All CPUs will get the NMI, but only those online and being monitored will look for hardlockups. The time-stamp counter (TSC) is used to determine whether the HPET caused the NMI. When the HPET channel fires, we compute the value the TSC is expected to have the next time it fires. Once it does, we read the actual TSC value. If it falls within a small error window, we determine that the HPET channel issued the NMI. I have found experimentally that the expected TSC value consistently has an error of less than 0.2% (see further details in the Testing section). A side effect of using this detector is that in a kernel configured with NO_HZ_FULL, the CPUs specified in the nohz_full command-line argument would also receive the NMI IPI (but they would not look for hardlokcups). Thus, this hardlockup detector is an opt-in feature. This detector cannot be used on systems that disable the HPET, unless it is force-enabled. On AMD systems with interrupt remapping enabled, the detector can only be used in APIC physical mode (see patch 12). == Main updates in this version Thomas Gleixner pointed out several design issues: 1) In previous versions, the HPET channel targeted individual CPUs in a round-robin manner. This required changing the affinity of the NMI. Doing this with interrupt remapping is not feasible since there is not an atomic, lock-less way of updating an IRTE in the interrupt remapping domain. + Solution: In this version, the affinity of the HPET timer never changes while it is in operation. When a change in affinity is needed (e.g., the handling CPU is going offline or stops being monitored), the detector is disabled first. 2) In previous versions, I issued apic::send_IPI_mask_allbutself() to trigger a hardlockup check on the monitored CPUs. This does not work as the target cpumask and the contents of the APIC ICR would be corrupted [9]. + Solution: Only use this detector when IPI shorthands [10] are enabled in the system. 3) In previous versions, I added checks in the interrupt remapping drivers checks to identify the HPET hardlockup detector IRQ and changed its delivery mode. This looked hacky [11]. + Solution: As per suggestion from Thomas, I added an X86_IRQ_ALLOC_AS_NMI to be used when allocating IRQs. Every irq_domain in the IRQ hierarchy will configure the delivery mode of the IRQ as specified in the IRQ config info. == Parts of this series ==
[PATCH v6 14/29] x86/hpet: Expose hpet_writel() in header
In order to allow hpet_writel() to be used by other components (e.g., the HPET-based hardlockup detector), expose it in the HPET header file. Cc: Andi Kleen Cc: Stephane Eranian Cc: "Ravi V. Shankar" Cc: iommu@lists.linux-foundation.org Cc: linuxppc-...@lists.ozlabs.org Cc: x...@kernel.org Reviewed-by: Tony Luck Signed-off-by: Ricardo Neri --- Changes since v5: * None Changes since v4: * Dropped exposing hpet_readq() as it is not needed. Changes since v3: * None Changes since v2: * None Changes since v1: * None --- arch/x86/include/asm/hpet.h | 1 + arch/x86/kernel/hpet.c | 2 +- 2 files changed, 2 insertions(+), 1 deletion(-) diff --git a/arch/x86/include/asm/hpet.h b/arch/x86/include/asm/hpet.h index ab9f3dd87c80..be9848f0883f 100644 --- a/arch/x86/include/asm/hpet.h +++ b/arch/x86/include/asm/hpet.h @@ -72,6 +72,7 @@ extern int is_hpet_enabled(void); extern int hpet_enable(void); extern void hpet_disable(void); extern unsigned int hpet_readl(unsigned int a); +extern void hpet_writel(unsigned int d, unsigned int a); extern void force_hpet_resume(void); #ifdef CONFIG_HPET_EMULATE_RTC diff --git a/arch/x86/kernel/hpet.c b/arch/x86/kernel/hpet.c index 71f336425e58..47678e7927ff 100644 --- a/arch/x86/kernel/hpet.c +++ b/arch/x86/kernel/hpet.c @@ -79,7 +79,7 @@ inline unsigned int hpet_readl(unsigned int a) return readl(hpet_virt_address + a); } -static inline void hpet_writel(unsigned int d, unsigned int a) +inline void hpet_writel(unsigned int d, unsigned int a) { writel(d, hpet_virt_address + a); } -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v6 13/29] iommu/amd: Compose MSI messages for NMI irqs in non-IR format
If NMIPass is enabled in a device's DTE, the IOMMU lets NMI interrupt messages pass through unmapped. Therefore, the contents of the MSI message, not an IRTE, determine how and where the NMI is delivered. Since the IOMMU driver owns the MSI message of the NMI irq, compose it using the non-interrupt-remapping format. Also, let descendant irqchips write the MSI as appropriate for the device. Cc: Andi Kleen Cc: "Ravi V. Shankar" Cc: Joerg Roedel Cc: Suravee Suthikulpanit Cc: Stephane Eranian Cc: iommu@lists.linux-foundation.org Cc: linuxppc-...@lists.ozlabs.org Cc: x...@kernel.org Signed-off-by: Ricardo Neri --- Changes since v5: * Introduced this patch Changes since v4: * N/A Changes since v3: * N/A Changes since v2: * N/A Changes since v1: * N/A --- drivers/iommu/amd/iommu.c | 23 ++- 1 file changed, 22 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c index 4d7421b6858d..6e07949b3e2a 100644 --- a/drivers/iommu/amd/iommu.c +++ b/drivers/iommu/amd/iommu.c @@ -3111,7 +3111,16 @@ static void irq_remapping_prepare_irte(struct amd_ir_data *data, case X86_IRQ_ALLOC_TYPE_HPET: case X86_IRQ_ALLOC_TYPE_PCI_MSI: case X86_IRQ_ALLOC_TYPE_PCI_MSIX: - fill_msi_msg(>msi_entry, irte_info->index); + if (irq_cfg->delivery_mode == APIC_DELIVERY_MODE_NMI) + /* +* The IOMMU lets NMIs pass through unmapped. Thus, the +* MSI message, not the IRTE, determines the irq +* configuration. Since we own the MSI message, +* compose it. Descendant irqchips will write it. +*/ + __irq_msi_compose_msg(irq_cfg, >msi_entry, true); + else + fill_msi_msg(>msi_entry, irte_info->index); break; default: @@ -3509,6 +3518,18 @@ static int amd_ir_set_affinity(struct irq_data *data, */ send_cleanup_vector(cfg); + /* +* When the delivery mode of an irq is NMI, the IOMMU lets the NMI +* interrupt messages pass through unmapped. Hence, changes in the +* destination are to be reflected in the NMI message itself, not the +* IRTE. Thus, descendant irqchips must set the affinity and compose +* write the MSI message. +* +* Also, NMIs do not have an associated vector. No need for cleanup. +*/ + if (cfg->delivery_mode == APIC_DELIVERY_MODE_NMI) + return IRQ_SET_MASK_OK; + return IRQ_SET_MASK_OK_DONE; } -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v6 07/29] iommu/vt-d: Clear the redirection hint when the destination mode is physical
When the destination mode of an interrupt is physical APICID, the interrupt is delivered only to the single CPU of which the physical APICID is specified in the destination ID field. Therefore, the redirection hint is meaningless. Furthermore, on certain processors, the IOMMU does not deliver the interrupt when the delivery mode is NMI, the redirection hint is set, and the destination mode is physical. Clearing the redirection hint ensures that the NMI is delivered. Cc: Andi Kleen Cc: David Woodhouse Cc: "Ravi V. Shankar" Cc: Lu Baolu Cc: Stephane Eranian Cc: iommu@lists.linux-foundation.org Cc: linuxppc-...@lists.ozlabs.org Cc: x...@kernel.org Suggested-by: Ashok Raj Reviewed-by: Lu Baolu Signed-off-by: Ricardo Neri --- Changes since v5: * Introduced this patch. Changes since v4: * N/A Changes since v3: * N/A Changes since v2: * N/A Changes since v1: * N/A --- drivers/iommu/intel/irq_remapping.c | 12 +++- 1 file changed, 11 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/intel/irq_remapping.c b/drivers/iommu/intel/irq_remapping.c index a67319597884..d2764a71f91a 100644 --- a/drivers/iommu/intel/irq_remapping.c +++ b/drivers/iommu/intel/irq_remapping.c @@ -1128,7 +1128,17 @@ static void prepare_irte(struct irte *irte, int vector, unsigned int dest) irte->dlvry_mode = apic->delivery_mode; irte->vector = vector; irte->dest_id = IRTE_DEST(dest); - irte->redir_hint = 1; + + /* +* When using the destination mode of physical APICID, only the +* processor specified in @dest receives the interrupt. Thus, the +* redirection hint is meaningless. +* +* Furthermore, on some processors, NMIs with physical delivery mode +* and the redirection hint set are delivered as regular interrupts +* or not delivered at all. +*/ + irte->redir_hint = apic->dest_mode_logical; } struct irq_remap_ops intel_irq_remap_ops = { -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v6 10/29] iommu/vt-d: Implement minor tweaks for NMI irqs
The Intel IOMMU interrupt remapping driver already programs correctly the delivery mode of individual irqs as per their irq_data. Improve handling of NMIs. Allow only one irq per NMI. Also, it is not necessary to cleanup irq vectors after updating affinity. NMIs do not have associated vectors. Cc: Andi Kleen Cc: David Woodhouse Cc: "Ravi V. Shankar" Cc: Lu Baolu Cc: Stephane Eranian Cc: iommu@lists.linux-foundation.org Cc: linuxppc-...@lists.ozlabs.org Cc: x...@kernel.org Reviewed-by: Lu Baolu Signed-off-by: Ricardo Neri --- Changes since v5: * Introduced this patch. Changes since v4: * N/A Changes since v3: * N/A Changes since v2: * N/A Changes since v1: * N/A --- drivers/iommu/intel/irq_remapping.c | 9 - 1 file changed, 8 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/intel/irq_remapping.c b/drivers/iommu/intel/irq_remapping.c index fb2d71bea98d..791a9331e257 100644 --- a/drivers/iommu/intel/irq_remapping.c +++ b/drivers/iommu/intel/irq_remapping.c @@ -1198,8 +1198,12 @@ intel_ir_set_affinity(struct irq_data *data, const struct cpumask *mask, * After this point, all the interrupts will start arriving * at the new destination. So, time to cleanup the previous * vector allocation. +* +* Do it only for non-NMI irqs. NMIs don't have associated +* vectors. */ - send_cleanup_vector(cfg); + if (cfg->delivery_mode != APIC_DELIVERY_MODE_NMI) + send_cleanup_vector(cfg); return IRQ_SET_MASK_OK_DONE; } @@ -1352,6 +1356,9 @@ static int intel_irq_remapping_alloc(struct irq_domain *domain, if (info->type == X86_IRQ_ALLOC_TYPE_PCI_MSI) info->flags &= ~X86_IRQ_ALLOC_CONTIGUOUS_VECTORS; + if ((info->flags & X86_IRQ_ALLOC_AS_NMI) && nr_irqs != 1) + return -EINVAL; + ret = irq_domain_alloc_irqs_parent(domain, virq, nr_irqs, arg); if (ret < 0) return ret; -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v6 02/29] x86/apic: Add irq_cfg::delivery_mode
Currently, the delivery mode of all interrupts is set to the mode of the APIC driver in use. There are no restrictions in hardware to configure the delivery mode of each interrupt individually. Also, certain IRQs need to be configured with a specific delivery mode (e.g., NMI). Add a new member, delivery_mode, to struct irq_cfg. Subsequent changesets will update every irq_domain to set the delivery mode of each IRQ to that specified in its irq_cfg data. To keep the current behavior, when allocating an IRQ in the root domain (i.e., the x86_vector_domain), set the delivery mode of the IRQ as that of the APIC driver. Cc: Andi Kleen Cc: "Ravi V. Shankar" Cc: Stephane Eranian Cc: iommu@lists.linux-foundation.org Cc: linuxppc-...@lists.ozlabs.org Cc: x...@kernel.org Reviewed-by: Ashok Raj Reviewed-by: Tony Luck Signed-off-by: Ricardo Neri --- Changes since v5: * Updated indentation of the existing members of struct irq_cfg. * Reworded the commit message. Changes since v4: * Rebased to use new enumeration apic_delivery_modes. Changes since v3: * None Changes since v2: * Reduced scope to only add the interrupt delivery mode in struct irq_alloc_info. Changes since v1: * Introduced this patch. --- arch/x86/include/asm/hw_irq.h | 5 +++-- arch/x86/kernel/apic/vector.c | 9 + 2 files changed, 12 insertions(+), 2 deletions(-) diff --git a/arch/x86/include/asm/hw_irq.h b/arch/x86/include/asm/hw_irq.h index d465ece58151..5ac5e6c603ee 100644 --- a/arch/x86/include/asm/hw_irq.h +++ b/arch/x86/include/asm/hw_irq.h @@ -88,8 +88,9 @@ struct irq_alloc_info { }; struct irq_cfg { - unsigned intdest_apicid; - unsigned intvector; + unsigned intdest_apicid; + unsigned intvector; + enum apic_delivery_modesdelivery_mode; }; extern struct irq_cfg *irq_cfg(unsigned int irq); diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c index 3e6f6b448f6a..838e220e8860 100644 --- a/arch/x86/kernel/apic/vector.c +++ b/arch/x86/kernel/apic/vector.c @@ -567,6 +567,7 @@ static int x86_vector_alloc_irqs(struct irq_domain *domain, unsigned int virq, irqd->chip_data = apicd; irqd->hwirq = virq + i; irqd_set_single_target(irqd); + /* * Prevent that any of these interrupts is invoked in * non interrupt context via e.g. generic_handle_irq() @@ -577,6 +578,14 @@ static int x86_vector_alloc_irqs(struct irq_domain *domain, unsigned int virq, /* Don't invoke affinity setter on deactivated interrupts */ irqd_set_affinity_on_activate(irqd); + /* +* Initialize the delivery mode of this irq to match the +* default delivery mode of the APIC. Children irq domains +* may take the delivery mode from the individual irq +* configuration rather than from the APIC driver. +*/ + apicd->hw_irq_cfg.delivery_mode = apic->delivery_mode; + /* * Legacy vectors are already assigned when the IOAPIC * takes them over. They stay on the same vector. This is -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v6 09/29] iommu/vt-d: Set the IRTE delivery mode individually for each IRQ
There are no hardware requirements to use the same delivery mode for all interrupts. Use the mode specified in the provided IRQ hardware configuration data. Since all IRQs are configured to use the delivery mode of the APIC drive, the only functional changes are where IRQs are configured to use a specific delivery mode. Cc: Andi Kleen Cc: David Woodhouse Cc: "Ravi V. Shankar" Cc: Lu Baolu Cc: Stephane Eranian Cc: iommu@lists.linux-foundation.org Cc: linuxppc-...@lists.ozlabs.org Cc: x...@kernel.org Reviewed-by: Tony Luck Reviewed-by: Lu Baolu Signed-off-by: Ricardo Neri --- Changes since v5: * Introduced this patch. Changes since v4: * N/A Changes since v3: * N/A Changes since v2: * N/A Changes since v1: * N/A --- drivers/iommu/intel/irq_remapping.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/iommu/intel/irq_remapping.c b/drivers/iommu/intel/irq_remapping.c index 66d37186ec28..fb2d71bea98d 100644 --- a/drivers/iommu/intel/irq_remapping.c +++ b/drivers/iommu/intel/irq_remapping.c @@ -1125,7 +1125,7 @@ static void prepare_irte(struct irte *irte, struct irq_cfg *irq_cfg) * irq migration in the presence of interrupt-remapping. */ irte->trigger_mode = 0; - irte->dlvry_mode = apic->delivery_mode; + irte->dlvry_mode = irq_cfg->delivery_mode; irte->vector = irq_cfg->vector; irte->dest_id = IRTE_DEST(irq_cfg->dest_apicid); -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v6 06/29] x86/apic/vector: Implement support for NMI delivery mode
The flag X86_IRQ_ALLOC_AS_NMI indicates to the interrupt controller that it should configure the delivery mode of an IRQ as NMI. Implement such request. This causes irq_domain children in the hierarchy to configure their irq_chips accordingly. When no specific delivery mode is requested, continue using the delivery mode of the APIC driver in use. Cc: Andi Kleen Cc: "Ravi V. Shankar" Cc: Stephane Eranian Cc: iommu@lists.linux-foundation.org Cc: linuxppc-...@lists.ozlabs.org Cc: x...@kernel.org Suggested-by: Thomas Gleixner Reviewed-by: Tony Luck Signed-off-by: Ricardo Neri --- Changes since v5: * Introduced this patch. Changes since v4: * N/A Changes since v3: * N/A Changes since v2: * N/A Changes since v1: * N/A --- arch/x86/kernel/apic/vector.c | 12 1 file changed, 12 insertions(+) diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c index 11f881f45cec..df4d7b9f6e27 100644 --- a/arch/x86/kernel/apic/vector.c +++ b/arch/x86/kernel/apic/vector.c @@ -570,6 +570,10 @@ static int x86_vector_alloc_irqs(struct irq_domain *domain, unsigned int virq, if ((info->flags & X86_IRQ_ALLOC_CONTIGUOUS_VECTORS) && nr_irqs > 1) return -ENOSYS; + /* Only one IRQ per NMI */ + if ((info->flags & X86_IRQ_ALLOC_AS_NMI) && nr_irqs != 1) + return -EINVAL; + /* * Catch any attempt to touch the cascade interrupt on a PIC * equipped system. @@ -610,7 +614,15 @@ static int x86_vector_alloc_irqs(struct irq_domain *domain, unsigned int virq, * default delivery mode of the APIC. Children irq domains * may take the delivery mode from the individual irq * configuration rather than from the APIC driver. +* +* Vectors are meaningless if the delivery mode is NMI. Since +* nr_irqs is 1, we can return. */ + if (info->flags & X86_IRQ_ALLOC_AS_NMI) { + apicd->hw_irq_cfg.delivery_mode = APIC_DELIVERY_MODE_NMI; + return 0; + } + apicd->hw_irq_cfg.delivery_mode = apic->delivery_mode; /* -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v6 04/29] x86/apic: Add the X86_IRQ_ALLOC_AS_NMI irq allocation flag
There are cases in which it is necessary to set the delivery mode of an interrupt as NMI. Add a new flag that callers can specify when allocating an IRQ. Cc: Andi Kleen Cc: "Ravi V. Shankar" Cc: Stephane Eranian Cc: iommu@lists.linux-foundation.org Cc: linuxppc-...@lists.ozlabs.org Cc: x...@kernel.org Suggested-by: Thomas Gleixner Reviewed-by: Tony Luck Signed-off-by: Ricardo Neri --- Changes since v5: * Introduced this patch. Changes since v4: * N/A Changes since v3: * N/A Changes since v2: * N/A Changes since v1: * N/A --- arch/x86/include/asm/irqdomain.h | 1 + 1 file changed, 1 insertion(+) diff --git a/arch/x86/include/asm/irqdomain.h b/arch/x86/include/asm/irqdomain.h index 125c23b7bad3..de1cf2e80443 100644 --- a/arch/x86/include/asm/irqdomain.h +++ b/arch/x86/include/asm/irqdomain.h @@ -10,6 +10,7 @@ enum { /* Allocate contiguous CPU vectors */ X86_IRQ_ALLOC_CONTIGUOUS_VECTORS= 0x1, X86_IRQ_ALLOC_LEGACY= 0x2, + X86_IRQ_ALLOC_AS_NMI= 0x4, }; extern int x86_fwspec_is_ioapic(struct irq_fwspec *fwspec); -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v6 05/29] x86/apic/vector: Do not allocate vectors for NMIs
Vectors are meaningless when allocating IRQs with NMI as the delivery mode. In such case, skip the reservation of IRQ vectors. Do it in the lowest- level functions where the actual IRQ reservation takes place. Since NMIs target specific CPUs, keep the functionality to find the best CPU. Cc: Andi Kleen Cc: "Ravi V. Shankar" Cc: Stephane Eranian Cc: iommu@lists.linux-foundation.org Cc: linuxppc-...@lists.ozlabs.org Cc: x...@kernel.org Reviewed-by: Tony Luck Signed-off-by: Ricardo Neri --- Changes since v5: * Introduced this patch. Changes since v4: * N/A Changes since v3: * N/A Changes since v2: * N/A Changes since v1: * N/A --- arch/x86/kernel/apic/vector.c | 27 +++ 1 file changed, 27 insertions(+) diff --git a/arch/x86/kernel/apic/vector.c b/arch/x86/kernel/apic/vector.c index 838e220e8860..11f881f45cec 100644 --- a/arch/x86/kernel/apic/vector.c +++ b/arch/x86/kernel/apic/vector.c @@ -245,11 +245,20 @@ assign_vector_locked(struct irq_data *irqd, const struct cpumask *dest) if (apicd->move_in_progress || !hlist_unhashed(>clist)) return -EBUSY; + if (apicd->hw_irq_cfg.delivery_mode == APIC_DELIVERY_MODE_NMI) { + cpu = irq_matrix_find_best_cpu(vector_matrix, dest); + apicd->cpu = cpu; + vector = 0; + goto no_vector; + } + vector = irq_matrix_alloc(vector_matrix, dest, resvd, ); trace_vector_alloc(irqd->irq, vector, resvd, vector); if (vector < 0) return vector; apic_update_vector(irqd, vector, cpu); + +no_vector: apic_update_irq_cfg(irqd, vector, cpu); return 0; @@ -321,12 +330,22 @@ assign_managed_vector(struct irq_data *irqd, const struct cpumask *dest) /* set_affinity might call here for nothing */ if (apicd->vector && cpumask_test_cpu(apicd->cpu, vector_searchmask)) return 0; + + if (apicd->hw_irq_cfg.delivery_mode == APIC_DELIVERY_MODE_NMI) { + cpu = irq_matrix_find_best_cpu_managed(vector_matrix, dest); + apicd->cpu = cpu; + vector = 0; + goto no_vector; + } + vector = irq_matrix_alloc_managed(vector_matrix, vector_searchmask, ); trace_vector_alloc_managed(irqd->irq, vector, vector); if (vector < 0) return vector; apic_update_vector(irqd, vector, cpu); + +no_vector: apic_update_irq_cfg(irqd, vector, cpu); return 0; } @@ -376,6 +395,10 @@ static void x86_vector_deactivate(struct irq_domain *dom, struct irq_data *irqd) if (apicd->has_reserved) return; + /* NMI IRQs do not have associated vectors; nothing to do. */ + if (apicd->hw_irq_cfg.delivery_mode == APIC_DELIVERY_MODE_NMI) + return; + raw_spin_lock_irqsave(_lock, flags); clear_irq_vector(irqd); if (apicd->can_reserve) @@ -472,6 +495,10 @@ static void vector_free_reserved_and_managed(struct irq_data *irqd) trace_vector_teardown(irqd->irq, apicd->is_managed, apicd->has_reserved); + /* NMI IRQs do not have associated vectors; nothing to do. */ + if (apicd->hw_irq_cfg.delivery_mode == APIC_DELIVERY_MODE_NMI) + return; + if (apicd->has_reserved) irq_matrix_remove_reserved(vector_matrix); if (apicd->is_managed) -- 2.17.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH] iommu: Reorganize __iommu_attach_group()
> From: Jason Gunthorpe > Sent: Friday, May 6, 2022 12:16 AM > > Call iommu_group_do_attach_device() only from > __iommu_group_attach_domain() which should be used to attach any > domain to > the group. > > The only unique thing __iommu_attach_group() does is to check if the group > is already attached to some caller specified group. Put this test into > __iommu_group_is_core_domain(), matching the > __iommu_group_attach_core_domain() nomenclature. > > Change the two callers to directly call __iommu_group_attach_domain() and > test __iommu_group_is_core_domain(). > > iommu_attach_device() should trigger a WARN_ON if the group is attached > as > the caller is using the function wrong. > > Suggested-by: "Tian, Kevin" > Signed-off-by: Jason Gunthorpe Reviewed-by: Kevin Tian > --- > drivers/iommu/iommu.c | 42 +++--- > 1 file changed, 19 insertions(+), 23 deletions(-) > > This goes after "iommu: iommu_group_claim_dma_owner() must always > assign a > domain" and simplifies some of the remaining duplication. > > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c > index c1bdec807ea381..09d00be887f924 100644 > --- a/drivers/iommu/iommu.c > +++ b/drivers/iommu/iommu.c > @@ -81,9 +81,10 @@ static struct iommu_domain > *__iommu_domain_alloc(struct bus_type *bus, >unsigned type); > static int __iommu_attach_device(struct iommu_domain *domain, >struct device *dev); > -static int __iommu_attach_group(struct iommu_domain *domain, > - struct iommu_group *group); > +static int __iommu_group_attach_domain(struct iommu_group *group, > +struct iommu_domain *new_domain); > static void __iommu_group_attach_core_domain(struct iommu_group > *group); > +static bool __iommu_group_is_core_domain(struct iommu_group *group); > static int iommu_create_device_direct_mappings(struct iommu_group > *group, > struct device *dev); > static struct iommu_group *iommu_group_get_for_dev(struct device *dev); > @@ -1938,10 +1939,11 @@ int iommu_attach_device(struct iommu_domain > *domain, struct device *dev) >*/ > mutex_lock(>mutex); > ret = -EINVAL; > - if (iommu_group_device_count(group) != 1) > + if (iommu_group_device_count(group) != 1 || > + WARN_ON(!__iommu_group_is_core_domain(group))) > goto out_unlock; > > - ret = __iommu_attach_group(domain, group); > + ret = __iommu_group_attach_domain(group, domain); > > out_unlock: > mutex_unlock(>mutex); > @@ -2032,31 +2034,19 @@ static int iommu_group_do_attach_device(struct > device *dev, void *data) > return __iommu_attach_device(domain, dev); > } > > -static int __iommu_attach_group(struct iommu_domain *domain, > - struct iommu_group *group) > -{ > - int ret; > - > - if (group->domain && group->domain != group->default_domain && > - group->domain != group->blocking_domain) > - return -EBUSY; > - > - ret = __iommu_group_for_each_dev(group, domain, > - iommu_group_do_attach_device); > - if (ret == 0) > - group->domain = domain; > - > - return ret; > -} > - > int iommu_attach_group(struct iommu_domain *domain, struct > iommu_group *group) > { > int ret; > > mutex_lock(>mutex); > - ret = __iommu_attach_group(domain, group); > - mutex_unlock(>mutex); > + if (!__iommu_group_is_core_domain(group)) { > + ret = -EBUSY; > + goto out_unlock; > + } > > + ret = __iommu_group_attach_domain(group, domain); > +out_unlock: > + mutex_unlock(>mutex); > return ret; > } > EXPORT_SYMBOL_GPL(iommu_attach_group); > @@ -2110,6 +2100,12 @@ static int __iommu_group_attach_domain(struct > iommu_group *group, > return 0; > } > > +static bool __iommu_group_is_core_domain(struct iommu_group *group) > +{ > + return !group->domain || group->domain == group- > >default_domain || > +group->domain == group->blocking_domain; > +} > + > /* > * Put the group's domain back to the appropriate core-owned domain - > either the > * standard kernel-mode DMA configuration or an all-DMA-blocked domain. > > base-commit: f9169ee95787fe691287eeed1a1c269ea72c8fb4 > -- > 2.36.0 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH v2] iommu: iommu_group_claim_dma_owner() must always assign a domain
> From: Jason Gunthorpe > Sent: Thursday, May 5, 2022 11:33 PM > > > /* > > > - * If the group has been claimed already, do not re-attach the default > > > - * domain. > > > + * New drivers should support default domains and so the > > > detach_dev() op > > > + * will never be called. Otherwise the NULL domain indicates the > > > + * translation for the group should be set so it will work with the > > > > translation should be 'blocked'? > > No, not blocked. > > > > + * platform DMA ops. > > > > I didn't get the meaning of the last sentence. > > It is as discussed with Robin, NULL means to return the group back to > some platform defined translation, and in some cases this means > returning back to work with the platform's DMA ops - presumably if it > isn't using the dma api. This is clearer than the original text. > > > > + /* > > > + * Changing the domain is done by calling attach on the new domain. > > > + * Drivers should implement this so that DMA is always translated by > > > > what does 'this' mean? > > The code below - attach_dev called in a loop for each dev in the group. Yes. > > > > + * either the new, old, or a blocking domain. DMA should never > > > > isn't the blocking domain passed in as the new domain? > > Soemtimes. This is a statement about the required > atomicity. New/old/block are all valid translations during the > operation. Identity is not. but new/old/block are not the same type of classifications. A group has an old domain and a new domain at this transition point, and both old/new domains have a domain type (dma, unmanged, block, identity, etc.). Mixing them together only increases confusion here. > > So, I'm going to leave this patch as-is since it has been tested now > and we need to get the series back into iommu-next ASAP. > Agree. Just hope some improvements on the code comment. But either way given no more code change: Reviewed-by: Kevin Tian ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH] iommu/vt-d: Increase DMAR_UNITS_SUPPORTED
Increase DMAR_UNITS_SUPPORTED to support 64 sockets with 10 DMAR units each, for a total of 640. If the available hardware exceeds DMAR_UNITS_SUPPORTED (previously set to MAX_IO_APICS, or 128), it causes these messages: "DMAR: Failed to allocate seq_id", "DMAR: Parse DMAR table failure.", and "x2apic: IRQ remapping doesn't support X2APIC mode x2apic disabled"; and the system fails to boot. Signed-off-by: Steve Wahl Reviewed-by: Mike Travis --- Note that we could not find a reason for connecting DMAR_UNITS_SUPPORTED to MAX_IO_APICS as was done previously. Perhaps it seemed like the two would continue to match on earlier processors. There doesn't appear to be kernel code that assumes that the value of one is related to the other. include/linux/dmar.h | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/include/linux/dmar.h b/include/linux/dmar.h index 45e903d84733..9d4867b8f42e 100644 --- a/include/linux/dmar.h +++ b/include/linux/dmar.h @@ -19,7 +19,7 @@ struct acpi_dmar_header; #ifdef CONFIG_X86 -# define DMAR_UNITS_SUPPORTEDMAX_IO_APICS +# define DMAR_UNITS_SUPPORTED640 #else # define DMAR_UNITS_SUPPORTED64 #endif -- 2.26.2 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2] iommu: iommu_group_claim_dma_owner() must always assign a domain
On 2022-05-05 20:27, Jason Gunthorpe wrote: On Thu, May 05, 2022 at 07:56:59PM +0100, Robin Murphy wrote: Ack to that, there are certainly further improvements to consider once we've got known-working code into a released kernel, but let's not get ahead of ourselves just now. Yes please (I'm pretty sure we could get away with a single blocking domain per IOMMU instance, rather than per group, but I deliberately saved that one for later - right now one per group to match default domains is simpler to reason about and less churny in the context of the current ownership patches) I noticed this too.. But I thought the driver can do a better job of this. There is no reason it has to allocate memory to return a IOMMU_DOMAIN_BLOCKED domain - this struct could be a globally allocated singleton for the entire driver and that would be even better memory savings. I was thinking about the domain pointers themselves as well, but on reflection, little systems are most likely to have the simpler IOMMUs with one fixed group per instance anyway, while the kind of systems which end up with a couple of hundred groups can probably bear the cost of a couple of hundred extra pointers. So yeah, probably not worth the code cost of chasing down a group->devices->dev->iommu->blocking_domain in practice. For instance, here is a sketch for the Intel driver based on Baolu's remark that intel_iommu_detach_device() establishes a blocking behavior already on detach_dev (Baolu if you like it can you make a proper patch?): diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index df5c62ecf942b8..317945f6134d42 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -4324,6 +4324,8 @@ static struct iommu_domain *intel_iommu_domain_alloc(unsigned type) return domain; case IOMMU_DOMAIN_IDENTITY: return _domain->domain; + case IOMMU_DOMAIN_BLOCKED; + return _blocking_domain; default: return NULL; } @@ -4404,12 +4406,25 @@ static int intel_iommu_attach_device(struct iommu_domain *domain, return domain_add_dev_info(to_dmar_domain(domain), dev); } -static void intel_iommu_detach_device(struct iommu_domain *domain, - struct device *dev) +static int intel_iommu_block_dev(struct iommu_domain *domain, +struct device *dev) { + if (!device_to_iommu(dev, NULL, NULL)) + return -EINVAL; + dmar_remove_one_dev_info(dev); + return 0; } +static struct iommu_domain intel_blocking_domain { + .type = IOMMU_DOMAIN_BLOCKED, + .ops = &(const struct iommu_domain_ops){ + .attach_dev = intel_iommu_detach_device, + // Fix core code so this works + .free = NULL, TBH unless and until it becomes commonplace, I'd just have drivers provide a no-op .free callback for least surprise, if their .alloc does something funky. But either way, with domain ops now we can so easily do whatever we need, yay! Cheers, Robin. + }, +}; + static int intel_iommu_map(struct iommu_domain *domain, unsigned long iova, phys_addr_t hpa, size_t size, int iommu_prot, gfp_t gfp) @@ -4890,7 +4905,6 @@ const struct iommu_ops intel_iommu_ops = { #endif .default_domain_ops = &(const struct iommu_domain_ops) { .attach_dev = intel_iommu_attach_device, - .detach_dev = intel_iommu_detach_device, .map_pages = intel_iommu_map_pages, .unmap_pages= intel_iommu_unmap_pages, .iotlb_sync_map = intel_iommu_iotlb_sync_map, Thanks, Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2] iommu: iommu_group_claim_dma_owner() must always assign a domain
On Thu, May 05, 2022 at 07:56:59PM +0100, Robin Murphy wrote: > Ack to that, there are certainly further improvements to consider once we've > got known-working code into a released kernel, but let's not get ahead of > ourselves just now. Yes please > (I'm pretty sure we could get away with a single blocking domain per IOMMU > instance, rather than per group, but I deliberately saved that one for later > - right now one per group to match default domains is simpler to reason > about and less churny in the context of the current ownership patches) I noticed this too.. But I thought the driver can do a better job of this. There is no reason it has to allocate memory to return a IOMMU_DOMAIN_BLOCKED domain - this struct could be a globally allocated singleton for the entire driver and that would be even better memory savings. For instance, here is a sketch for the Intel driver based on Baolu's remark that intel_iommu_detach_device() establishes a blocking behavior already on detach_dev (Baolu if you like it can you make a proper patch?): diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index df5c62ecf942b8..317945f6134d42 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -4324,6 +4324,8 @@ static struct iommu_domain *intel_iommu_domain_alloc(unsigned type) return domain; case IOMMU_DOMAIN_IDENTITY: return _domain->domain; + case IOMMU_DOMAIN_BLOCKED; + return _blocking_domain; default: return NULL; } @@ -4404,12 +4406,25 @@ static int intel_iommu_attach_device(struct iommu_domain *domain, return domain_add_dev_info(to_dmar_domain(domain), dev); } -static void intel_iommu_detach_device(struct iommu_domain *domain, - struct device *dev) +static int intel_iommu_block_dev(struct iommu_domain *domain, +struct device *dev) { + if (!device_to_iommu(dev, NULL, NULL)) + return -EINVAL; + dmar_remove_one_dev_info(dev); + return 0; } +static struct iommu_domain intel_blocking_domain { + .type = IOMMU_DOMAIN_BLOCKED, + .ops = &(const struct iommu_domain_ops){ + .attach_dev = intel_iommu_detach_device, + // Fix core code so this works + .free = NULL, + }, +}; + static int intel_iommu_map(struct iommu_domain *domain, unsigned long iova, phys_addr_t hpa, size_t size, int iommu_prot, gfp_t gfp) @@ -4890,7 +4905,6 @@ const struct iommu_ops intel_iommu_ops = { #endif .default_domain_ops = &(const struct iommu_domain_ops) { .attach_dev = intel_iommu_attach_device, - .detach_dev = intel_iommu_detach_device, .map_pages = intel_iommu_map_pages, .unmap_pages= intel_iommu_unmap_pages, .iotlb_sync_map = intel_iommu_iotlb_sync_map, Thanks, Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
On Mon, May 02, 2022 at 05:30:05PM +1000, David Gibson wrote: > > It is a bit more CPU work since maps in the lower range would have to > > be copied over, but conceptually the model matches the HW nesting. > > Ah.. ok. IIUC what you're saying is that the kernel-side IOASes have > fixed windows, but we fake dynamic windows in the userspace > implementation by flipping the devices over to a new IOAS with the new > windows. Is that right? Yes > Where exactly would the windows be specified? My understanding was > that when creating a back-end specific IOAS, that would typically be > for the case where you're using a user / guest managed IO pagetable, > with the backend specifying the format for that. In the ppc case we'd > need to specify the windows, but we'd still need the IOAS_MAP/UNMAP > operations to manage the mappings. The PAPR vIOMMU is > paravirtualized, so all updates come via hypercalls, so there's no > user/guest managed data structure. When the iommu_domain is created I want to have a iommu-driver-specific struct, so PPC can customize its iommu_domain however it likes. > That should work from the point of view of the userspace and guest > side interfaces. It might be fiddly from the point of view of the > back end. The ppc iommu doesn't really have the notion of > configurable domains - instead the address spaces are the hardware or > firmware fixed PEs, so they have a fixed set of devices. At the bare > metal level it's possible to sort of do domains by making the actual > pagetable pointers for several PEs point to a common place. I'm not sure I understand this - a domain is just a storage container for an IO page table, if the HW has IOPTEs then it should be able to have a domain? Making page table pointers point to a common IOPTE tree is exactly what iommu_domains are for - why is that "sort of" for ppc? > However, in the future, nested KVM under PowerVM is likely to be the > norm. In that situation the L1 as well as the L2 only has the > paravirtualized interfaces, which don't have any notion of domains, > only PEs. All updates take place via hypercalls which explicitly > specify a PE (strictly speaking they take a "Logical IO Bus Number" > (LIOBN), but those generally map one to one with PEs), so it can't use > shared pointer tricks either. How does the paravirtualized interfaces deal with the page table? Does it call a map/unmap hypercall instead of providing guest IOPTEs? Assuming yes, I'd expect that: The iommu_domain for nested PPC is just a log of map/unmap hypervsior calls to make. Whenever a new PE is attached to that domain it gets the logged map's replayed to set it up, and when a PE is detached the log is used to unmap everything. It is not perfectly memory efficient - and we could perhaps talk about a API modification to allow re-use of the iommufd datastructure somehow, but I think this is a good logical starting point. The PE would have to be modeled as an iommu_group. > So, here's an alternative set of interfaces that should work for ppc, > maybe you can tell me whether they also work for x86 and others: Fundamentally PPC has to fit into the iommu standard framework of group and domains, we can talk about modifications, but drifting too far away is a big problem. > * Each domain/IOAS has a concept of one or more IOVA windows, which > each have a base address, size, pagesize (granularity) and optionally > other flags/attributes. > * This has some bearing on hardware capabilities, but is > primarily a software notion iommu_domain has the aperture, PPC will require extending this to a list of apertures since it is currently only one window. Once a domain is created and attached to a group the aperture should be immutable. > * MAP/UNMAP operations are only permitted within an existing IOVA > window (and with addresses aligned to the window's pagesize) > * This is enforced by software whether or not it is required by > the underlying hardware > * Likewise IOAS_COPY operations are only permitted if the source and > destination windows have compatible attributes Already done, domain's aperture restricts all the iommufd operations > * A newly created kernel-managed IOAS has *no* IOVA windows Already done, the iommufd IOAS has no iommu_domains inside it at creation time. > * A CREATE_WINDOW operation is added > * This takes a size, pagesize/granularity, optional base address > and optional additional attributes > * If any of the specified attributes are incompatible with the > underlying hardware, the operation fails iommu layer has nothing called a window. The closest thing is a domain. I really don't want to try to make a new iommu layer object that is so unique and special to PPC - we have to figure out how to fit PPC into the iommu_domain model with reasonable extensions. > > > > Maybe every device gets a copy of the error notification? > > > > > > Alas, it's
Re: [PATCH v2] iommu: iommu_group_claim_dma_owner() must always assign a domain
On 2022-05-05 16:33, Jason Gunthorpe wrote: On Thu, May 05, 2022 at 10:56:28AM +, Tian, Kevin wrote: From: Jason Gunthorpe Sent: Thursday, May 5, 2022 3:09 AM Once the group enters 'owned' mode it can never be assigned back to the default_domain or to a NULL domain. It must always be actively assigned to worth pointing out that a NULL domain is not always translated to DMA blocked on all platforms. That was a wrong assumption before this patch. This is described in a comment blow @@ -2040,7 +2037,8 @@ static int __iommu_attach_group(struct iommu_domain *domain, { int ret; - if (group->domain && group->domain != group->default_domain) + if (group->domain && group->domain != group->default_domain && + group->domain != group->blocking_domain) return -EBUSY; ret = __iommu_group_for_each_dev(group, domain, Suppose this can be also replaced by __iommu_group_attach_domain()? It can, but lets do this as a follow patching since it is a bit big @@ -2072,38 +2070,68 @@ static int iommu_group_do_detach_device(struct device *dev, void *data) return 0; } -static void __iommu_detach_group(struct iommu_domain *domain, -struct iommu_group *group) +static int __iommu_group_attach_domain(struct iommu_group *group, + struct iommu_domain *new_domain) { int ret; + if (group->domain == new_domain) + return 0; + /* -* If the group has been claimed already, do not re-attach the default -* domain. +* New drivers should support default domains and so the detach_dev() op +* will never be called. Otherwise the NULL domain indicates the +* translation for the group should be set so it will work with the translation should be 'blocked'? No, not blocked. +* platform DMA ops. I didn't get the meaning of the last sentence. It is as discussed with Robin, NULL means to return the group back to some platform defined translation, and in some cases this means returning back to work with the platform's DMA ops - presumably if it isn't using the dma api. + /* +* Changing the domain is done by calling attach on the new domain. +* Drivers should implement this so that DMA is always translated by what does 'this' mean? The code below - attach_dev called in a loop for each dev in the group. +* either the new, old, or a blocking domain. DMA should never isn't the blocking domain passed in as the new domain? Soemtimes. This is a statement about the required atomicity. New/old/block are all valid translations during the operation. Identity is not. +* untranslated. +* +* Note that this is called in error unwind paths, attaching to a +* domain that has already been attached cannot fail. this is called in the normal path. Where does error unwind happen? Any place that checks the return and does WARN_ON Also the loop over each dev doesn't error unwind so it assumes that if the first dev succeeds then all subsequent devs will succeed. /** * iommu_group_claim_dma_owner() - Set DMA ownership of a group * @group: The group. @@ -3111,9 +3162,15 @@ int iommu_group_claim_dma_owner(struct iommu_group *group, void *owner) goto unlock_out; } + ret = __iommu_group_alloc_blocking_domain(group); + if (ret) + goto unlock_out; + + ret = __iommu_group_attach_domain(group, + group->blocking_domain); + if (ret) + goto unlock_out; group->owner = owner; Here can use __iommu_group_attach_core_domain() if calling it after setting group->owner. That is obfuscating. This function must always and only attach the blocking_domain. So, I'm going to leave this patch as-is since it has been tested now and we need to get the series back into iommu-next ASAP. Ack to that, there are certainly further improvements to consider once we've got known-working code into a released kernel, but let's not get ahead of ourselves just now. (I'm pretty sure we could get away with a single blocking domain per IOMMU instance, rather than per group, but I deliberately saved that one for later - right now one per group to match default domains is simpler to reason about and less churny in the context of the current ownership patches) Cheers, Robin. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v7 5/7] perf tool: Add support for HiSilicon PCIe Tune and Trace device driver
Hi Leo, Thanks for your review, some replies below. On 2022/4/30 15:35, Leo Yan wrote: On Thu, Apr 07, 2022 at 08:58:39PM +0800, Yicong Yang via iommu wrote: From: Qi Liu 'perf record' and 'perf report --dump-raw-trace' supported in this patch. Example usage: Output will contain raw PTT data and its textual representation, such as: 0 0 0x5810 [0x30]: PERF_RECORD_AUXTRACE size: 0x40 offset: 0 ref: 0xa5d50c725 idx: 0 tid: -1 cpu: 0 . . ... HISI PTT data: size 4194304 bytes . : 00 00 00 00 Prefix . 0004: 08 20 00 60 Header DW0 . 0008: ff 02 00 01 Header DW1 . 000c: 20 08 00 00 Header DW2 . 0010: 10 e7 44 ab Header DW3 . 0014: 2a a8 1e 01 Time . 0020: 00 00 00 00 Prefix . 0024: 01 00 00 60 Header DW0 . 0028: 0f 1e 00 01 Header DW1 . 002c: 04 00 00 00 Header DW2 . 0030: 40 00 81 02 Header DW3 . 0034: ee 02 00 00 Time Signed-off-by: Qi Liu Signed-off-by: Yicong Yang --- tools/perf/arch/arm/util/auxtrace.c | 76 +- tools/perf/arch/arm/util/pmu.c| 3 + tools/perf/arch/arm64/util/Build | 2 +- tools/perf/arch/arm64/util/hisi_ptt.c | 195 tools/perf/util/Build | 2 + tools/perf/util/auxtrace.c| 4 + tools/perf/util/auxtrace.h| 1 + tools/perf/util/hisi-ptt-decoder/Build| 1 + .../hisi-ptt-decoder/hisi-ptt-pkt-decoder.c | 170 ++ .../hisi-ptt-decoder/hisi-ptt-pkt-decoder.h | 28 +++ tools/perf/util/hisi_ptt.c| 218 ++ tools/perf/util/hisi_ptt.h| 28 +++ It's good to divide the big patch into smaller patches, e.g. one patch is to add PTT auxtrace (so mainly for perf record), and the second patch is to add PTT decoder for perf decoding. got it, I'll do this, thanks. 12 files changed, 724 insertions(+), 4 deletions(-) create mode 100644 tools/perf/arch/arm64/util/hisi_ptt.c create mode 100644 tools/perf/util/hisi-ptt-decoder/Build create mode 100644 tools/perf/util/hisi-ptt-decoder/hisi-ptt-pkt-decoder.c create mode 100644 tools/perf/util/hisi-ptt-decoder/hisi-ptt-pkt-decoder.h create mode 100644 tools/perf/util/hisi_ptt.c create mode 100644 tools/perf/util/hisi_ptt.h diff --git a/tools/perf/arch/arm/util/auxtrace.c b/tools/perf/arch/arm/util/auxtrace.c index 5fc6a2a3dbc5..393f5757c039 100644 --- a/tools/perf/arch/arm/util/auxtrace.c +++ b/tools/perf/arch/arm/util/auxtrace.c @@ -4,9 +4,11 @@ * Author: Mathieu Poirier */ [...] + + rewinddir(dir); + while ((dent = readdir(dir))) { + if (strstr(dent->d_name, HISI_PTT_PMU_NAME) && idx < (*nr_ptts)) { + hisi_ptt_pmus[idx] = perf_pmu__find(dent->d_name); + if (hisi_ptt_pmus[idx]) { + pr_debug2("%s %d: hisi_ptt_pmu %d type %d name %s\n", + __func__, __LINE__, idx, + hisi_ptt_pmus[idx]->type, + hisi_ptt_pmus[idx]->name); + idx++; Indentation for "idx++" is not right. will fix this, thanks. + } + Redundant new line. will fix this, thanks. + } + } + +out: + closedir(dir); + return hisi_ptt_pmus; +} + struct auxtrace_record *auxtrace_record__init(struct evlist *evlist, int *err) { @@ -57,8 +112,12 @@ struct auxtrace_record struct evsel *evsel; bool found_etm = false; struct perf_pmu *found_spe = NULL; + struct perf_pmu *found_ptt = NULL; struct perf_pmu **arm_spe_pmus = NULL; + struct perf_pmu **hisi_ptt_pmus = NULL; + int nr_spes = 0; + int nr_ptts = 0; int i = 0; if (!evlist) @@ -66,13 +125,14 @@ struct auxtrace_record cs_etm_pmu = perf_pmu__find(CORESIGHT_ETM_PMU_NAME); arm_spe_pmus = find_all_arm_spe_pmus(_spes, err); + hisi_ptt_pmus = find_all_hisi_ptt_pmus(_ptts, err); evlist__for_each_entry(evlist, evsel) { if (cs_etm_pmu && evsel->core.attr.type == cs_etm_pmu->type) found_etm = true; - if (!nr_spes || found_spe) + if ((!nr_spes || found_spe) && (!nr_ptts || found_ptt)) continue; for (i = 0; i < nr_spes; i++) { @@ -81,11 +141,18 @@ struct auxtrace_record break;
[PATCH] iommu: Reorganize __iommu_attach_group()
Call iommu_group_do_attach_device() only from __iommu_group_attach_domain() which should be used to attach any domain to the group. The only unique thing __iommu_attach_group() does is to check if the group is already attached to some caller specified group. Put this test into __iommu_group_is_core_domain(), matching the __iommu_group_attach_core_domain() nomenclature. Change the two callers to directly call __iommu_group_attach_domain() and test __iommu_group_is_core_domain(). iommu_attach_device() should trigger a WARN_ON if the group is attached as the caller is using the function wrong. Suggested-by: "Tian, Kevin" Signed-off-by: Jason Gunthorpe --- drivers/iommu/iommu.c | 42 +++--- 1 file changed, 19 insertions(+), 23 deletions(-) This goes after "iommu: iommu_group_claim_dma_owner() must always assign a domain" and simplifies some of the remaining duplication. diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index c1bdec807ea381..09d00be887f924 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -81,9 +81,10 @@ static struct iommu_domain *__iommu_domain_alloc(struct bus_type *bus, unsigned type); static int __iommu_attach_device(struct iommu_domain *domain, struct device *dev); -static int __iommu_attach_group(struct iommu_domain *domain, - struct iommu_group *group); +static int __iommu_group_attach_domain(struct iommu_group *group, + struct iommu_domain *new_domain); static void __iommu_group_attach_core_domain(struct iommu_group *group); +static bool __iommu_group_is_core_domain(struct iommu_group *group); static int iommu_create_device_direct_mappings(struct iommu_group *group, struct device *dev); static struct iommu_group *iommu_group_get_for_dev(struct device *dev); @@ -1938,10 +1939,11 @@ int iommu_attach_device(struct iommu_domain *domain, struct device *dev) */ mutex_lock(>mutex); ret = -EINVAL; - if (iommu_group_device_count(group) != 1) + if (iommu_group_device_count(group) != 1 || + WARN_ON(!__iommu_group_is_core_domain(group))) goto out_unlock; - ret = __iommu_attach_group(domain, group); + ret = __iommu_group_attach_domain(group, domain); out_unlock: mutex_unlock(>mutex); @@ -2032,31 +2034,19 @@ static int iommu_group_do_attach_device(struct device *dev, void *data) return __iommu_attach_device(domain, dev); } -static int __iommu_attach_group(struct iommu_domain *domain, - struct iommu_group *group) -{ - int ret; - - if (group->domain && group->domain != group->default_domain && - group->domain != group->blocking_domain) - return -EBUSY; - - ret = __iommu_group_for_each_dev(group, domain, -iommu_group_do_attach_device); - if (ret == 0) - group->domain = domain; - - return ret; -} - int iommu_attach_group(struct iommu_domain *domain, struct iommu_group *group) { int ret; mutex_lock(>mutex); - ret = __iommu_attach_group(domain, group); - mutex_unlock(>mutex); + if (!__iommu_group_is_core_domain(group)) { + ret = -EBUSY; + goto out_unlock; + } + ret = __iommu_group_attach_domain(group, domain); +out_unlock: + mutex_unlock(>mutex); return ret; } EXPORT_SYMBOL_GPL(iommu_attach_group); @@ -2110,6 +2100,12 @@ static int __iommu_group_attach_domain(struct iommu_group *group, return 0; } +static bool __iommu_group_is_core_domain(struct iommu_group *group) +{ + return !group->domain || group->domain == group->default_domain || + group->domain == group->blocking_domain; +} + /* * Put the group's domain back to the appropriate core-owned domain - either the * standard kernel-mode DMA configuration or an all-DMA-blocked domain. base-commit: f9169ee95787fe691287eeed1a1c269ea72c8fb4 -- 2.36.0 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2] iommu: iommu_group_claim_dma_owner() must always assign a domain
On Thu, May 05, 2022 at 10:56:28AM +, Tian, Kevin wrote: > > From: Jason Gunthorpe > > Sent: Thursday, May 5, 2022 3:09 AM > > > > Once the group enters 'owned' mode it can never be assigned back to the > > default_domain or to a NULL domain. It must always be actively assigned to > > worth pointing out that a NULL domain is not always translated to DMA > blocked on all platforms. That was a wrong assumption before this patch. This is described in a comment blow > > @@ -2040,7 +2037,8 @@ static int __iommu_attach_group(struct > > iommu_domain *domain, > > { > > int ret; > > > > - if (group->domain && group->domain != group->default_domain) > > + if (group->domain && group->domain != group->default_domain && > > + group->domain != group->blocking_domain) > > return -EBUSY; > > > > ret = __iommu_group_for_each_dev(group, domain, > > Suppose this can be also replaced by __iommu_group_attach_domain()? It can, but lets do this as a follow patching since it is a bit big > > @@ -2072,38 +2070,68 @@ static int > > iommu_group_do_detach_device(struct device *dev, void *data) > > return 0; > > } > > > > -static void __iommu_detach_group(struct iommu_domain *domain, > > -struct iommu_group *group) > > +static int __iommu_group_attach_domain(struct iommu_group *group, > > + struct iommu_domain *new_domain) > > { > > int ret; > > > > + if (group->domain == new_domain) > > + return 0; > > + > > /* > > -* If the group has been claimed already, do not re-attach the default > > -* domain. > > +* New drivers should support default domains and so the > > detach_dev() op > > +* will never be called. Otherwise the NULL domain indicates the > > +* translation for the group should be set so it will work with the > > translation should be 'blocked'? No, not blocked. > > +* platform DMA ops. > > I didn't get the meaning of the last sentence. It is as discussed with Robin, NULL means to return the group back to some platform defined translation, and in some cases this means returning back to work with the platform's DMA ops - presumably if it isn't using the dma api. > > + /* > > +* Changing the domain is done by calling attach on the new domain. > > +* Drivers should implement this so that DMA is always translated by > > what does 'this' mean? The code below - attach_dev called in a loop for each dev in the group. > > +* either the new, old, or a blocking domain. DMA should never > > isn't the blocking domain passed in as the new domain? Soemtimes. This is a statement about the required atomicity. New/old/block are all valid translations during the operation. Identity is not. > > +* untranslated. > > +* > > +* Note that this is called in error unwind paths, attaching to a > > +* domain that has already been attached cannot fail. > > this is called in the normal path. Where does error unwind happen? Any place that checks the return and does WARN_ON Also the loop over each dev doesn't error unwind so it assumes that if the first dev succeeds then all subsequent devs will succeed. > > /** > > * iommu_group_claim_dma_owner() - Set DMA ownership of a group > > * @group: The group. > > @@ -3111,9 +3162,15 @@ int iommu_group_claim_dma_owner(struct > > iommu_group *group, void *owner) > > goto unlock_out; > > } > > > > + ret = __iommu_group_alloc_blocking_domain(group); > > + if (ret) > > + goto unlock_out; > > + > > + ret = __iommu_group_attach_domain(group, > > + group->blocking_domain); > > + if (ret) > > + goto unlock_out; > > group->owner = owner; > > Here can use __iommu_group_attach_core_domain() if calling it after > setting group->owner. That is obfuscating. This function must always and only attach the blocking_domain. So, I'm going to leave this patch as-is since it has been tested now and we need to get the series back into iommu-next ASAP. Thanks, Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 0/3] iommu/arm-smmu: Support Tegra234 SMMU
On Thu, May 05, 2022 at 03:53:08PM +0100, Will Deacon wrote: > On Thu, May 05, 2022 at 04:15:29PM +0200, Thierry Reding wrote: > > On Fri, Apr 29, 2022 at 10:22:40AM +0200, Thierry Reding wrote: > > > From: Thierry Reding > > > > > > Hi Joerg, > > > > > > this is essentially a resend of v2 with a Acked-by:s from Robin and Will > > > added. These have been on the list for quite a while now, but apparently > > > there was a misunderstanding, so neither you nor Will picked this up. > > > > > > Since Will acked these, I think it's probably best for you to pick these > > > up directly. If not, let me know and I'll work with Will to merge via > > > the ARM SMMU tree. > > > > > > Thanks, > > > Thierry > > > > > > Thierry Reding (3): > > > dt-bindings: arm-smmu: Document nvidia,memory-controller property > > > dt-bindings: arm-smmu: Add compatible for Tegra234 SOC > > > iommu/arm-smmu: Support Tegra234 SMMU > > > > > > .../devicetree/bindings/iommu/arm,smmu.yaml | 23 +-- > > > drivers/iommu/arm/arm-smmu/arm-smmu-impl.c| 3 ++- > > > 2 files changed, 23 insertions(+), 3 deletions(-) > > > > Joerg, > > > > anything left to do on this from your perspective, or can this go into > > v5.19? > > I'll pick them up in the Arm SMMU queue, as there are some other SMMU > patches kicking around and we may as well keep them all together. Sounds good, thanks! Thierry signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3] iommu/mediatek: Fix NULL pointer dereference when printing dev_name
Il 05/05/22 15:27, Miles Chen ha scritto: When larbdev is NULL (in the case I hit, the node is incorrectly set iommus = < NUM>), it will cause device_link_add() fail and kernel crashes when we try to print dev_name(larbdev). Let's fail the probe if a larbdev is NULL to avoid invalid inputs from dts. It should work for normal correct setting and avoid the crash caused by my incorrect setting. Error log: [ 18.189042][ T301] Unable to handle kernel NULL pointer dereference at virtual address 0050 ... [ 18.344519][ T301] pstate: a045 (NzCv daif +PAN -UAO) [ 18.345213][ T301] pc : mtk_iommu_probe_device+0xf8/0x118 [mtk_iommu] [ 18.346050][ T301] lr : mtk_iommu_probe_device+0xd0/0x118 [mtk_iommu] [ 18.346884][ T301] sp : ffc00a5635e0 [ 18.347392][ T301] x29: ffc00a5635e0 x28: ffd44a46c1d8 [ 18.348156][ T301] x27: ff80c39a8000 x26: ffd44a80cc38 [ 18.348917][ T301] x25: x24: ffd44a80cc38 [ 18.349677][ T301] x23: ffd44e4da4c6 x22: ffd44a80cc38 [ 18.350438][ T301] x21: ff80cecd1880 x20: [ 18.351198][ T301] x19: ff80c439f010 x18: ffc00a50d0c0 [ 18.351959][ T301] x17: x16: 0004 [ 18.352719][ T301] x15: 0004 x14: ffd44eb5d420 [ 18.353480][ T301] x13: 0ad2 x12: 0003 [ 18.354241][ T301] x11: fad2 x10: c000fad2 [ 18.355003][ T301] x9 : a0d288d8d7142d00 x8 : a0d288d8d7142d00 [ 18.355763][ T301] x7 : ffd44c2bc640 x6 : [ 18.356524][ T301] x5 : 0080 x4 : 0001 [ 18.357284][ T301] x3 : x2 : 0005 [ 18.358045][ T301] x1 : x0 : [ 18.360208][ T301] Hardware name: MT6873 (DT) [ 18.360771][ T301] Call trace: [ 18.361168][ T301] dump_backtrace+0xf8/0x1f0 [ 18.361737][ T301] dump_stack_lvl+0xa8/0x11c [ 18.362305][ T301] dump_stack+0x1c/0x2c [ 18.362816][ T301] mrdump_common_die+0x184/0x40c [mrdump] [ 18.363575][ T301] ipanic_die+0x24/0x38 [mrdump] [ 18.364230][ T301] atomic_notifier_call_chain+0x128/0x2b8 [ 18.364937][ T301] die+0x16c/0x568 [ 18.365394][ T301] __do_kernel_fault+0x1e8/0x214 [ 18.365402][ T301] do_page_fault+0xb8/0x678 [ 18.366934][ T301] do_translation_fault+0x48/0x64 [ 18.368645][ T301] do_mem_abort+0x68/0x148 [ 18.368652][ T301] el1_abort+0x40/0x64 [ 18.368660][ T301] el1h_64_sync_handler+0x54/0x88 [ 18.368668][ T301] el1h_64_sync+0x68/0x6c [ 18.368673][ T301] mtk_iommu_probe_device+0xf8/0x118 [mtk_iommu] ... Cc: Robin Murphy Cc: Yong Wu Reported-by: kernel test robot Fixes: 635319a4a744 ("media: iommu/mediatek: Add device_link between the consumer and the larb devices") Signed-off-by: Miles Chen Reviewed-by: AngeloGioacchino Del Regno ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 0/3] iommu/arm-smmu: Support Tegra234 SMMU
On Thu, May 05, 2022 at 04:15:29PM +0200, Thierry Reding wrote: > On Fri, Apr 29, 2022 at 10:22:40AM +0200, Thierry Reding wrote: > > From: Thierry Reding > > > > Hi Joerg, > > > > this is essentially a resend of v2 with a Acked-by:s from Robin and Will > > added. These have been on the list for quite a while now, but apparently > > there was a misunderstanding, so neither you nor Will picked this up. > > > > Since Will acked these, I think it's probably best for you to pick these > > up directly. If not, let me know and I'll work with Will to merge via > > the ARM SMMU tree. > > > > Thanks, > > Thierry > > > > Thierry Reding (3): > > dt-bindings: arm-smmu: Document nvidia,memory-controller property > > dt-bindings: arm-smmu: Add compatible for Tegra234 SOC > > iommu/arm-smmu: Support Tegra234 SMMU > > > > .../devicetree/bindings/iommu/arm,smmu.yaml | 23 +-- > > drivers/iommu/arm/arm-smmu/arm-smmu-impl.c| 3 ++- > > 2 files changed, 23 insertions(+), 3 deletions(-) > > Joerg, > > anything left to do on this from your perspective, or can this go into > v5.19? I'll pick them up in the Arm SMMU queue, as there are some other SMMU patches kicking around and we may as well keep them all together. Will ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 0/3] iommu/arm-smmu: Support Tegra234 SMMU
On Fri, Apr 29, 2022 at 10:22:40AM +0200, Thierry Reding wrote: > From: Thierry Reding > > Hi Joerg, > > this is essentially a resend of v2 with a Acked-by:s from Robin and Will > added. These have been on the list for quite a while now, but apparently > there was a misunderstanding, so neither you nor Will picked this up. > > Since Will acked these, I think it's probably best for you to pick these > up directly. If not, let me know and I'll work with Will to merge via > the ARM SMMU tree. > > Thanks, > Thierry > > Thierry Reding (3): > dt-bindings: arm-smmu: Document nvidia,memory-controller property > dt-bindings: arm-smmu: Add compatible for Tegra234 SOC > iommu/arm-smmu: Support Tegra234 SMMU > > .../devicetree/bindings/iommu/arm,smmu.yaml | 23 +-- > drivers/iommu/arm/arm-smmu/arm-smmu-impl.c| 3 ++- > 2 files changed, 23 insertions(+), 3 deletions(-) Joerg, anything left to do on this from your perspective, or can this go into v5.19? Thanks, Thierry signature.asc Description: PGP signature ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
On Thu, May 05, 2022 at 07:40:37AM +, Tian, Kevin wrote: > In concept this is an iommu property instead of a domain property. Not really, domains shouldn't be changing behaviors once they are created. If a domain supports dirty tracking and I attach a new device then it still must support dirty tracking. I suppose we may need something here because we need to control when domains are re-used if they don't have the right properties in case the system iommu's are discontiguous somehow. ie iommufd should be able to assert that dirty tracking is desired and an existing non-dirty tracking capable domain will not be automatically re-used. We don't really have the right infrastructure to do this currently. > From this angle IMHO it's more reasonable to report this IOMMU > property to userspace via a device capability. If all devices attached > to a hwpt claim IOMMU dirty tracking capability, the user can call > set_dirty_tracking() on the hwpt object. Inherent domain properties need to be immutable or, at least one-way, like enforced coherent, or it just all stops making any kind of sense. > Once dirty tracking is enabled on a hwpt, further attaching a device > which doesn't claim this capability is simply rejected. It would be OK to do as enforced coherent does as flip a domain permanently into dirty-tracking enabled, or specify a flag at domain creation time. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
On Thu, May 05, 2022 at 11:03:18AM +, Tian, Kevin wrote: > iiuc the purpose of 'write-protection' here is to capture in-fly dirty pages > in the said race window until unmap and iotlb is invalidated is completed. No, the purpose is to perform "unmap" without destroying the dirty bit in the process. If an IOMMU architecture has a way to render the page unmaped and flush back the dirty bit/not destroy then it doesn't require a write protect pass. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v5 10/12] iommu: Prepare IOMMU domain for IOPF
Hi Baolu, On Thu, May 05, 2022 at 04:31:38PM +0800, Baolu Lu wrote: > On 2022/5/4 02:20, Jean-Philippe Brucker wrote: > > > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c > > > index 7cae631c1baa..33449523afbe 100644 > > > --- a/drivers/iommu/iommu.c > > > +++ b/drivers/iommu/iommu.c > > > @@ -3174,3 +3174,24 @@ void iommu_detach_device_pasid(struct iommu_domain > > > *domain, > > > iommu_group_put(group); > > > } > > > + > > > +struct iommu_domain *iommu_get_domain_for_dev_pasid(struct device *dev, > > > + ioasid_t pasid) > > > +{ > > > + struct iommu_domain *domain; > > > + struct iommu_group *group; > > > + > > > + if (!pasid_valid(pasid)) > > > + return NULL; > > > + > > > + group = iommu_group_get(dev); > > > + if (!group) > > > + return NULL; > > > + > > > + mutex_lock(>mutex); > > Unfortunately this still causes the deadlock when unbind() flushes the > > IOPF queue while holding the group mutex. > > Sorry, I didn't get your point here. > > Do you mean unbind() could hold group mutex before calling this helper? > The group mutex is only available in iommu.c. The unbind() has no means > to hold this lock. Or, I missed anything? I wasn't clear, it's iommu_detach_device_pasid() that holds the group->mutex: iommu_sva_unbind_device() | iommu_detach_device_pasid() | mutex_lock(>mutex)| domain->ops->detach_dev_pasid() | iopf_handle_group() iopf_queue_flush_dev() | iommu_get_domain_for_dev_pasid() ... wait for IOPF work | mutex_lock(>mutex) |... deadlock Thanks, Jean > > Best regards, > baolu > > > > > If we make this function private to IOPF, then we can get rid of this > > mutex_lock(). It's OK because: > > > > * xarray protects its internal state with RCU, so we can call > >xa_load() outside the lock. > > > > * The domain obtained from xa_load is finalized. Its content is valid > >because xarray stores the domain using rcu_assign_pointer(), which has a > >release memory barrier, which pairs with data dependencies in IOPF > >(domain->sva_ioas etc). > > > >We'll need to be careful about this when allowing other users to install > >a fault handler. Should be fine as long as the handler and data are > >installed before the domain is added to pasid_array. > > > > * We know the domain is valid the whole time IOPF is using it, because > >unbind() waits for pending faults. > > > > We just need a comment explaining the last point, something like: > > > > /* > > * Safe to fetch outside the group mutex because: > > * - xarray protects its internal state with RCU > > * - the domain obtained is either NULL or fully formed > > * - the IOPF work is the only caller and is flushed before the > > * domain is freed. > > */ > > > > Thanks, > > Jean > > > > > + domain = xa_load(>pasid_array, pasid); > > > + mutex_unlock(>mutex); > > > + iommu_group_put(group); > > > + > > > + return domain; > > > +} > ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v3] iommu/mediatek: Fix NULL pointer dereference when printing dev_name
When larbdev is NULL (in the case I hit, the node is incorrectly set iommus = < NUM>), it will cause device_link_add() fail and kernel crashes when we try to print dev_name(larbdev). Let's fail the probe if a larbdev is NULL to avoid invalid inputs from dts. It should work for normal correct setting and avoid the crash caused by my incorrect setting. Error log: [ 18.189042][ T301] Unable to handle kernel NULL pointer dereference at virtual address 0050 ... [ 18.344519][ T301] pstate: a045 (NzCv daif +PAN -UAO) [ 18.345213][ T301] pc : mtk_iommu_probe_device+0xf8/0x118 [mtk_iommu] [ 18.346050][ T301] lr : mtk_iommu_probe_device+0xd0/0x118 [mtk_iommu] [ 18.346884][ T301] sp : ffc00a5635e0 [ 18.347392][ T301] x29: ffc00a5635e0 x28: ffd44a46c1d8 [ 18.348156][ T301] x27: ff80c39a8000 x26: ffd44a80cc38 [ 18.348917][ T301] x25: x24: ffd44a80cc38 [ 18.349677][ T301] x23: ffd44e4da4c6 x22: ffd44a80cc38 [ 18.350438][ T301] x21: ff80cecd1880 x20: [ 18.351198][ T301] x19: ff80c439f010 x18: ffc00a50d0c0 [ 18.351959][ T301] x17: x16: 0004 [ 18.352719][ T301] x15: 0004 x14: ffd44eb5d420 [ 18.353480][ T301] x13: 0ad2 x12: 0003 [ 18.354241][ T301] x11: fad2 x10: c000fad2 [ 18.355003][ T301] x9 : a0d288d8d7142d00 x8 : a0d288d8d7142d00 [ 18.355763][ T301] x7 : ffd44c2bc640 x6 : [ 18.356524][ T301] x5 : 0080 x4 : 0001 [ 18.357284][ T301] x3 : x2 : 0005 [ 18.358045][ T301] x1 : x0 : [ 18.360208][ T301] Hardware name: MT6873 (DT) [ 18.360771][ T301] Call trace: [ 18.361168][ T301] dump_backtrace+0xf8/0x1f0 [ 18.361737][ T301] dump_stack_lvl+0xa8/0x11c [ 18.362305][ T301] dump_stack+0x1c/0x2c [ 18.362816][ T301] mrdump_common_die+0x184/0x40c [mrdump] [ 18.363575][ T301] ipanic_die+0x24/0x38 [mrdump] [ 18.364230][ T301] atomic_notifier_call_chain+0x128/0x2b8 [ 18.364937][ T301] die+0x16c/0x568 [ 18.365394][ T301] __do_kernel_fault+0x1e8/0x214 [ 18.365402][ T301] do_page_fault+0xb8/0x678 [ 18.366934][ T301] do_translation_fault+0x48/0x64 [ 18.368645][ T301] do_mem_abort+0x68/0x148 [ 18.368652][ T301] el1_abort+0x40/0x64 [ 18.368660][ T301] el1h_64_sync_handler+0x54/0x88 [ 18.368668][ T301] el1h_64_sync+0x68/0x6c [ 18.368673][ T301] mtk_iommu_probe_device+0xf8/0x118 [mtk_iommu] ... Cc: Robin Murphy Cc: Yong Wu Reported-by: kernel test robot Fixes: 635319a4a744 ("media: iommu/mediatek: Add device_link between the consumer and the larb devices") Signed-off-by: Miles Chen --- Change since v2 probe fail if larbdev is NULL so we do not have to worry about release logic Change since v1 fix a build warning reported by kernel test robot https://lore.kernel.org/lkml/202204231446.iykdz674-...@intel.com/ --- drivers/iommu/mtk_iommu.c| 6 ++ drivers/iommu/mtk_iommu_v1.c | 7 +++ 2 files changed, 13 insertions(+) diff --git a/drivers/iommu/mtk_iommu.c b/drivers/iommu/mtk_iommu.c index 6fd75a60abd6..155acfbce44f 100644 --- a/drivers/iommu/mtk_iommu.c +++ b/drivers/iommu/mtk_iommu.c @@ -572,6 +572,9 @@ static struct iommu_device *mtk_iommu_probe_device(struct device *dev) * All the ports in each a device should be in the same larbs. */ larbid = MTK_M4U_TO_LARB(fwspec->ids[0]); + if (larbid >= MTK_LARB_NR_MAX) + return ERR_PTR(-EINVAL); + for (i = 1; i < fwspec->num_ids; i++) { larbidx = MTK_M4U_TO_LARB(fwspec->ids[i]); if (larbid != larbidx) { @@ -581,6 +584,9 @@ static struct iommu_device *mtk_iommu_probe_device(struct device *dev) } } larbdev = data->larb_imu[larbid].dev; + if (!larbdev) + return ERR_PTR(-EINVAL); + link = device_link_add(dev, larbdev, DL_FLAG_PM_RUNTIME | DL_FLAG_STATELESS); if (!link) diff --git a/drivers/iommu/mtk_iommu_v1.c b/drivers/iommu/mtk_iommu_v1.c index ecff800656e6..74563f689fbd 100644 --- a/drivers/iommu/mtk_iommu_v1.c +++ b/drivers/iommu/mtk_iommu_v1.c @@ -80,6 +80,7 @@ /* MTK generation one iommu HW only support 4K size mapping */ #define MT2701_IOMMU_PAGE_SHIFT12 #define MT2701_IOMMU_PAGE_SIZE (1UL << MT2701_IOMMU_PAGE_SHIFT) +#define MT2701_LARB_NR_MAX 3 /* * MTK m4u support 4GB iova address space, and only support 4K page @@ -457,6 +458,9 @@ static struct iommu_device *mtk_iommu_probe_device(struct device *dev) /* Link the consumer device with the smi-larb device(supplier) */ larbid = mt2701_m4u_to_larb(fwspec->ids[0]); + if (larbid >= MT2701_LARB_NR_MAX) + return ERR_PTR(-EINVAL); +
Re: [PATCH 1/7] dt-bindings: gpio: renesas,rcar-gpio: R-Car V3U is R-Car Gen4
On Mon, May 2, 2022 at 3:35 PM Geert Uytterhoeven wrote: > > Despite the name, R-Car V3U is the first member of the R-Car Gen4 > family. Hence move its compatible value to the R-Car Gen4 section. > > Signed-off-by: Geert Uytterhoeven > --- > Documentation/devicetree/bindings/gpio/renesas,rcar-gpio.yaml | 4 +--- > 1 file changed, 1 insertion(+), 3 deletions(-) > > diff --git a/Documentation/devicetree/bindings/gpio/renesas,rcar-gpio.yaml > b/Documentation/devicetree/bindings/gpio/renesas,rcar-gpio.yaml > index 0681a4790cd62e23..75e5da6a7cc04bbd 100644 > --- a/Documentation/devicetree/bindings/gpio/renesas,rcar-gpio.yaml > +++ b/Documentation/devicetree/bindings/gpio/renesas,rcar-gpio.yaml > @@ -48,11 +48,9 @@ properties: >- renesas,gpio-r8a77995 # R-Car D3 >- const: renesas,rcar-gen3-gpio # R-Car Gen3 or RZ/G2 > > - - items: > - - const: renesas,gpio-r8a779a0 # R-Car V3U > - >- items: >- enum: > + - renesas,gpio-r8a779a0 # R-Car V3U >- renesas,gpio-r8a779f0 # R-Car S4-8 >- const: renesas,rcar-gen4-gpio # R-Car Gen4 > > -- > 2.25.1 > Acked-by: Bartosz Golaszewski ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 4/4] iommu/vt-d: Remove hard coding PGSNP bit in PASID entries
On 2022/5/5 16:46, Tian, Kevin wrote: From: Lu Baolu Sent: Thursday, May 5, 2022 9:07 AM As enforce_cache_coherency has been introduced into the iommu_domain_ops, the kernel component which owns the iommu domain is able to opt-in its requirement for force snooping support. The iommu driver has no need to hard code the page snoop control bit in the PASID table entries anymore. Signed-off-by: Lu Baolu Reviewed-by: Kevin Tian , with one nit: --- drivers/iommu/intel/pasid.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c index 41a0e3b02c79..0abfa7fc7fb0 100644 --- a/drivers/iommu/intel/pasid.c +++ b/drivers/iommu/intel/pasid.c @@ -710,9 +710,6 @@ int intel_pasid_setup_second_level(struct intel_iommu *iommu, pasid_set_fault_enable(pte); pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap)); Probably in a separate patch but above should really be renamed to pasid_set_page_walk_snoop(). Yeah! Need a cleanup here. Above name is confusing. - if (domain->domain.type == IOMMU_DOMAIN_UNMANAGED) - pasid_set_pgsnp(pte); - /* * Since it is a second level only translation setup, we should * set SRE bit as well (addresses are expected to be GPAs). -- 2.25.1 Best regards, baolu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v7 2/7] hwtracing: Add trace function support for HiSilicon PCIe Tune and Trace device
Hi Leo, Thanks for the comments. Some questions and replies below. On 2022/4/30 0:00, Leo Yan wrote: > On Thu, Apr 07, 2022 at 08:58:36PM +0800, Yicong Yang via iommu wrote: >> HiSilicon PCIe tune and trace device(PTT) is a PCIe Root Complex integrated >> Endpoint(RCiEP) device, providing the capability to dynamically monitor and >> tune the PCIe traffic, and trace the TLP headers. >> >> Add the driver for the device to enable the trace function. Register PMU >> device of PTT trace, then users can use trace through perf command. The >> driver makes use of perf AUX trace and support following events to >> configure the trace: >> >> - filter: select Root port or Endpoint to trace >> - type: select the type of traced TLP headers >> - direction: select the direction of traced TLP headers >> - format: select the data format of the traced TLP headers >> >> This patch adds the driver part of PTT trace. The perf command support of >> PTT trace is added in the following patch. >> >> Signed-off-by: Yicong Yang >> Reviewed-by: Jonathan Cameron >> --- [...] >> +static int hisi_ptt_update_aux(struct hisi_ptt *hisi_ptt, int index, bool >> stop) >> +{ >> +struct hisi_ptt_trace_ctrl *ctrl = _ptt->trace_ctrl; >> +struct perf_output_handle *handle = >handle; >> +struct perf_event *event = handle->event; >> +struct hisi_ptt_pmu_buf *buf; >> +void *addr; >> + >> +buf = perf_get_aux(handle); >> +if (!buf || !handle->size) >> +return -EINVAL; >> + >> +addr = ctrl->trace_buf[ctrl->buf_index].addr; >> + >> +memcpy(buf->base + buf->pos, addr, HISI_PTT_TRACE_BUF_SIZE); >> +memset(addr, 0, HISI_PTT_TRACE_BUF_SIZE); > > I am a bit worry buffer usages, at least for below aspects: > > The first thing is for memset(), which cleans up the buffer and the > buffer size is 4MiB, this means it will consume much CPU time to > cleanup the buffer, and trace_buf is mapped as non-cacheable, the > performance would be get worse. > > The second thing is here it always copies the trace data with size > HISI_PTT_TRACE_BUF_SIZE, usually, the trace module can provide a read > pointer register, so you can know the trace data length based on the > delta value between write and read pointers. > > The last thing is the ctrl->trace_buf[] works as bounce buffer, so it > means actually there have an extra copy from bounce buffer to AUX > buffer, is it possible to directly output PTT trace data to AUX buffer? > > Sorry if I bring up duplicate questions and before have the simliar > discussion when reviewed the patch. > The hardware is designed to use 4 DMA buffers and fill the buffer with the traced data one by one. When one buffer is full the device will notify the driver by interrupt and continue to trace the PCIe TLPs in the following buffer without pausing. If the interrupt status bit of the buffer going to use is uncleared, the trace will be paused until the corresponding interrupt status bit is cleared. So the buffer size of 4 MiB is calculated that even at the max flow the driver can handle the traced data and clear the status in time, so there won't be a data loss. For the first thing, the performance is acceptable as we only handle 4MiB at one time. It's in thread context so we won't block others. The hardware can keep tracing in the following DMA buffer so handling here won't pause the trace and will not cause data loss. For the second thing, this function is called in 2 places: 1) in the IRQ thread 2) the trace is going to stop. In the 1st case the data length will always be HISI_PTT_TRACE_BUF_SIZE as the interrupt only occurs when one buffer is full. In the 2nd case we may not have HISI_PTT_TRACE_BUF_SIZE data, the unfilled buffer is zeroed to be distinguished. We keep committing HISI_PTT_TRACE_BUF_SIZE data to keep consistence with handling in interrupt and make it simpler here. (HISI_PTT_TRACE_WR_STS register indicates will buffer is currently used and the offset in the buffer data is currently writing to) For the third thing, I'm not sure if we can map the AUX buffer as DMA buffer and by taht way, considering the case the buffer is full and we need to commit the AUX buffer and apply a new one, the trace is paused during the procedure and any TLPs on the link will be missed. But by current approach we won't have this problem as the tracing is still on even when we're committing the AUX buffer as the hardware is not directly writing to it. >> +buf->pos += HISI_PTT_TRACE_BUF_SIZE; >> + >> +if (stop) { >> +perf_aux_output_end(handle, buf->pos); >> +} else if (buf->length - buf->pos < HISI_PTT_TRACE_BUF_SIZE) { >> +perf_aux_output_skip(handle, buf->length - buf->pos); >> +perf_aux_output_end(handle, buf->pos); >> + >> +buf = perf_aux_output_begin(handle, event); >> +if (!buf) >> +return -EINVAL; >> + >> +buf->pos = handle->head % buf->length; >> +if (buf->length -
Re: [PATCH v2 2/4] iommu/vt-d: Check domain force_snooping against attached devices
On 2022/5/5 16:43, Tian, Kevin wrote: From: Lu Baolu Sent: Thursday, May 5, 2022 9:07 AM As domain->force_snooping only impacts the devices attached with the domain, there's no need to check against all IOMMU units. At the same time, for a brand new domain (hasn't been attached to any device), the force_snooping field could be set, but the attach_dev callback will return failure if it wants to attach to a device which IOMMU has no snoop control capability. The description about brand new domain is not very clear. I think the point here is that force_snooping could be set on a domain no matter whether it has been attached or not and once set it is an immutable flag. If no device attached the operation always succeeds then this empty domain can be only attached to a device of which the IOMMU supports snoop control. Exactly. Will update this description. static bool intel_iommu_enforce_cache_coherency(struct iommu_domain *domain) { struct dmar_domain *dmar_domain = to_dmar_domain(domain); - if (!domain_update_iommu_snooping(NULL)) + if (dmar_domain->force_snooping) + return true; + + if (!domain_support_force_snooping(dmar_domain)) return false; + Who guarantees that domain->devices won't change between above condition check and following set operation? Good catch. Should lift the lock up here. + domain_set_force_snooping(dmar_domain); dmar_domain->force_snooping = true; + return true; } Thanks Kevin Best regards, baolu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
On 5/5/22 12:03, Tian, Kevin wrote: >> From: Joao Martins >> Sent: Thursday, May 5, 2022 6:07 PM >> >> On 5/5/22 08:42, Tian, Kevin wrote: From: Jason Gunthorpe Sent: Tuesday, May 3, 2022 2:53 AM On Mon, May 02, 2022 at 12:11:07PM -0600, Alex Williamson wrote: > On Fri, 29 Apr 2022 05:45:20 + > "Tian, Kevin" wrote: >>> From: Joao Martins >>> 3) Unmapping an IOVA range while returning its dirty bit prior to >>> unmap. This case is specific for non-nested vIOMMU case where an >>> erronous guest (or device) DMAing to an address being unmapped at the >>> same time. >> >> an erroneous attempt like above cannot anticipate which DMAs can >> succeed in that window thus the end behavior is undefined. For an >> undefined behavior nothing will be broken by losing some bits dirtied >> in the window between reading back dirty bits of the range and >> actually calling unmap. From guest p.o.v. all those are black-box >> hardware logic to serve a virtual iotlb invalidation request which just >> cannot be completed in one cycle. >> >> Hence in reality probably this is not required except to meet vfio >> compat requirement. Just in concept returning dirty bits at unmap >> is more accurate. >> >> I'm slightly inclined to abandon it in iommufd uAPI. > > Sorry, I'm not following why an unmap with returned dirty bitmap > operation is specific to a vIOMMU case, or in fact indicative of some > sort of erroneous, racy behavior of guest or device. It is being compared against the alternative which is to explicitly query dirty then do a normal unmap as two system calls and permit a race. The only case with any difference is if the guest is racing DMA with the unmap - in which case it is already indeterminate for the guest if the DMA will be completed or not. eg on the vIOMMU case if the guest races DMA with unmap then we are already fine with throwing away that DMA because that is how the race resolves during non-migration situations, so resovling it as throwing away the DMA during migration is OK too. > We need the flexibility to support memory hot-unplug operations > during migration, I would have thought that hotplug during migration would simply discard all the data - how does it use the dirty bitmap? > This was implemented as a single operation specifically to avoid > races where ongoing access may be available after retrieving a > snapshot of the bitmap. Thanks, The issue is the cost. On a real iommu elminating the race is expensive as we have to write protect the pages before query dirty, which seems to be an extra IOTLB flush. It is not clear if paying this cost to become atomic is actually something any use case needs. So, I suggest we think about a 3rd op 'write protect and clear dirties' that will be followed by a normal unmap - the extra op will have the extra oveheard and userspace can decide if it wants to pay or not vs the non-atomic read dirties operation. And lets have a use case where this must be atomic before we implement it.. >>> >>> and write-protection also relies on the support of I/O page fault... >>> >> /I think/ all IOMMUs in this series already support permission/unrecoverable >> I/O page faults for a long time IIUC. >> >> The earlier suggestion was just to discard the I/O page fault after >> write-protection happens. fwiw, some IOMMUs also support suppressing >> the event notification (like AMD). > > iiuc the purpose of 'write-protection' here is to capture in-fly dirty pages > in the said race window until unmap and iotlb is invalidated is completed. > But then we depend on PRS being there on the device, because without it, DMA is aborted on the target on a read-only IOVA prior to the page fault, thus the page is not going to be dirty anyways. > *unrecoverable* faults are not expected to be used in a feature path > as occurrence of such faults may lead to severe reaction in iommu > drivers e.g. completely block DMA from the device causing such faults. Unless I totally misunderstood ... the later is actually what we were suggesting here /in the context of unmaping an GIOVA/(*). The wrprotect() was there to ensure we get an atomic dirty state of the IOVA range afterwards, by blocking DMA (as opposed to sort of mediating DMA). The I/O page fault is not supposed to happen unless there's rogue DMA AIUI. TBH, the same could be said for normal DMA unmap as that does not make any sort of guarantees of stopping DMA until the IOTLB flush happens. (*) Although I am not saying the use-case of wrprotect() and mediating dirty pages you say isn't useful. I guess it is in a world where we want support post-copy migration with VFs, which would require some form of PRI (via the PF?) of the
RE: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
> From: Joao Martins > Sent: Thursday, May 5, 2022 6:07 PM > > On 5/5/22 08:42, Tian, Kevin wrote: > >> From: Jason Gunthorpe > >> Sent: Tuesday, May 3, 2022 2:53 AM > >> > >> On Mon, May 02, 2022 at 12:11:07PM -0600, Alex Williamson wrote: > >>> On Fri, 29 Apr 2022 05:45:20 + > >>> "Tian, Kevin" wrote: > > From: Joao Martins > > 3) Unmapping an IOVA range while returning its dirty bit prior to > > unmap. This case is specific for non-nested vIOMMU case where an > > erronous guest (or device) DMAing to an address being unmapped at > >> the > > same time. > > an erroneous attempt like above cannot anticipate which DMAs can > succeed in that window thus the end behavior is undefined. For an > undefined behavior nothing will be broken by losing some bits dirtied > in the window between reading back dirty bits of the range and > actually calling unmap. From guest p.o.v. all those are black-box > hardware logic to serve a virtual iotlb invalidation request which just > cannot be completed in one cycle. > > Hence in reality probably this is not required except to meet vfio > compat requirement. Just in concept returning dirty bits at unmap > is more accurate. > > I'm slightly inclined to abandon it in iommufd uAPI. > >>> > >>> Sorry, I'm not following why an unmap with returned dirty bitmap > >>> operation is specific to a vIOMMU case, or in fact indicative of some > >>> sort of erroneous, racy behavior of guest or device. > >> > >> It is being compared against the alternative which is to explicitly > >> query dirty then do a normal unmap as two system calls and permit a > >> race. > >> > >> The only case with any difference is if the guest is racing DMA with > >> the unmap - in which case it is already indeterminate for the guest if > >> the DMA will be completed or not. > >> > >> eg on the vIOMMU case if the guest races DMA with unmap then we are > >> already fine with throwing away that DMA because that is how the race > >> resolves during non-migration situations, so resovling it as throwing > >> away the DMA during migration is OK too. > >> > >>> We need the flexibility to support memory hot-unplug operations > >>> during migration, > >> > >> I would have thought that hotplug during migration would simply > >> discard all the data - how does it use the dirty bitmap? > >> > >>> This was implemented as a single operation specifically to avoid > >>> races where ongoing access may be available after retrieving a > >>> snapshot of the bitmap. Thanks, > >> > >> The issue is the cost. > >> > >> On a real iommu elminating the race is expensive as we have to write > >> protect the pages before query dirty, which seems to be an extra IOTLB > >> flush. > >> > >> It is not clear if paying this cost to become atomic is actually > >> something any use case needs. > >> > >> So, I suggest we think about a 3rd op 'write protect and clear > >> dirties' that will be followed by a normal unmap - the extra op will > >> have the extra oveheard and userspace can decide if it wants to pay or > >> not vs the non-atomic read dirties operation. And lets have a use case > >> where this must be atomic before we implement it.. > > > > and write-protection also relies on the support of I/O page fault... > > > /I think/ all IOMMUs in this series already support permission/unrecoverable > I/O page faults for a long time IIUC. > > The earlier suggestion was just to discard the I/O page fault after > write-protection happens. fwiw, some IOMMUs also support suppressing > the event notification (like AMD). iiuc the purpose of 'write-protection' here is to capture in-fly dirty pages in the said race window until unmap and iotlb is invalidated is completed. *unrecoverable* faults are not expected to be used in a feature path as occurrence of such faults may lead to severe reaction in iommu drivers e.g. completely block DMA from the device causing such faults. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH v2] iommu: iommu_group_claim_dma_owner() must always assign a domain
> From: Jason Gunthorpe > Sent: Thursday, May 5, 2022 3:09 AM > > Once the group enters 'owned' mode it can never be assigned back to the > default_domain or to a NULL domain. It must always be actively assigned to worth pointing out that a NULL domain is not always translated to DMA blocked on all platforms. That was a wrong assumption before this patch. > a current domain. If the caller hasn't provided a domain then the core > must provide an explicit DMA blocking domain that has no DMA map. > > Lazily create a group-global blocking DMA domain when > iommu_group_claim_dma_owner is first called and immediately assign the > group to it. This ensures that DMA is immediately fully isolated on all > IOMMU drivers. > > If the user attaches/detaches while owned then detach will set the group > back to the blocking domain. > > Slightly reorganize the call chains so that > __iommu_group_attach_core_domain() is the function that removes any > caller > configured domain and sets the domains back a core owned domain with an > appropriate lifetime. > > __iommu_group_attach_domain() is the worker function that can change the > domain assigned to a group to any target domain, including NULL. > > Add comments clarifying how the NULL vs detach_dev vs default_domain > works > based on Robin's remarks. > > This fixes an oops with VFIO and SMMUv3 because VFIO will call > iommu_detach_group() and then immediately iommu_domain_free(), but > SMMUv3 has no way to know that the domain it is holding a pointer to > has been freed. Now the iommu_detach_group() will assign the blocking > domain and SMMUv3 will no longer hold a stale domain reference. Overall I like what this patch does. Just some nits below. > > Fixes: 1ea2a07a532b ("iommu: Add DMA ownership management > interfaces") > Reported-by: Qian Cai > Tested-by: Baolu Lu > Tested-by: Nicolin Chen > Co-developed-by: Robin Murphy > Signed-off-by: Robin Murphy > Signed-off-by: Jason Gunthorpe > --- > drivers/iommu/iommu.c | 122 ++ > 1 file changed, 87 insertions(+), 35 deletions(-) > > Joerg, this should address the issue, Nicolin reproduced the original issue > and verified this fix on ARM SMMUv3. > > v2: > - Remove redundant detach_dev ops check in __iommu_detach_device and >make the added WARN_ON fail instead > - Check for blocking_domain in __iommu_attach_group() so VFIO can >actually attach a new group > - Update comments and spelling > - Fix missed change to new_domain in iommu_group_do_detach_device() > v1: https://lore.kernel.org/r/0-v1-6e9d2d0a759d+11b- > iommu_dma_block_...@nvidia.com > > Thanks, > Jason > > diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c > index 0c42ece2585406..c1bdec807ea381 100644 > --- a/drivers/iommu/iommu.c > +++ b/drivers/iommu/iommu.c > @@ -44,6 +44,7 @@ struct iommu_group { > char *name; > int id; > struct iommu_domain *default_domain; > + struct iommu_domain *blocking_domain; > struct iommu_domain *domain; > struct list_head entry; > unsigned int owner_cnt; > @@ -82,8 +83,7 @@ static int __iommu_attach_device(struct > iommu_domain *domain, >struct device *dev); > static int __iommu_attach_group(struct iommu_domain *domain, > struct iommu_group *group); > -static void __iommu_detach_group(struct iommu_domain *domain, > - struct iommu_group *group); > +static void __iommu_group_attach_core_domain(struct iommu_group > *group); > static int iommu_create_device_direct_mappings(struct iommu_group > *group, > struct device *dev); > static struct iommu_group *iommu_group_get_for_dev(struct device *dev); > @@ -596,6 +596,8 @@ static void iommu_group_release(struct kobject > *kobj) > > if (group->default_domain) > iommu_domain_free(group->default_domain); > + if (group->blocking_domain) > + iommu_domain_free(group->blocking_domain); > > kfree(group->name); > kfree(group); > @@ -1963,9 +1965,6 @@ static void __iommu_detach_device(struct > iommu_domain *domain, > if (iommu_is_attach_deferred(dev)) > return; > > - if (unlikely(domain->ops->detach_dev == NULL)) > - return; > - > domain->ops->detach_dev(domain, dev); > trace_detach_device_from_domain(dev); > } > @@ -1979,12 +1978,10 @@ void iommu_detach_device(struct > iommu_domain *domain, struct device *dev) > return; > > mutex_lock(>mutex); > - if (iommu_group_device_count(group) != 1) { > - WARN_ON(1); > + if (WARN_ON(domain != group->domain) || > + WARN_ON(iommu_group_device_count(group) != 1)) > goto out_unlock; > - } > - > - __iommu_detach_group(domain, group); > + __iommu_group_attach_core_domain(group); > > out_unlock: > mutex_unlock(>mutex); > @@ -2040,7
Re: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
On 5/5/22 08:42, Tian, Kevin wrote: >> From: Jason Gunthorpe >> Sent: Tuesday, May 3, 2022 2:53 AM >> >> On Mon, May 02, 2022 at 12:11:07PM -0600, Alex Williamson wrote: >>> On Fri, 29 Apr 2022 05:45:20 + >>> "Tian, Kevin" wrote: > From: Joao Martins > 3) Unmapping an IOVA range while returning its dirty bit prior to > unmap. This case is specific for non-nested vIOMMU case where an > erronous guest (or device) DMAing to an address being unmapped at >> the > same time. an erroneous attempt like above cannot anticipate which DMAs can succeed in that window thus the end behavior is undefined. For an undefined behavior nothing will be broken by losing some bits dirtied in the window between reading back dirty bits of the range and actually calling unmap. From guest p.o.v. all those are black-box hardware logic to serve a virtual iotlb invalidation request which just cannot be completed in one cycle. Hence in reality probably this is not required except to meet vfio compat requirement. Just in concept returning dirty bits at unmap is more accurate. I'm slightly inclined to abandon it in iommufd uAPI. >>> >>> Sorry, I'm not following why an unmap with returned dirty bitmap >>> operation is specific to a vIOMMU case, or in fact indicative of some >>> sort of erroneous, racy behavior of guest or device. >> >> It is being compared against the alternative which is to explicitly >> query dirty then do a normal unmap as two system calls and permit a >> race. >> >> The only case with any difference is if the guest is racing DMA with >> the unmap - in which case it is already indeterminate for the guest if >> the DMA will be completed or not. >> >> eg on the vIOMMU case if the guest races DMA with unmap then we are >> already fine with throwing away that DMA because that is how the race >> resolves during non-migration situations, so resovling it as throwing >> away the DMA during migration is OK too. >> >>> We need the flexibility to support memory hot-unplug operations >>> during migration, >> >> I would have thought that hotplug during migration would simply >> discard all the data - how does it use the dirty bitmap? >> >>> This was implemented as a single operation specifically to avoid >>> races where ongoing access may be available after retrieving a >>> snapshot of the bitmap. Thanks, >> >> The issue is the cost. >> >> On a real iommu elminating the race is expensive as we have to write >> protect the pages before query dirty, which seems to be an extra IOTLB >> flush. >> >> It is not clear if paying this cost to become atomic is actually >> something any use case needs. >> >> So, I suggest we think about a 3rd op 'write protect and clear >> dirties' that will be followed by a normal unmap - the extra op will >> have the extra oveheard and userspace can decide if it wants to pay or >> not vs the non-atomic read dirties operation. And lets have a use case >> where this must be atomic before we implement it.. > > and write-protection also relies on the support of I/O page fault... > /I think/ all IOMMUs in this series already support permission/unrecoverable I/O page faults for a long time IIUC. The earlier suggestion was just to discard the I/O page fault after write-protection happens. fwiw, some IOMMUs also support suppressing the event notification (like AMD). ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 15/19] iommu/arm-smmu-v3: Add set_dirty_tracking_range() support
On 5/5/22 08:25, Shameerali Kolothum Thodi wrote: >> -Original Message- >> From: Joao Martins [mailto:joao.m.mart...@oracle.com] >> Sent: 29 April 2022 12:05 >> To: Tian, Kevin >> Cc: Joerg Roedel ; Suravee Suthikulpanit >> ; Will Deacon ; Robin >> Murphy ; Jean-Philippe Brucker >> ; zhukeqian ; >> Shameerali Kolothum Thodi ; >> David Woodhouse ; Lu Baolu >> ; Jason Gunthorpe ; Nicolin Chen >> ; Yishai Hadas ; Eric Auger >> ; Liu, Yi L ; Alex Williamson >> ; Cornelia Huck ; >> k...@vger.kernel.org; iommu@lists.linux-foundation.org >> Subject: Re: [PATCH RFC 15/19] iommu/arm-smmu-v3: Add >> set_dirty_tracking_range() support >> >> On 4/29/22 09:28, Tian, Kevin wrote: From: Joao Martins Sent: Friday, April 29, 2022 5:09 AM Similar to .read_and_clear_dirty() use the page table walker helper functions and set DBM|RDONLY bit, thus switching the IOPTE to writeable-clean. >>> >>> this should not be one-off if the operation needs to be >>> applied to IOPTE. Say a map request comes right after >>> set_dirty_tracking() is called. If it's agreed to remove >>> the range op then smmu driver should record the tracking >>> status internally and then apply the modifier to all the new >>> mappings automatically before dirty tracking is disabled. >>> Otherwise the same logic needs to be kept in iommufd to >>> call set_dirty_tracking_range() explicitly for every new >>> iopt_area created within the tracking window. >> >> Gah, I totally missed that by mistake. New mappings aren't >> carrying over the "DBM is set". This needs a new io-pgtable >> quirk added post dirty-tracking toggling. >> >> I can adjust, but I am at odds on including this in a future >> iteration given that I can't really test any of this stuff. >> Might drop the driver until I have hardware/emulation I can >> use (or maybe others can take over this). It was included >> for revising the iommu core ops and whether iommufd was >> affected by it. > > [+Kunkun Jiang]. I think he is now looking into this and might have > a test setup to verify this. I'll keep him CC'ed next iterations. Thanks! FWIW, the should change a bit on next iteration (simpler) by always enabling DBM from the start. SMMUv3 ::set_dirty_tracking() becomes a simpler function that tests quirks (i.e. DBM set) and what not, and calls read_and_clear_dirty() without a bitmap argument to clear dirties. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 10/37] iommu/amd: Introduce per PCI segment last_bdf
Hi Joerg, On 5/2/2022 4:24 PM, Joerg Roedel wrote: > Hi Vasant, > > On Fri, Apr 29, 2022 at 08:15:49PM +0530, Vasant Hegde wrote: >> We still need to parse IVHD to find max devices supported by each PCI segment >> (same as the way its doing it today). Hence we need all these variables. > > From what I have seen since a few years the IVRS tables enumerate the > whole PCI segment, up to device ff:1f.7. This results in the maximum > being allocated for all data structures anyway. Therefore we can > probably think about skipping the scan to find the largest bdf and just > assume it is ff:1f.7, saving us all the size-tracking variables? With PCI segment, I think we will have segments with less than ff:1f:7. Hence we need these variables. -Vasant ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH v2 4/4] iommu/vt-d: Remove hard coding PGSNP bit in PASID entries
> From: Lu Baolu > Sent: Thursday, May 5, 2022 9:07 AM > > As enforce_cache_coherency has been introduced into the > iommu_domain_ops, > the kernel component which owns the iommu domain is able to opt-in its > requirement for force snooping support. The iommu driver has no need to > hard code the page snoop control bit in the PASID table entries anymore. > > Signed-off-by: Lu Baolu Reviewed-by: Kevin Tian , with one nit: > --- > drivers/iommu/intel/pasid.c | 3 --- > 1 file changed, 3 deletions(-) > > diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c > index 41a0e3b02c79..0abfa7fc7fb0 100644 > --- a/drivers/iommu/intel/pasid.c > +++ b/drivers/iommu/intel/pasid.c > @@ -710,9 +710,6 @@ int intel_pasid_setup_second_level(struct > intel_iommu *iommu, > pasid_set_fault_enable(pte); > pasid_set_page_snoop(pte, !!ecap_smpwc(iommu->ecap)); Probably in a separate patch but above should really be renamed to pasid_set_page_walk_snoop(). > > - if (domain->domain.type == IOMMU_DOMAIN_UNMANAGED) > - pasid_set_pgsnp(pte); > - > /* >* Since it is a second level only translation setup, we should >* set SRE bit as well (addresses are expected to be GPAs). > -- > 2.25.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH v2 3/4] iommu/vt-d: Remove domain_update_iommu_snooping()
> From: Lu Baolu > Sent: Thursday, May 5, 2022 9:07 AM > > The IOMMU force snooping capability is not required to be consistent > among all the IOMMUs anymore. Remove force snooping capability check > in the IOMMU hot-add path and domain_update_iommu_snooping() > becomes > a dead code now. > > Signed-off-by: Lu Baolu > Reviewed-by: Jason Gunthorpe Reviewed-by: Kevin Tian > --- > drivers/iommu/intel/iommu.c | 34 +- > 1 file changed, 1 insertion(+), 33 deletions(-) > > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c > index 98112228ae93..da4bfb34ae4b 100644 > --- a/drivers/iommu/intel/iommu.c > +++ b/drivers/iommu/intel/iommu.c > @@ -533,33 +533,6 @@ static void > domain_update_iommu_coherency(struct dmar_domain *domain) > rcu_read_unlock(); > } > > -static bool domain_update_iommu_snooping(struct intel_iommu *skip) > -{ > - struct dmar_drhd_unit *drhd; > - struct intel_iommu *iommu; > - bool ret = true; > - > - rcu_read_lock(); > - for_each_active_iommu(iommu, drhd) { > - if (iommu != skip) { > - /* > - * If the hardware is operating in the scalable mode, > - * the snooping control is always supported since we > - * always set PASID-table-entry.PGSNP bit if the > domain > - * is managed outside (UNMANAGED). > - */ > - if (!sm_supported(iommu) && > - !ecap_sc_support(iommu->ecap)) { > - ret = false; > - break; > - } > - } > - } > - rcu_read_unlock(); > - > - return ret; > -} > - > static int domain_update_iommu_superpage(struct dmar_domain *domain, >struct intel_iommu *skip) > { > @@ -3593,12 +3566,7 @@ static int intel_iommu_add(struct > dmar_drhd_unit *dmaru) > iommu->name); > return -ENXIO; > } > - if (!ecap_sc_support(iommu->ecap) && > - domain_update_iommu_snooping(iommu)) { > - pr_warn("%s: Doesn't support snooping.\n", > - iommu->name); > - return -ENXIO; > - } > + > sp = domain_update_iommu_superpage(NULL, iommu) - 1; > if (sp >= 0 && !(cap_super_page_val(iommu->cap) & (1 << sp))) { > pr_warn("%s: Doesn't support large page.\n", > -- > 2.25.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH v2 2/4] iommu/vt-d: Check domain force_snooping against attached devices
> From: Lu Baolu > Sent: Thursday, May 5, 2022 9:07 AM > > As domain->force_snooping only impacts the devices attached with the > domain, there's no need to check against all IOMMU units. At the same > time, for a brand new domain (hasn't been attached to any device), the > force_snooping field could be set, but the attach_dev callback will > return failure if it wants to attach to a device which IOMMU has no > snoop control capability. The description about brand new domain is not very clear. I think the point here is that force_snooping could be set on a domain no matter whether it has been attached or not and once set it is an immutable flag. If no device attached the operation always succeeds then this empty domain can be only attached to a device of which the IOMMU supports snoop control. > static bool intel_iommu_enforce_cache_coherency(struct iommu_domain > *domain) > { > struct dmar_domain *dmar_domain = to_dmar_domain(domain); > > - if (!domain_update_iommu_snooping(NULL)) > + if (dmar_domain->force_snooping) > + return true; > + > + if (!domain_support_force_snooping(dmar_domain)) > return false; > + Who guarantees that domain->devices won't change between above condition check and following set operation? > + domain_set_force_snooping(dmar_domain); > dmar_domain->force_snooping = true; > + > return true; > } > Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH v2 1/4] iommu/vt-d: Block force-snoop domain attaching if no SC support
> From: Lu Baolu > Sent: Thursday, May 5, 2022 9:07 AM > > In the attach_dev callback of the default domain ops, if the domain has > been set force_snooping, but the iommu hardware of the device does not > support SC(Snoop Control) capability, the callback should block it and > return a corresponding error code. > > Signed-off-by: Lu Baolu > Reviewed-by: Jason Gunthorpe Reviewed-by: Kevin Tian > --- > drivers/iommu/intel/iommu.c | 3 +++ > 1 file changed, 3 insertions(+) > > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c > index cf43e8f9091b..d68f5bbf3e93 100644 > --- a/drivers/iommu/intel/iommu.c > +++ b/drivers/iommu/intel/iommu.c > @@ -4354,6 +4354,9 @@ static int prepare_domain_attach_device(struct > iommu_domain *domain, > if (!iommu) > return -ENODEV; > > + if (dmar_domain->force_snooping && !ecap_sc_support(iommu- > >ecap)) > + return -EOPNOTSUPP; > + > /* check if this iommu agaw is sufficient for max mapped address */ > addr_width = agaw_to_width(iommu->agaw); > if (addr_width > cap_mgaw(iommu->cap)) > -- > 2.25.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v5 10/12] iommu: Prepare IOMMU domain for IOPF
On 2022/5/4 02:20, Jean-Philippe Brucker wrote: diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 7cae631c1baa..33449523afbe 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -3174,3 +3174,24 @@ void iommu_detach_device_pasid(struct iommu_domain *domain, iommu_group_put(group); } + +struct iommu_domain *iommu_get_domain_for_dev_pasid(struct device *dev, + ioasid_t pasid) +{ + struct iommu_domain *domain; + struct iommu_group *group; + + if (!pasid_valid(pasid)) + return NULL; + + group = iommu_group_get(dev); + if (!group) + return NULL; + + mutex_lock(>mutex); Unfortunately this still causes the deadlock when unbind() flushes the IOPF queue while holding the group mutex. Sorry, I didn't get your point here. Do you mean unbind() could hold group mutex before calling this helper? The group mutex is only available in iommu.c. The unbind() has no means to hold this lock. Or, I missed anything? Best regards, baolu If we make this function private to IOPF, then we can get rid of this mutex_lock(). It's OK because: * xarray protects its internal state with RCU, so we can call xa_load() outside the lock. * The domain obtained from xa_load is finalized. Its content is valid because xarray stores the domain using rcu_assign_pointer(), which has a release memory barrier, which pairs with data dependencies in IOPF (domain->sva_ioas etc). We'll need to be careful about this when allowing other users to install a fault handler. Should be fine as long as the handler and data are installed before the domain is added to pasid_array. * We know the domain is valid the whole time IOPF is using it, because unbind() waits for pending faults. We just need a comment explaining the last point, something like: /* * Safe to fetch outside the group mutex because: * - xarray protects its internal state with RCU * - the domain obtained is either NULL or fully formed * - the IOPF work is the only caller and is flushed before the * domain is freed. */ Thanks, Jean + domain = xa_load(>pasid_array, pasid); + mutex_unlock(>mutex); + iommu_group_put(group); + + return domain; +} ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
> From: Jason Gunthorpe > Sent: Tuesday, May 3, 2022 2:53 AM > > On Mon, May 02, 2022 at 12:11:07PM -0600, Alex Williamson wrote: > > On Fri, 29 Apr 2022 05:45:20 + > > "Tian, Kevin" wrote: > > > > From: Joao Martins > > > > 3) Unmapping an IOVA range while returning its dirty bit prior to > > > > unmap. This case is specific for non-nested vIOMMU case where an > > > > erronous guest (or device) DMAing to an address being unmapped at > the > > > > same time. > > > > > > an erroneous attempt like above cannot anticipate which DMAs can > > > succeed in that window thus the end behavior is undefined. For an > > > undefined behavior nothing will be broken by losing some bits dirtied > > > in the window between reading back dirty bits of the range and > > > actually calling unmap. From guest p.o.v. all those are black-box > > > hardware logic to serve a virtual iotlb invalidation request which just > > > cannot be completed in one cycle. > > > > > > Hence in reality probably this is not required except to meet vfio > > > compat requirement. Just in concept returning dirty bits at unmap > > > is more accurate. > > > > > > I'm slightly inclined to abandon it in iommufd uAPI. > > > > Sorry, I'm not following why an unmap with returned dirty bitmap > > operation is specific to a vIOMMU case, or in fact indicative of some > > sort of erroneous, racy behavior of guest or device. > > It is being compared against the alternative which is to explicitly > query dirty then do a normal unmap as two system calls and permit a > race. > > The only case with any difference is if the guest is racing DMA with > the unmap - in which case it is already indeterminate for the guest if > the DMA will be completed or not. > > eg on the vIOMMU case if the guest races DMA with unmap then we are > already fine with throwing away that DMA because that is how the race > resolves during non-migration situations, so resovling it as throwing > away the DMA during migration is OK too. > > > We need the flexibility to support memory hot-unplug operations > > during migration, > > I would have thought that hotplug during migration would simply > discard all the data - how does it use the dirty bitmap? > > > This was implemented as a single operation specifically to avoid > > races where ongoing access may be available after retrieving a > > snapshot of the bitmap. Thanks, > > The issue is the cost. > > On a real iommu elminating the race is expensive as we have to write > protect the pages before query dirty, which seems to be an extra IOTLB > flush. > > It is not clear if paying this cost to become atomic is actually > something any use case needs. > > So, I suggest we think about a 3rd op 'write protect and clear > dirties' that will be followed by a normal unmap - the extra op will > have the extra oveheard and userspace can decide if it wants to pay or > not vs the non-atomic read dirties operation. And lets have a use case > where this must be atomic before we implement it.. and write-protection also relies on the support of I/O page fault... > > The downside is we loose a little bit of efficiency by unbundling > these steps, the upside is that it doesn't require quite as many > special iommu_domain/etc paths. > > (Also Joao, you should probably have a read and do not clear dirty > operation with the idea that the next operation will be unmap - then > maybe we can avoid IOTLB flushing..) > > Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
> From: Jason Gunthorpe > Sent: Friday, April 29, 2022 8:39 PM > > > >> * There's no capabilities API in IOMMUFD, and in this RFC each vendor > tracks > > > > > > there was discussion adding device capability uAPI somewhere. > > > > > ack let me know if there was snippets to the conversation as I seem to have > missed that. > > It was just discssion pending something we actually needed to report. > > Would be a very simple ioctl taking in the device ID and fulling a > struct of stuff. > > > > probably this can be reported as a device cap as supporting of dirty bit > > > is > > > an immutable property of the iommu serving that device. > > It is an easier fit to read it out of the iommu_domain after device > attach though - since we don't need to build new kernel infrastructure > to query it from a device. > > > > Userspace can > > > enable dirty tracking on a hwpt if all attached devices claim the support > > > and kernel will does the same verification. > > > > Sorry to be dense but this is not up to 'devices' given they take no > > part in the tracking? I guess by 'devices' you mean the software > > idea of it i.e. the iommu context created for attaching a said > > physical device, not the physical device itself. > > Indeed, an hwpt represents an iommu_domain and if the iommu_domain > has > dirty tracking ops set then that is an inherent propery of the domain > and does not suddenly go away when a new device is attached. > In concept this is an iommu property instead of a domain property. The two can draw an equation only if the iommu driver registers dirty tracking ops only when all IOMMUs in the platform support the capability, i.e. sort of managing this iommu property in a global way. But the global way sort of conflicts with the on-going direction making iommu capability truly per-iommu (though I'm not sure whether heterogeneity would exist for dirty tracking). Following that trend a domain property is not inherent as it is meaningless if no device is attached at all. >From this angle IMHO it's more reasonable to report this IOMMU property to userspace via a device capability. If all devices attached to a hwpt claim IOMMU dirty tracking capability, the user can call set_dirty_tracking() on the hwpt object. Once dirty tracking is enabled on a hwpt, further attaching a device which doesn't claim this capability is simply rejected. Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH RFC 15/19] iommu/arm-smmu-v3: Add set_dirty_tracking_range() support
> -Original Message- > From: Joao Martins [mailto:joao.m.mart...@oracle.com] > Sent: 29 April 2022 12:05 > To: Tian, Kevin > Cc: Joerg Roedel ; Suravee Suthikulpanit > ; Will Deacon ; Robin > Murphy ; Jean-Philippe Brucker > ; zhukeqian ; > Shameerali Kolothum Thodi ; > David Woodhouse ; Lu Baolu > ; Jason Gunthorpe ; Nicolin Chen > ; Yishai Hadas ; Eric Auger > ; Liu, Yi L ; Alex Williamson > ; Cornelia Huck ; > k...@vger.kernel.org; iommu@lists.linux-foundation.org > Subject: Re: [PATCH RFC 15/19] iommu/arm-smmu-v3: Add > set_dirty_tracking_range() support > > On 4/29/22 09:28, Tian, Kevin wrote: > >> From: Joao Martins > >> Sent: Friday, April 29, 2022 5:09 AM > >> > >> Similar to .read_and_clear_dirty() use the page table > >> walker helper functions and set DBM|RDONLY bit, thus > >> switching the IOPTE to writeable-clean. > > > > this should not be one-off if the operation needs to be > > applied to IOPTE. Say a map request comes right after > > set_dirty_tracking() is called. If it's agreed to remove > > the range op then smmu driver should record the tracking > > status internally and then apply the modifier to all the new > > mappings automatically before dirty tracking is disabled. > > Otherwise the same logic needs to be kept in iommufd to > > call set_dirty_tracking_range() explicitly for every new > > iopt_area created within the tracking window. > > Gah, I totally missed that by mistake. New mappings aren't > carrying over the "DBM is set". This needs a new io-pgtable > quirk added post dirty-tracking toggling. > > I can adjust, but I am at odds on including this in a future > iteration given that I can't really test any of this stuff. > Might drop the driver until I have hardware/emulation I can > use (or maybe others can take over this). It was included > for revising the iommu core ops and whether iommufd was > affected by it. [+Kunkun Jiang]. I think he is now looking into this and might have a test setup to verify this. Thanks, Shameer ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v5 07/12] arm-smmu-v3/sva: Add SVA domain support
On 2022/5/4 02:12, Jean-Philippe Brucker wrote: On Mon, May 02, 2022 at 09:48:37AM +0800, Lu Baolu wrote: Add support for SVA domain allocation and provide an SVA-specific iommu_domain_ops. Signed-off-by: Lu Baolu --- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 14 +++ .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c | 42 +++ drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 21 ++ 3 files changed, 77 insertions(+) diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h index cd48590ada30..7631c00fdcbd 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h @@ -759,6 +759,10 @@ struct iommu_sva *arm_smmu_sva_bind(struct device *dev, struct mm_struct *mm, void arm_smmu_sva_unbind(struct iommu_sva *handle); u32 arm_smmu_sva_get_pasid(struct iommu_sva *handle); void arm_smmu_sva_notifier_synchronize(void); +int arm_smmu_sva_attach_dev_pasid(struct iommu_domain *domain, + struct device *dev, ioasid_t id); +void arm_smmu_sva_detach_dev_pasid(struct iommu_domain *domain, + struct device *dev, ioasid_t id); #else /* CONFIG_ARM_SMMU_V3_SVA */ static inline bool arm_smmu_sva_supported(struct arm_smmu_device *smmu) { @@ -804,5 +808,15 @@ static inline u32 arm_smmu_sva_get_pasid(struct iommu_sva *handle) } static inline void arm_smmu_sva_notifier_synchronize(void) {} + +static inline int arm_smmu_sva_attach_dev_pasid(struct iommu_domain *domain, + struct device *dev, ioasid_t id) +{ + return -ENODEV; +} + +static inline void arm_smmu_sva_detach_dev_pasid(struct iommu_domain *domain, +struct device *dev, +ioasid_t id) {} #endif /* CONFIG_ARM_SMMU_V3_SVA */ #endif /* _ARM_SMMU_V3_H */ diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c index c623dae1e115..3b843cd3ed67 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c @@ -541,3 +541,45 @@ void arm_smmu_sva_notifier_synchronize(void) */ mmu_notifier_synchronize(); } + +int arm_smmu_sva_attach_dev_pasid(struct iommu_domain *domain, + struct device *dev, ioasid_t id) +{ + int ret = 0; + struct iommu_sva *handle; + struct mm_struct *mm = iommu_sva_domain_mm(domain); + + if (domain->type != IOMMU_DOMAIN_SVA || !mm) We wouldn't get that far with a non-SVA domain since iommu_sva_domain_mm() would dereference a NULL pointer. Could you move it after the domain->type check, and maybe add a WARN_ON()? It could help catch issues in future API changes. Sure. I will make it like this, int arm_smmu_sva_attach_dev_pasid(struct iommu_domain *domain, struct device *dev, ioasid_t id) { int ret = 0; struct mm_struct *mm; struct iommu_sva *handle; if (domain->type != IOMMU_DOMAIN_SVA) return -EINVAL; mm = iommu_sva_domain_mm(domain); if (WARN_ON(!mm)) return -ENODEV; ... ... + return -EINVAL; + + mutex_lock(_lock); + handle = __arm_smmu_sva_bind(dev, mm); + if (IS_ERR(handle)) + ret = PTR_ERR(handle); + mutex_unlock(_lock); + + return ret; +} + +void arm_smmu_sva_detach_dev_pasid(struct iommu_domain *domain, + struct device *dev, ioasid_t id) +{ + struct arm_smmu_bond *bond = NULL, *t; + struct mm_struct *mm = iommu_sva_domain_mm(domain); + struct arm_smmu_master *master = dev_iommu_priv_get(dev); + + mutex_lock(_lock); + list_for_each_entry(t, >bonds, list) { + if (t->mm == mm) { + bond = t; + break; + } + } + + if (!WARN_ON(!bond) && refcount_dec_and_test(>refs)) { + list_del(>list); + arm_smmu_mmu_notifier_put(bond->smmu_mn); + kfree(bond); + } + mutex_unlock(_lock); +} diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index afc63fce6107..bd80de0bad98 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -1995,10 +1995,31 @@ static bool arm_smmu_capable(enum iommu_cap cap) } } +static void arm_smmu_sva_domain_free(struct iommu_domain *domain) +{ + kfree(domain); +} + +static const struct iommu_domain_ops arm_smmu_sva_domain_ops = { + .attach_dev_pasid = arm_smmu_sva_attach_dev_pasid, + .detach_dev_pasid = arm_smmu_sva_detach_dev_pasid, + .free =
Re: [PATCH v5 04/12] iommu/sva: Basic data structures for SVA
On 2022/5/4 02:09, Jean-Philippe Brucker wrote: On Mon, May 02, 2022 at 09:48:34AM +0800, Lu Baolu wrote: Use below data structures for SVA implementation in the IOMMU core: - struct iommu_sva_ioas Represent the I/O address space shared with an application CPU address space. This structure has a 1:1 relationship with an mm_struct. It grabs a "mm->mm_count" refcount during creation and drop it on release. Do we actually need this structure? At the moment it only keeps track of bonds, which we can move to struct dev_iommu. Replacing it by a mm pointer in struct iommu_domain simplifies the driver and seems to work Fair enough. +struct iommu_sva_ioas { + struct mm_struct *mm; + ioasid_t pasid; + + /* Counter of domains attached to this ioas. */ + refcount_t users; + + /* All bindings are linked here. */ + struct list_head bonds; +}; By moving @mm to struct iommu_domain and @bonds to struct dev_iommu, the code looks simpler. The mm, sva domain and per-device dev_iommu are guaranteed to be valid during bind() and unbind(). Will head this direction in the next version. Thanks, Jean Best regards, baolu - struct iommu_domain (IOMMU_DOMAIN_SVA type) Represent a hardware pagetable that the IOMMU hardware could use for SVA translation. Multiple iommu domains could be bound with an SVA ioas and each grabs a refcount from ioas in order to make sure ioas could only be freed after all domains have been unbound. - struct iommu_sva Represent a bond relationship between an SVA ioas and an iommu domain. If a bond already exists, it's reused and a reference is taken. Signed-off-by: Lu Baolu --- include/linux/iommu.h | 14 +- drivers/iommu/iommu-sva-lib.h | 1 + drivers/iommu/iommu-sva-lib.c | 18 ++ 3 files changed, 32 insertions(+), 1 deletion(-) diff --git a/include/linux/iommu.h b/include/linux/iommu.h index ab36244d4e94..f582f434c513 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -42,6 +42,7 @@ struct notifier_block; struct iommu_sva; struct iommu_fault_event; struct iommu_dma_cookie; +struct iommu_sva_ioas; /* iommu fault flags */ #define IOMMU_FAULT_READ 0x0 @@ -64,6 +65,9 @@ struct iommu_domain_geometry { #define __IOMMU_DOMAIN_PT (1U << 2) /* Domain is identity mapped */ #define __IOMMU_DOMAIN_DMA_FQ (1U << 3) /* DMA-API uses flush queue*/ +#define __IOMMU_DOMAIN_SHARED (1U << 4) /* Page table shared from CPU */ +#define __IOMMU_DOMAIN_HOST_VA (1U << 5) /* Host CPU virtual address */ + /* * This are the possible domain-types * @@ -86,6 +90,8 @@ struct iommu_domain_geometry { #define IOMMU_DOMAIN_DMA_FQ (__IOMMU_DOMAIN_PAGING |\ __IOMMU_DOMAIN_DMA_API | \ __IOMMU_DOMAIN_DMA_FQ) +#define IOMMU_DOMAIN_SVA (__IOMMU_DOMAIN_SHARED |\ +__IOMMU_DOMAIN_HOST_VA) struct iommu_domain { unsigned type; @@ -95,6 +101,7 @@ struct iommu_domain { void *handler_token; struct iommu_domain_geometry geometry; struct iommu_dma_cookie *iova_cookie; + struct iommu_sva_ioas *sva_ioas; }; static inline bool iommu_is_dma_domain(struct iommu_domain *domain) @@ -628,7 +635,12 @@ struct iommu_fwspec { * struct iommu_sva - handle to a device-mm bond */ struct iommu_sva { - struct device *dev; + struct device *dev; + struct iommu_sva_ioas *sva_ioas; + struct iommu_domain *domain; + /* Link to sva ioas's bonds list */ + struct list_headnode; + refcount_t users; }; int iommu_fwspec_init(struct device *dev, struct fwnode_handle *iommu_fwnode, diff --git a/drivers/iommu/iommu-sva-lib.h b/drivers/iommu/iommu-sva-lib.h index 8909ea1094e3..9c5e108e2c8a 100644 --- a/drivers/iommu/iommu-sva-lib.h +++ b/drivers/iommu/iommu-sva-lib.h @@ -10,6 +10,7 @@ int iommu_sva_alloc_pasid(struct mm_struct *mm, ioasid_t min, ioasid_t max); struct mm_struct *iommu_sva_find(ioasid_t pasid); +struct mm_struct *iommu_sva_domain_mm(struct iommu_domain *domain); /* I/O Page fault */ struct device; diff --git a/drivers/iommu/iommu-sva-lib.c b/drivers/iommu/iommu-sva-lib.c index 106506143896..d524a402be3b 100644 --- a/drivers/iommu/iommu-sva-lib.c +++ b/drivers/iommu/iommu-sva-lib.c @@ -3,6 +3,8 @@ * Helpers for IOMMU drivers implementing SVA */ #include +#include +#include #include #include "iommu-sva-lib.h" @@ -10,6 +12,22 @@ static DEFINE_MUTEX(iommu_sva_lock); static DECLARE_IOASID_SET(iommu_sva_pasid); +struct iommu_sva_ioas { + struct mm_struct *mm; + ioasid_t pasid; + + /* Counter of domains attached to this ioas. */ + refcount_t users; + + /* All bindings are linked here. */ + struct list_head bonds; +}; +
Re: [PATCH v5 03/12] iommu: Add attach/detach_dev_pasid domain ops
On 2022/5/4 02:07, Jean-Philippe Brucker wrote: On Mon, May 02, 2022 at 09:48:33AM +0800, Lu Baolu wrote: Attaching an IOMMU domain to a PASID of a device is a generic operation for modern IOMMU drivers which support PASID-granular DMA address translation. Currently visible usage scenarios include (but not limited): - SVA (Shared Virtual Address) - kernel DMA with PASID - hardware-assist mediated device This adds a pair of common domain ops for this purpose and adds helpers to attach/detach a domain to/from a {device, PASID}. Some buses, like PCI, route packets without considering the PASID value. Thus a DMA target address with PASID might be treated as P2P if the address falls into the MMIO BAR of other devices in the group. To make things simple, these interfaces only apply to devices belonging to the singleton groups, and the singleton is immutable in fabric i.e. not affected by hotplug. Signed-off-by: Lu Baolu Reviewed-by: Jean-Philippe Brucker just a nit below --- include/linux/iommu.h | 21 drivers/iommu/iommu.c | 76 +++ 2 files changed, 97 insertions(+) diff --git a/include/linux/iommu.h b/include/linux/iommu.h index b8ffaf2cb1d0..ab36244d4e94 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -263,6 +263,8 @@ struct iommu_ops { * struct iommu_domain_ops - domain specific operations * @attach_dev: attach an iommu domain to a device * @detach_dev: detach an iommu domain from a device + * @attach_dev_pasid: attach an iommu domain to a pasid of device + * @detach_dev_pasid: detach an iommu domain from a pasid of device * @map: map a physically contiguous memory region to an iommu domain * @map_pages: map a physically contiguous set of pages of the same size to * an iommu domain. @@ -283,6 +285,10 @@ struct iommu_ops { struct iommu_domain_ops { int (*attach_dev)(struct iommu_domain *domain, struct device *dev); void (*detach_dev)(struct iommu_domain *domain, struct device *dev); + int (*attach_dev_pasid)(struct iommu_domain *domain, + struct device *dev, ioasid_t pasid); + void (*detach_dev_pasid)(struct iommu_domain *domain, +struct device *dev, ioasid_t pasid); int (*map)(struct iommu_domain *domain, unsigned long iova, phys_addr_t paddr, size_t size, int prot, gfp_t gfp); @@ -678,6 +684,10 @@ int iommu_group_claim_dma_owner(struct iommu_group *group, void *owner); void iommu_group_release_dma_owner(struct iommu_group *group); bool iommu_group_dma_owner_claimed(struct iommu_group *group); +int iommu_attach_device_pasid(struct iommu_domain *domain, + struct device *dev, ioasid_t pasid); +void iommu_detach_device_pasid(struct iommu_domain *domain, + struct device *dev, ioasid_t pasid); #else /* CONFIG_IOMMU_API */ struct iommu_ops {}; @@ -1051,6 +1061,17 @@ static inline bool iommu_group_dma_owner_claimed(struct iommu_group *group) { return false; } + +static inline int iommu_attach_device_pasid(struct iommu_domain *domain, + struct device *dev, ioasid_t pasid) +{ + return -ENODEV; +} + +static inline void iommu_detach_device_pasid(struct iommu_domain *domain, +struct device *dev, ioasid_t pasid) +{ +} #endif /* CONFIG_IOMMU_API */ /** diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 29906bc16371..89c9d19ddb28 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -38,6 +38,7 @@ struct iommu_group { struct kobject kobj; struct kobject *devices_kobj; struct list_head devices; + struct xarray pasid_array; struct mutex mutex; void *iommu_data; void (*iommu_data_release)(void *iommu_data); @@ -630,6 +631,7 @@ struct iommu_group *iommu_group_alloc(void) mutex_init(>mutex); INIT_LIST_HEAD(>devices); INIT_LIST_HEAD(>entry); + xa_init(>pasid_array); ret = ida_simple_get(_group_ida, 0, 0, GFP_KERNEL); if (ret < 0) { @@ -3190,3 +3192,77 @@ bool iommu_group_dma_owner_claimed(struct iommu_group *group) return user; } EXPORT_SYMBOL_GPL(iommu_group_dma_owner_claimed); + +/* + * Use standard PCI bus topology and isolation features to check immutable + * singleton. Otherwise, assume the bus is static and then singleton can + * know from the device count in the group. + */ The comment doesn't really add anything that can't be directly understood from the code. Yes. It's fine to remove it. Best regards, baolu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v5 02/12] iommu: Add pasid_bits field in struct dev_iommu
On 2022/5/4 02:02, Jean-Philippe Brucker wrote: On Mon, May 02, 2022 at 09:48:32AM +0800, Lu Baolu wrote: Use this field to save the pasid/ssid bits that a device is able to support with its IOMMU hardware. It is a generic attribute of a device and lifting it into the per-device dev_iommu struct makes it possible to allocate a PASID for device without calls into the IOMMU drivers. Any iommu driver which suports PASID related features should set this field before features are enabled on the devices. For initialization of this field in the VT-d driver, the info->pasid_supported is only set for PCI devices. So the status is that non-PCI SVA hasn't been supported yet. Setting this field only for PCI devices has no functional change. Signed-off-by: Lu Baolu Reviewed-by: Jean-Philippe Brucker Thank you! Very appreciated for reviewing my patches. Best regards, baolu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu