Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
On Tue, May 10, 2022 at 04:00:09PM -0300, Jason Gunthorpe wrote: > On Tue, May 10, 2022 at 05:12:04PM +1000, David Gibson wrote: > > On Mon, May 09, 2022 at 11:00:41AM -0300, Jason Gunthorpe wrote: > > > On Mon, May 09, 2022 at 04:01:52PM +1000, David Gibson wrote: > > > > > > > > The default iommu_domain that the iommu driver creates will be used > > > > > here, it is up to the iommu driver to choose something reasonable for > > > > > use by applications like DPDK. ie PPC should probably pick its biggest > > > > > x86-like aperture. > > > > > > > > So, using the big aperture means a very high base IOVA > > > > (1<<59)... which means that it won't work at all if you want to attach > > > > any devices that aren't capable of 64-bit DMA. > > > > > > I'd expect to include the 32 bit window too.. > > > > I'm not entirely sure what you mean. Are you working on the > > assumption that we've extended to allowing multiple apertures, so we'd > > default to advertising both a small/low aperture and a large/high > > aperture? > > Yes Ok, that works assuming we can advertise multiple windows. > > > No, this just makes it fragile in the other direction because now > > > userspace has to know what platform specific things to ask for *or it > > > doesn't work at all*. This is not a improvement for the DPDK cases. > > > > Um.. no. The idea is that userspace requests *what it needs*, not > > anything platform specific. In the case of DPDK that would be nothing > > more than the (minimum) aperture size. Nothing platform specific > > about that. > > Except a 32 bit platform can only maybe do a < 4G aperture, a 64 bit > platform can do more, but it varies how much more, etc. > > There is no constant value DPDK could stuff in this request, unless it > needs a really small amount of IOVA, like 1G or something. Well, my assumption was that DPDK always wanted an IOVA window to cover its hugepage buffer space. So not "constant" exactly, but a value it will know at start up time. But I think we cover that more closely below. > > > It isn't like there is some hard coded value we can put into DPDK that > > > will work on every platform. So kernel must pick for DPDK, IMHO. I > > > don't see any feasible alternative. > > > > Yes, hence *optionally specified* base address only. > > Okay, so imagine we've already done this and DPDK is not optionally > specifying anything :) > > The structs can be extended so we can add this as an input to creation > when a driver can implement it. > > > > The ppc specific driver would be on the generic side of qemu in its > > > viommu support framework. There is lots of host driver optimization > > > possible here with knowledge of the underlying host iommu HW. It > > > should not be connected to the qemu target. > > > > Thinking through this... > > > > So, I guess we could have basically the same logic I'm suggesting be > > in the qemu backend iommu driver instead. So the target side (machine > > type, strictly speaking) would request of the host side the apertures > > it needs, and the host side driver would see if it can do that, based > > on both specific knowledge of that driver and the query reponses. > > Yes, this is what I'm thinking > > > ppc on x86 should work with that.. at least if the x86 aperture is > > large enough to reach up to ppc's high window. I guess we'd have the > > option here of using either the generic host driver or the > > x86-specific driver. The latter would mean qemu maintaining an > > x86-format shadow of the io pagetables; mildly tedious, but doable. > > The appeal of having userspace page tables is performance, so it is > tedious to shadow, but it should run faster. I doubt the difference is meaningful in the context of an emulated guest, though. > > So... is there any way of backing out of this gracefully. We could > > detach the device, but in the meantime ongoing DMA maps from > > previous devices might have failed. > > This sounds like a good use case for qemu to communicate ranges - but > as I mentioned before Alex said qemu didn't know the ranges.. Yeah, I'm a bit baffled by that, and I don't know the context. Note that there are at least two different very different users of the host IOMMU backends in: one is for emulation of guest DMA (with or without a vIOMMU). In that case the details of the guest platform should let qemu know the ranges. There's also a VFIO based NVME backend; that one's much more like a "normal" userspace driver, where it doesn't care about the address ranges (because they're not guest visible). > > We could pre-attach the new device to a new IOAS and check the > > apertures there - but when we move it to the final IOAS is it > > guaranteed that the apertures will be (at least) the intersection of > > the old and new apertures, or is that just the probable outcome. > > Should be guarenteed Ok; that would need to be documented. > > Ok.. you convinced me. As long as we have some way to handle the >
RE: [PATCH v6 03/12] iommu: Add attach/detach_dev_pasid domain ops
> From: Baolu Lu > Sent: Wednesday, May 11, 2022 10:32 AM > > On 2022/5/10 22:02, Jason Gunthorpe wrote: > > On Tue, May 10, 2022 at 02:17:29PM +0800, Lu Baolu wrote: > > > >> This adds a pair of common domain ops for this purpose and adds > helpers > >> to attach/detach a domain to/from a {device, PASID}. > > > > I wonder if this should not have a detach op - after discussing with > > Robin we can see that detach_dev is not used in updated > > drivers. Instead attach_dev acts as 'set_domain' > > > > So, it would be more symmetrical if attaching a blocking_domain to the > > PASID was the way to 'detach'. > > > > This could be made straightforward by following the sketch I showed to > > have a static, global blocing_domain and providing a pointer to it in > > struct iommu_ops > > > > Then 'detach pasid' is: > > > > iommu_ops->blocking_domain->ops->attach_dev_pasid(domain, dev, > pasid); > > > > And we move away from the notion of 'detach' and in the direction that > > everything continuously has a domain set. PASID would logically > > default to blocking_domain, though we wouldn't track this anywhere. > > I am not sure whether we still need to keep the blocking domain concept > when we are entering the new PASID world. Please allow me to wait and > listen to more opinions. > I'm with Jason on this direction. In concept after a PASID is detached it's essentially blocked. Implementation-wise it doesn't prevent the iommu driver from marking the PASID entry as non-present as doing in this series instead of actually pointing to the empty page table of the block domain. But api-wise it does make the entire semantics more consistent. Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH] iommu/vt-d: Increase DMAR_UNITS_SUPPORTED
> From: Steve Wahl > Sent: Wednesday, May 11, 2022 3:07 AM > > On Tue, May 10, 2022 at 01:16:26AM +, Tian, Kevin wrote: > > > From: Steve Wahl > > > Sent: Friday, May 6, 2022 11:26 PM > > > > > > On Fri, May 06, 2022 at 08:12:11AM +, Tian, Kevin wrote: > > > > > From: David Woodhouse > > > > > Sent: Friday, May 6, 2022 3:17 PM > > > > > > > > > > On Fri, 2022-05-06 at 06:49 +, Tian, Kevin wrote: > > > > > > > From: Baolu Lu > > > > > > > > > > > > > > > --- a/include/linux/dmar.h > > > > > > > > +++ b/include/linux/dmar.h > > > > > > > > @@ -19,7 +19,7 @@ > > > > > > > > struct acpi_dmar_header; > > > > > > > > > > > > > > > > #ifdefCONFIG_X86 > > > > > > > > -# define DMAR_UNITS_SUPPORTEDMAX_IO_APICS > > > > > > > > +# define DMAR_UNITS_SUPPORTED640 > > > > > > > > #else > > > > > > > > # define DMAR_UNITS_SUPPORTED64 > > > > > > > > #endif > > > > > > > > > > > > ... is it necessary to permanently do 10x increase which wastes > memory > > > > > > on most platforms which won't have such need. > > > > > > > > > > I was just looking at that. It mostly adds about 3½ KiB to each struct > > > > > dmar_domain. > > > > > > > > > > I think the only actual static array is the dmar_seq_ids bitmap which > > > > > grows to 640 *bits* which is fairly negligible, and the main growth is > > > > > that it adds about 3½ KiB to each struct dmar_domain for the > > > > > iommu_refcnt[] and iommu_did[] arrays. > > > > > > > > Thanks for the quick experiment! though the added material is > > > > negligible it's cleaner to me if having a way to configure it as > > > > discussed below. > > > > > > > > > > > > > > > Does it make more sense to have a configurable approach similar to > > > > > > CONFIG_NR_CPUS? or even better can we just replace those static > > > > > > arrays with dynamic allocation so removing this restriction > > > > > > completely? > > > > > > > > > > Hotplug makes that fun, but I suppose you only need to grow the > array > > > > > in a given struct dmar_domain if you actually add a device to it > > > > > that's > > > > > behind a newly added IOMMU. I don't know if the complexity of > making it > > > > > fully dynamic is worth it though. We could make it a config option, > > > > > and/or a command line option (perhaps automatically derived from > > > > > CONFIG_NR_CPUS). > > > > > > > > either config option or command line option is OK to me. Probably > > > > the former is simpler given no need to dynamically expand the > > > > static array. btw though deriving from CONFIG_NR_CPUS could work > > > > in this case it is unclear why tying the two together is necessary in > > > > concept, e.g. is there guarantee that the number of IOMMUs must > > > > be smaller than the number of CPUs in a platform? > > > > > > > > > > > > > > If it wasn't for hotplug, I think we'd know the right number by the > > > > > time we actually need it anyway, wouldn't we? Can we have a > heuristic > > > > > for how many DMAR units are likely to be hotplugged? Is it as simple > as > > > > > the ratio of present to not-yet-present CPUs in MADT? > > > > > > > > Probably. But I don't have enough knowledge on DMAR hotplug to > > > > judge (e.g. whether it's strictly tied to CPU hotplug and if yes whether > > > > there could be multiple IOMMUs hotplugged together with a CPU > > > > socket)... > > > > > > > > Thanks > > > > Kevin > > > > > > Would anyone be more comfortable if we only increase the limit where > > > MAXSMP is set? > > > > > > i.e. > > > > > > #if defined(CONFIG_X86) && defined(CONFIG_MAXSMP) > > > # define DMAR_UNITS_SUPPORTED640 > > > #elif defined(CONFIG_X86) > > > # define DMAR_UNITS_SUPPORTEDMAX_IO_APICS > > > #else > > > # define DMAR_UNITS_SUPPORTED64 > > > #endif > > > > > > Thank you all for your time looking at this. > > > > > > > This works for your own configuration but it's unclear whether other > > MAXSMP platforms have the exact same requirements (different > > number of sockets, different ratio of #iommus/#sockets, etc.). In any > > case since we are at it having a generic way to extend it makes more > > sense to me. > > So, to be clear, what you would like to see would be Kconfig entries > to create a config option, say "NR_DMARS", set up so the default is: > > MAXSMP? 640 usually we do 2's power thus 1024 is more reasonable. If people do care about the exact memory footprint they can always manually change it. > X86_64? 128 > X86_32? 64 > other64 > > And DMAR_UNITS_SUPPORTED gets removed, and everywhere it was used > we > use CONFIG_NR_DMARS in its place? Let's keep DMAR_UNITS_SUPPORTED and just redefine it to be CONFIG_NR_DMARS for less changes. > > I can give that a shot but wanted to confirm this is what you'd want > first. > > Thanks, > > --> Steve > > -- > Steve Wahl, Hewlett Packard Enterprise ___ iommu mailing list
RE: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
> From: Jason Gunthorpe > Sent: Wednesday, May 11, 2022 3:00 AM > > On Tue, May 10, 2022 at 05:12:04PM +1000, David Gibson wrote: > > Ok... here's a revised version of my proposal which I think addresses > > your concerns and simplfies things. > > > > - No new operations, but IOAS_MAP gets some new flags (and IOAS_COPY > > will probably need matching changes) > > > > - By default the IOVA given to IOAS_MAP is a hint only, and the IOVA > > is chosen by the kernel within the aperture(s). This is closer to > > how mmap() operates, and DPDK and similar shouldn't care about > > having specific IOVAs, even at the individual mapping level. > > > > - IOAS_MAP gets an IOMAP_FIXED flag, analagous to mmap()'s MAP_FIXED, > > for when you really do want to control the IOVA (qemu, maybe some > > special userspace driver cases) > > We already did both of these, the flag is called > IOMMU_IOAS_MAP_FIXED_IOVA - if it is not specified then kernel will > select the IOVA internally. > > > - ATTACH will fail if the new device would shrink the aperture to > > exclude any already established mappings (I assume this is already > > the case) > > Yes > > > - IOAS_MAP gets an IOMAP_RESERVE flag, which operates a bit like a > > PROT_NONE mmap(). It reserves that IOVA space, so other (non-FIXED) > > MAPs won't use it, but doesn't actually put anything into the IO > > pagetables. > > - Like a regular mapping, ATTACHes that are incompatible with an > > IOMAP_RESERVEed region will fail > > - An IOMAP_RESERVEed area can be overmapped with an IOMAP_FIXED > > mapping > > Yeah, this seems OK, I'm thinking a new API might make sense because > you don't really want mmap replacement semantics but a permanent > record of what IOVA must always be valid. > > IOMMU_IOA_REQUIRE_IOVA perhaps, similar signature to > IOMMUFD_CMD_IOAS_IOVA_RANGES: > > struct iommu_ioas_require_iova { > __u32 size; > __u32 ioas_id; > __u32 num_iovas; > __u32 __reserved; > struct iommu_required_iovas { > __aligned_u64 start; > __aligned_u64 last; > } required_iovas[]; > }; As a permanent record do we want to enforce that once the required range list is set all FIXED and non-FIXED allocations must be within the list of ranges? If yes we can take the end of the last range as the max size of the iova address space to optimize the page table layout. otherwise we may need another dedicated hint for that optimization. > > > So, for DPDK the sequence would be: > > > > 1. Create IOAS > > 2. ATTACH devices > > 3. IOAS_MAP some stuff > > 4. Do DMA with the IOVAs that IOAS_MAP returned > > > > (Note, not even any need for QUERY in simple cases) > > Yes, this is done already > > > For (unoptimized) qemu it would be: > > > > 1. Create IOAS > > 2. IOAS_MAP(IOMAP_FIXED|IOMAP_RESERVE) the valid IOVA regions of > the > >guest platform > > 3. ATTACH devices (this will fail if they're not compatible with the > >reserved IOVA regions) > > 4. Boot the guest I suppose above is only the sample flow for PPC vIOMMU. For non-PPC vIOMMUs regular mappings are required before booting the guest and reservation might be done but not mandatory (at least not what current Qemu vfio can afford as it simply replays valid ranges in the CPU address space). > > > > (on guest map/invalidate) -> IOAS_MAP(IOMAP_FIXED) to overmap part > of > >the reserved regions > > (on dev hotplug) -> ATTACH (which might fail, if it conflicts with the > > reserved regions) > > (on vIOMMU reconfiguration) -> UNMAP/MAP reserved regions as > > necessary (which might fail) > > OK, I will take care of it > > Thanks, > Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
> From: Jason Gunthorpe > Sent: Monday, May 9, 2022 10:01 PM > > On Mon, May 09, 2022 at 04:01:52PM +1000, David Gibson wrote: > > > Which is why I'm suggesting that the base address be an optional > > request. DPDK *will* care about the size of the range, so it just > > requests that and gets told a base address. > > We've talked about a size of IOVA address space before, strictly as a > hint, to possible optimize page table layout, or something, and I'm > fine with that idea. But - we have no driver implementation today, so > I'm not sure what we can really do with this right now.. > > Kevin could Intel consume a hint on IOVA space and optimize the number > of IO page table levels? > It could, but not implemented now. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v6 03/12] iommu: Add attach/detach_dev_pasid domain ops
On 2022/5/10 22:02, Jason Gunthorpe wrote: On Tue, May 10, 2022 at 02:17:29PM +0800, Lu Baolu wrote: This adds a pair of common domain ops for this purpose and adds helpers to attach/detach a domain to/from a {device, PASID}. I wonder if this should not have a detach op - after discussing with Robin we can see that detach_dev is not used in updated drivers. Instead attach_dev acts as 'set_domain' So, it would be more symmetrical if attaching a blocking_domain to the PASID was the way to 'detach'. This could be made straightforward by following the sketch I showed to have a static, global blocing_domain and providing a pointer to it in struct iommu_ops Then 'detach pasid' is: iommu_ops->blocking_domain->ops->attach_dev_pasid(domain, dev, pasid); And we move away from the notion of 'detach' and in the direction that everything continuously has a domain set. PASID would logically default to blocking_domain, though we wouldn't track this anywhere. I am not sure whether we still need to keep the blocking domain concept when we are entering the new PASID world. Please allow me to wait and listen to more opinions. Best regards, baolu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v6 02/12] iommu: Add pasid_bits field in struct dev_iommu
On 2022/5/10 22:34, Jason Gunthorpe wrote: On Tue, May 10, 2022 at 02:17:28PM +0800, Lu Baolu wrote: int iommu_device_register(struct iommu_device *iommu, diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index 627a3ed5ee8f..afc63fce6107 100644 +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -2681,6 +2681,8 @@ static struct iommu_device *arm_smmu_probe_device(struct device *dev) smmu->features & ARM_SMMU_FEAT_STALL_FORCE) master->stall_enabled = true; + dev->iommu->pasid_bits = master->ssid_bits; return >iommu; err_free_master: diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index 2990f80c5e08..99643f897f26 100644 +++ b/drivers/iommu/intel/iommu.c @@ -4624,8 +4624,11 @@ static struct iommu_device *intel_iommu_probe_device(struct device *dev) if (pasid_supported(iommu)) { int features = pci_pasid_features(pdev); -if (features >= 0) + if (features >= 0) { info->pasid_supported = features | 1; + dev->iommu->pasid_bits = + fls(pci_max_pasids(pdev)) - 1; + } It is not very nice that both the iommu drivers have to duplicate the code to read the pasid capability out of the PCI device. IMHO it would make more sense for the iommu layer to report the capability of its own HW block only, and for the core code to figure out the master's limitation using a bus-specific approach. Fair enough. The iommu hardware capability could be reported in /** * struct iommu_device - IOMMU core representation of one IOMMU hardware * instance * @list: Used by the iommu-core to keep a list of registered iommus * @ops: iommu-ops for talking to this iommu * @dev: struct device for sysfs handling */ struct iommu_device { struct list_head list; const struct iommu_ops *ops; struct fwnode_handle *fwnode; struct device *dev; }; I haven't checked ARM code yet, but it works for x86 as far as I can see. It is also unfortunate that the enable/disable pasid is inside the iommu driver as well - ideally the PCI driver itself would do this when it knows it wants to use PASIDs. The ordering interaction with ATS makes this look quite annoying though. :( I'm also not convinced individual IOMMU drivers should be forcing ATS on, there are performance and functional implications here. Using ATS or not is possibly best left as an administrator policy controlled by the core code. Again we seem to have some mess. Agreed with you. This has already been in my task list. I will start to solve it after the iommufd tasks. Best regards, baolu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v7 5/7] perf tool: Add support for HiSilicon PCIe Tune and Trace device driver
Hi James, On 2022/5/10 18:14, James Clark wrote: On 07/04/2022 13:58, Yicong Yang wrote: From: Qi Liu [...] struct auxtrace_record *auxtrace_record__init(struct evlist *evlist, int *err) { @@ -57,8 +112,12 @@ struct auxtrace_record struct evsel *evsel; bool found_etm = false; struct perf_pmu *found_spe = NULL; + struct perf_pmu *found_ptt = NULL; struct perf_pmu **arm_spe_pmus = NULL; + struct perf_pmu **hisi_ptt_pmus = NULL; + int nr_spes = 0; + int nr_ptts = 0; int i = 0; if (!evlist) @@ -66,13 +125,14 @@ struct auxtrace_record cs_etm_pmu = perf_pmu__find(CORESIGHT_ETM_PMU_NAME); arm_spe_pmus = find_all_arm_spe_pmus(_spes, err); + hisi_ptt_pmus = find_all_hisi_ptt_pmus(_ptts, err); evlist__for_each_entry(evlist, evsel) { if (cs_etm_pmu && evsel->core.attr.type == cs_etm_pmu->type) found_etm = true; - if (!nr_spes || found_spe) + if ((!nr_spes || found_spe) && (!nr_ptts || found_ptt)) continue; for (i = 0; i < nr_spes; i++) { @@ -81,11 +141,18 @@ struct auxtrace_record break; } } + + for (i = 0; i < nr_ptts; i++) { + if (evsel->core.attr.type == hisi_ptt_pmus[i]->type) { + found_ptt = hisi_ptt_pmus[i]; + break; + } + } } free(arm_spe_pmus); - if (found_etm && found_spe) { - pr_err("Concurrent ARM Coresight ETM and SPE operation not currently supported\n"); + if (found_etm && found_spe && found_ptt) { + pr_err("Concurrent ARM Coresight ETM ,SPE and HiSilicon PCIe Trace operation not currently supported\n"); Hi Yicong, Is that actually a limitation? I don't see why they couldn't work concurrently. As Leo said, the logic here should be like this: int auxtrace_event_cnt = 0; if (found_etm) auxtrace_event_cnt++; if (found_spe) auxtrace_event_cnt++; if (found_ptt) auxtrace_event_cnt++; if (auxtrace_event_cnt > 1) { pr_err("Concurrent AUX trace operation isn't supported: found etm %d spe %d ptt %d\n", found_etm, found_spe, found_ptt); *err = -EOPNOTSUPP; return NULL; } which means perf doesn't allow more than one auxtrace event recording at the same time. *err = -EOPNOTSUPP; return NULL; } @@ -96,6 +163,9 @@ struct auxtrace_record #if defined(__aarch64__) if (found_spe) return arm_spe_recording_init(err, found_spe); + + if (found_ptt) + return hisi_ptt_recording_init(err, found_ptt); #endif /* [...] + +static int hisi_ptt_recording_options(struct auxtrace_record *itr, + struct evlist *evlist, + struct record_opts *opts) +{ + struct hisi_ptt_recording *pttr = + container_of(itr, struct hisi_ptt_recording, itr); + struct perf_pmu *hisi_ptt_pmu = pttr->hisi_ptt_pmu; + struct perf_cpu_map *cpus = evlist->core.cpus; + struct evsel *evsel, *hisi_ptt_evsel = NULL; + struct evsel *tracking_evsel; + int err; + + pttr->evlist = evlist; + evlist__for_each_entry(evlist, evsel) { + if (evsel->core.attr.type == hisi_ptt_pmu->type) { + if (hisi_ptt_evsel) { + pr_err("There may be only one " HISI_PTT_PMU_NAME "x event\n"); + return -EINVAL; + } + evsel->core.attr.freq = 0; + evsel->core.attr.sample_period = 1; + hisi_ptt_evsel = evsel; + opts->full_auxtrace = true; + } + } + + err = hisi_ptt_set_auxtrace_mmap_page(opts); + if (err) + return err; + /* +* To obtain the auxtrace buffer file descriptor, the auxtrace event +* must come first. +*/ + evlist__to_front(evlist, hisi_ptt_evsel); + + if (!perf_cpu_map__empty(cpus)) { + evsel__set_sample_bit(hisi_ptt_evsel, TIME); + evsel__set_sample_bit(hisi_ptt_evsel, CPU); + } Similar to Leo's comment: CPU isn't required if it's uncore, and if TIME is useful then add it regardless of whether the event is opened per-cpu or on a task. got it, will fix this next time. + + /* Add dummy event to keep tracking */ + err = parse_events(evlist, "dummy:u", NULL); + if (err) + return err; + + tracking_evsel = evlist__last(evlist); +
RE: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
> From: Joao Martins > Sent: Tuesday, May 10, 2022 7:51 PM > > On 5/10/22 02:38, Tian, Kevin wrote: > >> From: Jason Gunthorpe > >> Sent: Friday, May 6, 2022 7:46 PM > >> > >> On Fri, May 06, 2022 at 03:51:40AM +, Tian, Kevin wrote: > From: Jason Gunthorpe > Sent: Thursday, May 5, 2022 10:08 PM > > On Thu, May 05, 2022 at 07:40:37AM +, Tian, Kevin wrote: > > > In concept this is an iommu property instead of a domain property. > > Not really, domains shouldn't be changing behaviors once they are > created. If a domain supports dirty tracking and I attach a new device > then it still must support dirty tracking. > >>> > >>> That sort of suggests that userspace should specify whether a domain > >>> supports dirty tracking when it's created. But how does userspace > >>> know that it should create the domain in this way in the first place? > >>> live migration is triggered on demand and it may not happen in the > >>> lifetime of a VM. > >> > >> The best you could do is to look at the devices being plugged in at VM > >> startup, and if they all support live migration then request dirty > >> tracking, otherwise don't. > > > > Yes, this is how a device capability can help. > > > >> > >> However, tt costs nothing to have dirty tracking as long as all iommus > >> support it in the system - which seems to be the normal case today. > >> > >> We should just always turn it on at this point. > > > > Then still need a way to report " all iommus support it in the system" > > to userspace since many old systems don't support it at all. If we all > > agree that a device capability flag would be helpful on this front (like > > you also said below), probably can start building the initial skeleton > > with that in mind? > > > > This would capture device-specific and maybe iommu-instance features, but > there's some tiny bit odd semantic here. There's nothing that > depends on the device to support any of this, but rather the IOMMU instance > that sits > below the device which is independent of device-own capabilities e.g. PRI on > the other > hand would be a perfect fit for a device capability (?), but dirty tracking > conveying over a device capability would be a convenience rather than an > exact > hw representation. it is sort of getting certain iommu capability for a given device, as how the iommu kAPI is moving toward. > > Thinking out loud if we are going as a device/iommu capability [to see if this > matches > what people have or not in mind]: we would add dirty-tracking feature bit via > the existent > kAPI for iommu device features (e.g. IOMMU_DEV_FEAT_AD) and on > iommufd we would maybe add > an IOMMUFD_CMD_DEV_GET_IOMMU_FEATURES ioctl which would have an > u64 dev_id as input (from > the returned vfio-pci BIND_IOMMUFD @out_dev_id) and u64 features as an > output bitmap of > synthetic feature bits, having IOMMUFD_FEATURE_AD the only one we query > (and > IOMMUFD_FEATURE_{SVA,IOPF} as potentially future candidates). Qemu > would then at start of > day would check if /all devices/ support it and it would then still do the > blind > set > tracking, but bail out preemptively if any of device-iommu don't support > dirty-tracking. I > don't think we have any case today for having to deal with different IOMMU > instances that > have different features. This heterogeneity already exists today. On Intel platform not all IOMMUs support force snooping. I believe ARM has similar situation which is why Robin is refactoring bus-oriented iommu_capable() etc. to device-oriented. I'm not aware of such heterogeneity particularly for dirty tracking today. But who knows it won't happen in the future? I just feel that aligning iommufd uAPI to iommu kAPI for capability reporting might be more future proof here. > > Either that or as discussed in the beginning perhaps add an iommufd (or > iommufd hwpt one) > ioctl call (e.g.IOMMUFD_CMD_CAP) via a input value (e.g. subop > IOMMU_FEATURES) which > would gives us a structure of things (e.g. for the IOMMU_FEATURES subop > the common > featureset bitmap in all iommu instances). This would give the 'all iommus > support it in > the system'. Albeit the device one might have more concrete longevity if > there's further > plans aside from dirty tracking. > > >> > >>> and if the user always creates domain to allow dirty tracking by default, > >>> how does it know a failed attach is due to missing dirty tracking support > >>> by the IOMMU and then creates another domain which disables dirty > >>> tracking and retry-attach again? > >> > >> The automatic logic is complicated for sure, if you had a device flag > >> it would have to figure it out that way > >> > > > > Yes. That is the model in my mind. > > > > Thanks > > Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
> From: Jason Gunthorpe > Sent: Tuesday, May 10, 2022 9:47 PM > > On Tue, May 10, 2022 at 01:38:26AM +, Tian, Kevin wrote: > > > > However, tt costs nothing to have dirty tracking as long as all iommus > > > support it in the system - which seems to be the normal case today. > > > > > > We should just always turn it on at this point. > > > > Then still need a way to report " all iommus support it in the system" > > to userspace since many old systems don't support it at all. > > Userspace can query the iommu_domain directly, or 'try and fail' to > turn on tracking. > > A device capability flag is useless without a control knob to request > a domain is created with tracking, and we don't have that, or a reason > to add that. > I'm getting confused on your last comment. A capability flag has to accompany with a control knob which iiuc is what you advocated in earlier discussion i.e. specifying the tracking property when creating the domain. In this case the flag assists the userspace in deciding whether to set the property. Not sure whether we argued pass each other but here is another attempt. In general I saw three options here: a) 'try and fail' when creating the domain. It succeeds only when all iommus support tracking; b) capability reported on iommu domain. The capability is reported true only when all iommus support tracking. This allows domain property to be set after domain is created. But there is no much gain of doing so when comparing to a). c) capability reported on device. future compatible for heterogenous platform. domain property is specified at domain creation and domains can have different properties based on tracking capability of attached devices. I'm inclined to c) as it is more aligned to Robin's cleanup effort on iommu_capable() and iommu_present() in the iommu layer which moves away from global manner to per-device style. Along with that direction I guess we want to discourage adding more APIs assuming 'all iommus supporting certain capability' thing? Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 2/4] iommu: Add PASID support for DMA mapping API users
Hi Jason, On Tue, 10 May 2022 20:28:04 -0300, Jason Gunthorpe wrote: > On Tue, May 10, 2022 at 02:07:02PM -0700, Jacob Pan wrote: > > DMA mapping API is the de facto standard for in-kernel DMA. It operates > > on a per device/RID basis which is not PASID-aware. > > > > Some modern devices such as Intel Data Streaming Accelerator, PASID is > > required for certain work submissions. To allow such devices use DMA > > mapping API, we need the following functionalities: > > 1. Provide device a way to retrieve a PASID for work submission within > > the kernel > > 2. Enable the kernel PASID on the IOMMU for the device > > 3. Attach the kernel PASID to the device's default DMA domain, let it > > be IOVA or physical address in case of pass-through. > > > > This patch introduces a driver facing API that enables DMA API > > PASID usage. Once enabled, device drivers can continue to use DMA APIs > > as is. There is no difference in dma_handle between without PASID and > > with PASID. > > > > Signed-off-by: Jacob Pan > > drivers/iommu/dma-iommu.c | 107 ++ > > include/linux/dma-iommu.h | 3 ++ > > include/linux/iommu.h | 2 + > > 3 files changed, 112 insertions(+) > > > > diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c > > index 1ca85d37eeab..5984f3129fa2 100644 > > +++ b/drivers/iommu/dma-iommu.c > > @@ -34,6 +34,8 @@ struct iommu_dma_msi_page { > > phys_addr_t phys; > > }; > > > > +static DECLARE_IOASID_SET(iommu_dma_pasid); > > + > > enum iommu_dma_cookie_type { > > IOMMU_DMA_IOVA_COOKIE, > > IOMMU_DMA_MSI_COOKIE, > > @@ -370,6 +372,111 @@ void iommu_put_dma_cookie(struct iommu_domain > > *domain) domain->iova_cookie = NULL; > > } > > > > +/** > > + * iommu_attach_dma_pasid --Attach a PASID for in-kernel DMA. Use the > > device's > > + * DMA domain. > > + * @dev: Device to be enabled > > + * @pasid: The returned kernel PASID to be used for DMA > > + * > > + * DMA request with PASID will be mapped the same way as the legacy > > DMA. > > + * If the device is in pass-through, PASID will also pass-through. If > > the > > + * device is in IOVA, the PASID will point to the same IOVA page table. > > + * > > + * @return err code or 0 on success > > + */ > > +int iommu_attach_dma_pasid(struct device *dev, ioasid_t *pasid) > > +{ > > + struct iommu_domain *dom; > > + ioasid_t id, max; > > + int ret = 0; > > + > > + dom = iommu_get_domain_for_dev(dev); > > + if (!dom || !dom->ops || !dom->ops->attach_dev_pasid) > > + return -ENODEV; > > + > > + /* Only support domain types that DMA API can be used */ > > + if (dom->type == IOMMU_DOMAIN_UNMANAGED || > > + dom->type == IOMMU_DOMAIN_BLOCKED) { > > + dev_warn(dev, "Invalid domain type %d", dom->type); > > This should be a WARN_ON > will do, thanks > > + return -EPERM; > > + } > > + > > + id = dom->pasid; > > + if (!id) { > > + /* > > +* First device to use PASID in its DMA domain, > > allocate > > +* a single PASID per DMA domain is all we need, it is > > also > > +* good for performance when it comes down to IOTLB > > flush. > > +*/ > > + max = 1U << dev->iommu->pasid_bits; > > + if (!max) > > + return -EINVAL; > > + > > + id = ioasid_alloc(_dma_pasid, 1, max, dev); > > + if (id == INVALID_IOASID) > > + return -ENOMEM; > > + > > + dom->pasid = id; > > + atomic_set(>pasid_users, 1); > > All of this needs proper locking. > good catch, will add a mutex for domain updates, detach as well. > > + } > > + > > + ret = dom->ops->attach_dev_pasid(dom, dev, id); > > + if (!ret) { > > + *pasid = id; > > + atomic_inc(>pasid_users); > > + return 0; > > + } > > + > > + if (atomic_dec_and_test(>pasid_users)) { > > + ioasid_free(id); > > + dom->pasid = 0; > > + } > > + > > + return ret; > > +} > > +EXPORT_SYMBOL(iommu_attach_dma_pasid); > > + > > +/** > > + * iommu_detach_dma_pasid --Disable in-kernel DMA request with PASID > > + * @dev: Device's PASID DMA to be disabled > > + * > > + * It is the device driver's responsibility to ensure no more incoming > > DMA > > + * requests with the kernel PASID before calling this function. IOMMU > > driver > > + * ensures PASID cache, IOTLBs related to the kernel PASID are cleared > > and > > + * drained. > > + * > > + */ > > +void iommu_detach_dma_pasid(struct device *dev) > > +{ > > + struct iommu_domain *dom; > > + ioasid_t pasid; > > + > > + dom = iommu_get_domain_for_dev(dev); > > + if (!dom || !dom->ops || !dom->ops->detach_dev_pasid) { > > + dev_warn(dev, "No ops for detaching PASID %u", pasid); > > + return; > > + } > > + /* Only support DMA API managed domain type */ > > + if (dom->type == IOMMU_DOMAIN_UNMANAGED || > > +
Re: [PATCH v3 1/4] iommu/vt-d: Implement domain ops for attach_dev_pasid
Hi Jason, On Tue, 10 May 2022 20:21:21 -0300, Jason Gunthorpe wrote: > On Tue, May 10, 2022 at 02:07:01PM -0700, Jacob Pan wrote: > > +static int intel_iommu_attach_dev_pasid(struct iommu_domain *domain, > > + struct device *dev, > > + ioasid_t pasid) > > +{ > > + struct device_domain_info *info = dev_iommu_priv_get(dev); > > + struct dmar_domain *dmar_domain = to_dmar_domain(domain); > > + struct intel_iommu *iommu = info->iommu; > > + unsigned long flags; > > + int ret = 0; > > + > > + if (!sm_supported(iommu) || !info) > > + return -ENODEV; > > + > > + spin_lock_irqsave(_domain_lock, flags); > > + /* > > +* If the same device already has a PASID attached, just > > return. > > +* DMA layer will return the PASID value to the caller. > > +*/ > > + if (pasid != PASID_RID2PASID && info->pasid) { > > Why check for PASID == 0 like this? Shouldn't pasid == 0 be rejected > as an invalid argument? Right, I was planning on reuse the attach function for RIDPASID as clean up, but didn't include here. Will fix. > > > + if (info->pasid == pasid) > > + ret = 0; > > Doesn't this need to check that the current domain is the requested > domain as well? How can this happen anyhow - isn't it an error to > double attach? > > > diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h > > index 5af24befc9f1..55845a8c4f4d 100644 > > +++ b/include/linux/intel-iommu.h > > @@ -627,6 +627,7 @@ struct device_domain_info { > > struct intel_iommu *iommu; /* IOMMU used by this device */ > > struct dmar_domain *domain; /* pointer to domain */ > > struct pasid_table *pasid_table; /* pasid table */ > > + ioasid_t pasid; /* DMA request with PASID */ > > And this seems wrong - the DMA API is not the only user of > attach_dev_pasid, so there should not be any global pasid for the > device. > True but the attach_dev_pasid() op is domain type specific. i.e. DMA API has its own attach_dev_pasid which is different than sva domain attach_dev_pasid(). device_domain_info is only used by DMA API. > I suspect this should be a counter of # of pasid domains attached so > that the special flush logic triggers > This field is only used for devTLB, so it is per domain-device. struct device_domain_info is allocated per device-domain as well. Sorry, I might have totally missed your point. > And rely on the core code to worry about assigning only one domain per > pasid - this should really be a 'set' function. > Yes, in this set the core code (in dma-iommu.c) only assign one PASID per DMA domain type. Are you suggesting the dma-iommu API should be called iommu_set_dma_pasid instead of iommu_attach_dma_pasid? Thanks a lot for the quick review! Jacob ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3 2/4] iommu: Add PASID support for DMA mapping API users
On Tue, May 10, 2022 at 02:07:02PM -0700, Jacob Pan wrote: > DMA mapping API is the de facto standard for in-kernel DMA. It operates > on a per device/RID basis which is not PASID-aware. > > Some modern devices such as Intel Data Streaming Accelerator, PASID is > required for certain work submissions. To allow such devices use DMA > mapping API, we need the following functionalities: > 1. Provide device a way to retrieve a PASID for work submission within > the kernel > 2. Enable the kernel PASID on the IOMMU for the device > 3. Attach the kernel PASID to the device's default DMA domain, let it > be IOVA or physical address in case of pass-through. > > This patch introduces a driver facing API that enables DMA API > PASID usage. Once enabled, device drivers can continue to use DMA APIs as > is. There is no difference in dma_handle between without PASID and with > PASID. > > Signed-off-by: Jacob Pan > drivers/iommu/dma-iommu.c | 107 ++ > include/linux/dma-iommu.h | 3 ++ > include/linux/iommu.h | 2 + > 3 files changed, 112 insertions(+) > > diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c > index 1ca85d37eeab..5984f3129fa2 100644 > +++ b/drivers/iommu/dma-iommu.c > @@ -34,6 +34,8 @@ struct iommu_dma_msi_page { > phys_addr_t phys; > }; > > +static DECLARE_IOASID_SET(iommu_dma_pasid); > + > enum iommu_dma_cookie_type { > IOMMU_DMA_IOVA_COOKIE, > IOMMU_DMA_MSI_COOKIE, > @@ -370,6 +372,111 @@ void iommu_put_dma_cookie(struct iommu_domain *domain) > domain->iova_cookie = NULL; > } > > +/** > + * iommu_attach_dma_pasid --Attach a PASID for in-kernel DMA. Use the > device's > + * DMA domain. > + * @dev: Device to be enabled > + * @pasid: The returned kernel PASID to be used for DMA > + * > + * DMA request with PASID will be mapped the same way as the legacy DMA. > + * If the device is in pass-through, PASID will also pass-through. If the > + * device is in IOVA, the PASID will point to the same IOVA page table. > + * > + * @return err code or 0 on success > + */ > +int iommu_attach_dma_pasid(struct device *dev, ioasid_t *pasid) > +{ > + struct iommu_domain *dom; > + ioasid_t id, max; > + int ret = 0; > + > + dom = iommu_get_domain_for_dev(dev); > + if (!dom || !dom->ops || !dom->ops->attach_dev_pasid) > + return -ENODEV; > + > + /* Only support domain types that DMA API can be used */ > + if (dom->type == IOMMU_DOMAIN_UNMANAGED || > + dom->type == IOMMU_DOMAIN_BLOCKED) { > + dev_warn(dev, "Invalid domain type %d", dom->type); This should be a WARN_ON > + return -EPERM; > + } > + > + id = dom->pasid; > + if (!id) { > + /* > + * First device to use PASID in its DMA domain, allocate > + * a single PASID per DMA domain is all we need, it is also > + * good for performance when it comes down to IOTLB flush. > + */ > + max = 1U << dev->iommu->pasid_bits; > + if (!max) > + return -EINVAL; > + > + id = ioasid_alloc(_dma_pasid, 1, max, dev); > + if (id == INVALID_IOASID) > + return -ENOMEM; > + > + dom->pasid = id; > + atomic_set(>pasid_users, 1); All of this needs proper locking. > + } > + > + ret = dom->ops->attach_dev_pasid(dom, dev, id); > + if (!ret) { > + *pasid = id; > + atomic_inc(>pasid_users); > + return 0; > + } > + > + if (atomic_dec_and_test(>pasid_users)) { > + ioasid_free(id); > + dom->pasid = 0; > + } > + > + return ret; > +} > +EXPORT_SYMBOL(iommu_attach_dma_pasid); > + > +/** > + * iommu_detach_dma_pasid --Disable in-kernel DMA request with PASID > + * @dev: Device's PASID DMA to be disabled > + * > + * It is the device driver's responsibility to ensure no more incoming DMA > + * requests with the kernel PASID before calling this function. IOMMU driver > + * ensures PASID cache, IOTLBs related to the kernel PASID are cleared and > + * drained. > + * > + */ > +void iommu_detach_dma_pasid(struct device *dev) > +{ > + struct iommu_domain *dom; > + ioasid_t pasid; > + > + dom = iommu_get_domain_for_dev(dev); > + if (!dom || !dom->ops || !dom->ops->detach_dev_pasid) { > + dev_warn(dev, "No ops for detaching PASID %u", pasid); > + return; > + } > + /* Only support DMA API managed domain type */ > + if (dom->type == IOMMU_DOMAIN_UNMANAGED || > + dom->type == IOMMU_DOMAIN_BLOCKED) { > + dev_err(dev, "Invalid domain type %d to detach DMA PASID %u\n", > + dom->type, pasid); > + return; > + } > + pasid = dom->pasid; > + if (!pasid) { > + dev_err(dev, "No DMA PASID attached\n"); > + return; > + } All WARN_ON's
Re: [PATCH v3 1/4] iommu/vt-d: Implement domain ops for attach_dev_pasid
On Tue, May 10, 2022 at 02:07:01PM -0700, Jacob Pan wrote: > +static int intel_iommu_attach_dev_pasid(struct iommu_domain *domain, > + struct device *dev, > + ioasid_t pasid) > +{ > + struct device_domain_info *info = dev_iommu_priv_get(dev); > + struct dmar_domain *dmar_domain = to_dmar_domain(domain); > + struct intel_iommu *iommu = info->iommu; > + unsigned long flags; > + int ret = 0; > + > + if (!sm_supported(iommu) || !info) > + return -ENODEV; > + > + spin_lock_irqsave(_domain_lock, flags); > + /* > + * If the same device already has a PASID attached, just return. > + * DMA layer will return the PASID value to the caller. > + */ > + if (pasid != PASID_RID2PASID && info->pasid) { Why check for PASID == 0 like this? Shouldn't pasid == 0 be rejected as an invalid argument? > + if (info->pasid == pasid) > + ret = 0; Doesn't this need to check that the current domain is the requested domain as well? How can this happen anyhow - isn't it an error to double attach? > diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h > index 5af24befc9f1..55845a8c4f4d 100644 > +++ b/include/linux/intel-iommu.h > @@ -627,6 +627,7 @@ struct device_domain_info { > struct intel_iommu *iommu; /* IOMMU used by this device */ > struct dmar_domain *domain; /* pointer to domain */ > struct pasid_table *pasid_table; /* pasid table */ > + ioasid_t pasid; /* DMA request with PASID */ And this seems wrong - the DMA API is not the only user of attach_dev_pasid, so there should not be any global pasid for the device. I suspect this should be a counter of # of pasid domains attached so that the special flush logic triggers And rely on the core code to worry about assigning only one domain per pasid - this should really be a 'set' function. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v3 3/4] dmaengine: idxd: Use DMA API for in-kernel DMA with PASID
The current in-kernel supervisor PASID support is based on the SVM/SVA machinery in SVA lib. The binding between a kernel PASID and kernel mapping has many flaws. See discussions in the link below. This patch enables in-kernel DMA by switching from SVA lib to the standard DMA mapping APIs. Since both DMA requests with and without PASIDs are mapped identically, there is no change to how DMA APIs are used after the kernel PASID is enabled. Link: https://lore.kernel.org/linux-iommu/20210511194726.gp1002...@nvidia.com/ Signed-off-by: Jacob Pan --- drivers/dma/idxd/idxd.h | 1 - drivers/dma/idxd/init.c | 34 +- drivers/dma/idxd/sysfs.c | 7 --- 3 files changed, 9 insertions(+), 33 deletions(-) diff --git a/drivers/dma/idxd/idxd.h b/drivers/dma/idxd/idxd.h index ccbefd0be617..190b08bd7c08 100644 --- a/drivers/dma/idxd/idxd.h +++ b/drivers/dma/idxd/idxd.h @@ -277,7 +277,6 @@ struct idxd_device { struct idxd_wq **wqs; struct idxd_engine **engines; - struct iommu_sva *sva; unsigned int pasid; int num_groups; diff --git a/drivers/dma/idxd/init.c b/drivers/dma/idxd/init.c index e1b5d1e4a949..e2e1c0eae6d6 100644 --- a/drivers/dma/idxd/init.c +++ b/drivers/dma/idxd/init.c @@ -16,6 +16,7 @@ #include #include #include +#include #include #include #include "../dmaengine.h" @@ -466,36 +467,22 @@ static struct idxd_device *idxd_alloc(struct pci_dev *pdev, struct idxd_driver_d static int idxd_enable_system_pasid(struct idxd_device *idxd) { - int flags; - unsigned int pasid; - struct iommu_sva *sva; + u32 pasid; + int ret; - flags = SVM_FLAG_SUPERVISOR_MODE; - - sva = iommu_sva_bind_device(>pdev->dev, NULL, ); - if (IS_ERR(sva)) { - dev_warn(>pdev->dev, -"iommu sva bind failed: %ld\n", PTR_ERR(sva)); - return PTR_ERR(sva); - } - - pasid = iommu_sva_get_pasid(sva); - if (pasid == IOMMU_PASID_INVALID) { - iommu_sva_unbind_device(sva); - return -ENODEV; + ret = iommu_attach_dma_pasid(>pdev->dev, ); + if (ret) { + dev_err(>pdev->dev, "No DMA PASID %d\n", ret); + return ret; } - - idxd->sva = sva; idxd->pasid = pasid; - dev_dbg(>pdev->dev, "system pasid: %u\n", pasid); + return 0; } static void idxd_disable_system_pasid(struct idxd_device *idxd) { - - iommu_sva_unbind_device(idxd->sva); - idxd->sva = NULL; + iommu_detach_dma_pasid(>pdev->dev); } static int idxd_probe(struct idxd_device *idxd) @@ -527,10 +514,7 @@ static int idxd_probe(struct idxd_device *idxd) else set_bit(IDXD_FLAG_PASID_ENABLED, >flags); } - } else if (!sva) { - dev_warn(dev, "User forced SVA off via module param.\n"); } - idxd_read_caps(idxd); idxd_read_table_offsets(idxd); diff --git a/drivers/dma/idxd/sysfs.c b/drivers/dma/idxd/sysfs.c index dfd549685c46..a48928973bd4 100644 --- a/drivers/dma/idxd/sysfs.c +++ b/drivers/dma/idxd/sysfs.c @@ -839,13 +839,6 @@ static ssize_t wq_name_store(struct device *dev, if (strlen(buf) > WQ_NAME_SIZE || strlen(buf) == 0) return -EINVAL; - /* -* This is temporarily placed here until we have SVM support for -* dmaengine. -*/ - if (wq->type == IDXD_WQT_KERNEL && device_pasid_enabled(wq->idxd)) - return -EOPNOTSUPP; - memset(wq->name, 0, WQ_NAME_SIZE + 1); strncpy(wq->name, buf, WQ_NAME_SIZE); strreplace(wq->name, '\n', '\0'); -- 2.25.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v3 2/4] iommu: Add PASID support for DMA mapping API users
DMA mapping API is the de facto standard for in-kernel DMA. It operates on a per device/RID basis which is not PASID-aware. Some modern devices such as Intel Data Streaming Accelerator, PASID is required for certain work submissions. To allow such devices use DMA mapping API, we need the following functionalities: 1. Provide device a way to retrieve a PASID for work submission within the kernel 2. Enable the kernel PASID on the IOMMU for the device 3. Attach the kernel PASID to the device's default DMA domain, let it be IOVA or physical address in case of pass-through. This patch introduces a driver facing API that enables DMA API PASID usage. Once enabled, device drivers can continue to use DMA APIs as is. There is no difference in dma_handle between without PASID and with PASID. Signed-off-by: Jacob Pan --- drivers/iommu/dma-iommu.c | 107 ++ include/linux/dma-iommu.h | 3 ++ include/linux/iommu.h | 2 + 3 files changed, 112 insertions(+) diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index 1ca85d37eeab..5984f3129fa2 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -34,6 +34,8 @@ struct iommu_dma_msi_page { phys_addr_t phys; }; +static DECLARE_IOASID_SET(iommu_dma_pasid); + enum iommu_dma_cookie_type { IOMMU_DMA_IOVA_COOKIE, IOMMU_DMA_MSI_COOKIE, @@ -370,6 +372,111 @@ void iommu_put_dma_cookie(struct iommu_domain *domain) domain->iova_cookie = NULL; } +/** + * iommu_attach_dma_pasid --Attach a PASID for in-kernel DMA. Use the device's + * DMA domain. + * @dev: Device to be enabled + * @pasid: The returned kernel PASID to be used for DMA + * + * DMA request with PASID will be mapped the same way as the legacy DMA. + * If the device is in pass-through, PASID will also pass-through. If the + * device is in IOVA, the PASID will point to the same IOVA page table. + * + * @return err code or 0 on success + */ +int iommu_attach_dma_pasid(struct device *dev, ioasid_t *pasid) +{ + struct iommu_domain *dom; + ioasid_t id, max; + int ret = 0; + + dom = iommu_get_domain_for_dev(dev); + if (!dom || !dom->ops || !dom->ops->attach_dev_pasid) + return -ENODEV; + + /* Only support domain types that DMA API can be used */ + if (dom->type == IOMMU_DOMAIN_UNMANAGED || + dom->type == IOMMU_DOMAIN_BLOCKED) { + dev_warn(dev, "Invalid domain type %d", dom->type); + return -EPERM; + } + + id = dom->pasid; + if (!id) { + /* +* First device to use PASID in its DMA domain, allocate +* a single PASID per DMA domain is all we need, it is also +* good for performance when it comes down to IOTLB flush. +*/ + max = 1U << dev->iommu->pasid_bits; + if (!max) + return -EINVAL; + + id = ioasid_alloc(_dma_pasid, 1, max, dev); + if (id == INVALID_IOASID) + return -ENOMEM; + + dom->pasid = id; + atomic_set(>pasid_users, 1); + } + + ret = dom->ops->attach_dev_pasid(dom, dev, id); + if (!ret) { + *pasid = id; + atomic_inc(>pasid_users); + return 0; + } + + if (atomic_dec_and_test(>pasid_users)) { + ioasid_free(id); + dom->pasid = 0; + } + + return ret; +} +EXPORT_SYMBOL(iommu_attach_dma_pasid); + +/** + * iommu_detach_dma_pasid --Disable in-kernel DMA request with PASID + * @dev: Device's PASID DMA to be disabled + * + * It is the device driver's responsibility to ensure no more incoming DMA + * requests with the kernel PASID before calling this function. IOMMU driver + * ensures PASID cache, IOTLBs related to the kernel PASID are cleared and + * drained. + * + */ +void iommu_detach_dma_pasid(struct device *dev) +{ + struct iommu_domain *dom; + ioasid_t pasid; + + dom = iommu_get_domain_for_dev(dev); + if (!dom || !dom->ops || !dom->ops->detach_dev_pasid) { + dev_warn(dev, "No ops for detaching PASID %u", pasid); + return; + } + /* Only support DMA API managed domain type */ + if (dom->type == IOMMU_DOMAIN_UNMANAGED || + dom->type == IOMMU_DOMAIN_BLOCKED) { + dev_err(dev, "Invalid domain type %d to detach DMA PASID %u\n", +dom->type, pasid); + return; + } + + pasid = dom->pasid; + if (!pasid) { + dev_err(dev, "No DMA PASID attached\n"); + return; + } + dom->ops->detach_dev_pasid(dom, dev, pasid); + if (atomic_dec_and_test(>pasid_users)) { + ioasid_free(pasid); + dom->pasid = 0; + } +} +EXPORT_SYMBOL(iommu_detach_dma_pasid); + /** *
[PATCH v3 4/4] iommu/vt-d: Delete unused SVM flag
Supervisor PASID for SVA/SVM is no longer supported, delete the unused flag. Signed-off-by: Jacob Pan --- drivers/iommu/intel/svm.c | 2 +- include/linux/intel-svm.h | 13 - 2 files changed, 1 insertion(+), 14 deletions(-) diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c index 38c33cde177e..98ec77415770 100644 --- a/drivers/iommu/intel/svm.c +++ b/drivers/iommu/intel/svm.c @@ -750,7 +750,7 @@ static irqreturn_t prq_event_thread(int irq, void *d) * to unbind the mm while any page faults are outstanding. */ svm = pasid_private_find(req->pasid); - if (IS_ERR_OR_NULL(svm) || (svm->flags & SVM_FLAG_SUPERVISOR_MODE)) + if (IS_ERR_OR_NULL(svm)) goto bad_req; } diff --git a/include/linux/intel-svm.h b/include/linux/intel-svm.h index b3b125b332aa..6835a665c195 100644 --- a/include/linux/intel-svm.h +++ b/include/linux/intel-svm.h @@ -13,17 +13,4 @@ #define PRQ_RING_MASK ((0x1000 << PRQ_ORDER) - 0x20) #define PRQ_DEPTH ((0x1000 << PRQ_ORDER) >> 5) -/* - * The SVM_FLAG_SUPERVISOR_MODE flag requests a PASID which can be used only - * for access to kernel addresses. No IOTLB flushes are automatically done - * for kernel mappings; it is valid only for access to the kernel's static - * 1:1 mapping of physical memory — not to vmalloc or even module mappings. - * A future API addition may permit the use of such ranges, by means of an - * explicit IOTLB flush call (akin to the DMA API's unmap method). - * - * It is unlikely that we will ever hook into flush_tlb_kernel_range() to - * do such IOTLB flushes automatically. - */ -#define SVM_FLAG_SUPERVISOR_MODE BIT(0) - #endif /* __INTEL_SVM_H__ */ -- 2.25.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v3 0/4] Enable PASID for DMA API users
Some modern accelerators such as Intel's Data Streaming Accelerator (DSA) require PASID in DMA requests to be operational. Specifically, the work submissions with ENQCMD on shared work queues require PASIDs. The use cases include both user DMA with shared virtual addressing (SVA) and in-kernel DMA similar to legacy DMA w/o PASID. Here we address the latter. DMA mapping API is the de facto standard for in-kernel DMA. However, it operates on a per device or Requester ID(RID) basis which is not PASID-aware. To leverage DMA API for devices relies on PASIDs, this patchset introduces the following APIs 1. A driver facing API that enables DMA API PASID usage: iommu_enable_pasid_dma(struct device *dev, ioasid_t ); 2. An IOMMU op that allows attaching device-domain-PASID generically (will be used beyond DMA API PASID support) Once PASID DMA is enabled and attached to the appropriate IOMMU domain, device drivers can continue to use DMA APIs as-is. There is no difference in terms of mapping in dma_handle between without PASID and with PASID. The DMA mapping performed by IOMMU will be identical for both requests, let it be IOVA or PA in case of pass-through. In addition, this set converts DSA driver in-kernel DMA with PASID from SVA lib to DMA API. There have been security and functional issues with the kernel SVA approach: (https://lore.kernel.org/linux-iommu/20210511194726.gp1002...@nvidia.com/) The highlights are as the following: - The lack of IOTLB synchronization upon kernel page table updates. (vmalloc, module/BPF loading, CONFIG_DEBUG_PAGEALLOC etc.) - Other than slight more protection, using kernel virtual address (KVA) has little advantage over physical address. There are also no use cases yet where DMA engines need kernel virtual addresses for in-kernel DMA. Subsequently, cleanup is done around the usage of sva_bind_device() for in-kernel DMA. Removing special casing code in VT-d driver and tightening SVA lib API. This work and idea behind it is a collaboration with many people, many thanks to Baolu Lu, Jason Gunthorpe, Dave Jiang, and others. ChangeLog: v3 - Rebased on "Baolu's SVA and IOPF refactoring" series v5. (https://github.com/LuBaolu/intel-iommu/commits/iommu-sva-refactoring-v5) This version is significantly simplified by leveraging IOMMU domain ops, attach_dev_pasid() op is implemented differently on a DMA domain than on a SVA domain. We currently have no need to support multiple PASIDs per DMA domain. (https://lore.kernel.org/lkml/20220315142216.gv11...@nvidia.com/). Removed PASID-device list from V2, a PASID field is introduced to struct iommu_domain instead. It is intended for DMA requests with PASID by all devices attached to the domain. v2 - Do not reserve a special PASID for DMA API usage. Use IOASID allocation instead. - Introduced a generic device-pasid-domain attachment IOMMU op. Replaced the DMA API only IOMMU op. - Removed supervisor SVA support in VT-d - Removed unused sva_bind_device parameters - Use IOMMU specific data instead of struct device to store PASID info Jacob Pan (4): iommu/vt-d: Implement domain ops for attach_dev_pasid iommu: Add PASID support for DMA mapping API users dmaengine: idxd: Use DMA API for in-kernel DMA with PASID iommu/vt-d: Delete unused SVM flag drivers/dma/idxd/idxd.h | 1 - drivers/dma/idxd/init.c | 34 +++- drivers/dma/idxd/sysfs.c| 7 --- drivers/iommu/dma-iommu.c | 107 drivers/iommu/intel/iommu.c | 81 ++- drivers/iommu/intel/svm.c | 2 +- include/linux/dma-iommu.h | 3 + include/linux/intel-iommu.h | 1 + include/linux/intel-svm.h | 13 - include/linux/iommu.h | 2 + 10 files changed, 202 insertions(+), 49 deletions(-) -- 2.25.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v3 1/4] iommu/vt-d: Implement domain ops for attach_dev_pasid
On VT-d platforms with scalable mode enabled, devices issue DMA requests with PASID need to attach PASIDs to given IOMMU domains. The attach operation involves the following: - Programming the PASID into the device's PASID table - Tracking device domain and the PASID relationship - Managing IOTLB and device TLB invalidations This patch add attach_dev_pasid functions to the default domain ops which is used by DMA and identity domain types. It could be extended to support other domain types whenever necessary. Signed-off-by: Lu Baolu Signed-off-by: Jacob Pan --- drivers/iommu/intel/iommu.c | 81 - include/linux/intel-iommu.h | 1 + 2 files changed, 80 insertions(+), 2 deletions(-) diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index a51b96fa9b3a..5408418f4f4b 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -1562,6 +1562,10 @@ static void __iommu_flush_dev_iotlb(struct device_domain_info *info, sid = info->bus << 8 | info->devfn; qdep = info->ats_qdep; + if (info->pasid) { + qi_flush_dev_iotlb_pasid(info->iommu, sid, info->pfsid, +info->pasid, qdep, addr, mask); + } qi_flush_dev_iotlb(info->iommu, sid, info->pfsid, qdep, addr, mask); } @@ -1591,6 +1595,7 @@ static void iommu_flush_iotlb_psi(struct intel_iommu *iommu, unsigned int mask = ilog2(aligned_pages); uint64_t addr = (uint64_t)pfn << VTD_PAGE_SHIFT; u16 did = domain->iommu_did[iommu->seq_id]; + struct iommu_domain *iommu_domain = >domain; BUG_ON(pages == 0); @@ -1599,6 +1604,9 @@ static void iommu_flush_iotlb_psi(struct intel_iommu *iommu, if (domain_use_first_level(domain)) { qi_flush_piotlb(iommu, did, PASID_RID2PASID, addr, pages, ih); + /* flush additional kernel DMA PASIDs attached */ + if (iommu_domain->pasid) + qi_flush_piotlb(iommu, did, iommu_domain->pasid, addr, pages, ih); } else { unsigned long bitmask = aligned_pages - 1; @@ -4265,10 +4273,13 @@ static void __dmar_remove_one_dev_info(struct device_domain_info *info) domain = info->domain; if (info->dev && !dev_is_real_dma_subdevice(info->dev)) { - if (dev_is_pci(info->dev) && sm_supported(iommu)) + if (dev_is_pci(info->dev) && sm_supported(iommu)) { intel_pasid_tear_down_entry(iommu, info->dev, PASID_RID2PASID, false); - + if (info->pasid) + intel_pasid_tear_down_entry(iommu, info->dev, + info->pasid, false); + } iommu_disable_dev_iotlb(info); domain_context_clear(info); intel_pasid_free_table(info->dev); @@ -4912,6 +4923,70 @@ static void intel_iommu_iotlb_sync_map(struct iommu_domain *domain, } } +static int intel_iommu_attach_dev_pasid(struct iommu_domain *domain, + struct device *dev, + ioasid_t pasid) +{ + struct device_domain_info *info = dev_iommu_priv_get(dev); + struct dmar_domain *dmar_domain = to_dmar_domain(domain); + struct intel_iommu *iommu = info->iommu; + unsigned long flags; + int ret = 0; + + if (!sm_supported(iommu) || !info) + return -ENODEV; + + spin_lock_irqsave(_domain_lock, flags); + /* +* If the same device already has a PASID attached, just return. +* DMA layer will return the PASID value to the caller. +*/ + if (pasid != PASID_RID2PASID && info->pasid) { + if (info->pasid == pasid) + ret = 0; + else { + dev_warn(dev, "Cannot attach PASID %u, %u already attached\n", +pasid, info->pasid); + ret = -EBUSY; + } + goto out_unlock_domain; + } + + spin_lock(>lock); + if (hw_pass_through && domain_type_is_si(dmar_domain)) + ret = intel_pasid_setup_pass_through(iommu, dmar_domain, +dev, pasid); + else if (domain_use_first_level(dmar_domain)) + ret = domain_setup_first_level(iommu, dmar_domain, + dev, pasid); + else + ret = intel_pasid_setup_second_level(iommu, dmar_domain, +dev, pasid); + + spin_unlock(>lock); +out_unlock_domain: + spin_unlock_irqrestore(_domain_lock, flags); + if (!ret) + info->pasid = pasid; + + return ret; +} + +static void
Re: [PATCH] iommu/vt-d: Increase DMAR_UNITS_SUPPORTED
On Tue, May 10, 2022 at 01:16:26AM +, Tian, Kevin wrote: > > From: Steve Wahl > > Sent: Friday, May 6, 2022 11:26 PM > > > > On Fri, May 06, 2022 at 08:12:11AM +, Tian, Kevin wrote: > > > > From: David Woodhouse > > > > Sent: Friday, May 6, 2022 3:17 PM > > > > > > > > On Fri, 2022-05-06 at 06:49 +, Tian, Kevin wrote: > > > > > > From: Baolu Lu > > > > > > > > > > > > > --- a/include/linux/dmar.h > > > > > > > +++ b/include/linux/dmar.h > > > > > > > @@ -19,7 +19,7 @@ > > > > > > > struct acpi_dmar_header; > > > > > > > > > > > > > > #ifdef CONFIG_X86 > > > > > > > -# define DMAR_UNITS_SUPPORTEDMAX_IO_APICS > > > > > > > +# define DMAR_UNITS_SUPPORTED640 > > > > > > > #else > > > > > > > # defineDMAR_UNITS_SUPPORTED64 > > > > > > > #endif > > > > > > > > > > ... is it necessary to permanently do 10x increase which wastes memory > > > > > on most platforms which won't have such need. > > > > > > > > I was just looking at that. It mostly adds about 3½ KiB to each struct > > > > dmar_domain. > > > > > > > > I think the only actual static array is the dmar_seq_ids bitmap which > > > > grows to 640 *bits* which is fairly negligible, and the main growth is > > > > that it adds about 3½ KiB to each struct dmar_domain for the > > > > iommu_refcnt[] and iommu_did[] arrays. > > > > > > Thanks for the quick experiment! though the added material is > > > negligible it's cleaner to me if having a way to configure it as > > > discussed below. > > > > > > > > > > > > Does it make more sense to have a configurable approach similar to > > > > > CONFIG_NR_CPUS? or even better can we just replace those static > > > > > arrays with dynamic allocation so removing this restriction > > > > > completely? > > > > > > > > Hotplug makes that fun, but I suppose you only need to grow the array > > > > in a given struct dmar_domain if you actually add a device to it that's > > > > behind a newly added IOMMU. I don't know if the complexity of making it > > > > fully dynamic is worth it though. We could make it a config option, > > > > and/or a command line option (perhaps automatically derived from > > > > CONFIG_NR_CPUS). > > > > > > either config option or command line option is OK to me. Probably > > > the former is simpler given no need to dynamically expand the > > > static array. btw though deriving from CONFIG_NR_CPUS could work > > > in this case it is unclear why tying the two together is necessary in > > > concept, e.g. is there guarantee that the number of IOMMUs must > > > be smaller than the number of CPUs in a platform? > > > > > > > > > > > If it wasn't for hotplug, I think we'd know the right number by the > > > > time we actually need it anyway, wouldn't we? Can we have a heuristic > > > > for how many DMAR units are likely to be hotplugged? Is it as simple as > > > > the ratio of present to not-yet-present CPUs in MADT? > > > > > > Probably. But I don't have enough knowledge on DMAR hotplug to > > > judge (e.g. whether it's strictly tied to CPU hotplug and if yes whether > > > there could be multiple IOMMUs hotplugged together with a CPU > > > socket)... > > > > > > Thanks > > > Kevin > > > > Would anyone be more comfortable if we only increase the limit where > > MAXSMP is set? > > > > i.e. > > > > #if defined(CONFIG_X86) && defined(CONFIG_MAXSMP) > > # defineDMAR_UNITS_SUPPORTED640 > > #elif defined(CONFIG_X86) > > # defineDMAR_UNITS_SUPPORTEDMAX_IO_APICS > > #else > > # defineDMAR_UNITS_SUPPORTED64 > > #endif > > > > Thank you all for your time looking at this. > > > > This works for your own configuration but it's unclear whether other > MAXSMP platforms have the exact same requirements (different > number of sockets, different ratio of #iommus/#sockets, etc.). In any > case since we are at it having a generic way to extend it makes more > sense to me. So, to be clear, what you would like to see would be Kconfig entries to create a config option, say "NR_DMARS", set up so the default is: MAXSMP? 640 X86_64? 128 X86_32? 64 other64 And DMAR_UNITS_SUPPORTED gets removed, and everywhere it was used we use CONFIG_NR_DMARS in its place? I can give that a shot but wanted to confirm this is what you'd want first. Thanks, --> Steve -- Steve Wahl, Hewlett Packard Enterprise ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
On Tue, May 10, 2022 at 05:12:04PM +1000, David Gibson wrote: > On Mon, May 09, 2022 at 11:00:41AM -0300, Jason Gunthorpe wrote: > > On Mon, May 09, 2022 at 04:01:52PM +1000, David Gibson wrote: > > > > > > The default iommu_domain that the iommu driver creates will be used > > > > here, it is up to the iommu driver to choose something reasonable for > > > > use by applications like DPDK. ie PPC should probably pick its biggest > > > > x86-like aperture. > > > > > > So, using the big aperture means a very high base IOVA > > > (1<<59)... which means that it won't work at all if you want to attach > > > any devices that aren't capable of 64-bit DMA. > > > > I'd expect to include the 32 bit window too.. > > I'm not entirely sure what you mean. Are you working on the > assumption that we've extended to allowing multiple apertures, so we'd > default to advertising both a small/low aperture and a large/high > aperture? Yes > > No, this just makes it fragile in the other direction because now > > userspace has to know what platform specific things to ask for *or it > > doesn't work at all*. This is not a improvement for the DPDK cases. > > Um.. no. The idea is that userspace requests *what it needs*, not > anything platform specific. In the case of DPDK that would be nothing > more than the (minimum) aperture size. Nothing platform specific > about that. Except a 32 bit platform can only maybe do a < 4G aperture, a 64 bit platform can do more, but it varies how much more, etc. There is no constant value DPDK could stuff in this request, unless it needs a really small amount of IOVA, like 1G or something. > > It isn't like there is some hard coded value we can put into DPDK that > > will work on every platform. So kernel must pick for DPDK, IMHO. I > > don't see any feasible alternative. > > Yes, hence *optionally specified* base address only. Okay, so imagine we've already done this and DPDK is not optionally specifying anything :) The structs can be extended so we can add this as an input to creation when a driver can implement it. > > The ppc specific driver would be on the generic side of qemu in its > > viommu support framework. There is lots of host driver optimization > > possible here with knowledge of the underlying host iommu HW. It > > should not be connected to the qemu target. > > Thinking through this... > > So, I guess we could have basically the same logic I'm suggesting be > in the qemu backend iommu driver instead. So the target side (machine > type, strictly speaking) would request of the host side the apertures > it needs, and the host side driver would see if it can do that, based > on both specific knowledge of that driver and the query reponses. Yes, this is what I'm thinking > ppc on x86 should work with that.. at least if the x86 aperture is > large enough to reach up to ppc's high window. I guess we'd have the > option here of using either the generic host driver or the > x86-specific driver. The latter would mean qemu maintaining an > x86-format shadow of the io pagetables; mildly tedious, but doable. The appeal of having userspace page tables is performance, so it is tedious to shadow, but it should run faster. > So... is there any way of backing out of this gracefully. We could > detach the device, but in the meantime ongoing DMA maps from > previous devices might have failed. This sounds like a good use case for qemu to communicate ranges - but as I mentioned before Alex said qemu didn't know the ranges.. > We could pre-attach the new device to a new IOAS and check the > apertures there - but when we move it to the final IOAS is it > guaranteed that the apertures will be (at least) the intersection of > the old and new apertures, or is that just the probable outcome. Should be guarenteed > Ok.. you convinced me. As long as we have some way to handle the > device hotplug case, we can work with this. I like the communicate ranges for hotplug, so long as we can actually implement it in qemu - I'm a bit unclear on that honestly. > Ok, I see. That can certainly be done. I was really hoping we could > have a working, though non-optimized, implementation using just the > generic interface. Oh, sure that should largely work as well too, this is just an additional direction people may find interesting and helps explain why qemu should have an iommu layer inside. > "holes" versus "windows". We can choose either one; I think "windows" > rather than "holes" makes more sense, but it doesn't really matter. Yes, I picked windows aka ranges for the uAPI - we translate the holes from the groups into windows and intersect them with the apertures. > > > Another approach would be to give the required apertures / pagesizes > > > in the initial creation of the domain/IOAS. In that case they would > > > be static for the IOAS, as well as the underlying iommu_domains: any > > > ATTACH which would be incompatible would fail. > > > > This is the device-specific
Re: [PATCH] vfio: Remove VFIO_TYPE1_NESTING_IOMMU
On Tue, May 10, 2022 at 01:55:24PM -0300, Jason Gunthorpe wrote: > This control causes the ARM SMMU drivers to choose a stage 2 > implementation for the IO pagetable (vs the stage 1 usual default), > however this choice has no visible impact to the VFIO user. Further qemu > never implemented this and no other userspace user is known. > > The original description in commit f5c9ecebaf2a ("vfio/iommu_type1: add > new VFIO_TYPE1_NESTING_IOMMU IOMMU type") suggested this was to "provide > SMMU translation services to the guest operating system" however the rest > of the API to set the guest table pointer for the stage 1 was never > completed, or at least never upstreamed, rendering this part useless dead > code. > > Since the current patches to enable nested translation, aka userspace page > tables, rely on iommufd and will not use the enable_nesting() > iommu_domain_op, remove this infrastructure. However, don't cut too deep > into the SMMU drivers for now expecting the iommufd work to pick it up - > we still need to create S2 IO page tables. > > Remove VFIO_TYPE1_NESTING_IOMMU and everything under it including the > enable_nesting iommu_domain_op. > > Just in-case there is some userspace using this continue to treat > requesting it as a NOP, but do not advertise support any more. > > Signed-off-by: Jason Gunthorpe Sanity-tested with qemu-system-aarch64 using "iommu=nested-smmuv3" (unmerged feature) and "iommu=none" strings on top of vfio next. Tested-by: Nicolin Chen ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH] swiotlb: Max mapping size takes min align mask into account
From: Robin Murphy Sent: Tuesday, May 10, 2022 9:34 AM > > On 2022-05-10 15:21, Tianyu Lan wrote: > > From: Tianyu Lan > > > > swiotlb_find_slots() skips slots according to io tlb aligned mask > > calculated from min aligned mask and original physical address > > offset. This affects max mapping size. The mapping size can't > > achieve the IO_TLB_SEGSIZE * IO_TLB_SIZE when original offset is > > non-zero. This will cause system boot up failure in Hyper-V > > Isolation VM where swiotlb force is enabled. Scsi layer use return > > value of dma_max_mapping_size() to set max segment size and it > > finally calls swiotlb_max_mapping_size(). Hyper-V storage driver > > sets min align mask to 4k - 1. Scsi layer may pass 256k length of > > request buffer with 0~4k offset and Hyper-V storage driver can't > > get swiotlb bounce buffer via DMA API. Swiotlb_find_slots() can't > > find 256k length bounce buffer with offset. Make swiotlb_max_mapping > > _size() take min align mask into account. > > Hmm, this seems a bit pessimistic - the offset can vary per mapping, so > it feels to me like it should really be the caller's responsibility to > account for it if they're already involved enough to care about both > constraints. But I'm not sure how practical that would be. Tianyu and I discussed this prior to his submitting the patch. Presumably dma_max_mapping_size() exists so that the higher level blk-mq code can limit the size of I/O requests to something that will "fit" in the swiotlb when bounce buffering is enabled. Unfortunately, the current code is just giving the wrong answer when the offset is non-zero. The offset would be less than PAGE_SIZE, so the impact would be dma_max_mapping_size() returning 252 Kbytes instead of 256 Kbytes, but only for devices where dma min align mask is set. And any I/O sizes less than 252 Kbytes are unaffected even when dma min align mask is set. Net, the impact would be only in a fairly rare edge case. Even on ARM64 with a 64K page size, the Hyper-V storage driver is setting the dma min align mask to only 4K (which is correct because the Hyper-V host uses a 4K page size even if the guest is using something larger), so again the limit becomes 252 Kbytes instead of 256 Kbytes, and any impact is rare. As you mentioned, how else would a caller handle this situation? Michael > > Robin. > > > Signed-off-by: Tianyu Lan > > --- > > kernel/dma/swiotlb.c | 13 - > > 1 file changed, 12 insertions(+), 1 deletion(-) > > > > diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c > > index 73a41cec9e38..0d6684ca7eab 100644 > > --- a/kernel/dma/swiotlb.c > > +++ b/kernel/dma/swiotlb.c > > @@ -743,7 +743,18 @@ dma_addr_t swiotlb_map(struct device *dev, phys_addr_t > paddr, size_t size, > > > > size_t swiotlb_max_mapping_size(struct device *dev) > > { > > - return ((size_t)IO_TLB_SIZE) * IO_TLB_SEGSIZE; > > + int min_align_mask = dma_get_min_align_mask(dev); > > + int min_align = 0; > > + > > + /* > > +* swiotlb_find_slots() skips slots according to > > +* min align mask. This affects max mapping size. > > +* Take it into acount here. > > +*/ > > + if (min_align_mask) > > + min_align = roundup(min_align_mask, IO_TLB_SIZE); > > + > > + return ((size_t)IO_TLB_SIZE) * IO_TLB_SEGSIZE - min_align; > > } > > > > bool is_swiotlb_active(struct device *dev) ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] vfio: Remove VFIO_TYPE1_NESTING_IOMMU
On Tue, May 10, 2022 at 06:52:06PM +0100, Robin Murphy wrote: > On 2022-05-10 17:55, Jason Gunthorpe via iommu wrote: > > This control causes the ARM SMMU drivers to choose a stage 2 > > implementation for the IO pagetable (vs the stage 1 usual default), > > however this choice has no visible impact to the VFIO user. Further qemu > > never implemented this and no other userspace user is known. > > > > The original description in commit f5c9ecebaf2a ("vfio/iommu_type1: add > > new VFIO_TYPE1_NESTING_IOMMU IOMMU type") suggested this was to "provide > > SMMU translation services to the guest operating system" however the rest > > of the API to set the guest table pointer for the stage 1 was never > > completed, or at least never upstreamed, rendering this part useless dead > > code. > > > > Since the current patches to enable nested translation, aka userspace page > > tables, rely on iommufd and will not use the enable_nesting() > > iommu_domain_op, remove this infrastructure. However, don't cut too deep > > into the SMMU drivers for now expecting the iommufd work to pick it up - > > we still need to create S2 IO page tables. > > > > Remove VFIO_TYPE1_NESTING_IOMMU and everything under it including the > > enable_nesting iommu_domain_op. > > > > Just in-case there is some userspace using this continue to treat > > requesting it as a NOP, but do not advertise support any more. > > I assume the nested translation/guest SVA patches that Eric and Vivek were > working on pre-IOMMUFD made use of this, and given that they got quite far > along, I wouldn't be too surprised if some eager cloud vendors might have > even deployed something based on the patches off the list. With upstream there is no way to make use of this flag, if someone is using it they have other out of tree kernel, vfio, kvm and qemu patches to make it all work. You can see how much is still needed in Eric's tree: https://github.com/eauger/linux/commits/v5.15-rc7-nested-v16 > I can't help feeling a little wary about removing this until IOMMUFD > can actually offer a functional replacement - is it in the way of > anything upcoming? >From an upstream perspective if someone has a patched kernel to complete the feature, then they can patch this part in as well, we should not carry dead code like this in the kernel and in the uapi. It is not directly in the way, but this needs to get done at some point, I'd rather just get it out of the way. Thanks, Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] vfio: Remove VFIO_TYPE1_NESTING_IOMMU
On 2022-05-10 17:55, Jason Gunthorpe via iommu wrote: This control causes the ARM SMMU drivers to choose a stage 2 implementation for the IO pagetable (vs the stage 1 usual default), however this choice has no visible impact to the VFIO user. Further qemu never implemented this and no other userspace user is known. The original description in commit f5c9ecebaf2a ("vfio/iommu_type1: add new VFIO_TYPE1_NESTING_IOMMU IOMMU type") suggested this was to "provide SMMU translation services to the guest operating system" however the rest of the API to set the guest table pointer for the stage 1 was never completed, or at least never upstreamed, rendering this part useless dead code. Since the current patches to enable nested translation, aka userspace page tables, rely on iommufd and will not use the enable_nesting() iommu_domain_op, remove this infrastructure. However, don't cut too deep into the SMMU drivers for now expecting the iommufd work to pick it up - we still need to create S2 IO page tables. Remove VFIO_TYPE1_NESTING_IOMMU and everything under it including the enable_nesting iommu_domain_op. Just in-case there is some userspace using this continue to treat requesting it as a NOP, but do not advertise support any more. I assume the nested translation/guest SVA patches that Eric and Vivek were working on pre-IOMMUFD made use of this, and given that they got quite far along, I wouldn't be too surprised if some eager cloud vendors might have even deployed something based on the patches off the list. I can't help feeling a little wary about removing this until IOMMUFD can actually offer a functional replacement - is it in the way of anything upcoming? Robin. Signed-off-by: Jason Gunthorpe --- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 16 drivers/iommu/arm/arm-smmu/arm-smmu.c | 16 drivers/iommu/iommu.c | 10 -- drivers/vfio/vfio_iommu_type1.c | 12 +--- include/linux/iommu.h | 3 --- include/uapi/linux/vfio.h | 2 +- 6 files changed, 2 insertions(+), 57 deletions(-) It would probably make sense for this to go through the VFIO tree with Robin's ack for the SMMU changes. diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index 627a3ed5ee8fd1..b901e8973bb4ea 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -2724,21 +2724,6 @@ static struct iommu_group *arm_smmu_device_group(struct device *dev) return group; } -static int arm_smmu_enable_nesting(struct iommu_domain *domain) -{ - struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain); - int ret = 0; - - mutex_lock(_domain->init_mutex); - if (smmu_domain->smmu) - ret = -EPERM; - else - smmu_domain->stage = ARM_SMMU_DOMAIN_NESTED; - mutex_unlock(_domain->init_mutex); - - return ret; -} - static int arm_smmu_of_xlate(struct device *dev, struct of_phandle_args *args) { return iommu_fwspec_add_ids(dev, args->args, 1); @@ -2865,7 +2850,6 @@ static struct iommu_ops arm_smmu_ops = { .flush_iotlb_all= arm_smmu_flush_iotlb_all, .iotlb_sync = arm_smmu_iotlb_sync, .iova_to_phys = arm_smmu_iova_to_phys, - .enable_nesting = arm_smmu_enable_nesting, .free = arm_smmu_domain_free, } }; diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c index 568cce590ccc13..239e6f6585b48d 100644 --- a/drivers/iommu/arm/arm-smmu/arm-smmu.c +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c @@ -1507,21 +1507,6 @@ static struct iommu_group *arm_smmu_device_group(struct device *dev) return group; } -static int arm_smmu_enable_nesting(struct iommu_domain *domain) -{ - struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain); - int ret = 0; - - mutex_lock(_domain->init_mutex); - if (smmu_domain->smmu) - ret = -EPERM; - else - smmu_domain->stage = ARM_SMMU_DOMAIN_NESTED; - mutex_unlock(_domain->init_mutex); - - return ret; -} - static int arm_smmu_set_pgtable_quirks(struct iommu_domain *domain, unsigned long quirks) { @@ -1600,7 +1585,6 @@ static struct iommu_ops arm_smmu_ops = { .flush_iotlb_all= arm_smmu_flush_iotlb_all, .iotlb_sync = arm_smmu_iotlb_sync, .iova_to_phys = arm_smmu_iova_to_phys, - .enable_nesting = arm_smmu_enable_nesting, .set_pgtable_quirks = arm_smmu_set_pgtable_quirks, .free = arm_smmu_domain_free, } diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index
[PATCH] dma-debug: change allocation mode from GFP_NOWAIT to GFP_ATIOMIC
We observed the error "cacheline tracking ENOMEM, dma-debug disabled" during a light system load (copying some files). The reason for this error is that the dma_active_cacheline radix tree uses GFP_NOWAIT allocation - so it can't access the emergency memory reserves and it fails as soon as anybody reaches the watermark. This patch changes GFP_NOWAIT to GFP_ATOMIC, so that it can access the emergency memory reserves. Signed-off-by: Mikulas Patocka --- kernel/dma/debug.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) Index: linux-2.6/kernel/dma/debug.c === --- linux-2.6.orig/kernel/dma/debug.c 2022-03-30 14:39:00.0 +0200 +++ linux-2.6/kernel/dma/debug.c2022-05-09 16:32:07.0 +0200 @@ -448,7 +448,7 @@ void debug_dma_dump_mappings(struct devi * other hand, consumes a single dma_debug_entry, but inserts 'nents' * entries into the tree. */ -static RADIX_TREE(dma_active_cacheline, GFP_NOWAIT); +static RADIX_TREE(dma_active_cacheline, GFP_ATOMIC); static DEFINE_SPINLOCK(radix_lock); #define ACTIVE_CACHELINE_MAX_OVERLAP ((1 << RADIX_TREE_MAX_TAGS) - 1) #define CACHELINE_PER_PAGE_SHIFT (PAGE_SHIFT - L1_CACHE_SHIFT) ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH] vfio: Remove VFIO_TYPE1_NESTING_IOMMU
This control causes the ARM SMMU drivers to choose a stage 2 implementation for the IO pagetable (vs the stage 1 usual default), however this choice has no visible impact to the VFIO user. Further qemu never implemented this and no other userspace user is known. The original description in commit f5c9ecebaf2a ("vfio/iommu_type1: add new VFIO_TYPE1_NESTING_IOMMU IOMMU type") suggested this was to "provide SMMU translation services to the guest operating system" however the rest of the API to set the guest table pointer for the stage 1 was never completed, or at least never upstreamed, rendering this part useless dead code. Since the current patches to enable nested translation, aka userspace page tables, rely on iommufd and will not use the enable_nesting() iommu_domain_op, remove this infrastructure. However, don't cut too deep into the SMMU drivers for now expecting the iommufd work to pick it up - we still need to create S2 IO page tables. Remove VFIO_TYPE1_NESTING_IOMMU and everything under it including the enable_nesting iommu_domain_op. Just in-case there is some userspace using this continue to treat requesting it as a NOP, but do not advertise support any more. Signed-off-by: Jason Gunthorpe --- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 16 drivers/iommu/arm/arm-smmu/arm-smmu.c | 16 drivers/iommu/iommu.c | 10 -- drivers/vfio/vfio_iommu_type1.c | 12 +--- include/linux/iommu.h | 3 --- include/uapi/linux/vfio.h | 2 +- 6 files changed, 2 insertions(+), 57 deletions(-) It would probably make sense for this to go through the VFIO tree with Robin's ack for the SMMU changes. diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index 627a3ed5ee8fd1..b901e8973bb4ea 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -2724,21 +2724,6 @@ static struct iommu_group *arm_smmu_device_group(struct device *dev) return group; } -static int arm_smmu_enable_nesting(struct iommu_domain *domain) -{ - struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain); - int ret = 0; - - mutex_lock(_domain->init_mutex); - if (smmu_domain->smmu) - ret = -EPERM; - else - smmu_domain->stage = ARM_SMMU_DOMAIN_NESTED; - mutex_unlock(_domain->init_mutex); - - return ret; -} - static int arm_smmu_of_xlate(struct device *dev, struct of_phandle_args *args) { return iommu_fwspec_add_ids(dev, args->args, 1); @@ -2865,7 +2850,6 @@ static struct iommu_ops arm_smmu_ops = { .flush_iotlb_all= arm_smmu_flush_iotlb_all, .iotlb_sync = arm_smmu_iotlb_sync, .iova_to_phys = arm_smmu_iova_to_phys, - .enable_nesting = arm_smmu_enable_nesting, .free = arm_smmu_domain_free, } }; diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c index 568cce590ccc13..239e6f6585b48d 100644 --- a/drivers/iommu/arm/arm-smmu/arm-smmu.c +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c @@ -1507,21 +1507,6 @@ static struct iommu_group *arm_smmu_device_group(struct device *dev) return group; } -static int arm_smmu_enable_nesting(struct iommu_domain *domain) -{ - struct arm_smmu_domain *smmu_domain = to_smmu_domain(domain); - int ret = 0; - - mutex_lock(_domain->init_mutex); - if (smmu_domain->smmu) - ret = -EPERM; - else - smmu_domain->stage = ARM_SMMU_DOMAIN_NESTED; - mutex_unlock(_domain->init_mutex); - - return ret; -} - static int arm_smmu_set_pgtable_quirks(struct iommu_domain *domain, unsigned long quirks) { @@ -1600,7 +1585,6 @@ static struct iommu_ops arm_smmu_ops = { .flush_iotlb_all= arm_smmu_flush_iotlb_all, .iotlb_sync = arm_smmu_iotlb_sync, .iova_to_phys = arm_smmu_iova_to_phys, - .enable_nesting = arm_smmu_enable_nesting, .set_pgtable_quirks = arm_smmu_set_pgtable_quirks, .free = arm_smmu_domain_free, } diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 857d4c2fd1a206..f33c0d569a5d03 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -2561,16 +2561,6 @@ static int __init iommu_init(void) } core_initcall(iommu_init); -int iommu_enable_nesting(struct iommu_domain *domain) -{ - if (domain->type != IOMMU_DOMAIN_UNMANAGED) - return -EINVAL; - if (!domain->ops->enable_nesting) - return -EINVAL; - return domain->ops->enable_nesting(domain); -} -EXPORT_SYMBOL_GPL(iommu_enable_nesting); - int
Re: [PATCH] swiotlb: Max mapping size takes min align mask into account
On 2022-05-10 15:21, Tianyu Lan wrote: From: Tianyu Lan swiotlb_find_slots() skips slots according to io tlb aligned mask calculated from min aligned mask and original physical address offset. This affects max mapping size. The mapping size can't achieve the IO_TLB_SEGSIZE * IO_TLB_SIZE when original offset is non-zero. This will cause system boot up failure in Hyper-V Isolation VM where swiotlb force is enabled. Scsi layer use return value of dma_max_mapping_size() to set max segment size and it finally calls swiotlb_max_mapping_size(). Hyper-V storage driver sets min align mask to 4k - 1. Scsi layer may pass 256k length of request buffer with 0~4k offset and Hyper-V storage driver can't get swiotlb bounce buffer via DMA API. Swiotlb_find_slots() can't find 256k length bounce buffer with offset. Make swiotlb_max_mapping _size() take min align mask into account. Hmm, this seems a bit pessimistic - the offset can vary per mapping, so it feels to me like it should really be the caller's responsibility to account for it if they're already involved enough to care about both constraints. But I'm not sure how practical that would be. Robin. Signed-off-by: Tianyu Lan --- kernel/dma/swiotlb.c | 13 - 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index 73a41cec9e38..0d6684ca7eab 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -743,7 +743,18 @@ dma_addr_t swiotlb_map(struct device *dev, phys_addr_t paddr, size_t size, size_t swiotlb_max_mapping_size(struct device *dev) { - return ((size_t)IO_TLB_SIZE) * IO_TLB_SEGSIZE; + int min_align_mask = dma_get_min_align_mask(dev); + int min_align = 0; + + /* +* swiotlb_find_slots() skips slots according to +* min align mask. This affects max mapping size. +* Take it into acount here. +*/ + if (min_align_mask) + min_align = roundup(min_align_mask, IO_TLB_SIZE); + + return ((size_t)IO_TLB_SIZE) * IO_TLB_SEGSIZE - min_align; } bool is_swiotlb_active(struct device *dev) ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[GIT PULL] iommu/arm-smmu: Updates for 5.19
Hi Joerg, Please pull these Arm SMMU updates for 5.19. The bulk of it is just adding new device-tree compatible strings for the existing drivers, but there are some non-critical fixes in here as well. Please see the tag for more details. I used the previous fixes pull as a base for this so that you shouldn't run into any conflicts with upstream. Cheers, Will --->8 The following changes since commit 4a25f2ea0e030b2fc852c4059a50181bfc5b2f57: iommu: arm-smmu: disable large page mappings for Nvidia arm-smmu (2022-04-22 11:21:30 +0100) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/will/linux.git tags/arm-smmu-updates for you to fetch changes up to 628bf55b620497a105f4963ee8fb84769f7e6bb4: iommu/arm-smmu: Force identity domains for legacy binding (2022-05-10 12:01:31 +0100) Arm SMMU updates for 5.19 - Add new Qualcomm device-tree compatible strings - Add new Nvidia device-tree compatible string for Tegra234 - Fix UAF in SMMUv3 shared virtual addressing code - Force identity-mapped domains for users of ye olde SMMU legacy binding - Minor cleanups Bjorn Andersson (2): dt-bindings: arm-smmu: Add compatible for Qualcomm SC8280XP iommu/arm-smmu-qcom: Add SC8280XP support Jean-Philippe Brucker (1): iommu/arm-smmu-v3-sva: Fix mm use-after-free Robin Murphy (1): iommu/arm-smmu: Force identity domains for legacy binding Rohit Agarwal (1): dt-bindings: arm-smmu: Add binding for SDX65 SMMU Thierry Reding (3): dt-bindings: arm-smmu: Document nvidia,memory-controller property dt-bindings: arm-smmu: Add compatible for Tegra234 SOC iommu/arm-smmu: Support Tegra234 SMMU Yang Yingliang (2): iommu/arm-smmu: fix possible null-ptr-deref in arm_smmu_device_probe() iommu/arm-smmu-v3: check return value after calling platform_get_resource() .../devicetree/bindings/iommu/arm,smmu.yaml| 25 -- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c| 13 +-- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c| 2 ++ drivers/iommu/arm/arm-smmu/arm-smmu-impl.c | 3 ++- drivers/iommu/arm/arm-smmu/arm-smmu-qcom.c | 1 + drivers/iommu/arm/arm-smmu/arm-smmu.c | 8 --- 6 files changed, 44 insertions(+), 8 deletions(-) ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH AUTOSEL 5.15 03/19] iommu: arm-smmu: disable large page mappings for Nvidia arm-smmu
From: Ashish Mhetre [ Upstream commit 4a25f2ea0e030b2fc852c4059a50181bfc5b2f57 ] Tegra194 and Tegra234 SoCs have the erratum that causes walk cache entries to not be invalidated correctly. The problem is that the walk cache index generated for IOVA is not same across translation and invalidation requests. This is leading to page faults when PMD entry is released during unmap and populated with new PTE table during subsequent map request. Disabling large page mappings avoids the release of PMD entry and avoid translations seeing stale PMD entry in walk cache. Fix this by limiting the page mappings to PAGE_SIZE for Tegra194 and Tegra234 devices. This is recommended fix from Tegra hardware design team. Acked-by: Robin Murphy Reviewed-by: Krishna Reddy Co-developed-by: Pritesh Raithatha Signed-off-by: Pritesh Raithatha Signed-off-by: Ashish Mhetre Link: https://lore.kernel.org/r/20220421081504.24678-1-amhe...@nvidia.com Signed-off-by: Will Deacon Signed-off-by: Sasha Levin --- drivers/iommu/arm/arm-smmu/arm-smmu-nvidia.c | 30 1 file changed, 30 insertions(+) diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-nvidia.c b/drivers/iommu/arm/arm-smmu/arm-smmu-nvidia.c index 01e9b50b10a1..87bf522b9d2e 100644 --- a/drivers/iommu/arm/arm-smmu/arm-smmu-nvidia.c +++ b/drivers/iommu/arm/arm-smmu/arm-smmu-nvidia.c @@ -258,6 +258,34 @@ static void nvidia_smmu_probe_finalize(struct arm_smmu_device *smmu, struct devi dev_name(dev), err); } +static int nvidia_smmu_init_context(struct arm_smmu_domain *smmu_domain, + struct io_pgtable_cfg *pgtbl_cfg, + struct device *dev) +{ + struct arm_smmu_device *smmu = smmu_domain->smmu; + const struct device_node *np = smmu->dev->of_node; + + /* +* Tegra194 and Tegra234 SoCs have the erratum that causes walk cache +* entries to not be invalidated correctly. The problem is that the walk +* cache index generated for IOVA is not same across translation and +* invalidation requests. This is leading to page faults when PMD entry +* is released during unmap and populated with new PTE table during +* subsequent map request. Disabling large page mappings avoids the +* release of PMD entry and avoid translations seeing stale PMD entry in +* walk cache. +* Fix this by limiting the page mappings to PAGE_SIZE on Tegra194 and +* Tegra234. +*/ + if (of_device_is_compatible(np, "nvidia,tegra234-smmu") || + of_device_is_compatible(np, "nvidia,tegra194-smmu")) { + smmu->pgsize_bitmap = PAGE_SIZE; + pgtbl_cfg->pgsize_bitmap = smmu->pgsize_bitmap; + } + + return 0; +} + static const struct arm_smmu_impl nvidia_smmu_impl = { .read_reg = nvidia_smmu_read_reg, .write_reg = nvidia_smmu_write_reg, @@ -268,10 +296,12 @@ static const struct arm_smmu_impl nvidia_smmu_impl = { .global_fault = nvidia_smmu_global_fault, .context_fault = nvidia_smmu_context_fault, .probe_finalize = nvidia_smmu_probe_finalize, + .init_context = nvidia_smmu_init_context, }; static const struct arm_smmu_impl nvidia_smmu_single_impl = { .probe_finalize = nvidia_smmu_probe_finalize, + .init_context = nvidia_smmu_init_context, }; struct arm_smmu_device *nvidia_smmu_impl_init(struct arm_smmu_device *smmu) -- 2.35.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH AUTOSEL 5.17 04/21] iommu: arm-smmu: disable large page mappings for Nvidia arm-smmu
From: Ashish Mhetre [ Upstream commit 4a25f2ea0e030b2fc852c4059a50181bfc5b2f57 ] Tegra194 and Tegra234 SoCs have the erratum that causes walk cache entries to not be invalidated correctly. The problem is that the walk cache index generated for IOVA is not same across translation and invalidation requests. This is leading to page faults when PMD entry is released during unmap and populated with new PTE table during subsequent map request. Disabling large page mappings avoids the release of PMD entry and avoid translations seeing stale PMD entry in walk cache. Fix this by limiting the page mappings to PAGE_SIZE for Tegra194 and Tegra234 devices. This is recommended fix from Tegra hardware design team. Acked-by: Robin Murphy Reviewed-by: Krishna Reddy Co-developed-by: Pritesh Raithatha Signed-off-by: Pritesh Raithatha Signed-off-by: Ashish Mhetre Link: https://lore.kernel.org/r/20220421081504.24678-1-amhe...@nvidia.com Signed-off-by: Will Deacon Signed-off-by: Sasha Levin --- drivers/iommu/arm/arm-smmu/arm-smmu-nvidia.c | 30 1 file changed, 30 insertions(+) diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu-nvidia.c b/drivers/iommu/arm/arm-smmu/arm-smmu-nvidia.c index 01e9b50b10a1..87bf522b9d2e 100644 --- a/drivers/iommu/arm/arm-smmu/arm-smmu-nvidia.c +++ b/drivers/iommu/arm/arm-smmu/arm-smmu-nvidia.c @@ -258,6 +258,34 @@ static void nvidia_smmu_probe_finalize(struct arm_smmu_device *smmu, struct devi dev_name(dev), err); } +static int nvidia_smmu_init_context(struct arm_smmu_domain *smmu_domain, + struct io_pgtable_cfg *pgtbl_cfg, + struct device *dev) +{ + struct arm_smmu_device *smmu = smmu_domain->smmu; + const struct device_node *np = smmu->dev->of_node; + + /* +* Tegra194 and Tegra234 SoCs have the erratum that causes walk cache +* entries to not be invalidated correctly. The problem is that the walk +* cache index generated for IOVA is not same across translation and +* invalidation requests. This is leading to page faults when PMD entry +* is released during unmap and populated with new PTE table during +* subsequent map request. Disabling large page mappings avoids the +* release of PMD entry and avoid translations seeing stale PMD entry in +* walk cache. +* Fix this by limiting the page mappings to PAGE_SIZE on Tegra194 and +* Tegra234. +*/ + if (of_device_is_compatible(np, "nvidia,tegra234-smmu") || + of_device_is_compatible(np, "nvidia,tegra194-smmu")) { + smmu->pgsize_bitmap = PAGE_SIZE; + pgtbl_cfg->pgsize_bitmap = smmu->pgsize_bitmap; + } + + return 0; +} + static const struct arm_smmu_impl nvidia_smmu_impl = { .read_reg = nvidia_smmu_read_reg, .write_reg = nvidia_smmu_write_reg, @@ -268,10 +296,12 @@ static const struct arm_smmu_impl nvidia_smmu_impl = { .global_fault = nvidia_smmu_global_fault, .context_fault = nvidia_smmu_context_fault, .probe_finalize = nvidia_smmu_probe_finalize, + .init_context = nvidia_smmu_init_context, }; static const struct arm_smmu_impl nvidia_smmu_single_impl = { .probe_finalize = nvidia_smmu_probe_finalize, + .init_context = nvidia_smmu_init_context, }; struct arm_smmu_device *nvidia_smmu_impl_init(struct arm_smmu_device *smmu) -- 2.35.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v6 08/12] iommu/sva: Use attach/detach_pasid_dev in SVA interfaces
On Tue, May 10, 2022 at 02:17:34PM +0800, Lu Baolu wrote: > +/** > + * iommu_sva_bind_device() - Bind a process address space to a device > + * @dev: the device > + * @mm: the mm to bind, caller must hold a reference to mm_users > + * @drvdata: opaque data pointer to pass to bind callback > + * > + * Create a bond between device and address space, allowing the device to > access > + * the mm using the returned PASID. If a bond already exists between @device > and > + * @mm, it is returned and an additional reference is taken. Caller must call > + * iommu_sva_unbind_device() to release each reference. > + * > + * iommu_dev_enable_feature(dev, IOMMU_DEV_FEAT_SVA) must be called first, to > + * initialize the required SVA features. > + * > + * On error, returns an ERR_PTR value. > + */ > +struct iommu_sva * > +iommu_sva_bind_device(struct device *dev, struct mm_struct *mm, void > *drvdata) > +{ > + int ret = -EINVAL; > + struct iommu_sva *handle; > + struct iommu_domain *domain; > + > + /* > + * TODO: Remove the drvdata parameter after kernel PASID support is > + * enabled for the idxd driver. > + */ > + if (drvdata) > + return ERR_PTR(-EOPNOTSUPP); Why is this being left behind? Clean up the callers too please. > + /* Allocate mm->pasid if necessary. */ > + ret = iommu_sva_alloc_pasid(mm, 1, (1U << dev->iommu->pasid_bits) - 1); > + if (ret) > + return ERR_PTR(ret); > + > + mutex_lock(_sva_lock); > + /* Search for an existing bond. */ > + handle = xa_load(>iommu->sva_bonds, mm->pasid); > + if (handle) { > + refcount_inc(>users); > + goto out_success; > + } How can there be an existing bond? dev->iommu is per-device The device_group_immutable_singleton() insists on a single device group Basically 'sva_bonds' is the same thing as the group->pasid_array. Assuming we leave room for multi-device groups this logic should just be group = iommu_group_get(dev); if (!group) return -ENODEV; mutex_lock(>mutex); domain = xa_load(>pasid_array, mm->pasid); if (!domain || domain->type != IOMMU_DOMAIN_SVA || domain->mm != mm) domain = iommu_sva_alloc_domain(dev, mm); ? And stick the refcount in the sva_domain Also, given the current arrangement it might make sense to have a struct iommu_domain_sva given that no driver is wrappering this in something else. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v6 05/12] iommu/vt-d: Remove SVM_FLAG_SUPERVISOR_MODE support
On Tue, May 10, 2022 at 02:17:31PM +0800, Lu Baolu wrote: > The current kernel DMA with PASID support is based on the SVA with a flag > SVM_FLAG_SUPERVISOR_MODE. The IOMMU driver binds the kernel memory address > space to a PASID of the device. The device driver programs the device with > kernel virtual address (KVA) for DMA access. There have been security and > functional issues with this approach: > > - The lack of IOTLB synchronization upon kernel page table updates. > (vmalloc, module/BPF loading, CONFIG_DEBUG_PAGEALLOC etc.) > - Other than slight more protection, using kernel virtual address (KVA) > has little advantage over physical address. There are also no use > cases yet where DMA engines need kernel virtual addresses for in-kernel > DMA. > > This removes SVM_FLAG_SUPERVISOR_MODE support in the Intel IOMMU driver. > The device driver is suggested to handle kernel DMA with PASID through > the kernel DMA APIs. > > Link: https://lore.kernel.org/linux-iommu/20210511194726.gp1002...@nvidia.com/ > Signed-off-by: Jacob Pan > Signed-off-by: Lu Baolu > --- > drivers/iommu/intel/svm.c | 53 +-- > 1 file changed, 12 insertions(+), 41 deletions(-) Reviewed-by: Jason Gunthorpe Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v6 02/12] iommu: Add pasid_bits field in struct dev_iommu
On Tue, May 10, 2022 at 02:17:28PM +0800, Lu Baolu wrote: > int iommu_device_register(struct iommu_device *iommu, > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > index 627a3ed5ee8f..afc63fce6107 100644 > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > @@ -2681,6 +2681,8 @@ static struct iommu_device > *arm_smmu_probe_device(struct device *dev) > smmu->features & ARM_SMMU_FEAT_STALL_FORCE) > master->stall_enabled = true; > > + dev->iommu->pasid_bits = master->ssid_bits; > return >iommu; > > err_free_master: > diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c > index 2990f80c5e08..99643f897f26 100644 > +++ b/drivers/iommu/intel/iommu.c > @@ -4624,8 +4624,11 @@ static struct iommu_device > *intel_iommu_probe_device(struct device *dev) > if (pasid_supported(iommu)) { > int features = pci_pasid_features(pdev); > > - if (features >= 0) > + if (features >= 0) { > info->pasid_supported = features | 1; > + dev->iommu->pasid_bits = > + fls(pci_max_pasids(pdev)) - 1; > + } It is not very nice that both the iommu drivers have to duplicate the code to read the pasid capability out of the PCI device. IMHO it would make more sense for the iommu layer to report the capability of its own HW block only, and for the core code to figure out the master's limitation using a bus-specific approach. It is also unfortunate that the enable/disable pasid is inside the iommu driver as well - ideally the PCI driver itself would do this when it knows it wants to use PASIDs. The ordering interaction with ATS makes this look quite annoying though. :( I'm also not convinced individual IOMMU drivers should be forcing ATS on, there are performance and functional implications here. Using ATS or not is possibly best left as an administrator policy controlled by the core code. Again we seem to have some mess. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH] swiotlb: Max mapping size takes min align mask into account
From: Tianyu Lan swiotlb_find_slots() skips slots according to io tlb aligned mask calculated from min aligned mask and original physical address offset. This affects max mapping size. The mapping size can't achieve the IO_TLB_SEGSIZE * IO_TLB_SIZE when original offset is non-zero. This will cause system boot up failure in Hyper-V Isolation VM where swiotlb force is enabled. Scsi layer use return value of dma_max_mapping_size() to set max segment size and it finally calls swiotlb_max_mapping_size(). Hyper-V storage driver sets min align mask to 4k - 1. Scsi layer may pass 256k length of request buffer with 0~4k offset and Hyper-V storage driver can't get swiotlb bounce buffer via DMA API. Swiotlb_find_slots() can't find 256k length bounce buffer with offset. Make swiotlb_max_mapping _size() take min align mask into account. Signed-off-by: Tianyu Lan --- kernel/dma/swiotlb.c | 13 - 1 file changed, 12 insertions(+), 1 deletion(-) diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index 73a41cec9e38..0d6684ca7eab 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -743,7 +743,18 @@ dma_addr_t swiotlb_map(struct device *dev, phys_addr_t paddr, size_t size, size_t swiotlb_max_mapping_size(struct device *dev) { - return ((size_t)IO_TLB_SIZE) * IO_TLB_SEGSIZE; + int min_align_mask = dma_get_min_align_mask(dev); + int min_align = 0; + + /* +* swiotlb_find_slots() skips slots according to +* min align mask. This affects max mapping size. +* Take it into acount here. +*/ + if (min_align_mask) + min_align = roundup(min_align_mask, IO_TLB_SIZE); + + return ((size_t)IO_TLB_SIZE) * IO_TLB_SEGSIZE - min_align; } bool is_swiotlb_active(struct device *dev) -- 2.25.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v6 03/12] iommu: Add attach/detach_dev_pasid domain ops
On Tue, May 10, 2022 at 02:17:29PM +0800, Lu Baolu wrote: > This adds a pair of common domain ops for this purpose and adds helpers > to attach/detach a domain to/from a {device, PASID}. I wonder if this should not have a detach op - after discussing with Robin we can see that detach_dev is not used in updated drivers. Instead attach_dev acts as 'set_domain' So, it would be more symmetrical if attaching a blocking_domain to the PASID was the way to 'detach'. This could be made straightforward by following the sketch I showed to have a static, global blocing_domain and providing a pointer to it in struct iommu_ops Then 'detach pasid' is: iommu_ops->blocking_domain->ops->attach_dev_pasid(domain, dev, pasid); And we move away from the notion of 'detach' and in the direction that everything continuously has a domain set. PASID would logically default to blocking_domain, though we wouldn't track this anywhere. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v7 2/7] hwtracing: Add trace function support for HiSilicon PCIe Tune and Trace device
On 10/05/2022 12:18, Yicong Yang wrote: > On 2022/5/10 17:46, James Clark wrote: >> >> >> On 07/04/2022 13:58, Yicong Yang wrote: >>> HiSilicon PCIe tune and trace device(PTT) is a PCIe Root Complex integrated >>> Endpoint(RCiEP) device, providing the capability to dynamically monitor and >>> tune the PCIe traffic, and trace the TLP headers. >>> >>> Add the driver for the device to enable the trace function. Register PMU >>> device of PTT trace, then users can use trace through perf command. The >>> driver makes use of perf AUX trace and support following events to >>> configure the trace: >>> >>> - filter: select Root port or Endpoint to trace >>> - type: select the type of traced TLP headers >>> - direction: select the direction of traced TLP headers >>> - format: select the data format of the traced TLP headers >>> >>> This patch adds the driver part of PTT trace. The perf command support of >>> PTT trace is added in the following patch. >>> >>> Signed-off-by: Yicong Yang >>> Reviewed-by: Jonathan Cameron >>> --- >>> drivers/Makefile | 1 + >>> drivers/hwtracing/Kconfig| 2 + >>> drivers/hwtracing/ptt/Kconfig| 12 + >>> drivers/hwtracing/ptt/Makefile | 2 + >>> drivers/hwtracing/ptt/hisi_ptt.c | 874 +++ >>> drivers/hwtracing/ptt/hisi_ptt.h | 166 ++ >>> 6 files changed, 1057 insertions(+) >>> create mode 100644 drivers/hwtracing/ptt/Kconfig >>> create mode 100644 drivers/hwtracing/ptt/Makefile >>> create mode 100644 drivers/hwtracing/ptt/hisi_ptt.c >>> create mode 100644 drivers/hwtracing/ptt/hisi_ptt.h >>> >>> diff --git a/drivers/Makefile b/drivers/Makefile >>> index 020780b6b4d2..662d50599467 100644 >>> --- a/drivers/Makefile >>> +++ b/drivers/Makefile >>> @@ -175,6 +175,7 @@ obj-$(CONFIG_USB4) += thunderbolt/ >>> obj-$(CONFIG_CORESIGHT)+= hwtracing/coresight/ >>> obj-y += hwtracing/intel_th/ >>> obj-$(CONFIG_STM) += hwtracing/stm/ >>> +obj-$(CONFIG_HISI_PTT) += hwtracing/ptt/ >>> obj-$(CONFIG_ANDROID) += android/ >>> obj-$(CONFIG_NVMEM)+= nvmem/ >>> obj-$(CONFIG_FPGA) += fpga/ >>> diff --git a/drivers/hwtracing/Kconfig b/drivers/hwtracing/Kconfig >>> index 13085835a636..911ee977103c 100644 >>> --- a/drivers/hwtracing/Kconfig >>> +++ b/drivers/hwtracing/Kconfig >>> @@ -5,4 +5,6 @@ source "drivers/hwtracing/stm/Kconfig" >>> >>> source "drivers/hwtracing/intel_th/Kconfig" >>> >>> +source "drivers/hwtracing/ptt/Kconfig" >>> + >>> endmenu >>> diff --git a/drivers/hwtracing/ptt/Kconfig b/drivers/hwtracing/ptt/Kconfig >>> new file mode 100644 >>> index ..8902a6f27563 >>> --- /dev/null >>> +++ b/drivers/hwtracing/ptt/Kconfig >>> @@ -0,0 +1,12 @@ >>> +# SPDX-License-Identifier: GPL-2.0-only >>> +config HISI_PTT >>> + tristate "HiSilicon PCIe Tune and Trace Device" >>> + depends on ARM64 || (COMPILE_TEST && 64BIT) >>> + depends on PCI && HAS_DMA && HAS_IOMEM && PERF_EVENTS >>> + help >>> + HiSilicon PCIe Tune and Trace Device exists as a PCIe RCiEP >>> + device, and it provides support for PCIe traffic tuning and >>> + tracing TLP headers to the memory. >>> + >>> + This driver can also be built as a module. If so, the module >>> + will be called hisi_ptt. >>> diff --git a/drivers/hwtracing/ptt/Makefile b/drivers/hwtracing/ptt/Makefile >>> new file mode 100644 >>> index ..908c09a98161 >>> --- /dev/null >>> +++ b/drivers/hwtracing/ptt/Makefile >>> @@ -0,0 +1,2 @@ >>> +# SPDX-License-Identifier: GPL-2.0 >>> +obj-$(CONFIG_HISI_PTT) += hisi_ptt.o >>> diff --git a/drivers/hwtracing/ptt/hisi_ptt.c >>> b/drivers/hwtracing/ptt/hisi_ptt.c >>> new file mode 100644 >>> index ..242b41870380 >>> --- /dev/null >>> +++ b/drivers/hwtracing/ptt/hisi_ptt.c >>> @@ -0,0 +1,874 @@ >>> +// SPDX-License-Identifier: GPL-2.0 >>> +/* >>> + * Driver for HiSilicon PCIe tune and trace device >>> + * >>> + * Copyright (c) 2022 HiSilicon Technologies Co., Ltd. >>> + * Author: Yicong Yang >>> + */ >>> + >>> +#include >>> +#include >>> +#include >>> +#include >>> +#include >>> +#include >>> +#include >>> +#include >>> +#include >>> +#include >>> +#include >>> +#include >>> + >>> +#include "hisi_ptt.h" >>> + >>> +static u16 hisi_ptt_get_filter_val(struct pci_dev *pdev) >>> +{ >>> + if (pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT) >>> + return BIT(HISI_PCIE_CORE_PORT_ID(PCI_SLOT(pdev->devfn))); >>> + >>> + return PCI_DEVID(pdev->bus->number, pdev->devfn); >>> +} >>> + >>> +static bool hisi_ptt_wait_trace_hw_idle(struct hisi_ptt *hisi_ptt) >>> +{ >>> + u32 val; >>> + >>> + return !readl_poll_timeout_atomic(hisi_ptt->iobase + HISI_PTT_TRACE_STS, >>> + val, val & HISI_PTT_TRACE_IDLE, >>> + HISI_PTT_WAIT_POLL_INTERVAL_US, >>> +
Re: [PATCH v7 5/7] perf tool: Add support for HiSilicon PCIe Tune and Trace device driver
On 07/04/2022 13:58, Yicong Yang wrote: > From: Qi Liu > > 'perf record' and 'perf report --dump-raw-trace' supported in this > patch. > > Example usage: > > Output will contain raw PTT data and its textual representation, such > as: > > 0 0 0x5810 [0x30]: PERF_RECORD_AUXTRACE size: 0x40 offset: 0 > ref: 0xa5d50c725 idx: 0 tid: -1 cpu: 0 > . > . ... HISI PTT data: size 4194304 bytes > . : 00 00 00 00 Prefix > . 0004: 08 20 00 60 Header DW0 > . 0008: ff 02 00 01 Header DW1 > . 000c: 20 08 00 00 Header DW2 > . 0010: 10 e7 44 ab Header DW3 > . 0014: 2a a8 1e 01 Time > . 0020: 00 00 00 00 Prefix > . 0024: 01 00 00 60 Header DW0 > . 0028: 0f 1e 00 01 Header DW1 > . 002c: 04 00 00 00 Header DW2 > . 0030: 40 00 81 02 Header DW3 > . 0034: ee 02 00 00 Time > > > Signed-off-by: Qi Liu > Signed-off-by: Yicong Yang > --- > tools/perf/arch/arm/util/auxtrace.c | 76 +- > tools/perf/arch/arm/util/pmu.c| 3 + > tools/perf/arch/arm64/util/Build | 2 +- > tools/perf/arch/arm64/util/hisi_ptt.c | 195 > tools/perf/util/Build | 2 + > tools/perf/util/auxtrace.c| 4 + > tools/perf/util/auxtrace.h| 1 + > tools/perf/util/hisi-ptt-decoder/Build| 1 + > .../hisi-ptt-decoder/hisi-ptt-pkt-decoder.c | 170 ++ > .../hisi-ptt-decoder/hisi-ptt-pkt-decoder.h | 28 +++ > tools/perf/util/hisi_ptt.c| 218 ++ > tools/perf/util/hisi_ptt.h| 28 +++ > 12 files changed, 724 insertions(+), 4 deletions(-) > create mode 100644 tools/perf/arch/arm64/util/hisi_ptt.c > create mode 100644 tools/perf/util/hisi-ptt-decoder/Build > create mode 100644 tools/perf/util/hisi-ptt-decoder/hisi-ptt-pkt-decoder.c > create mode 100644 tools/perf/util/hisi-ptt-decoder/hisi-ptt-pkt-decoder.h > create mode 100644 tools/perf/util/hisi_ptt.c > create mode 100644 tools/perf/util/hisi_ptt.h > > diff --git a/tools/perf/arch/arm/util/auxtrace.c > b/tools/perf/arch/arm/util/auxtrace.c > index 5fc6a2a3dbc5..393f5757c039 100644 > --- a/tools/perf/arch/arm/util/auxtrace.c > +++ b/tools/perf/arch/arm/util/auxtrace.c > @@ -4,9 +4,11 @@ > * Author: Mathieu Poirier > */ > > +#include > #include > #include > #include > +#include > > #include "../../../util/auxtrace.h" > #include "../../../util/debug.h" > @@ -14,6 +16,7 @@ > #include "../../../util/pmu.h" > #include "cs-etm.h" > #include "arm-spe.h" > +#include "hisi_ptt.h" > > static struct perf_pmu **find_all_arm_spe_pmus(int *nr_spes, int *err) > { > @@ -50,6 +53,58 @@ static struct perf_pmu **find_all_arm_spe_pmus(int > *nr_spes, int *err) > return arm_spe_pmus; > } > > +static struct perf_pmu **find_all_hisi_ptt_pmus(int *nr_ptts, int *err) > +{ > + const char *sysfs = sysfs__mountpoint(); > + struct perf_pmu **hisi_ptt_pmus = NULL; > + struct dirent *dent; > + char path[PATH_MAX]; > + DIR *dir = NULL; > + int idx = 0; > + > + snprintf(path, PATH_MAX, "%s" EVENT_SOURCE_DEVICE_PATH, sysfs); > + dir = opendir(path); > + if (!dir) { > + pr_err("can't read directory '%s'\n", EVENT_SOURCE_DEVICE_PATH); > + *err = -EINVAL; > + goto out; > + } > + > + while ((dent = readdir(dir))) { > + if (strstr(dent->d_name, HISI_PTT_PMU_NAME)) > + (*nr_ptts)++; > + } > + > + if (!(*nr_ptts)) > + goto out; > + > + hisi_ptt_pmus = zalloc(sizeof(struct perf_pmu *) * (*nr_ptts)); > + if (!hisi_ptt_pmus) { > + pr_err("hisi_ptt alloc failed\n"); > + *err = -ENOMEM; > + goto out; > + } > + > + rewinddir(dir); > + while ((dent = readdir(dir))) { > + if (strstr(dent->d_name, HISI_PTT_PMU_NAME) && idx < > (*nr_ptts)) { > + hisi_ptt_pmus[idx] = perf_pmu__find(dent->d_name); > + if (hisi_ptt_pmus[idx]) { > + pr_debug2("%s %d: hisi_ptt_pmu %d type %d name > %s\n", > + __func__, __LINE__, idx, > + hisi_ptt_pmus[idx]->type, > + hisi_ptt_pmus[idx]->name); > + idx++; > + } > + > + } > + } > + > +out: > + closedir(dir); > + return hisi_ptt_pmus; > +} > + > struct auxtrace_record >
Re: [PATCH v7 2/7] hwtracing: Add trace function support for HiSilicon PCIe Tune and Trace device
On 07/04/2022 13:58, Yicong Yang wrote: > HiSilicon PCIe tune and trace device(PTT) is a PCIe Root Complex integrated > Endpoint(RCiEP) device, providing the capability to dynamically monitor and > tune the PCIe traffic, and trace the TLP headers. > > Add the driver for the device to enable the trace function. Register PMU > device of PTT trace, then users can use trace through perf command. The > driver makes use of perf AUX trace and support following events to > configure the trace: > > - filter: select Root port or Endpoint to trace > - type: select the type of traced TLP headers > - direction: select the direction of traced TLP headers > - format: select the data format of the traced TLP headers > > This patch adds the driver part of PTT trace. The perf command support of > PTT trace is added in the following patch. > > Signed-off-by: Yicong Yang > Reviewed-by: Jonathan Cameron > --- > drivers/Makefile | 1 + > drivers/hwtracing/Kconfig| 2 + > drivers/hwtracing/ptt/Kconfig| 12 + > drivers/hwtracing/ptt/Makefile | 2 + > drivers/hwtracing/ptt/hisi_ptt.c | 874 +++ > drivers/hwtracing/ptt/hisi_ptt.h | 166 ++ > 6 files changed, 1057 insertions(+) > create mode 100644 drivers/hwtracing/ptt/Kconfig > create mode 100644 drivers/hwtracing/ptt/Makefile > create mode 100644 drivers/hwtracing/ptt/hisi_ptt.c > create mode 100644 drivers/hwtracing/ptt/hisi_ptt.h > > diff --git a/drivers/Makefile b/drivers/Makefile > index 020780b6b4d2..662d50599467 100644 > --- a/drivers/Makefile > +++ b/drivers/Makefile > @@ -175,6 +175,7 @@ obj-$(CONFIG_USB4)+= thunderbolt/ > obj-$(CONFIG_CORESIGHT) += hwtracing/coresight/ > obj-y+= hwtracing/intel_th/ > obj-$(CONFIG_STM)+= hwtracing/stm/ > +obj-$(CONFIG_HISI_PTT) += hwtracing/ptt/ > obj-$(CONFIG_ANDROID)+= android/ > obj-$(CONFIG_NVMEM) += nvmem/ > obj-$(CONFIG_FPGA) += fpga/ > diff --git a/drivers/hwtracing/Kconfig b/drivers/hwtracing/Kconfig > index 13085835a636..911ee977103c 100644 > --- a/drivers/hwtracing/Kconfig > +++ b/drivers/hwtracing/Kconfig > @@ -5,4 +5,6 @@ source "drivers/hwtracing/stm/Kconfig" > > source "drivers/hwtracing/intel_th/Kconfig" > > +source "drivers/hwtracing/ptt/Kconfig" > + > endmenu > diff --git a/drivers/hwtracing/ptt/Kconfig b/drivers/hwtracing/ptt/Kconfig > new file mode 100644 > index ..8902a6f27563 > --- /dev/null > +++ b/drivers/hwtracing/ptt/Kconfig > @@ -0,0 +1,12 @@ > +# SPDX-License-Identifier: GPL-2.0-only > +config HISI_PTT > + tristate "HiSilicon PCIe Tune and Trace Device" > + depends on ARM64 || (COMPILE_TEST && 64BIT) > + depends on PCI && HAS_DMA && HAS_IOMEM && PERF_EVENTS > + help > + HiSilicon PCIe Tune and Trace Device exists as a PCIe RCiEP > + device, and it provides support for PCIe traffic tuning and > + tracing TLP headers to the memory. > + > + This driver can also be built as a module. If so, the module > + will be called hisi_ptt. > diff --git a/drivers/hwtracing/ptt/Makefile b/drivers/hwtracing/ptt/Makefile > new file mode 100644 > index ..908c09a98161 > --- /dev/null > +++ b/drivers/hwtracing/ptt/Makefile > @@ -0,0 +1,2 @@ > +# SPDX-License-Identifier: GPL-2.0 > +obj-$(CONFIG_HISI_PTT) += hisi_ptt.o > diff --git a/drivers/hwtracing/ptt/hisi_ptt.c > b/drivers/hwtracing/ptt/hisi_ptt.c > new file mode 100644 > index ..242b41870380 > --- /dev/null > +++ b/drivers/hwtracing/ptt/hisi_ptt.c > @@ -0,0 +1,874 @@ > +// SPDX-License-Identifier: GPL-2.0 > +/* > + * Driver for HiSilicon PCIe tune and trace device > + * > + * Copyright (c) 2022 HiSilicon Technologies Co., Ltd. > + * Author: Yicong Yang > + */ > + > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > +#include > + > +#include "hisi_ptt.h" > + > +static u16 hisi_ptt_get_filter_val(struct pci_dev *pdev) > +{ > + if (pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT) > + return BIT(HISI_PCIE_CORE_PORT_ID(PCI_SLOT(pdev->devfn))); > + > + return PCI_DEVID(pdev->bus->number, pdev->devfn); > +} > + > +static bool hisi_ptt_wait_trace_hw_idle(struct hisi_ptt *hisi_ptt) > +{ > + u32 val; > + > + return !readl_poll_timeout_atomic(hisi_ptt->iobase + HISI_PTT_TRACE_STS, > + val, val & HISI_PTT_TRACE_IDLE, > + HISI_PTT_WAIT_POLL_INTERVAL_US, > + HISI_PTT_WAIT_TRACE_TIMEOUT_US); > +} > + > +static bool hisi_ptt_wait_dma_reset_done(struct hisi_ptt *hisi_ptt) > +{ > + u32 val; > + > + return !readl_poll_timeout_atomic(hisi_ptt->iobase + > HISI_PTT_TRACE_WR_STS, > + val, !val, > HISI_PTT_RESET_POLL_INTERVAL_US,
Re: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
On Tue, May 10, 2022 at 01:38:26AM +, Tian, Kevin wrote: > > However, tt costs nothing to have dirty tracking as long as all iommus > > support it in the system - which seems to be the normal case today. > > > > We should just always turn it on at this point. > > Then still need a way to report " all iommus support it in the system" > to userspace since many old systems don't support it at all. Userspace can query the iommu_domain directly, or 'try and fail' to turn on tracking. A device capability flag is useless without a control knob to request a domain is created with tracking, and we don't have that, or a reason to add that. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v7 2/7] hwtracing: Add trace function support for HiSilicon PCIe Tune and Trace device
On 2022/5/10 20:54, James Clark wrote: > > > On 10/05/2022 12:18, Yicong Yang wrote: >> On 2022/5/10 17:46, James Clark wrote: >>> >>> >>> On 07/04/2022 13:58, Yicong Yang wrote: HiSilicon PCIe tune and trace device(PTT) is a PCIe Root Complex integrated Endpoint(RCiEP) device, providing the capability to dynamically monitor and tune the PCIe traffic, and trace the TLP headers. Add the driver for the device to enable the trace function. Register PMU device of PTT trace, then users can use trace through perf command. The driver makes use of perf AUX trace and support following events to configure the trace: - filter: select Root port or Endpoint to trace - type: select the type of traced TLP headers - direction: select the direction of traced TLP headers - format: select the data format of the traced TLP headers This patch adds the driver part of PTT trace. The perf command support of PTT trace is added in the following patch. Signed-off-by: Yicong Yang Reviewed-by: Jonathan Cameron --- drivers/Makefile | 1 + drivers/hwtracing/Kconfig| 2 + drivers/hwtracing/ptt/Kconfig| 12 + drivers/hwtracing/ptt/Makefile | 2 + drivers/hwtracing/ptt/hisi_ptt.c | 874 +++ drivers/hwtracing/ptt/hisi_ptt.h | 166 ++ 6 files changed, 1057 insertions(+) create mode 100644 drivers/hwtracing/ptt/Kconfig create mode 100644 drivers/hwtracing/ptt/Makefile create mode 100644 drivers/hwtracing/ptt/hisi_ptt.c create mode 100644 drivers/hwtracing/ptt/hisi_ptt.h diff --git a/drivers/Makefile b/drivers/Makefile index 020780b6b4d2..662d50599467 100644 --- a/drivers/Makefile +++ b/drivers/Makefile @@ -175,6 +175,7 @@ obj-$(CONFIG_USB4) += thunderbolt/ obj-$(CONFIG_CORESIGHT) += hwtracing/coresight/ obj-y += hwtracing/intel_th/ obj-$(CONFIG_STM) += hwtracing/stm/ +obj-$(CONFIG_HISI_PTT)+= hwtracing/ptt/ obj-$(CONFIG_ANDROID) += android/ obj-$(CONFIG_NVMEM) += nvmem/ obj-$(CONFIG_FPGA)+= fpga/ diff --git a/drivers/hwtracing/Kconfig b/drivers/hwtracing/Kconfig index 13085835a636..911ee977103c 100644 --- a/drivers/hwtracing/Kconfig +++ b/drivers/hwtracing/Kconfig @@ -5,4 +5,6 @@ source "drivers/hwtracing/stm/Kconfig" source "drivers/hwtracing/intel_th/Kconfig" +source "drivers/hwtracing/ptt/Kconfig" + endmenu diff --git a/drivers/hwtracing/ptt/Kconfig b/drivers/hwtracing/ptt/Kconfig new file mode 100644 index ..8902a6f27563 --- /dev/null +++ b/drivers/hwtracing/ptt/Kconfig @@ -0,0 +1,12 @@ +# SPDX-License-Identifier: GPL-2.0-only +config HISI_PTT + tristate "HiSilicon PCIe Tune and Trace Device" + depends on ARM64 || (COMPILE_TEST && 64BIT) + depends on PCI && HAS_DMA && HAS_IOMEM && PERF_EVENTS + help +HiSilicon PCIe Tune and Trace Device exists as a PCIe RCiEP +device, and it provides support for PCIe traffic tuning and +tracing TLP headers to the memory. + +This driver can also be built as a module. If so, the module +will be called hisi_ptt. diff --git a/drivers/hwtracing/ptt/Makefile b/drivers/hwtracing/ptt/Makefile new file mode 100644 index ..908c09a98161 --- /dev/null +++ b/drivers/hwtracing/ptt/Makefile @@ -0,0 +1,2 @@ +# SPDX-License-Identifier: GPL-2.0 +obj-$(CONFIG_HISI_PTT) += hisi_ptt.o diff --git a/drivers/hwtracing/ptt/hisi_ptt.c b/drivers/hwtracing/ptt/hisi_ptt.c new file mode 100644 index ..242b41870380 --- /dev/null +++ b/drivers/hwtracing/ptt/hisi_ptt.c @@ -0,0 +1,874 @@ +// SPDX-License-Identifier: GPL-2.0 +/* + * Driver for HiSilicon PCIe tune and trace device + * + * Copyright (c) 2022 HiSilicon Technologies Co., Ltd. + * Author: Yicong Yang + */ + +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include +#include + +#include "hisi_ptt.h" + +static u16 hisi_ptt_get_filter_val(struct pci_dev *pdev) +{ + if (pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT) + return BIT(HISI_PCIE_CORE_PORT_ID(PCI_SLOT(pdev->devfn))); + + return PCI_DEVID(pdev->bus->number, pdev->devfn); +} + +static bool hisi_ptt_wait_trace_hw_idle(struct hisi_ptt *hisi_ptt) +{ + u32 val; + + return !readl_poll_timeout_atomic(hisi_ptt->iobase + HISI_PTT_TRACE_STS, +
Re: [PATCH v7 1/7] iommu/arm-smmu-v3: Make default domain type of HiSilicon PTT device to identity
On 2022/5/10 19:23, Will Deacon wrote: > On Thu, Apr 07, 2022 at 08:58:35PM +0800, Yicong Yang wrote: >> The DMA operations of HiSilicon PTT device can only work properly with >> identical mappings. So add a quirk for the device to force the domain >> as passthrough. >> >> Signed-off-by: Yicong Yang >> --- >> drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 16 >> 1 file changed, 16 insertions(+) > > I still don't like this being part of the SMMU driver, but given that > (a) Robin doesn't seem to agree with the objection and (b) you've been > refreshing the patch series: > > Acked-by: Will Deacon > > If you do respin, then: > >> diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c >> b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c >> index 627a3ed5ee8f..5ec15ae2a9b1 100644 >> --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c >> +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c >> @@ -2839,6 +2839,21 @@ static int arm_smmu_dev_disable_feature(struct device >> *dev, >> } >> } > > It might be worth adding a brief comment here to explain what this device is > and why it needs an identity mapping. > >> +#define IS_HISI_PTT_DEVICE(pdev)((pdev)->vendor == PCI_VENDOR_ID_HUAWEI >> && \ >> + (pdev)->device == 0xa12e) > Will add a brief comment here in next version. Thanks, Yicong ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v6 29/29] x86/tsc: Switch to perf-based hardlockup detector if TSC become unstable
Excerpts from Ricardo Neri's message of May 6, 2022 10:00 am: > The HPET-based hardlockup detector relies on the TSC to determine if an > observed NMI interrupt was originated by HPET timer. Hence, this detector > can no longer be used with an unstable TSC. > > In such case, permanently stop the HPET-based hardlockup detector and > start the perf-based detector. > > Cc: Andi Kleen > Cc: Stephane Eranian > Cc: "Ravi V. Shankar" > Cc: iommu@lists.linux-foundation.org > Cc: linuxppc-...@lists.ozlabs.org > Cc: x...@kernel.org > Suggested-by: Thomas Gleixner > Reviewed-by: Tony Luck > Signed-off-by: Ricardo Neri > --- > Changes since v5: > * Relocated the delcaration of hardlockup_detector_switch_to_perf() to >x86/nmi.h It does not depend on HPET. > * Removed function stub. The shim hardlockup detector is always for x86. > > Changes since v4: > * Added a stub version of hardlockup_detector_switch_to_perf() for >!CONFIG_HPET_TIMER. (lkp) > * Reconfigure the whole lockup detector instead of unconditionally >starting the perf-based hardlockup detector. > > Changes since v3: > * None > > Changes since v2: > * Introduced this patch. > > Changes since v1: > * N/A > --- > arch/x86/include/asm/nmi.h | 6 ++ > arch/x86/kernel/tsc.c | 2 ++ > arch/x86/kernel/watchdog_hld.c | 6 ++ > 3 files changed, 14 insertions(+) > > diff --git a/arch/x86/include/asm/nmi.h b/arch/x86/include/asm/nmi.h > index 4a0d5b562c91..47752ff67d8b 100644 > --- a/arch/x86/include/asm/nmi.h > +++ b/arch/x86/include/asm/nmi.h > @@ -63,4 +63,10 @@ void stop_nmi(void); > void restart_nmi(void); > void local_touch_nmi(void); > > +#ifdef CONFIG_X86_HARDLOCKUP_DETECTOR > +void hardlockup_detector_switch_to_perf(void); > +#else > +static inline void hardlockup_detector_switch_to_perf(void) { } > +#endif > + > #endif /* _ASM_X86_NMI_H */ > diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c > index cc1843044d88..74772ffc79d1 100644 > --- a/arch/x86/kernel/tsc.c > +++ b/arch/x86/kernel/tsc.c > @@ -1176,6 +1176,8 @@ void mark_tsc_unstable(char *reason) > > clocksource_mark_unstable(_tsc_early); > clocksource_mark_unstable(_tsc); > + > + hardlockup_detector_switch_to_perf(); > } > > EXPORT_SYMBOL_GPL(mark_tsc_unstable); > diff --git a/arch/x86/kernel/watchdog_hld.c b/arch/x86/kernel/watchdog_hld.c > index ef11f0af4ef5..7940977c6312 100644 > --- a/arch/x86/kernel/watchdog_hld.c > +++ b/arch/x86/kernel/watchdog_hld.c > @@ -83,3 +83,9 @@ void watchdog_nmi_start(void) > if (detector_type == X86_HARDLOCKUP_DETECTOR_HPET) > hardlockup_detector_hpet_start(); > } > + > +void hardlockup_detector_switch_to_perf(void) > +{ > + detector_type = X86_HARDLOCKUP_DETECTOR_PERF; Another possible problem along the same lines here, isn't your watchdog still running at this point? And it uses detector_type in the switch. > + lockup_detector_reconfigure(); Actually the detector_type switch is used in some functions called by lockup_detector_reconfigure() e.g., watchdog_nmi_stop, so this seems buggy even without concurrent watchdog. Is this switching a good idea in general? The admin has asked for non-standard option because they want more PMU counterss available and now it eats a counter potentially causing a problem rather than detecting one. I would rather just disable with a warning if it were up to me. If you *really* wanted to be fancy then allow admin to re-enable via proc maybe. Thanks, Nick ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 00/19] IOMMUFD Dirty Tracking
On 5/10/22 02:38, Tian, Kevin wrote: >> From: Jason Gunthorpe >> Sent: Friday, May 6, 2022 7:46 PM >> >> On Fri, May 06, 2022 at 03:51:40AM +, Tian, Kevin wrote: From: Jason Gunthorpe Sent: Thursday, May 5, 2022 10:08 PM On Thu, May 05, 2022 at 07:40:37AM +, Tian, Kevin wrote: > In concept this is an iommu property instead of a domain property. Not really, domains shouldn't be changing behaviors once they are created. If a domain supports dirty tracking and I attach a new device then it still must support dirty tracking. >>> >>> That sort of suggests that userspace should specify whether a domain >>> supports dirty tracking when it's created. But how does userspace >>> know that it should create the domain in this way in the first place? >>> live migration is triggered on demand and it may not happen in the >>> lifetime of a VM. >> >> The best you could do is to look at the devices being plugged in at VM >> startup, and if they all support live migration then request dirty >> tracking, otherwise don't. > > Yes, this is how a device capability can help. > >> >> However, tt costs nothing to have dirty tracking as long as all iommus >> support it in the system - which seems to be the normal case today. >> >> We should just always turn it on at this point. > > Then still need a way to report " all iommus support it in the system" > to userspace since many old systems don't support it at all. If we all > agree that a device capability flag would be helpful on this front (like > you also said below), probably can start building the initial skeleton > with that in mind? > This would capture device-specific and maybe iommu-instance features, but there's some tiny bit odd semantic here. There's nothing that depends on the device to support any of this, but rather the IOMMU instance that sits below the device which is independent of device-own capabilities e.g. PRI on the other hand would be a perfect fit for a device capability (?), but dirty tracking conveying over a device capability would be a convenience rather than an exact hw representation. Thinking out loud if we are going as a device/iommu capability [to see if this matches what people have or not in mind]: we would add dirty-tracking feature bit via the existent kAPI for iommu device features (e.g. IOMMU_DEV_FEAT_AD) and on iommufd we would maybe add an IOMMUFD_CMD_DEV_GET_IOMMU_FEATURES ioctl which would have an u64 dev_id as input (from the returned vfio-pci BIND_IOMMUFD @out_dev_id) and u64 features as an output bitmap of synthetic feature bits, having IOMMUFD_FEATURE_AD the only one we query (and IOMMUFD_FEATURE_{SVA,IOPF} as potentially future candidates). Qemu would then at start of day would check if /all devices/ support it and it would then still do the blind set tracking, but bail out preemptively if any of device-iommu don't support dirty-tracking. I don't think we have any case today for having to deal with different IOMMU instances that have different features. Either that or as discussed in the beginning perhaps add an iommufd (or iommufd hwpt one) ioctl call (e.g.IOMMUFD_CMD_CAP) via a input value (e.g. subop IOMMU_FEATURES) which would gives us a structure of things (e.g. for the IOMMU_FEATURES subop the common featureset bitmap in all iommu instances). This would give the 'all iommus support it in the system'. Albeit the device one might have more concrete longevity if there's further plans aside from dirty tracking. >> >>> and if the user always creates domain to allow dirty tracking by default, >>> how does it know a failed attach is due to missing dirty tracking support >>> by the IOMMU and then creates another domain which disables dirty >>> tracking and retry-attach again? >> >> The automatic logic is complicated for sure, if you had a device flag >> it would have to figure it out that way >> > > Yes. That is the model in my mind. > > Thanks > Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v6 28/29] x86/tsc: Restart NMI watchdog after refining tsc_khz
On Tue, May 10 2022 at 21:16, Nicholas Piggin wrote: > Excerpts from Ricardo Neri's message of May 6, 2022 10:00 am: >> +/* >> + * If in use, the HPET hardlockup detector relies on tsc_khz. >> + * Reconfigure it to make use of the refined tsc_khz. >> + */ >> +lockup_detector_reconfigure(); > > I don't know if the API is conceptually good. > > You change something that the lockup detector is currently using, > *while* the detector is running asynchronously, and then reconfigure > it. What happens in the window? If this code is only used for small > adjustments maybe it does not really matter but in principle it's > a bad API to export. > > lockup_detector_reconfigure as an internal API is okay because it > reconfigures things while the watchdog is stopped [actually that > looks untrue for soft dog which uses watchdog_thresh in > is_softlockup(), but that should be fixed]. > > You're the arch so you're allowed to stop the watchdog and configure > it, e.g., hardlockup_detector_perf_stop() is called in arch/. > > So you want to disable HPET watchdog if it was enabled, then update > wherever you're using tsc_khz, then re-enable. The real question is whether making this refined tsc_khz value immediately effective matters at all. IMO, it does not because up to that point the watchdog was happily using the coarse calibrated value and the whole use TSC to assess whether the HPET fired mechanism is just a guestimate anyway. So what's the point of trying to guess 'more correct'. Thanks, tglx ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] iommu/arm-smmu: Force identity domains for legacy binding
On Tue, 10 May 2022 09:38:58 +0100, Robin Murphy wrote: > When using the legacy "mmu-masters" DT binding, we reject DMA domains > since we have no guarantee of driver probe order and thus can't rely on > client drivers getting the correct DMA ops. However, we can do better > than fall back to the old no-default-domain behaviour now, by forcing an > identity default domain instead. This also means that detaching from a > VFIO domain can actually work - that looks to have been broken for over > 6 years, so clearly isn't something that legacy binding users care > about, but we may as well make the driver code make sense anyway. > > [...] Applied to will (for-joerg/arm-smmu/updates), thanks! [1/1] iommu/arm-smmu: Force identity domains for legacy binding https://git.kernel.org/will/c/628bf55b6204 Cheers, -- Will https://fixes.arm64.dev https://next.arm64.dev https://will.arm64.dev ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v7 1/7] iommu/arm-smmu-v3: Make default domain type of HiSilicon PTT device to identity
On Thu, Apr 07, 2022 at 08:58:35PM +0800, Yicong Yang wrote: > The DMA operations of HiSilicon PTT device can only work properly with > identical mappings. So add a quirk for the device to force the domain > as passthrough. > > Signed-off-by: Yicong Yang > --- > drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 16 > 1 file changed, 16 insertions(+) I still don't like this being part of the SMMU driver, but given that (a) Robin doesn't seem to agree with the objection and (b) you've been refreshing the patch series: Acked-by: Will Deacon If you do respin, then: > diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > index 627a3ed5ee8f..5ec15ae2a9b1 100644 > --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c > @@ -2839,6 +2839,21 @@ static int arm_smmu_dev_disable_feature(struct device > *dev, > } > } It might be worth adding a brief comment here to explain what this device is and why it needs an identity mapping. > +#define IS_HISI_PTT_DEVICE(pdev) ((pdev)->vendor == PCI_VENDOR_ID_HUAWEI > && \ > + (pdev)->device == 0xa12e) Will ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v7 2/7] hwtracing: Add trace function support for HiSilicon PCIe Tune and Trace device
On 2022/5/10 17:46, James Clark wrote: > > > On 07/04/2022 13:58, Yicong Yang wrote: >> HiSilicon PCIe tune and trace device(PTT) is a PCIe Root Complex integrated >> Endpoint(RCiEP) device, providing the capability to dynamically monitor and >> tune the PCIe traffic, and trace the TLP headers. >> >> Add the driver for the device to enable the trace function. Register PMU >> device of PTT trace, then users can use trace through perf command. The >> driver makes use of perf AUX trace and support following events to >> configure the trace: >> >> - filter: select Root port or Endpoint to trace >> - type: select the type of traced TLP headers >> - direction: select the direction of traced TLP headers >> - format: select the data format of the traced TLP headers >> >> This patch adds the driver part of PTT trace. The perf command support of >> PTT trace is added in the following patch. >> >> Signed-off-by: Yicong Yang >> Reviewed-by: Jonathan Cameron >> --- >> drivers/Makefile | 1 + >> drivers/hwtracing/Kconfig| 2 + >> drivers/hwtracing/ptt/Kconfig| 12 + >> drivers/hwtracing/ptt/Makefile | 2 + >> drivers/hwtracing/ptt/hisi_ptt.c | 874 +++ >> drivers/hwtracing/ptt/hisi_ptt.h | 166 ++ >> 6 files changed, 1057 insertions(+) >> create mode 100644 drivers/hwtracing/ptt/Kconfig >> create mode 100644 drivers/hwtracing/ptt/Makefile >> create mode 100644 drivers/hwtracing/ptt/hisi_ptt.c >> create mode 100644 drivers/hwtracing/ptt/hisi_ptt.h >> >> diff --git a/drivers/Makefile b/drivers/Makefile >> index 020780b6b4d2..662d50599467 100644 >> --- a/drivers/Makefile >> +++ b/drivers/Makefile >> @@ -175,6 +175,7 @@ obj-$(CONFIG_USB4) += thunderbolt/ >> obj-$(CONFIG_CORESIGHT) += hwtracing/coresight/ >> obj-y += hwtracing/intel_th/ >> obj-$(CONFIG_STM) += hwtracing/stm/ >> +obj-$(CONFIG_HISI_PTT) += hwtracing/ptt/ >> obj-$(CONFIG_ANDROID) += android/ >> obj-$(CONFIG_NVMEM) += nvmem/ >> obj-$(CONFIG_FPGA) += fpga/ >> diff --git a/drivers/hwtracing/Kconfig b/drivers/hwtracing/Kconfig >> index 13085835a636..911ee977103c 100644 >> --- a/drivers/hwtracing/Kconfig >> +++ b/drivers/hwtracing/Kconfig >> @@ -5,4 +5,6 @@ source "drivers/hwtracing/stm/Kconfig" >> >> source "drivers/hwtracing/intel_th/Kconfig" >> >> +source "drivers/hwtracing/ptt/Kconfig" >> + >> endmenu >> diff --git a/drivers/hwtracing/ptt/Kconfig b/drivers/hwtracing/ptt/Kconfig >> new file mode 100644 >> index ..8902a6f27563 >> --- /dev/null >> +++ b/drivers/hwtracing/ptt/Kconfig >> @@ -0,0 +1,12 @@ >> +# SPDX-License-Identifier: GPL-2.0-only >> +config HISI_PTT >> +tristate "HiSilicon PCIe Tune and Trace Device" >> +depends on ARM64 || (COMPILE_TEST && 64BIT) >> +depends on PCI && HAS_DMA && HAS_IOMEM && PERF_EVENTS >> +help >> + HiSilicon PCIe Tune and Trace Device exists as a PCIe RCiEP >> + device, and it provides support for PCIe traffic tuning and >> + tracing TLP headers to the memory. >> + >> + This driver can also be built as a module. If so, the module >> + will be called hisi_ptt. >> diff --git a/drivers/hwtracing/ptt/Makefile b/drivers/hwtracing/ptt/Makefile >> new file mode 100644 >> index ..908c09a98161 >> --- /dev/null >> +++ b/drivers/hwtracing/ptt/Makefile >> @@ -0,0 +1,2 @@ >> +# SPDX-License-Identifier: GPL-2.0 >> +obj-$(CONFIG_HISI_PTT) += hisi_ptt.o >> diff --git a/drivers/hwtracing/ptt/hisi_ptt.c >> b/drivers/hwtracing/ptt/hisi_ptt.c >> new file mode 100644 >> index ..242b41870380 >> --- /dev/null >> +++ b/drivers/hwtracing/ptt/hisi_ptt.c >> @@ -0,0 +1,874 @@ >> +// SPDX-License-Identifier: GPL-2.0 >> +/* >> + * Driver for HiSilicon PCIe tune and trace device >> + * >> + * Copyright (c) 2022 HiSilicon Technologies Co., Ltd. >> + * Author: Yicong Yang >> + */ >> + >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> +#include >> + >> +#include "hisi_ptt.h" >> + >> +static u16 hisi_ptt_get_filter_val(struct pci_dev *pdev) >> +{ >> +if (pci_pcie_type(pdev) == PCI_EXP_TYPE_ROOT_PORT) >> +return BIT(HISI_PCIE_CORE_PORT_ID(PCI_SLOT(pdev->devfn))); >> + >> +return PCI_DEVID(pdev->bus->number, pdev->devfn); >> +} >> + >> +static bool hisi_ptt_wait_trace_hw_idle(struct hisi_ptt *hisi_ptt) >> +{ >> +u32 val; >> + >> +return !readl_poll_timeout_atomic(hisi_ptt->iobase + HISI_PTT_TRACE_STS, >> + val, val & HISI_PTT_TRACE_IDLE, >> + HISI_PTT_WAIT_POLL_INTERVAL_US, >> + HISI_PTT_WAIT_TRACE_TIMEOUT_US); >> +} >> + >> +static bool hisi_ptt_wait_dma_reset_done(struct hisi_ptt *hisi_ptt) >> +{ >> +u32 val; >> + >> +return
Re: [PATCH v6 28/29] x86/tsc: Restart NMI watchdog after refining tsc_khz
Excerpts from Ricardo Neri's message of May 6, 2022 10:00 am: > The HPET hardlockup detector relies on tsc_khz to estimate the value of > that the TSC will have when its HPET channel fires. A refined tsc_khz > helps to estimate better the expected TSC value. > > Using the early value of tsc_khz may lead to a large error in the expected > TSC value. Restarting the NMI watchdog detector has the effect of kicking > its HPET channel and make use of the refined tsc_khz. > > When the HPET hardlockup is not in use, restarting the NMI watchdog is > a noop. > > Cc: Andi Kleen > Cc: Stephane Eranian > Cc: "Ravi V. Shankar" > Cc: iommu@lists.linux-foundation.org > Cc: linuxppc-...@lists.ozlabs.org > Cc: x...@kernel.org > Signed-off-by: Ricardo Neri > --- > Changes since v5: > * Introduced this patch > > Changes since v4 > * N/A > > Changes since v3 > * N/A > > Changes since v2: > * N/A > > Changes since v1: > * N/A > --- > arch/x86/kernel/tsc.c | 6 ++ > 1 file changed, 6 insertions(+) > > diff --git a/arch/x86/kernel/tsc.c b/arch/x86/kernel/tsc.c > index cafacb2e58cc..cc1843044d88 100644 > --- a/arch/x86/kernel/tsc.c > +++ b/arch/x86/kernel/tsc.c > @@ -1386,6 +1386,12 @@ static void tsc_refine_calibration_work(struct > work_struct *work) > /* Inform the TSC deadline clockevent devices about the recalibration */ > lapic_update_tsc_freq(); > > + /* > + * If in use, the HPET hardlockup detector relies on tsc_khz. > + * Reconfigure it to make use of the refined tsc_khz. > + */ > + lockup_detector_reconfigure(); I don't know if the API is conceptually good. You change something that the lockup detector is currently using, *while* the detector is running asynchronously, and then reconfigure it. What happens in the window? If this code is only used for small adjustments maybe it does not really matter but in principle it's a bad API to export. lockup_detector_reconfigure as an internal API is okay because it reconfigures things while the watchdog is stopped [actually that looks untrue for soft dog which uses watchdog_thresh in is_softlockup(), but that should be fixed]. You're the arch so you're allowed to stop the watchdog and configure it, e.g., hardlockup_detector_perf_stop() is called in arch/. So you want to disable HPET watchdog if it was enabled, then update wherever you're using tsc_khz, then re-enable. Thanks, Nick ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v6 24/29] watchdog/hardlockup: Use parse_option_str() to handle "nmi_watchdog"
Excerpts from Ricardo Neri's message of May 6, 2022 10:00 am: > Prepare hardlockup_panic_setup() to handle a comma-separated list of > options. Thus, it can continue parsing its own command-line options while > ignoring parameters that are relevant only to specific implementations of > the hardlockup detector. Such implementations may use an early_param to > parse their own options. It can't really handle comma separated list though, until the next patch. nmi_watchdog=panic,0 does not make sense, so you lost error handling of that. And is it kosher to double handle options like this? I'm sure it happens but it's ugly. Would you consider just add a new option for x86 and avoid changing this? Less code and patches. Thanks, Nick > > Cc: Andi Kleen > Cc: Nicholas Piggin > Cc: Stephane Eranian > Cc: "Ravi V. Shankar" > Cc: iommu@lists.linux-foundation.org > Cc: linuxppc-...@lists.ozlabs.org > Cc: x...@kernel.org > Reviewed-by: Tony Luck > Signed-off-by: Ricardo Neri > --- > Changes since v5: > * Corrected typo in commit message. (Tony) > > Changes since v4: > * None > > Changes since v3: > * None > > Changes since v2: > * Introduced this patch. > > Changes since v1: > * None > --- > kernel/watchdog.c | 8 > 1 file changed, 4 insertions(+), 4 deletions(-) > > diff --git a/kernel/watchdog.c b/kernel/watchdog.c > index 9166220457bc..6443841a755f 100644 > --- a/kernel/watchdog.c > +++ b/kernel/watchdog.c > @@ -73,13 +73,13 @@ void __init hardlockup_detector_disable(void) > > static int __init hardlockup_panic_setup(char *str) > { > - if (!strncmp(str, "panic", 5)) > + if (parse_option_str(str, "panic")) > hardlockup_panic = 1; > - else if (!strncmp(str, "nopanic", 7)) > + else if (parse_option_str(str, "nopanic")) > hardlockup_panic = 0; > - else if (!strncmp(str, "0", 1)) > + else if (parse_option_str(str, "0")) > nmi_watchdog_user_enabled = 0; > - else if (!strncmp(str, "1", 1)) > + else if (parse_option_str(str, "1")) > nmi_watchdog_user_enabled = 1; > return 1; > } > -- > 2.17.1 > > ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v6 20/29] init/main: Delay initialization of the lockup detector after smp_init()
Excerpts from Ricardo Neri's message of May 6, 2022 9:59 am: > Certain implementations of the hardlockup detector require support for > Inter-Processor Interrupt shorthands. On x86, support for these can only > be determined after all the possible CPUs have booted once (in > smp_init()). Other architectures may not need such check. > > lockup_detector_init() only performs the initializations of data > structures of the lockup detector. Hence, there are no dependencies on > smp_init(). I think this is the only real thing which affects other watchdog types? Not sure if it's a big problem, the secondary CPUs coming up won't have their watchdog active until quite late, and the primary could implement its own timeout in __cpu_up for secondary coming up, and IPI it to get traces if necessary which is probably more robust. Acked-by: Nicholas Piggin > > Cc: Andi Kleen > Cc: Nicholas Piggin > Cc: Andrew Morton > Cc: Stephane Eranian > Cc: "Ravi V. Shankar" > Cc: iommu@lists.linux-foundation.org > Cc: linuxppc-...@lists.ozlabs.org > Cc: x...@kernel.org > Reviewed-by: Tony Luck > Signed-off-by: Ricardo Neri > --- > Changes since v5: > * Introduced this patch > > Changes since v4: > * N/A > > Changes since v3: > * N/A > > Changes since v2: > * N/A > > Changes since v1: > * N/A > --- > init/main.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/init/main.c b/init/main.c > index 98182c3c2c4b..62c52c9e4c2b 100644 > --- a/init/main.c > +++ b/init/main.c > @@ -1600,9 +1600,11 @@ static noinline void __init kernel_init_freeable(void) > > rcu_init_tasks_generic(); > do_pre_smp_initcalls(); > - lockup_detector_init(); > > smp_init(); > + > + lockup_detector_init(); > + > sched_init_smp(); > > padata_init(); > -- > 2.17.1 > > ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH] iommu/arm-smmu: Force identity domains for legacy binding
When using the legacy "mmu-masters" DT binding, we reject DMA domains since we have no guarantee of driver probe order and thus can't rely on client drivers getting the correct DMA ops. However, we can do better than fall back to the old no-default-domain behaviour now, by forcing an identity default domain instead. This also means that detaching from a VFIO domain can actually work - that looks to have been broken for over 6 years, so clearly isn't something that legacy binding users care about, but we may as well make the driver code make sense anyway. Suggested-by: Jason Gunthorpe Signed-off-by: Robin Murphy --- drivers/iommu/arm/arm-smmu/arm-smmu.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/drivers/iommu/arm/arm-smmu/arm-smmu.c b/drivers/iommu/arm/arm-smmu/arm-smmu.c index 568cce590ccc..e7ec2b74a28d 100644 --- a/drivers/iommu/arm/arm-smmu/arm-smmu.c +++ b/drivers/iommu/arm/arm-smmu/arm-smmu.c @@ -1574,6 +1574,9 @@ static int arm_smmu_def_domain_type(struct device *dev) struct arm_smmu_master_cfg *cfg = dev_iommu_priv_get(dev); const struct arm_smmu_impl *impl = cfg->smmu->impl; + if (using_legacy_binding) + return IOMMU_DOMAIN_IDENTITY; + if (impl && impl->def_domain_type) return impl->def_domain_type(dev); -- 2.35.3.dirty ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v12 0/9] ACPI/IORT: Support for IORT RMR node
On 2022-05-10 08:23, Shameerali Kolothum Thodi wrote: Hi Joerg/Robin, I think this series is now ready to be merged. Could you please let me know if there is anything missing. Fine by me - these patches have had enough review and testing now that even if anything else did come up, I think it would be better done as follow-up work on the merged code. Cheers, Robin. Thanks, Shameer -Original Message- From: Guohanjun (Hanjun Guo) Sent: 05 May 2022 02:24 To: Shameerali Kolothum Thodi ; linux-arm-ker...@lists.infradead.org; linux-a...@vger.kernel.org; iommu@lists.linux-foundation.org Cc: Linuxarm ; lorenzo.pieral...@arm.com; j...@8bytes.org; robin.mur...@arm.com; w...@kernel.org; wanghuiqiang ; steven.pr...@arm.com; sami.muja...@arm.com; j...@solid-run.com; eric.au...@redhat.com; laurentiu.tu...@nxp.com; h...@infradead.org Subject: Re: [PATCH v12 0/9] ACPI/IORT: Support for IORT RMR node On 2022/5/4 0:33, Shameer Kolothum wrote: Hi v11 --> v12 -Minor fix in patch #4 to address the issue reported by the kernel test robot. -Added R-by tags by Christoph(patch #1) and Lorenzo(patch #4). -Added T-by from Steve to all relevant patches. Many thanks!. Please note, this series has a dependency on the ACPICA header patch here[1]. Tested on a Kunpeng920 server machine with SMMUv3, the 3408iMR RAID controller card works as expected, Tested-by: Hanjun Guo Thanks Hanjun ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH v12 0/9] ACPI/IORT: Support for IORT RMR node
Hi Joerg/Robin, I think this series is now ready to be merged. Could you please let me know if there is anything missing. Thanks, Shameer > -Original Message- > From: Guohanjun (Hanjun Guo) > Sent: 05 May 2022 02:24 > To: Shameerali Kolothum Thodi ; > linux-arm-ker...@lists.infradead.org; linux-a...@vger.kernel.org; > iommu@lists.linux-foundation.org > Cc: Linuxarm ; lorenzo.pieral...@arm.com; > j...@8bytes.org; robin.mur...@arm.com; w...@kernel.org; wanghuiqiang > ; steven.pr...@arm.com; > sami.muja...@arm.com; j...@solid-run.com; eric.au...@redhat.com; > laurentiu.tu...@nxp.com; h...@infradead.org > Subject: Re: [PATCH v12 0/9] ACPI/IORT: Support for IORT RMR node > > On 2022/5/4 0:33, Shameer Kolothum wrote: > > Hi > > > > v11 --> v12 > >-Minor fix in patch #4 to address the issue reported by the kernel test > robot. > >-Added R-by tags by Christoph(patch #1) and Lorenzo(patch #4). > >-Added T-by from Steve to all relevant patches. Many thanks!. > > > > Please note, this series has a dependency on the ACPICA header patch > > here[1]. > > Tested on a Kunpeng920 server machine with SMMUv3, the 3408iMR RAID > controller card works as expected, > > Tested-by: Hanjun Guo > > Thanks > Hanjun ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 11/12] iommufd: vfio container FD ioctl compatibility
On Mon, May 09, 2022 at 11:00:41AM -0300, Jason Gunthorpe wrote: > On Mon, May 09, 2022 at 04:01:52PM +1000, David Gibson wrote: > > > > The default iommu_domain that the iommu driver creates will be used > > > here, it is up to the iommu driver to choose something reasonable for > > > use by applications like DPDK. ie PPC should probably pick its biggest > > > x86-like aperture. > > > > So, using the big aperture means a very high base IOVA > > (1<<59)... which means that it won't work at all if you want to attach > > any devices that aren't capable of 64-bit DMA. > > I'd expect to include the 32 bit window too.. I'm not entirely sure what you mean. Are you working on the assumption that we've extended to allowing multiple apertures, so we'd default to advertising both a small/low aperture and a large/high aperture? > > Using the maximum possible window size would mean we either > > potentially waste a lot of kernel memory on pagetables, or we use > > unnecessarily large number of levels to the pagetable. > > All drivers have this issue to one degree or another. We seem to be > ignoring it - in any case this is a micro optimization, not a > functional need? Ok, fair point. > > More generally, the problem with the interface advertising limitations > > and it being up to userspace to work out if those are ok or not is > > that it's fragile. It's pretty plausible that some future IOMMU model > > will have some new kind of limitation that can't be expressed in the > > query structure we invented now. > > The basic API is very simple - the driver needs to provide ranges of > IOVA and map/unmap - I don't think we have a future problem here we > need to try and guess and solve today. Well.. maybe. My experience of encountering hardware doing weird-arse stuff makes me less sanguine. > Even PPC fits this just fine, the open question for DPDK is more > around optimization, not functional. > > > But if userspace requests the capabilities it wants, and the kernel > > acks or nacks that, we can support the new host IOMMU with existing > > software just fine. > > No, this just makes it fragile in the other direction because now > userspace has to know what platform specific things to ask for *or it > doesn't work at all*. This is not a improvement for the DPDK cases. Um.. no. The idea is that userspace requests *what it needs*, not anything platform specific. In the case of DPDK that would be nothing more than the (minimum) aperture size. Nothing platform specific about that. > Kernel decides, using all the kernel knowledge it has and tells the > application what it can do - this is the basic simplified interface. > > > > The iommu-driver-specific struct is the "advanced" interface and > > > allows a user-space IOMMU driver to tightly control the HW with full > > > HW specific knowledge. This is where all the weird stuff that is not > > > general should go. > > > > Right, but forcing anything more complicated than "give me some IOVA > > region" to go through the advanced interface means that qemu (or any > > hypervisor where the guest platform need not identically match the > > host) has to have n^2 complexity to match each guest IOMMU model to > > each host IOMMU model. > > I wouldn't say n^2, but yes, qemu needs to have a userspace driver for > the platform IOMMU, and yes it needs this to reach optimal > behavior. We already know this is a hard requirement for using nesting > as acceleration, I don't see why apertures are so different. For one thing, because we only care about optimal behaviour on the host ~= guest KVM case. That means it's not n^2, just (roughly) one host driver for each matching guest driver. I'm considering the general X on Y case - we don't need to optimize it, but it would be nice for it to work without considering every combination separately. > > Errr.. how do you figure? On ppc the ranges and pagesizes are > > definitely negotiable. I'm not really familiar with other models, but > > anything which allows *any* variations in the pagetable structure will > > effectively have at least some negotiable properties. > > As above, if you ask for the wrong thing then you don't get > anything. If DPDK asks for something that works on ARM like 0 -> 4G > then PPC and x86 will always fail. How is this improving anything to > require applications to carefully ask for exactly the right platform > specific ranges? Hm, looks like I didn't sufficiently emphasize that the base address would be optional for userspace to supply. So userspace would request a range *size* only, unless it needs a specific IOVA base address. It only requests the latter if it actually needs it - so failing in that case is correct. (Qemu, with or without an vIOMMU is the obvious case for that, though I could also imagine it for a specialized driver for some broken device which has weird limitations on what IOVA addresses it can generate on the bus). > It isn't like there is some hard coded value we
[PATCH v6 12/12] iommu: Rename iommu-sva-lib.{c,h}
Rename iommu-sva-lib.c[h] to iommu-sva.c[h] as it contains all code for SVA implementation in iommu core. Signed-off-by: Lu Baolu Reviewed-by: Jean-Philippe Brucker --- drivers/iommu/{iommu-sva-lib.h => iommu-sva.h} | 0 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c | 2 +- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 2 +- drivers/iommu/intel/iommu.c | 2 +- drivers/iommu/intel/svm.c | 2 +- drivers/iommu/io-pgfault.c | 2 +- drivers/iommu/{iommu-sva-lib.c => iommu-sva.c} | 2 +- drivers/iommu/Makefile | 2 +- 8 files changed, 7 insertions(+), 7 deletions(-) rename drivers/iommu/{iommu-sva-lib.h => iommu-sva.h} (100%) rename drivers/iommu/{iommu-sva-lib.c => iommu-sva.c} (99%) diff --git a/drivers/iommu/iommu-sva-lib.h b/drivers/iommu/iommu-sva.h similarity index 100% rename from drivers/iommu/iommu-sva-lib.h rename to drivers/iommu/iommu-sva.h diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c index 56644e553c42..265b125d7dc4 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c @@ -9,7 +9,7 @@ #include #include "arm-smmu-v3.h" -#include "../../iommu-sva-lib.h" +#include "../../iommu-sva.h" #include "../../io-pgtable-arm.h" struct arm_smmu_mmu_notifier { diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index 6a10fa181827..de3b6fbf8766 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -31,7 +31,7 @@ #include #include "arm-smmu-v3.h" -#include "../../iommu-sva-lib.h" +#include "../../iommu-sva.h" static bool disable_bypass = true; module_param(disable_bypass, bool, 0444); diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index a5728f743c6d..1c2c92b657c7 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -27,7 +27,7 @@ #include #include "../irq_remapping.h" -#include "../iommu-sva-lib.h" +#include "../iommu-sva.h" #include "pasid.h" #include "cap_audit.h" diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c index ca83ebc708a8..44331db060e4 100644 --- a/drivers/iommu/intel/svm.c +++ b/drivers/iommu/intel/svm.c @@ -25,7 +25,7 @@ #include "pasid.h" #include "perf.h" -#include "../iommu-sva-lib.h" +#include "../iommu-sva.h" static irqreturn_t prq_event_thread(int irq, void *d); static void intel_svm_drain_prq(struct device *dev, u32 pasid); diff --git a/drivers/iommu/io-pgfault.c b/drivers/iommu/io-pgfault.c index 9efe5259402b..2a8604013b7e 100644 --- a/drivers/iommu/io-pgfault.c +++ b/drivers/iommu/io-pgfault.c @@ -11,7 +11,7 @@ #include #include -#include "iommu-sva-lib.h" +#include "iommu-sva.h" /** * struct iopf_queue - IO Page Fault queue diff --git a/drivers/iommu/iommu-sva-lib.c b/drivers/iommu/iommu-sva.c similarity index 99% rename from drivers/iommu/iommu-sva-lib.c rename to drivers/iommu/iommu-sva.c index ea12504a9e12..1791ac1e3d34 100644 --- a/drivers/iommu/iommu-sva-lib.c +++ b/drivers/iommu/iommu-sva.c @@ -7,7 +7,7 @@ #include #include -#include "iommu-sva-lib.h" +#include "iommu-sva.h" static DEFINE_MUTEX(iommu_sva_lock); static DECLARE_IOASID_SET(iommu_sva_pasid); diff --git a/drivers/iommu/Makefile b/drivers/iommu/Makefile index 44475a9b3eea..c1763476162b 100644 --- a/drivers/iommu/Makefile +++ b/drivers/iommu/Makefile @@ -27,6 +27,6 @@ obj-$(CONFIG_FSL_PAMU) += fsl_pamu.o fsl_pamu_domain.o obj-$(CONFIG_S390_IOMMU) += s390-iommu.o obj-$(CONFIG_HYPERV_IOMMU) += hyperv-iommu.o obj-$(CONFIG_VIRTIO_IOMMU) += virtio-iommu.o -obj-$(CONFIG_IOMMU_SVA) += iommu-sva-lib.o io-pgfault.o +obj-$(CONFIG_IOMMU_SVA) += iommu-sva.o io-pgfault.o obj-$(CONFIG_SPRD_IOMMU) += sprd-iommu.o obj-$(CONFIG_APPLE_DART) += apple-dart.o -- 2.25.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v6 11/12] iommu: Per-domain I/O page fault handling
Tweak the I/O page fault handling framework to route the page faults to the domain and call the page fault handler retrieved from the domain. This makes the I/O page fault handling framework possible to serve more usage scenarios as long as they have an IOMMU domain and install a page fault handler in it. Some unused functions are also removed to avoid dead code. Signed-off-by: Lu Baolu Reviewed-by: Jean-Philippe Brucker --- drivers/iommu/iommu-sva-lib.h | 2 -- drivers/iommu/io-pgfault.c| 64 --- drivers/iommu/iommu-sva-lib.c | 20 --- 3 files changed, 7 insertions(+), 79 deletions(-) diff --git a/drivers/iommu/iommu-sva-lib.h b/drivers/iommu/iommu-sva-lib.h index 3420654c6e2f..74ce2e76321b 100644 --- a/drivers/iommu/iommu-sva-lib.h +++ b/drivers/iommu/iommu-sva-lib.h @@ -8,8 +8,6 @@ #include #include -struct mm_struct *iommu_sva_find(ioasid_t pasid); - /* I/O Page fault */ struct device; struct iommu_fault; diff --git a/drivers/iommu/io-pgfault.c b/drivers/iommu/io-pgfault.c index aee9e033012f..9efe5259402b 100644 --- a/drivers/iommu/io-pgfault.c +++ b/drivers/iommu/io-pgfault.c @@ -69,69 +69,18 @@ static int iopf_complete_group(struct device *dev, struct iopf_fault *iopf, return iommu_page_response(dev, ); } -static enum iommu_page_response_code -iopf_handle_single(struct iopf_fault *iopf) -{ - vm_fault_t ret; - struct mm_struct *mm; - struct vm_area_struct *vma; - unsigned int access_flags = 0; - unsigned int fault_flags = FAULT_FLAG_REMOTE; - struct iommu_fault_page_request *prm = >fault.prm; - enum iommu_page_response_code status = IOMMU_PAGE_RESP_INVALID; - - if (!(prm->flags & IOMMU_FAULT_PAGE_REQUEST_PASID_VALID)) - return status; - - mm = iommu_sva_find(prm->pasid); - if (IS_ERR_OR_NULL(mm)) - return status; - - mmap_read_lock(mm); - - vma = find_extend_vma(mm, prm->addr); - if (!vma) - /* Unmapped area */ - goto out_put_mm; - - if (prm->perm & IOMMU_FAULT_PERM_READ) - access_flags |= VM_READ; - - if (prm->perm & IOMMU_FAULT_PERM_WRITE) { - access_flags |= VM_WRITE; - fault_flags |= FAULT_FLAG_WRITE; - } - - if (prm->perm & IOMMU_FAULT_PERM_EXEC) { - access_flags |= VM_EXEC; - fault_flags |= FAULT_FLAG_INSTRUCTION; - } - - if (!(prm->perm & IOMMU_FAULT_PERM_PRIV)) - fault_flags |= FAULT_FLAG_USER; - - if (access_flags & ~vma->vm_flags) - /* Access fault */ - goto out_put_mm; - - ret = handle_mm_fault(vma, prm->addr, fault_flags, NULL); - status = ret & VM_FAULT_ERROR ? IOMMU_PAGE_RESP_INVALID : - IOMMU_PAGE_RESP_SUCCESS; - -out_put_mm: - mmap_read_unlock(mm); - mmput(mm); - - return status; -} - static void iopf_handle_group(struct work_struct *work) { struct iopf_group *group; + struct iommu_domain *domain; struct iopf_fault *iopf, *next; enum iommu_page_response_code status = IOMMU_PAGE_RESP_SUCCESS; group = container_of(work, struct iopf_group, work); + domain = iommu_get_domain_for_iopf(group->dev, + group->last_fault.fault.prm.pasid); + if (!domain || !domain->iopf_handler) + status = IOMMU_PAGE_RESP_INVALID; list_for_each_entry_safe(iopf, next, >faults, list) { /* @@ -139,7 +88,8 @@ static void iopf_handle_group(struct work_struct *work) * faults in the group if there is an error. */ if (status == IOMMU_PAGE_RESP_SUCCESS) - status = iopf_handle_single(iopf); + status = domain->iopf_handler(>fault, + domain->fault_data); if (!(iopf->fault.prm.flags & IOMMU_FAULT_PAGE_REQUEST_LAST_PAGE)) diff --git a/drivers/iommu/iommu-sva-lib.c b/drivers/iommu/iommu-sva-lib.c index 32c836e4a60e..ea12504a9e12 100644 --- a/drivers/iommu/iommu-sva-lib.c +++ b/drivers/iommu/iommu-sva-lib.c @@ -52,26 +52,6 @@ static int iommu_sva_alloc_pasid(struct mm_struct *mm, return ret; } -/* ioasid_find getter() requires a void * argument */ -static bool __mmget_not_zero(void *mm) -{ - return mmget_not_zero(mm); -} - -/** - * iommu_sva_find() - Find mm associated to the given PASID - * @pasid: Process Address Space ID assigned to the mm - * - * On success a reference to the mm is taken, and must be released with mmput(). - * - * Returns the mm corresponding to this PASID, or an error if not found. - */ -struct mm_struct *iommu_sva_find(ioasid_t pasid) -{ - return ioasid_find(_sva_pasid, pasid, __mmget_not_zero); -} -EXPORT_SYMBOL_GPL(iommu_sva_find); - /* * I/O page fault handler for
[PATCH v6 10/12] iommu: Prepare IOMMU domain for IOPF
This adds some mechanisms around the iommu_domain so that the I/O page fault handling framework could route a page fault to the domain and call the fault handler from it. Add pointers to the page fault handler and its private data in struct iommu_domain. The fault handler will be called with the private data as a parameter once a page fault is routed to the domain. Any kernel component which owns an iommu domain could install handler and its private parameter so that the page fault could be further routed and handled. A new helper iommu_get_domain_for_iopf() which retrieves attached domain for a {device, PASID} pair is added. It will be used by the page fault handling framework which knows {device, PASID} reported from the iommu driver. We have a guarantee that the SVA domain doesn't go away during IOPF handling, because unbind() waits for pending faults with iopf_queue_flush_dev() before freeing the domain. Hence, there's no need to synchronize life cycle of the iommu domains between the unbind() and the interrupt threads. This also prepares the SVA implementation to be the first consumer of the per-domain page fault handling model. Suggested-by: Jean-Philippe Brucker Signed-off-by: Lu Baolu --- include/linux/iommu.h | 12 +++ drivers/iommu/io-pgfault.c| 7 drivers/iommu/iommu-sva-lib.c | 65 +++ drivers/iommu/iommu.c | 27 +++ 4 files changed, 111 insertions(+) diff --git a/include/linux/iommu.h b/include/linux/iommu.h index 392b8adc3495..9405034e3013 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -103,6 +103,9 @@ struct iommu_domain { #ifdef CONFIG_IOMMU_SVA struct mm_struct *mm; #endif + enum iommu_page_response_code (*iopf_handler)(struct iommu_fault *fault, + void *data); + void *fault_data; }; static inline bool iommu_is_dma_domain(struct iommu_domain *domain) @@ -687,6 +690,9 @@ int iommu_attach_device_pasid(struct iommu_domain *domain, struct device *dev, ioasid_t pasid); void iommu_detach_device_pasid(struct iommu_domain *domain, struct device *dev, ioasid_t pasid); +struct iommu_domain * +iommu_get_domain_for_iopf(struct device *dev, ioasid_t pasid); + #else /* CONFIG_IOMMU_API */ struct iommu_ops {}; @@ -1056,6 +1062,12 @@ static inline void iommu_detach_device_pasid(struct iommu_domain *domain, struct device *dev, ioasid_t pasid) { } + +static inline struct iommu_domain * +iommu_get_domain_for_iopf(struct device *dev, ioasid_t pasid) +{ + return NULL; +} #endif /* CONFIG_IOMMU_API */ #ifdef CONFIG_IOMMU_SVA diff --git a/drivers/iommu/io-pgfault.c b/drivers/iommu/io-pgfault.c index 1df8c1dcae77..aee9e033012f 100644 --- a/drivers/iommu/io-pgfault.c +++ b/drivers/iommu/io-pgfault.c @@ -181,6 +181,13 @@ static void iopf_handle_group(struct work_struct *work) * request completes, outstanding faults will have been dealt with by the time * the PASID is freed. * + * Any valid page fault will be eventually routed to an iommu domain and the + * page fault handler installed there will get called. The users of this + * handling framework should guarantee that the iommu domain could only be + * freed after the device has stopped generating page faults (or the iommu + * hardware has been set to block the page faults) and the pending page faults + * have been flushed. + * * Return: 0 on success and <0 on error. */ int iommu_queue_iopf(struct iommu_fault *fault, void *cookie) diff --git a/drivers/iommu/iommu-sva-lib.c b/drivers/iommu/iommu-sva-lib.c index ef6ed87d04ba..32c836e4a60e 100644 --- a/drivers/iommu/iommu-sva-lib.c +++ b/drivers/iommu/iommu-sva-lib.c @@ -72,6 +72,69 @@ struct mm_struct *iommu_sva_find(ioasid_t pasid) } EXPORT_SYMBOL_GPL(iommu_sva_find); +/* + * I/O page fault handler for SVA + * + * Copied from io-pgfault.c with mmget_not_zero() added before + * mmap_read_lock(). + */ +static enum iommu_page_response_code +iommu_sva_handle_iopf(struct iommu_fault *fault, void *data) +{ + vm_fault_t ret; + struct mm_struct *mm; + struct vm_area_struct *vma; + unsigned int access_flags = 0; + struct iommu_domain *domain = data; + unsigned int fault_flags = FAULT_FLAG_REMOTE; + struct iommu_fault_page_request *prm = >prm; + enum iommu_page_response_code status = IOMMU_PAGE_RESP_INVALID; + + if (!(prm->flags & IOMMU_FAULT_PAGE_REQUEST_PASID_VALID)) + return status; + + mm = domain->mm; + if (IS_ERR_OR_NULL(mm) || !mmget_not_zero(mm)) + return status; + + mmap_read_lock(mm); + + vma = find_extend_vma(mm, prm->addr); + if (!vma) + /* Unmapped area */ + goto out_put_mm; + + if (prm->perm & IOMMU_FAULT_PERM_READ) + access_flags |=
[PATCH v6 09/12] iommu: Remove SVA related callbacks from iommu ops
These ops'es have been replaced with the dev_attach/detach_pasid domain ops'es. There's no need for them anymore. Remove them to avoid dead code. Signed-off-by: Lu Baolu Reviewed-by: Jean-Philippe Brucker --- include/linux/intel-iommu.h | 4 -- include/linux/iommu.h | 8 --- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 17 --- drivers/iommu/iommu-sva-lib.h | 1 - .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c | 41 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 3 -- drivers/iommu/intel/iommu.c | 3 -- drivers/iommu/intel/svm.c | 49 --- drivers/iommu/iommu-sva-lib.c | 4 +- 9 files changed, 2 insertions(+), 128 deletions(-) diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h index 2397c2007cda..4a5cc796f917 100644 --- a/include/linux/intel-iommu.h +++ b/include/linux/intel-iommu.h @@ -738,10 +738,6 @@ struct intel_iommu *device_to_iommu(struct device *dev, u8 *bus, u8 *devfn); extern void intel_svm_check(struct intel_iommu *iommu); extern int intel_svm_enable_prq(struct intel_iommu *iommu); extern int intel_svm_finish_prq(struct intel_iommu *iommu); -struct iommu_sva *intel_svm_bind(struct device *dev, struct mm_struct *mm, -void *drvdata); -void intel_svm_unbind(struct iommu_sva *handle); -u32 intel_svm_get_pasid(struct iommu_sva *handle); int intel_svm_page_response(struct device *dev, struct iommu_fault_event *evt, struct iommu_page_response *msg); struct iommu_domain *intel_svm_domain_alloc(void); diff --git a/include/linux/iommu.h b/include/linux/iommu.h index 5a3ef4d58b1f..392b8adc3495 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -215,9 +215,6 @@ struct iommu_iotlb_gather { * @dev_has/enable/disable_feat: per device entries to check/enable/disable * iommu specific features. * @dev_feat_enabled: check enabled feature - * @sva_bind: Bind process address space to device - * @sva_unbind: Unbind process address space from device - * @sva_get_pasid: Get PASID associated to a SVA handle * @page_response: handle page request response * @def_domain_type: device default domain type, return value: * - IOMMU_DOMAIN_IDENTITY: must use an identity domain @@ -251,11 +248,6 @@ struct iommu_ops { int (*dev_enable_feat)(struct device *dev, enum iommu_dev_features f); int (*dev_disable_feat)(struct device *dev, enum iommu_dev_features f); - struct iommu_sva *(*sva_bind)(struct device *dev, struct mm_struct *mm, - void *drvdata); - void (*sva_unbind)(struct iommu_sva *handle); - u32 (*sva_get_pasid)(struct iommu_sva *handle); - int (*page_response)(struct device *dev, struct iommu_fault_event *evt, struct iommu_page_response *msg); diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h index e077f21e2528..15dd4c7e6d3a 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h @@ -754,10 +754,6 @@ bool arm_smmu_master_sva_enabled(struct arm_smmu_master *master); int arm_smmu_master_enable_sva(struct arm_smmu_master *master); int arm_smmu_master_disable_sva(struct arm_smmu_master *master); bool arm_smmu_master_iopf_supported(struct arm_smmu_master *master); -struct iommu_sva *arm_smmu_sva_bind(struct device *dev, struct mm_struct *mm, - void *drvdata); -void arm_smmu_sva_unbind(struct iommu_sva *handle); -u32 arm_smmu_sva_get_pasid(struct iommu_sva *handle); void arm_smmu_sva_notifier_synchronize(void); struct iommu_domain *arm_smmu_sva_domain_alloc(void); #else /* CONFIG_ARM_SMMU_V3_SVA */ @@ -791,19 +787,6 @@ static inline bool arm_smmu_master_iopf_supported(struct arm_smmu_master *master return false; } -static inline struct iommu_sva * -arm_smmu_sva_bind(struct device *dev, struct mm_struct *mm, void *drvdata) -{ - return ERR_PTR(-ENODEV); -} - -static inline void arm_smmu_sva_unbind(struct iommu_sva *handle) {} - -static inline u32 arm_smmu_sva_get_pasid(struct iommu_sva *handle) -{ - return IOMMU_PASID_INVALID; -} - static inline void arm_smmu_sva_notifier_synchronize(void) {} static inline struct iommu_domain *arm_smmu_sva_domain_alloc(void) diff --git a/drivers/iommu/iommu-sva-lib.h b/drivers/iommu/iommu-sva-lib.h index 8909ea1094e3..3420654c6e2f 100644 --- a/drivers/iommu/iommu-sva-lib.h +++ b/drivers/iommu/iommu-sva-lib.h @@ -8,7 +8,6 @@ #include #include -int iommu_sva_alloc_pasid(struct mm_struct *mm, ioasid_t min, ioasid_t max); struct mm_struct *iommu_sva_find(ioasid_t pasid); /* I/O Page fault */ diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
[PATCH v6 08/12] iommu/sva: Use attach/detach_pasid_dev in SVA interfaces
The existing iommu SVA interfaces are implemented by calling the SVA specific iommu ops provided by the IOMMU drivers. There's no need for any SVA specific ops in iommu_ops vector anymore as we can achieve this through the generic attach/detach_dev_pasid domain ops. This refactors the IOMMU SVA interfaces implementation by using the attach/detach_pasid_dev ops and align them with the concept of the iommu domain. Put the new SVA code in the sva related file in order to make it self-contained. Signed-off-by: Lu Baolu --- include/linux/iommu.h | 44 ++- drivers/iommu/iommu-sva-lib.c | 145 ++ drivers/iommu/iommu.c | 92 - 3 files changed, 168 insertions(+), 113 deletions(-) diff --git a/include/linux/iommu.h b/include/linux/iommu.h index 2921e634491e..5a3ef4d58b1f 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -684,12 +684,6 @@ int iommu_dev_enable_feature(struct device *dev, enum iommu_dev_features f); int iommu_dev_disable_feature(struct device *dev, enum iommu_dev_features f); bool iommu_dev_feature_enabled(struct device *dev, enum iommu_dev_features f); -struct iommu_sva *iommu_sva_bind_device(struct device *dev, - struct mm_struct *mm, - void *drvdata); -void iommu_sva_unbind_device(struct iommu_sva *handle); -u32 iommu_sva_get_pasid(struct iommu_sva *handle); - int iommu_device_use_default_domain(struct device *dev); void iommu_device_unuse_default_domain(struct device *dev); @@ -1031,21 +1025,6 @@ iommu_dev_disable_feature(struct device *dev, enum iommu_dev_features feat) return -ENODEV; } -static inline struct iommu_sva * -iommu_sva_bind_device(struct device *dev, struct mm_struct *mm, void *drvdata) -{ - return NULL; -} - -static inline void iommu_sva_unbind_device(struct iommu_sva *handle) -{ -} - -static inline u32 iommu_sva_get_pasid(struct iommu_sva *handle) -{ - return IOMMU_PASID_INVALID; -} - static inline struct iommu_fwspec *dev_iommu_fwspec_get(struct device *dev) { return NULL; @@ -1087,6 +1066,29 @@ static inline void iommu_detach_device_pasid(struct iommu_domain *domain, } #endif /* CONFIG_IOMMU_API */ +#ifdef CONFIG_IOMMU_SVA +struct iommu_sva *iommu_sva_bind_device(struct device *dev, + struct mm_struct *mm, + void *drvdata); +void iommu_sva_unbind_device(struct iommu_sva *handle); +u32 iommu_sva_get_pasid(struct iommu_sva *handle); +#else /* CONFIG_IOMMU_SVA */ +static inline struct iommu_sva * +iommu_sva_bind_device(struct device *dev, struct mm_struct *mm, void *drvdata) +{ + return NULL; +} + +static inline void iommu_sva_unbind_device(struct iommu_sva *handle) +{ +} + +static inline u32 iommu_sva_get_pasid(struct iommu_sva *handle) +{ + return IOMMU_PASID_INVALID; +} +#endif /* CONFIG_IOMMU_SVA */ + /** * iommu_map_sgtable - Map the given buffer to the IOMMU domain * @domain:The IOMMU domain to perform the mapping diff --git a/drivers/iommu/iommu-sva-lib.c b/drivers/iommu/iommu-sva-lib.c index 106506143896..e7301514f286 100644 --- a/drivers/iommu/iommu-sva-lib.c +++ b/drivers/iommu/iommu-sva-lib.c @@ -3,6 +3,8 @@ * Helpers for IOMMU drivers implementing SVA */ #include +#include +#include #include #include "iommu-sva-lib.h" @@ -69,3 +71,146 @@ struct mm_struct *iommu_sva_find(ioasid_t pasid) return ioasid_find(_sva_pasid, pasid, __mmget_not_zero); } EXPORT_SYMBOL_GPL(iommu_sva_find); + +/* + * IOMMU SVA driver-oriented interfaces + */ +static struct iommu_domain * +iommu_sva_alloc_domain(struct device *dev, struct mm_struct *mm) +{ + struct bus_type *bus = dev->bus; + struct iommu_domain *domain; + + if (!bus || !bus->iommu_ops) + return NULL; + + domain = bus->iommu_ops->domain_alloc(IOMMU_DOMAIN_SVA); + if (!domain) + return NULL; + + mmgrab(mm); + domain->mm = mm; + domain->type = IOMMU_DOMAIN_SVA; + + return domain; +} + +static void iommu_sva_free_domain(struct iommu_domain *domain) +{ + mmdrop(domain->mm); + iommu_domain_free(domain); +} + +/** + * iommu_sva_bind_device() - Bind a process address space to a device + * @dev: the device + * @mm: the mm to bind, caller must hold a reference to mm_users + * @drvdata: opaque data pointer to pass to bind callback + * + * Create a bond between device and address space, allowing the device to access + * the mm using the returned PASID. If a bond already exists between @device and + * @mm, it is returned and an additional reference is taken. Caller must call + * iommu_sva_unbind_device() to release each reference. + * + * iommu_dev_enable_feature(dev, IOMMU_DEV_FEAT_SVA) must be called first, to + * initialize the required SVA features. + * + * On error, returns an ERR_PTR value. + */ +struct
[PATCH v6 07/12] arm-smmu-v3/sva: Add SVA domain support
Add support for SVA domain allocation and provide an SVA-specific iommu_domain_ops. Signed-off-by: Lu Baolu --- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 6 ++ .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c | 68 +++ drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 3 + 3 files changed, 77 insertions(+) diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h index cd48590ada30..e077f21e2528 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h @@ -759,6 +759,7 @@ struct iommu_sva *arm_smmu_sva_bind(struct device *dev, struct mm_struct *mm, void arm_smmu_sva_unbind(struct iommu_sva *handle); u32 arm_smmu_sva_get_pasid(struct iommu_sva *handle); void arm_smmu_sva_notifier_synchronize(void); +struct iommu_domain *arm_smmu_sva_domain_alloc(void); #else /* CONFIG_ARM_SMMU_V3_SVA */ static inline bool arm_smmu_sva_supported(struct arm_smmu_device *smmu) { @@ -804,5 +805,10 @@ static inline u32 arm_smmu_sva_get_pasid(struct iommu_sva *handle) } static inline void arm_smmu_sva_notifier_synchronize(void) {} + +static inline struct iommu_domain *arm_smmu_sva_domain_alloc(void) +{ + return NULL; +} #endif /* CONFIG_ARM_SMMU_V3_SVA */ #endif /* _ARM_SMMU_V3_H */ diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c index c623dae1e115..9d176e836d6b 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c @@ -541,3 +541,71 @@ void arm_smmu_sva_notifier_synchronize(void) */ mmu_notifier_synchronize(); } + +static int arm_smmu_sva_attach_dev_pasid(struct iommu_domain *domain, +struct device *dev, ioasid_t id) +{ + int ret = 0; + struct mm_struct *mm; + struct iommu_sva *handle; + + if (domain->type != IOMMU_DOMAIN_SVA) + return -EINVAL; + + mm = domain->mm; + if (WARN_ON(!mm)) + return -ENODEV; + + mutex_lock(_lock); + handle = __arm_smmu_sva_bind(dev, mm); + if (IS_ERR(handle)) + ret = PTR_ERR(handle); + mutex_unlock(_lock); + + return ret; +} + +static void arm_smmu_sva_detach_dev_pasid(struct iommu_domain *domain, + struct device *dev, ioasid_t id) +{ + struct mm_struct *mm = domain->mm; + struct arm_smmu_bond *bond = NULL, *t; + struct arm_smmu_master *master = dev_iommu_priv_get(dev); + + mutex_lock(_lock); + list_for_each_entry(t, >bonds, list) { + if (t->mm == mm) { + bond = t; + break; + } + } + + if (!WARN_ON(!bond) && refcount_dec_and_test(>refs)) { + list_del(>list); + arm_smmu_mmu_notifier_put(bond->smmu_mn); + kfree(bond); + } + mutex_unlock(_lock); +} + +static void arm_smmu_sva_domain_free(struct iommu_domain *domain) +{ + kfree(domain); +} + +static const struct iommu_domain_ops arm_smmu_sva_domain_ops = { + .attach_dev_pasid = arm_smmu_sva_attach_dev_pasid, + .detach_dev_pasid = arm_smmu_sva_detach_dev_pasid, + .free = arm_smmu_sva_domain_free, +}; + +struct iommu_domain *arm_smmu_sva_domain_alloc(void) +{ + struct iommu_domain *domain; + + domain = kzalloc(sizeof(*domain), GFP_KERNEL); + if (domain) + domain->ops = _smmu_sva_domain_ops; + + return domain; +} diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index afc63fce6107..9daf3de7e539 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -1999,6 +1999,9 @@ static struct iommu_domain *arm_smmu_domain_alloc(unsigned type) { struct arm_smmu_domain *smmu_domain; + if (type == IOMMU_DOMAIN_SVA) + return arm_smmu_sva_domain_alloc(); + if (type != IOMMU_DOMAIN_UNMANAGED && type != IOMMU_DOMAIN_DMA && type != IOMMU_DOMAIN_DMA_FQ && -- 2.25.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v6 05/12] iommu/vt-d: Remove SVM_FLAG_SUPERVISOR_MODE support
The current kernel DMA with PASID support is based on the SVA with a flag SVM_FLAG_SUPERVISOR_MODE. The IOMMU driver binds the kernel memory address space to a PASID of the device. The device driver programs the device with kernel virtual address (KVA) for DMA access. There have been security and functional issues with this approach: - The lack of IOTLB synchronization upon kernel page table updates. (vmalloc, module/BPF loading, CONFIG_DEBUG_PAGEALLOC etc.) - Other than slight more protection, using kernel virtual address (KVA) has little advantage over physical address. There are also no use cases yet where DMA engines need kernel virtual addresses for in-kernel DMA. This removes SVM_FLAG_SUPERVISOR_MODE support in the Intel IOMMU driver. The device driver is suggested to handle kernel DMA with PASID through the kernel DMA APIs. Link: https://lore.kernel.org/linux-iommu/20210511194726.gp1002...@nvidia.com/ Signed-off-by: Jacob Pan Signed-off-by: Lu Baolu --- drivers/iommu/intel/svm.c | 53 +-- 1 file changed, 12 insertions(+), 41 deletions(-) diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c index 7ee37d996e15..574aa33a 100644 --- a/drivers/iommu/intel/svm.c +++ b/drivers/iommu/intel/svm.c @@ -313,8 +313,7 @@ static int pasid_to_svm_sdev(struct device *dev, unsigned int pasid, return 0; } -static int intel_svm_alloc_pasid(struct device *dev, struct mm_struct *mm, -unsigned int flags) +static int intel_svm_alloc_pasid(struct device *dev, struct mm_struct *mm) { ioasid_t max_pasid = dev_is_pci(dev) ? pci_max_pasids(to_pci_dev(dev)) : intel_pasid_max_id; @@ -324,8 +323,7 @@ static int intel_svm_alloc_pasid(struct device *dev, struct mm_struct *mm, static struct iommu_sva *intel_svm_bind_mm(struct intel_iommu *iommu, struct device *dev, - struct mm_struct *mm, - unsigned int flags) + struct mm_struct *mm) { struct device_domain_info *info = dev_iommu_priv_get(dev); unsigned long iflags, sflags; @@ -341,22 +339,18 @@ static struct iommu_sva *intel_svm_bind_mm(struct intel_iommu *iommu, svm->pasid = mm->pasid; svm->mm = mm; - svm->flags = flags; INIT_LIST_HEAD_RCU(>devs); - if (!(flags & SVM_FLAG_SUPERVISOR_MODE)) { - svm->notifier.ops = _mmuops; - ret = mmu_notifier_register(>notifier, mm); - if (ret) { - kfree(svm); - return ERR_PTR(ret); - } + svm->notifier.ops = _mmuops; + ret = mmu_notifier_register(>notifier, mm); + if (ret) { + kfree(svm); + return ERR_PTR(ret); } ret = pasid_private_add(svm->pasid, svm); if (ret) { - if (svm->notifier.ops) - mmu_notifier_unregister(>notifier, mm); + mmu_notifier_unregister(>notifier, mm); kfree(svm); return ERR_PTR(ret); } @@ -391,9 +385,7 @@ static struct iommu_sva *intel_svm_bind_mm(struct intel_iommu *iommu, } /* Setup the pasid table: */ - sflags = (flags & SVM_FLAG_SUPERVISOR_MODE) ? - PASID_FLAG_SUPERVISOR_MODE : 0; - sflags |= cpu_feature_enabled(X86_FEATURE_LA57) ? PASID_FLAG_FL5LP : 0; + sflags = cpu_feature_enabled(X86_FEATURE_LA57) ? PASID_FLAG_FL5LP : 0; spin_lock_irqsave(>lock, iflags); ret = intel_pasid_setup_first_level(iommu, dev, mm->pgd, mm->pasid, FLPT_DEFAULT_DID, sflags); @@ -410,8 +402,7 @@ static struct iommu_sva *intel_svm_bind_mm(struct intel_iommu *iommu, kfree(sdev); free_svm: if (list_empty(>devs)) { - if (svm->notifier.ops) - mmu_notifier_unregister(>notifier, mm); + mmu_notifier_unregister(>notifier, mm); pasid_private_remove(mm->pasid); kfree(svm); } @@ -821,37 +812,17 @@ static irqreturn_t prq_event_thread(int irq, void *d) struct iommu_sva *intel_svm_bind(struct device *dev, struct mm_struct *mm, void *drvdata) { struct intel_iommu *iommu = device_to_iommu(dev, NULL, NULL); - unsigned int flags = 0; struct iommu_sva *sva; int ret; - if (drvdata) - flags = *(unsigned int *)drvdata; - - if (flags & SVM_FLAG_SUPERVISOR_MODE) { - if (!ecap_srs(iommu->ecap)) { - dev_err(dev, "%s: Supervisor PASID not
[PATCH v6 06/12] iommu/vt-d: Add SVA domain support
Add support for SVA domain allocation and provide an SVA-specific iommu_domain_ops. Signed-off-by: Lu Baolu --- include/linux/intel-iommu.h | 5 drivers/iommu/intel/iommu.c | 2 ++ drivers/iommu/intel/svm.c | 48 + 3 files changed, 55 insertions(+) diff --git a/include/linux/intel-iommu.h b/include/linux/intel-iommu.h index 72e5d7900e71..2397c2007cda 100644 --- a/include/linux/intel-iommu.h +++ b/include/linux/intel-iommu.h @@ -744,6 +744,7 @@ void intel_svm_unbind(struct iommu_sva *handle); u32 intel_svm_get_pasid(struct iommu_sva *handle); int intel_svm_page_response(struct device *dev, struct iommu_fault_event *evt, struct iommu_page_response *msg); +struct iommu_domain *intel_svm_domain_alloc(void); struct intel_svm_dev { struct list_head list; @@ -769,6 +770,10 @@ struct intel_svm { }; #else static inline void intel_svm_check(struct intel_iommu *iommu) {} +static inline struct iommu_domain *intel_svm_domain_alloc(void) +{ + return NULL; +} #endif #ifdef CONFIG_INTEL_IOMMU_DEBUGFS diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index 99643f897f26..10b1e9dcbd98 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -4343,6 +4343,8 @@ static struct iommu_domain *intel_iommu_domain_alloc(unsigned type) return domain; case IOMMU_DOMAIN_IDENTITY: return _domain->domain; + case IOMMU_DOMAIN_SVA: + return intel_svm_domain_alloc(); default: return NULL; } diff --git a/drivers/iommu/intel/svm.c b/drivers/iommu/intel/svm.c index 574aa33a..641ab0491ada 100644 --- a/drivers/iommu/intel/svm.c +++ b/drivers/iommu/intel/svm.c @@ -931,3 +931,51 @@ int intel_svm_page_response(struct device *dev, mutex_unlock(_mutex); return ret; } + +static int intel_svm_attach_dev_pasid(struct iommu_domain *domain, + struct device *dev, ioasid_t pasid) +{ + struct device_domain_info *info = dev_iommu_priv_get(dev); + struct intel_iommu *iommu = info->iommu; + struct mm_struct *mm = domain->mm; + struct iommu_sva *sva; + int ret = 0; + + mutex_lock(_mutex); + sva = intel_svm_bind_mm(iommu, dev, mm); + if (IS_ERR(sva)) + ret = PTR_ERR(sva); + mutex_unlock(_mutex); + + return ret; +} + +static void intel_svm_detach_dev_pasid(struct iommu_domain *domain, + struct device *dev, ioasid_t pasid) +{ + mutex_lock(_mutex); + intel_svm_unbind_mm(dev, pasid); + mutex_unlock(_mutex); +} + +static void intel_svm_domain_free(struct iommu_domain *domain) +{ + kfree(domain); +} + +static const struct iommu_domain_ops intel_svm_domain_ops = { + .attach_dev_pasid = intel_svm_attach_dev_pasid, + .detach_dev_pasid = intel_svm_detach_dev_pasid, + .free = intel_svm_domain_free, +}; + +struct iommu_domain *intel_svm_domain_alloc(void) +{ + struct iommu_domain *domain; + + domain = kzalloc(sizeof(*domain), GFP_KERNEL); + if (domain) + domain->ops = _svm_domain_ops; + + return domain; +} -- 2.25.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v6 04/12] iommu/sva: Basic data structures for SVA
Use below data structures for SVA implementation in the IOMMU core: - struct iommu_domain (IOMMU_DOMAIN_SVA type) Represent a hardware pagetable that the IOMMU hardware could use for SVA translation. Multiple iommu domains could be bound with an SVA mm and each grabs a mm_count of the mm in order to make sure mm could only be freed after all domains have been unbound. A new mm field is added to struct iommu_domain and a helper is added to retrieve mm from a domain pointer. - struct iommu_sva (existing) Represent a bond relationship between an SVA ioas and an iommu domain. If a bond already exists, it's reused and a reference is taken. - struct dev_iommu::sva_bonds A pasid-indexed xarray to track the bonds happened on the device. Suggested-by: Jean-Philippe Brucker Signed-off-by: Lu Baolu --- include/linux/iommu.h | 13 + drivers/iommu/iommu.c | 3 +++ 2 files changed, 16 insertions(+) diff --git a/include/linux/iommu.h b/include/linux/iommu.h index ab36244d4e94..2921e634491e 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -64,6 +64,9 @@ struct iommu_domain_geometry { #define __IOMMU_DOMAIN_PT (1U << 2) /* Domain is identity mapped */ #define __IOMMU_DOMAIN_DMA_FQ (1U << 3) /* DMA-API uses flush queue*/ +#define __IOMMU_DOMAIN_SHARED (1U << 4) /* Page table shared from CPU */ +#define __IOMMU_DOMAIN_HOST_VA (1U << 5) /* Host CPU virtual address */ + /* * This are the possible domain-types * @@ -86,6 +89,8 @@ struct iommu_domain_geometry { #define IOMMU_DOMAIN_DMA_FQ(__IOMMU_DOMAIN_PAGING |\ __IOMMU_DOMAIN_DMA_API | \ __IOMMU_DOMAIN_DMA_FQ) +#define IOMMU_DOMAIN_SVA (__IOMMU_DOMAIN_SHARED |\ +__IOMMU_DOMAIN_HOST_VA) struct iommu_domain { unsigned type; @@ -95,6 +100,9 @@ struct iommu_domain { void *handler_token; struct iommu_domain_geometry geometry; struct iommu_dma_cookie *iova_cookie; +#ifdef CONFIG_IOMMU_SVA + struct mm_struct *mm; +#endif }; static inline bool iommu_is_dma_domain(struct iommu_domain *domain) @@ -380,6 +388,9 @@ struct dev_iommu { struct iommu_device *iommu_dev; void*priv; unsigned intpasid_bits; +#ifdef CONFIG_IOMMU_SVA + struct xarray sva_bonds; +#endif }; int iommu_device_register(struct iommu_device *iommu, @@ -629,6 +640,8 @@ struct iommu_fwspec { */ struct iommu_sva { struct device *dev; + struct iommu_domain *domain; + refcount_t users; }; int iommu_fwspec_init(struct device *dev, struct fwnode_handle *iommu_fwnode, diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 16e8db2d86fc..1abff5fc9554 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -202,6 +202,9 @@ static struct dev_iommu *dev_iommu_get(struct device *dev) return NULL; mutex_init(>lock); +#ifdef CONFIG_IOMMU_SVA + xa_init(>sva_bonds); +#endif dev->iommu = param; return param; } -- 2.25.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v6 01/12] dmaengine: idxd: Separate user and kernel pasid enabling
From: Dave Jiang The idxd driver always gated the pasid enabling under a single knob and this assumption is incorrect. The pasid used for kernel operation can be independently toggled and has no dependency on the user pasid (and vice versa). Split the two so they are independent "enabled" flags. Cc: Vinod Koul Signed-off-by: Dave Jiang Signed-off-by: Jacob Pan Link: https://lore.kernel.org/linux-iommu/20220315050713.2000518-10-jacob.jun@linux.intel.com/ Signed-off-by: Lu Baolu --- drivers/dma/idxd/idxd.h | 6 ++ drivers/dma/idxd/cdev.c | 4 ++-- drivers/dma/idxd/init.c | 30 ++ 3 files changed, 26 insertions(+), 14 deletions(-) diff --git a/drivers/dma/idxd/idxd.h b/drivers/dma/idxd/idxd.h index da72eb15f610..ccbefd0be617 100644 --- a/drivers/dma/idxd/idxd.h +++ b/drivers/dma/idxd/idxd.h @@ -239,6 +239,7 @@ enum idxd_device_flag { IDXD_FLAG_CONFIGURABLE = 0, IDXD_FLAG_CMD_RUNNING, IDXD_FLAG_PASID_ENABLED, + IDXD_FLAG_USER_PASID_ENABLED, }; struct idxd_dma_dev { @@ -469,6 +470,11 @@ static inline bool device_pasid_enabled(struct idxd_device *idxd) return test_bit(IDXD_FLAG_PASID_ENABLED, >flags); } +static inline bool device_user_pasid_enabled(struct idxd_device *idxd) +{ + return test_bit(IDXD_FLAG_USER_PASID_ENABLED, >flags); +} + static inline bool device_swq_supported(struct idxd_device *idxd) { return (support_enqcmd && device_pasid_enabled(idxd)); diff --git a/drivers/dma/idxd/cdev.c b/drivers/dma/idxd/cdev.c index b9b2b4a4124e..7df996deffbe 100644 --- a/drivers/dma/idxd/cdev.c +++ b/drivers/dma/idxd/cdev.c @@ -99,7 +99,7 @@ static int idxd_cdev_open(struct inode *inode, struct file *filp) ctx->wq = wq; filp->private_data = ctx; - if (device_pasid_enabled(idxd)) { + if (device_user_pasid_enabled(idxd)) { sva = iommu_sva_bind_device(dev, current->mm, NULL); if (IS_ERR(sva)) { rc = PTR_ERR(sva); @@ -152,7 +152,7 @@ static int idxd_cdev_release(struct inode *node, struct file *filep) if (wq_shared(wq)) { idxd_device_drain_pasid(idxd, ctx->pasid); } else { - if (device_pasid_enabled(idxd)) { + if (device_user_pasid_enabled(idxd)) { /* The wq disable in the disable pasid function will drain the wq */ rc = idxd_wq_disable_pasid(wq); if (rc < 0) diff --git a/drivers/dma/idxd/init.c b/drivers/dma/idxd/init.c index 993a5dcca24f..e1b5d1e4a949 100644 --- a/drivers/dma/idxd/init.c +++ b/drivers/dma/idxd/init.c @@ -513,16 +513,19 @@ static int idxd_probe(struct idxd_device *idxd) if (IS_ENABLED(CONFIG_INTEL_IDXD_SVM) && sva) { rc = iommu_dev_enable_feature(dev, IOMMU_DEV_FEAT_SVA); - if (rc == 0) { - rc = idxd_enable_system_pasid(idxd); - if (rc < 0) { - iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_SVA); - dev_warn(dev, "Failed to enable PASID. No SVA support: %d\n", rc); - } else { - set_bit(IDXD_FLAG_PASID_ENABLED, >flags); - } - } else { + if (rc) { + /* +* Do not bail here since legacy DMA is still +* supported, both user and in-kernel DMA with +* PASID rely on SVA feature. +*/ dev_warn(dev, "Unable to turn on SVA feature.\n"); + } else { + set_bit(IDXD_FLAG_USER_PASID_ENABLED, >flags); + if (idxd_enable_system_pasid(idxd)) + dev_warn(dev, "No in-kernel DMA with PASID.\n"); + else + set_bit(IDXD_FLAG_PASID_ENABLED, >flags); } } else if (!sva) { dev_warn(dev, "User forced SVA off via module param.\n"); @@ -561,7 +564,8 @@ static int idxd_probe(struct idxd_device *idxd) err: if (device_pasid_enabled(idxd)) idxd_disable_system_pasid(idxd); - iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_SVA); + if (device_user_pasid_enabled(idxd) || device_pasid_enabled(idxd)) + iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_SVA); return rc; } @@ -574,7 +578,8 @@ static void idxd_cleanup(struct idxd_device *idxd) idxd_cleanup_internals(idxd); if (device_pasid_enabled(idxd)) idxd_disable_system_pasid(idxd); - iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_SVA); + if (device_user_pasid_enabled(idxd) || device_pasid_enabled(idxd)) + iommu_dev_disable_feature(dev, IOMMU_DEV_FEAT_SVA); } static int idxd_pci_probe(struct pci_dev
[PATCH v6 02/12] iommu: Add pasid_bits field in struct dev_iommu
Use this field to save the pasid/ssid bits that a device is able to support with its IOMMU hardware. It is a generic attribute of a device and lifting it into the per-device dev_iommu struct makes it possible to allocate a PASID for device without calls into the IOMMU drivers. Any iommu driver which suports PASID related features should set this field before features are enabled on the devices. For initialization of this field in the VT-d driver, the info->pasid_supported is only set for PCI devices. So the status is that non-PCI SVA hasn't been supported yet. Setting this field only for PCI devices has no functional change. Signed-off-by: Lu Baolu Reviewed-by: Jean-Philippe Brucker --- include/linux/iommu.h | 1 + drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 2 ++ drivers/iommu/intel/iommu.c | 5 - 3 files changed, 7 insertions(+), 1 deletion(-) diff --git a/include/linux/iommu.h b/include/linux/iommu.h index 5e1afe169549..b8ffaf2cb1d0 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -373,6 +373,7 @@ struct dev_iommu { struct iommu_fwspec *fwspec; struct iommu_device *iommu_dev; void*priv; + unsigned intpasid_bits; }; int iommu_device_register(struct iommu_device *iommu, diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c index 627a3ed5ee8f..afc63fce6107 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c @@ -2681,6 +2681,8 @@ static struct iommu_device *arm_smmu_probe_device(struct device *dev) smmu->features & ARM_SMMU_FEAT_STALL_FORCE) master->stall_enabled = true; + dev->iommu->pasid_bits = master->ssid_bits; + return >iommu; err_free_master: diff --git a/drivers/iommu/intel/iommu.c b/drivers/iommu/intel/iommu.c index 2990f80c5e08..99643f897f26 100644 --- a/drivers/iommu/intel/iommu.c +++ b/drivers/iommu/intel/iommu.c @@ -4624,8 +4624,11 @@ static struct iommu_device *intel_iommu_probe_device(struct device *dev) if (pasid_supported(iommu)) { int features = pci_pasid_features(pdev); - if (features >= 0) + if (features >= 0) { info->pasid_supported = features | 1; + dev->iommu->pasid_bits = + fls(pci_max_pasids(pdev)) - 1; + } } if (info->ats_supported && ecap_prs(iommu->ecap) && -- 2.25.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v6 03/12] iommu: Add attach/detach_dev_pasid domain ops
Attaching an IOMMU domain to a PASID of a device is a generic operation for modern IOMMU drivers which support PASID-granular DMA address translation. Currently visible usage scenarios include (but not limited): - SVA (Shared Virtual Address) - kernel DMA with PASID - hardware-assist mediated device This adds a pair of common domain ops for this purpose and adds helpers to attach/detach a domain to/from a {device, PASID}. Some buses, like PCI, route packets without considering the PASID value. Thus a DMA target address with PASID might be treated as P2P if the address falls into the MMIO BAR of other devices in the group. To make things simple, these interfaces only apply to devices belonging to the singleton groups, and the singleton is immutable in fabric i.e. not affected by hotplug. Signed-off-by: Lu Baolu Reviewed-by: Jean-Philippe Brucker --- include/linux/iommu.h | 21 + drivers/iommu/iommu.c | 71 +++ 2 files changed, 92 insertions(+) diff --git a/include/linux/iommu.h b/include/linux/iommu.h index b8ffaf2cb1d0..ab36244d4e94 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -263,6 +263,8 @@ struct iommu_ops { * struct iommu_domain_ops - domain specific operations * @attach_dev: attach an iommu domain to a device * @detach_dev: detach an iommu domain from a device + * @attach_dev_pasid: attach an iommu domain to a pasid of device + * @detach_dev_pasid: detach an iommu domain from a pasid of device * @map: map a physically contiguous memory region to an iommu domain * @map_pages: map a physically contiguous set of pages of the same size to * an iommu domain. @@ -283,6 +285,10 @@ struct iommu_ops { struct iommu_domain_ops { int (*attach_dev)(struct iommu_domain *domain, struct device *dev); void (*detach_dev)(struct iommu_domain *domain, struct device *dev); + int (*attach_dev_pasid)(struct iommu_domain *domain, + struct device *dev, ioasid_t pasid); + void (*detach_dev_pasid)(struct iommu_domain *domain, +struct device *dev, ioasid_t pasid); int (*map)(struct iommu_domain *domain, unsigned long iova, phys_addr_t paddr, size_t size, int prot, gfp_t gfp); @@ -678,6 +684,10 @@ int iommu_group_claim_dma_owner(struct iommu_group *group, void *owner); void iommu_group_release_dma_owner(struct iommu_group *group); bool iommu_group_dma_owner_claimed(struct iommu_group *group); +int iommu_attach_device_pasid(struct iommu_domain *domain, + struct device *dev, ioasid_t pasid); +void iommu_detach_device_pasid(struct iommu_domain *domain, + struct device *dev, ioasid_t pasid); #else /* CONFIG_IOMMU_API */ struct iommu_ops {}; @@ -1051,6 +1061,17 @@ static inline bool iommu_group_dma_owner_claimed(struct iommu_group *group) { return false; } + +static inline int iommu_attach_device_pasid(struct iommu_domain *domain, + struct device *dev, ioasid_t pasid) +{ + return -ENODEV; +} + +static inline void iommu_detach_device_pasid(struct iommu_domain *domain, +struct device *dev, ioasid_t pasid) +{ +} #endif /* CONFIG_IOMMU_API */ /** diff --git a/drivers/iommu/iommu.c b/drivers/iommu/iommu.c index 29906bc16371..16e8db2d86fc 100644 --- a/drivers/iommu/iommu.c +++ b/drivers/iommu/iommu.c @@ -38,6 +38,7 @@ struct iommu_group { struct kobject kobj; struct kobject *devices_kobj; struct list_head devices; + struct xarray pasid_array; struct mutex mutex; void *iommu_data; void (*iommu_data_release)(void *iommu_data); @@ -630,6 +631,7 @@ struct iommu_group *iommu_group_alloc(void) mutex_init(>mutex); INIT_LIST_HEAD(>devices); INIT_LIST_HEAD(>entry); + xa_init(>pasid_array); ret = ida_simple_get(_group_ida, 0, 0, GFP_KERNEL); if (ret < 0) { @@ -3190,3 +3192,72 @@ bool iommu_group_dma_owner_claimed(struct iommu_group *group) return user; } EXPORT_SYMBOL_GPL(iommu_group_dma_owner_claimed); + +static bool device_group_immutable_singleton(struct device *dev) +{ + struct iommu_group *group = iommu_group_get(dev); + int count; + + if (!group) + return false; + + mutex_lock(>mutex); + count = iommu_group_device_count(group); + mutex_unlock(>mutex); + iommu_group_put(group); + + if (count != 1) + return false; + + /* +* The PCI device could be considered to be fully isolated if all +* devices on the path from the device to the host-PCI bridge are +* protected from peer-to-peer DMA by ACS. +*/ + if (dev_is_pci(dev)) + return pci_acs_path_enabled(to_pci_dev(dev), NULL, +
[PATCH v6 00/12] iommu: SVA and IOPF refactoring
Hi folks, The former part of this series refactors the IOMMU SVA code by assigning an SVA type of iommu_domain to a shared virtual address and replacing sva_bind/unbind iommu ops with attach/detach_dev_pasid domain ops. The latter part changes the existing I/O page fault handling framework from only serving SVA to a generic one. Any driver or component could handle the I/O page faults for its domain in its own way by installing an I/O page fault handler. This series has been functionally tested on an x86 machine and compile tested for other architectures. This series is also available on github: [2] https://github.com/LuBaolu/intel-iommu/commits/iommu-sva-refactoring-v6 Please review and suggest. Best regards, baolu Change log: v6: - Refine the SVA basic data structures. Link: https://lore.kernel.org/linux-iommu/YnFv0ps0Ad8v+7uH@myrica/ - Refine arm smmuv3 sva domain allocation. - Fix a possible lock issue. Link: https://lore.kernel.org/linux-iommu/YnFydE8j8l7Q4m+b@myrica/ v5: - https://lore.kernel.org/linux-iommu/20220502014842.991097-1-baolu...@linux.intel.com/ - Address review comments from Jean-Philippe Brucker. Very appreciated! - Remove redundant pci aliases check in device_group_immutable_singleton(). - Treat all buses exept PCI as static in immutable singleton check. - As the sva_bind/unbind() have already guaranteed sva domain free only after iopf_queue_flush_dev(), remove the unnecessary domain refcount. - Move domain get() out of the list iteration in iopf_handle_group(). v4: - https://lore.kernel.org/linux-iommu/20220421052121.3464100-1-baolu...@linux.intel.com/ - Solve the overlap with another series and make this series self-contained. - No objection to the abstraction of data structure during v3 review. Hence remove the RFC subject prefix. - Refine the immutable singleton group code according to Kevin's comments. v3: - https://lore.kernel.org/linux-iommu/20220410102443.294128-1-baolu...@linux.intel.com/ - Rework iommu_group_singleton_lockdown() by adding a flag to the group that positively indicates the group can never have more than one member, even after hot plug. - Abstract the data structs used for iommu sva in a separated patches to make it easier for review. - I still keep the RFC prefix in this series as above two significant changes need at least another round review to be finalized. - Several misc refinements. v2: - https://lore.kernel.org/linux-iommu/20220329053800.3049561-1-baolu...@linux.intel.com/ - Add sva domain life cycle management to avoid race between unbind and page fault handling. - Use a single domain for each mm. - Return a single sva handler for the same binding. - Add a new helper to meet singleton group requirement. - Rework the SVA domain allocation for arm smmu v3 driver and move the pasid_bit initialization to device probe. - Drop the patch "iommu: Handle IO page faults directly". - Add mmget_not_zero(mm) in SVA page fault handler. v1: - https://lore.kernel.org/linux-iommu/20220320064030.2936936-1-baolu...@linux.intel.com/ - Initial post. Dave Jiang (1): dmaengine: idxd: Separate user and kernel pasid enabling Lu Baolu (11): iommu: Add pasid_bits field in struct dev_iommu iommu: Add attach/detach_dev_pasid domain ops iommu/sva: Basic data structures for SVA iommu/vt-d: Remove SVM_FLAG_SUPERVISOR_MODE support iommu/vt-d: Add SVA domain support arm-smmu-v3/sva: Add SVA domain support iommu/sva: Use attach/detach_pasid_dev in SVA interfaces iommu: Remove SVA related callbacks from iommu ops iommu: Prepare IOMMU domain for IOPF iommu: Per-domain I/O page fault handling iommu: Rename iommu-sva-lib.{c,h} include/linux/intel-iommu.h | 9 +- include/linux/iommu.h | 99 +-- drivers/dma/idxd/idxd.h | 6 + drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.h | 19 +- .../iommu/{iommu-sva-lib.h => iommu-sva.h}| 3 - drivers/dma/idxd/cdev.c | 4 +- drivers/dma/idxd/init.c | 30 +- .../iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c | 111 +--- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3.c | 10 +- drivers/iommu/intel/iommu.c | 12 +- drivers/iommu/intel/svm.c | 146 -- drivers/iommu/io-pgfault.c| 73 + drivers/iommu/iommu-sva-lib.c | 71 - drivers/iommu/iommu-sva.c | 261 ++ drivers/iommu/iommu.c | 193 +++-- drivers/iommu/Makefile| 2 +- 16 files changed, 623 insertions(+), 426 deletions(-) rename drivers/iommu/{iommu-sva-lib.h => iommu-sva.h} (91%) delete mode 100644 drivers/iommu/iommu-sva-lib.c create mode 100644 drivers/iommu/iommu-sva.c -- 2.25.1 ___ iommu mailing list iommu@lists.linux-foundation.org