Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs
On 2022/6/1 02:51, Jason Gunthorpe wrote: Oh, I've spent the last couple of weeks hacking up horrible things manipulating entries in init_mm, and never realised that that was actually the special case. Oh well, live and learn. The init_mm is sort of different, it doesn't have zap in quite the same way, for example. I was talking about the typical process mm. Anyhow, the right solution is to use RCU as I described before, Baolu do you want to try? Yes, of course. Your discussion with Robin gave me a lot of inspiration. Very appreciated! I want to use a separate patch to solve this debugfs problem, because it has exceeded the original intention of this series. Best regards, baolu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs
On 2022/5/31 23:59, Jason Gunthorpe wrote: On Tue, May 31, 2022 at 02:52:28PM +0100, Robin Murphy wrote: + break; + pgtable_walk_level(m, phys_to_virt(phys_addr), Also, obligatory reminder that pfn_valid() only means that pfn_to_page() gets you a valid struct page. Whether that page is direct-mapped kernel memory or not is a different matter. Even though this is debugfs, if the operation is sketchy like that and can theortically crash the kernel the driver should test capabilities, CAP_SYS_RAWIO or something may be appropriate. I don't think we have a better cap for 'userspace may crash the kernel' Yes. We should test both CAP_SYS_ADMIN and CAP_SYS_RAWIO. Best regards, baolu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs
Hi Robin, Thank you for the comments. On 2022/5/31 21:52, Robin Murphy wrote: On 2022-05-31 04:02, Baolu Lu wrote: On 2022/5/30 20:14, Jason Gunthorpe wrote: On Sun, May 29, 2022 at 01:14:46PM +0800, Baolu Lu wrote: [--snip--] diff --git a/drivers/iommu/intel/debugfs.c b/drivers/iommu/intel/debugfs.c index d927ef10641b..e6f4835b8d9f 100644 --- a/drivers/iommu/intel/debugfs.c +++ b/drivers/iommu/intel/debugfs.c @@ -333,25 +333,28 @@ static void pgtable_walk_level(struct seq_file *m, struct dma_pte *pde, continue; path[level] = pde->val; - if (dma_pte_superpage(pde) || level == 1) + if (dma_pte_superpage(pde) || level == 1) { dump_page_info(m, start, path); - else - pgtable_walk_level(m, phys_to_virt(dma_pte_addr(pde)), + } else { + unsigned long phys_addr; + + phys_addr = (unsigned long)dma_pte_addr(pde); + if (!pfn_valid(__phys_to_pfn(phys_addr))) Given that pte_present(pde) passed just above, it was almost certainly a valid entry, so it seems unlikely that the physical address it pointed to could have disappeared in the meantime. If you're worried about the potential case where we've been preempted during this walk for long enough that the page has already been freed by an unmap, reallocated, Yes. This is exactly what I am worried about and what this patch wants to solve. and filled with someone else's data that happens to look like valid PTEs, this still isn't enough, since that data could just as well happen to look like valid physical addresses too. I imagine that if you want to safely walk pagetables concurrently with them potentially being freed, you'd probably need to get RCU involved. I don't want to make the map/unmap interface more complex or inefficient because of a debugfs feature. I hope that the debugfs and map/unmap interfaces are orthogonal, just like the IOMMU hardware traversing the page tables, as long as the accessed physical address is valid and accessible. Otherwise, stop the traversal immediately. If we can't achieve this, I'd rather stop supporting this debugfs node. + break; + pgtable_walk_level(m, phys_to_virt(phys_addr), Also, obligatory reminder that pfn_valid() only means that pfn_to_page() gets you a valid struct page. Whether that page is direct-mapped kernel memory or not is a different matter. Perhaps I can check this from the page flags? level - 1, start, path); + } path[level] = 0; } } -static int show_device_domain_translation(struct device *dev, void *data) +static int __show_device_domain_translation(struct device *dev, void *data) { struct device_domain_info *info = dev_iommu_priv_get(dev); struct dmar_domain *domain = info->domain; struct seq_file *m = data; u64 path[6] = { 0 }; - if (!domain) - return 0; - seq_printf(m, "Device %s @0x%llx\n", dev_name(dev), (u64)virt_to_phys(domain->pgd)); seq_puts(m, "IOVA_PFN\t\tPML5E\t\t\tPML4E\t\t\tPDPE\t\t\tPDE\t\t\tPTE\n"); @@ -359,20 +362,27 @@ static int show_device_domain_translation(struct device *dev, void *data) pgtable_walk_level(m, domain->pgd, domain->agaw + 2, 0, path); seq_putc(m, '\n'); - return 0; + return 1; } -static int domain_translation_struct_show(struct seq_file *m, void *unused) +static int show_device_domain_translation(struct device *dev, void *data) { - unsigned long flags; - int ret; + struct iommu_group *group; - spin_lock_irqsave(&device_domain_lock, flags); - ret = bus_for_each_dev(&pci_bus_type, NULL, m, - show_device_domain_translation); - spin_unlock_irqrestore(&device_domain_lock, flags); + group = iommu_group_get(dev); + if (group) { + iommu_group_for_each_dev(group, data, + __show_device_domain_translation); Why group_for_each_dev? This will hold the group mutex when the callback is invoked. With the group mutex hold, the domain could not get changed. If there *are* multiple devices in the group then by definition they should be attached to the same domain, so dumping that domain's mappings more than once seems pointless. Especially given that the outer bus_for_each_dev iteration will already visit each individual device anyway, so this would only make the redundancy even worse than it already is. __show_device_domain_translation() only dumps mappings once as it always returns 1. Best regards, baolu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] dma-debug: Make things less spammy under memory pressure
On Tue, May 31, 2022 at 03:19:45PM -0700, Rob Clark wrote: > um, quite.. tbf that was in the context of a WIP igt test for > shrinker which was trying to cycle thru ~2x RAM size worth of GEM > buffers on something like 72 threads. So it could just be threads > that had gotten past the dma_debug_disabled() check already before > global_disable was set to true? > > I guess this could be pr_err_once() instead, then? Yes, we could use pr_err_once to reduce the chattyness while still keeping global_disable to disable all the actual tracking. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH v4 1/6] iommu: Add a per domain PASID for DMA API
> From: Jacob Pan > Sent: Wednesday, June 1, 2022 4:44 AM > > Hi Jason, > > On Tue, 31 May 2022 16:05:50 -0300, Jason Gunthorpe > wrote: > > > On Tue, May 31, 2022 at 10:29:55AM -0700, Jacob Pan wrote: > > > > > The reason why I store PASID at IOMMU domain is for IOTLB flush within > > > the domain. Device driver is not aware of domain level IOTLB flush. We > > > also have iova_cookie for each domain which essentially is for > > > RIDPASID. > > > > You need to make the PASID stuff work generically. > > > > The domain needs to hold a list of all the places it needs to flush > > and that list needs to be maintained during attach/detach. > > > > A single PASID on the domain is obviously never going to work > > generically. > > > I agree, I did it this way really meant to be part of iommu_domain's > iommu_dma_cookie, not meant to be global. But for the lack of common > storage between identity domain and dma domain, I put it here as global. > > Then should we also extract RIDPASID to become part of the generic API? > i.e. set pasid, flush IOTLB etc. Right? RIDPASID is not in group's > pasid_array today. > RIDPASID is just an alias to RID in the PASID table (similar to pasid#0 on other platforms). it's reserved and not exposed outside the intel-iommu driver. So I don't think it should be managed via the generic iommu API. But probably you can generalize it with other pasids within intel-iommu driver if doing so can simplify the logic there. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH v4 1/6] iommu: Add a per domain PASID for DMA API
> From: Jacob Pan > Sent: Wednesday, June 1, 2022 1:30 AM > > > > > > In both cases the pasid is stored in the attach data instead of the > > > domain. > > > > So during IOTLB flush for the domain, do we loop through the attach data? Yes and it's required. > > > > DMA API pasid is no special from above except it needs to allow > > > one device attached to the same domain twice (one with RID > > > and the other with RID+PASID). > > > > > > for iommufd those operations are initiated by userspace via > > > iommufd uAPI. > > > > My understanding is that device driver owns its PASID policy. If ENQCMD > > is supported on the device, the PASIDs should be allocated through > > ioasid_alloc(). Otherwise, the whole PASID pool is managed by the device > > driver. > > > It seems the changes we want for this patchset are: > 1. move ioasid_alloc() from the core to device (allocation scope will be > based on whether ENQCMD is intended or not) yes, and the driver can specify whether the allocation is system-wide or per-device. > 2. store pasid in the attach data > 3. use the same iommufd api to attach/set pasid on its default domain s/iommufd/iommu/ ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V3 5/8] dt-bindings: Add xen,grant-dma IOMMU description for xen-grant DMA ops
On Tue, 31 May 2022, Oleksandr Tyshchenko wrote: > From: Oleksandr Tyshchenko > > The main purpose of this binding is to communicate Xen specific > information using generic IOMMU device tree bindings (which is > a good fit here) rather than introducing a custom property. > > Introduce Xen specific IOMMU for the virtualized device (e.g. virtio) > to be used by Xen grant DMA-mapping layer in the subsequent commit. > > The reference to Xen specific IOMMU node using "iommus" property > indicates that Xen grant mappings need to be enabled for the device, > and it specifies the ID of the domain where the corresponding backend > resides. The domid (domain ID) is used as an argument to the Xen grant > mapping APIs. > > This is needed for the option to restrict memory access using Xen grant > mappings to work which primary goal is to enable using virtio devices > in Xen guests. > > Signed-off-by: Oleksandr Tyshchenko > --- > Changes RFC -> V1: >- update commit subject/description and text in description >- move to devicetree/bindings/arm/ > > Changes V1 -> V2: >- update text in description >- change the maintainer of the binding >- fix validation issue >- reference xen,dev-domid.yaml schema from virtio/mmio.yaml > > Change V2 -> V3: >- Stefano already gave his Reviewed-by, I dropped it due to the changes > (significant) >- use generic IOMMU device tree bindings instead of custom property > "xen,dev-domid" >- change commit subject and description, was > "dt-bindings: Add xen,dev-domid property description for xen-grant DMA > ops" > --- > .../devicetree/bindings/iommu/xen,grant-dma.yaml | 49 > ++ > 1 file changed, 49 insertions(+) > create mode 100644 Documentation/devicetree/bindings/iommu/xen,grant-dma.yaml > > diff --git a/Documentation/devicetree/bindings/iommu/xen,grant-dma.yaml > b/Documentation/devicetree/bindings/iommu/xen,grant-dma.yaml > new file mode 100644 > index ..ab5765c > --- /dev/null > +++ b/Documentation/devicetree/bindings/iommu/xen,grant-dma.yaml > @@ -0,0 +1,49 @@ > +# SPDX-License-Identifier: (GPL-2.0-only or BSD-2-Clause) > +%YAML 1.2 > +--- > +$id: http://devicetree.org/schemas/iommu/xen,grant-dma.yaml# > +$schema: http://devicetree.org/meta-schemas/core.yaml# > + > +title: Xen specific IOMMU for virtualized devices (e.g. virtio) > + > +maintainers: > + - Stefano Stabellini > + > +description: > + The reference to Xen specific IOMMU node using "iommus" property indicates > + that Xen grant mappings need to be enabled for the device, and it specifies > + the ID of the domain where the corresponding backend resides. > + The binding is required to restrict memory access using Xen grant mappings. I think this is OK and in line with the discussion we had on the list. I propose the following wording instead: """ The Xen IOMMU represents the Xen grant table interface. Grant mappings are to be used with devices connected to the Xen IOMMU using the "iommus" property, which also specifies the ID of the backend domain. The binding is required to restrict memory access using Xen grant mappings. """ > +properties: > + compatible: > +const: xen,grant-dma > + > + '#iommu-cells': > +const: 1 > +description: > + Xen specific IOMMU is multiple-master IOMMU device. > + The single cell describes the domid (domain ID) of the domain where > + the backend is running. Here I would say: """ The single cell is the domid (domain ID) of the domain where the backend is running. """ With the two wording improvements: Reviewed-by: Stefano Stabellini > +required: > + - compatible > + - "#iommu-cells" > + > +additionalProperties: false > + > +examples: > + - | > +xen_iommu { > +compatible = "xen,grant-dma"; > +#iommu-cells = <1>; > +}; > + > +virtio@3000 { > +compatible = "virtio,mmio"; > +reg = <0x3000 0x100>; > +interrupts = <41>; > + > +/* The backend is located in Xen domain with ID 1 */ > +iommus = <&xen_iommu 1>; > +}; > -- > 2.7.4 > > > ___ > linux-arm-kernel mailing list > linux-arm-ker...@lists.infradead.org > http://lists.infradead.org/mailman/listinfo/linux-arm-kernel > ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs
On Tue, May 31, 2022 at 10:22:32PM +0100, Robin Murphy wrote: > There are only 3 instances where we'll free a table while the domain is > live. The first is the one legitimate race condition, where two map requests > targeting relatively nearby PTEs both go to fill in an intermediate level of > table; whoever loses that race frees the table they allocated, but it was > never visible to anyone else so that's definitely fine. The second is if > we're mapping a block entry, and find that there's already a table entry > there, wherein we assume the table must be empty, clear the entry, > invalidate any walk caches, install the block entry, then free the orphaned > table; since we're mapping the entire IOVA range covered by that table, > there should be no other operations on that IOVA range attempting to walk > the table at the same time, so it's fine. I saw these two in the Intel driver > The third is effectively the inverse, if we get a block-sized unmap > but find a table entry rather than a block at that point (on the > assumption that it's de-facto allowed for a single unmap to cover > multiple adjacent mappings as long as it does so exactly); similarly > we assume that the table must be full, and no other operations > should be racing because we're unmapping its whole IOVA range, so we > remove the table entry, invalidate, and free as before. Not sure I noticed this one though This all it all makes sense though. > Although we don't have debug dumping for io-pgtable-arm, it's good to be > thinking about this, since it's made me realise that dirty-tracking sweeps > per that proposal might pose a similar kind of concern, so we might still > need to harden these corners for the sake of that. Lets make sure Joao sees this.. It is interesting because we probably don't want the big latency spikes that would come from using locking to block map/unmap while dirty reading is happening - eg at the iommfd level. >From a consistency POV the only case that matters is unmap and unmap should already be doing a dedicated dirty read directly prior to the unmap (as per that other long thread) So having safe racy reading in the kernel is probably best, and so RCU would be good here too. > that somewhere I have some half-finished patches making io-pgtable-arm use > the iommu_iotlb_gather freelist, so maybe I'll tackle both concerns at once > (perhaps we might even be able to RCU-ify the freelist generically? I'll see > how it goes when I get there). This is all very similar to how the mm's tlb gather stuff works, especially on PPC which requires RCU protection of page tables instead of the IPI trick. Currently the rcu_head and the lru overlap in the struct page. To do this I'd suggest following what was done for slab - see ca1a46d6f506 ("Merge tag 'slab-for-5.17'..) and create a 'struct page' sub-class like 'struct slab' for use with iommu page tables and put the list_head and rcu_head sequentially in the struct iommu_pt_page. Continue to use put_pages_list() just with the list_head in the new struct not the lru. If we need it for dirty tracking then it makes sense to start progressing parts at least.. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] dma-debug: Make things less spammy under memory pressure
On Tue, May 31, 2022 at 3:00 PM Robin Murphy wrote: > > On 2022-05-31 22:51, Rob Clark wrote: > > From: Rob Clark > > > > Ratelimit the error msg to avoid flooding the console. > > > > Signed-off-by: Rob Clark > > --- > > kernel/dma/debug.c | 2 +- > > 1 file changed, 1 insertion(+), 1 deletion(-) > > > > diff --git a/kernel/dma/debug.c b/kernel/dma/debug.c > > index f8ff598596b8..683966f0247b 100644 > > --- a/kernel/dma/debug.c > > +++ b/kernel/dma/debug.c > > @@ -564,7 +564,7 @@ static void add_dma_entry(struct dma_debug_entry > > *entry, unsigned long attrs) > > > > rc = active_cacheline_insert(entry); > > if (rc == -ENOMEM) { > > - pr_err("cacheline tracking ENOMEM, dma-debug disabled\n"); > > + pr_err_ratelimited("cacheline tracking ENOMEM, dma-debug > > disabled\n"); > > global_disable = true; > > Given that it's supposed to disable itself entirely if it ever gets > here, just how spammy is it exactly? um, quite.. tbf that was in the context of a WIP igt test for shrinker which was trying to cycle thru ~2x RAM size worth of GEM buffers on something like 72 threads. So it could just be threads that had gotten past the dma_debug_disabled() check already before global_disable was set to true? I guess this could be pr_err_once() instead, then? BR, -R > Thanks, > Robin. > > > } else if (rc == -EEXIST && !(attrs & DMA_ATTR_SKIP_CPU_SYNC)) { > > err_printk(entry->dev, entry, ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 10/10] dmapool: improve scalability of dma_pool_free
On 5/31/22 17:54, Keith Busch wrote: > On Tue, May 31, 2022 at 02:23:44PM -0400, Tony Battersby wrote: >> dma_pool_free() scales poorly when the pool contains many pages because >> pool_find_page() does a linear scan of all allocated pages. Improve its >> scalability by replacing the linear scan with a red-black tree lookup. >> In big O notation, this improves the algorithm from O(n^2) to O(n * log n). > The improvement should say O(n) to O(log n), right? That would be the improvement for a single call to dma_pool_alloc or dma_pool_free, but I was going with the improvement for "n" calls instead, which is consistent with the improvement for the example in the cover letter for mpt3sas. I would have to look up the convention to be sure of the proper notation in a situation like this, but I usually think "inserting N items takes N^2 time"; to me it makes less sense to say "inserting 1 item takes N time", because the N seems to come out of nowhere. Tony ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] dma-debug: Make things less spammy under memory pressure
On 2022-05-31 22:51, Rob Clark wrote: From: Rob Clark Ratelimit the error msg to avoid flooding the console. Signed-off-by: Rob Clark --- kernel/dma/debug.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/dma/debug.c b/kernel/dma/debug.c index f8ff598596b8..683966f0247b 100644 --- a/kernel/dma/debug.c +++ b/kernel/dma/debug.c @@ -564,7 +564,7 @@ static void add_dma_entry(struct dma_debug_entry *entry, unsigned long attrs) rc = active_cacheline_insert(entry); if (rc == -ENOMEM) { - pr_err("cacheline tracking ENOMEM, dma-debug disabled\n"); + pr_err_ratelimited("cacheline tracking ENOMEM, dma-debug disabled\n"); global_disable = true; Given that it's supposed to disable itself entirely if it ever gets here, just how spammy is it exactly? Thanks, Robin. } else if (rc == -EEXIST && !(attrs & DMA_ATTR_SKIP_CPU_SYNC)) { err_printk(entry->dev, entry, ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 04/10] dmapool: improve accuracy of debug statistics
On 2022-05-31 20:52, Tony Battersby wrote: On 5/31/22 15:48, Robin Murphy wrote: On 2022-05-31 19:17, Tony Battersby wrote: pool->name, blocks, -(size_t) pages * -(pool->allocation / pool->size), +(size_t) pages * pool->blks_per_alloc, pool->size, pages); size -= temp; next += temp; @@ -168,6 +168,9 @@ struct dma_pool *dma_pool_create(const char *name, struct device *dev, retval->size = size; retval->boundary = boundary; retval->allocation = allocation; + retval->blks_per_alloc = + (allocation / boundary) * (boundary / size) + + (allocation % boundary) / size; Do we really need to store this? Sure, 4 divisions (which could possibly be fewer given the constraints on boundary) isn't the absolute cheapest calculation, but I still can't imagine anyone would be polling sysfs stats hard enough to even notice. The stored value is also used in patch #5, in more performance-critical code, although only when debug is enabled. Ah, fair enough. On second look I think 64-bit systems could effectively store this for free anyway, if patch #2 moved "size" down past "dev" in struct dma_pool, such that blks_per_alloc then ends up padding out the hole again. FWIW the mathematician in me also now can't help seeing the algebraic reduction to at least "(allocation + (allocation % boundary)) / size", but is now too tired to reason about the power-of-two constraints and whether the intermediate integer truncations matter... Cheers, Robin. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 10/10] dmapool: improve scalability of dma_pool_free
On Tue, May 31, 2022 at 02:23:44PM -0400, Tony Battersby wrote: > dma_pool_free() scales poorly when the pool contains many pages because > pool_find_page() does a linear scan of all allocated pages. Improve its > scalability by replacing the linear scan with a red-black tree lookup. > In big O notation, this improves the algorithm from O(n^2) to O(n * log n). The improvement should say O(n) to O(log n), right? ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH] dma-debug: Make things less spammy under memory pressure
From: Rob Clark Ratelimit the error msg to avoid flooding the console. Signed-off-by: Rob Clark --- kernel/dma/debug.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/kernel/dma/debug.c b/kernel/dma/debug.c index f8ff598596b8..683966f0247b 100644 --- a/kernel/dma/debug.c +++ b/kernel/dma/debug.c @@ -564,7 +564,7 @@ static void add_dma_entry(struct dma_debug_entry *entry, unsigned long attrs) rc = active_cacheline_insert(entry); if (rc == -ENOMEM) { - pr_err("cacheline tracking ENOMEM, dma-debug disabled\n"); + pr_err_ratelimited("cacheline tracking ENOMEM, dma-debug disabled\n"); global_disable = true; } else if (rc == -EEXIST && !(attrs & DMA_ATTR_SKIP_CPU_SYNC)) { err_printk(entry->dev, entry, -- 2.36.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 08/10] dmapool: cleanup dma_pool_destroy
On Tue, May 31, 2022 at 02:22:21PM -0400, Tony Battersby wrote: > +static void pool_free_page(struct dma_pool *pool, > +struct dma_page *page, > +bool destroying_pool) 'destroying_pool' is always true, so I don't think you need it. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs
On 2022-05-31 19:51, Jason Gunthorpe wrote: On Tue, May 31, 2022 at 07:07:32PM +0100, Robin Murphy wrote: And we expect the iommu driver to be unable to free page table levels that have IOVA boundaries in them? I'm not entirely sure what you mean there, but in general an unmap request is expected to match some previous map request atomic cmpxchg is OK for inserting new page table levels but it can't protect you against concurrent freeing of page table levels. So without locks it means that page tables can't usually be freed. Which seems to match what the Intel driver does - at least from a cursory look. This is one of the reasons the mm has the mmap/etc lock and spinlocks because we do expect page table levels to get wiped out when VMA's are zap'd - all the different locks provide the protection against page tables disappearing under from something manipulating them. Basically every "lockless" walk in (process) MM land is actually protected by some kind of lock that blocks zap_page_range() from removing the page table levels themselves. I'm not an expert in the Intel or AMD code, so I can only speak with confidence about what we do in io-pgtable-arm, but the main reason for not freeing pagetables is that it's simply not worth the bother of trying to work out whether a whole sub-tree is empty. Not to mention whether it's *still* empty by the time that we may have figured out that it was. There are only 3 instances where we'll free a table while the domain is live. The first is the one legitimate race condition, where two map requests targeting relatively nearby PTEs both go to fill in an intermediate level of table; whoever loses that race frees the table they allocated, but it was never visible to anyone else so that's definitely fine. The second is if we're mapping a block entry, and find that there's already a table entry there, wherein we assume the table must be empty, clear the entry, invalidate any walk caches, install the block entry, then free the orphaned table; since we're mapping the entire IOVA range covered by that table, there should be no other operations on that IOVA range attempting to walk the table at the same time, so it's fine. The third is effectively the inverse, if we get a block-sized unmap but find a table entry rather than a block at that point (on the assumption that it's de-facto allowed for a single unmap to cover multiple adjacent mappings as long as it does so exactly); similarly we assume that the table must be full, and no other operations should be racing because we're unmapping its whole IOVA range, so we remove the table entry, invalidate, and free as before. Again for efficiency reasons we don't attempt to validate those assumptions by inspecting the freed tables, so odd behaviour can fall out if the caller *does* do something bogus. For example if two calls race to map a block and a page in the same (unmapped) region, the block mapping will always succeed (and be what ends up in the final pagetable state), but the page mapping may or may not report failure depending on the exact timing. Although we don't have debug dumping for io-pgtable-arm, it's good to be thinking about this, since it's made me realise that dirty-tracking sweeps per that proposal might pose a similar kind of concern, so we might still need to harden these corners for the sake of that. Which also reminds me that somewhere I have some half-finished patches making io-pgtable-arm use the iommu_iotlb_gather freelist, so maybe I'll tackle both concerns at once (perhaps we might even be able to RCU-ify the freelist generically? I'll see how it goes when I get there). Cheers, Robin. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 1/6] iommu/qcom: Use the asid read from device-tree if specified
On Tue, May 31, 2022 at 9:19 AM Will Deacon wrote: > > On Tue, May 31, 2022 at 09:15:22AM -0700, Rob Clark wrote: > > On Tue, May 31, 2022 at 8:46 AM Will Deacon wrote: > > > > > > On Fri, May 27, 2022 at 11:28:56PM +0200, Konrad Dybcio wrote: > > > > From: AngeloGioacchino Del Regno > > > > > > > > > > > > As specified in this driver, the context banks are 0x1000 apart. > > > > Problem is that sometimes the context number (our asid) does not > > > > match this logic and we end up using the wrong one: this starts > > > > being a problem in the case that we need to send TZ commands > > > > to do anything on a specific context. > > > > > > I don't understand this. The ASID is a software construct, so it shouldn't > > > matter what we use. If it does matter, then please can you explain why? > > > The > > > fact that the context banks are 0x1000 apart seems unrelated. > > > > I think the connection is that mapping from ctx bank to ASID is 1:1 > > But in what sense? How is the ASID used beyond a tag in the TLB? The commit > message hints at "TZ commands" being a problem. > > I'm not doubting that this is needed to make the thing work, I just don't > understand why. (disclaimer, it has been quite a while since I've looked at the smmu setup with earlier tz, ie. things that use qcom_iommu, but from memory...) We cannot actually assign the context banks ourselves, so in the dt bindings the "ASID" is actually the context bank index. I don't remember exactly if this was a limitation of the tz interface, or result of not being able to program the smmu's global registers ourselves. BR, -R ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v4 1/6] iommu: Add a per domain PASID for DMA API
Hi Jason, On Tue, 31 May 2022 16:05:50 -0300, Jason Gunthorpe wrote: > On Tue, May 31, 2022 at 10:29:55AM -0700, Jacob Pan wrote: > > > The reason why I store PASID at IOMMU domain is for IOTLB flush within > > the domain. Device driver is not aware of domain level IOTLB flush. We > > also have iova_cookie for each domain which essentially is for > > RIDPASID. > > You need to make the PASID stuff work generically. > > The domain needs to hold a list of all the places it needs to flush > and that list needs to be maintained during attach/detach. > > A single PASID on the domain is obviously never going to work > generically. > I agree, I did it this way really meant to be part of iommu_domain's iommu_dma_cookie, not meant to be global. But for the lack of common storage between identity domain and dma domain, I put it here as global. Then should we also extract RIDPASID to become part of the generic API? i.e. set pasid, flush IOTLB etc. Right? RIDPASID is not in group's pasid_array today. Thanks, Jacob ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 04/10] dmapool: improve accuracy of debug statistics
On 5/31/22 15:48, Robin Murphy wrote: > On 2022-05-31 19:17, Tony Battersby wrote: > >> pool->name, blocks, >> - (size_t) pages * >> - (pool->allocation / pool->size), >> + (size_t) pages * pool->blks_per_alloc, >> pool->size, pages); >> size -= temp; >> next += temp; >> @@ -168,6 +168,9 @@ struct dma_pool *dma_pool_create(const char *name, >> struct device *dev, >> retval->size = size; >> retval->boundary = boundary; >> retval->allocation = allocation; >> +retval->blks_per_alloc = >> +(allocation / boundary) * (boundary / size) + >> +(allocation % boundary) / size; > Do we really need to store this? Sure, 4 divisions (which could possibly > be fewer given the constraints on boundary) isn't the absolute cheapest > calculation, but I still can't imagine anyone would be polling sysfs > stats hard enough to even notice. > The stored value is also used in patch #5, in more performance-critical code, although only when debug is enabled. Tony ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 04/10] dmapool: improve accuracy of debug statistics
On 2022-05-31 19:17, Tony Battersby wrote: The "total number of blocks in pool" debug statistic currently does not take the boundary value into account, so it diverges from the "total number of blocks in use" statistic when a boundary is in effect. Add a calculation for the number of blocks per allocation that takes the boundary into account, and use it to replace the inaccurate calculation. This depends on the patch "dmapool: fix boundary comparison" for the calculated blks_per_alloc value to be correct. Signed-off-by: Tony Battersby --- mm/dmapool.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/mm/dmapool.c b/mm/dmapool.c index 782143144a32..9e30f4425dea 100644 --- a/mm/dmapool.c +++ b/mm/dmapool.c @@ -47,6 +47,7 @@ struct dma_pool { /* the pool */ struct device *dev; unsigned int allocation; unsigned int boundary; + unsigned int blks_per_alloc; char name[32]; struct list_head pools; }; @@ -92,8 +93,7 @@ static ssize_t pools_show(struct device *dev, struct device_attribute *attr, cha /* per-pool info, no real statistics yet */ temp = scnprintf(next, size, "%-16s %4zu %4zu %4u %2u\n", Nit: if we're tinkering with this, it's probably worth updating the whole function to use sysfs_emit{_at}(). pool->name, blocks, -(size_t) pages * -(pool->allocation / pool->size), +(size_t) pages * pool->blks_per_alloc, pool->size, pages); size -= temp; next += temp; @@ -168,6 +168,9 @@ struct dma_pool *dma_pool_create(const char *name, struct device *dev, retval->size = size; retval->boundary = boundary; retval->allocation = allocation; + retval->blks_per_alloc = + (allocation / boundary) * (boundary / size) + + (allocation % boundary) / size; Do we really need to store this? Sure, 4 divisions (which could possibly be fewer given the constraints on boundary) isn't the absolute cheapest calculation, but I still can't imagine anyone would be polling sysfs stats hard enough to even notice. Thanks, Robin. INIT_LIST_HEAD(&retval->pools); ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 08/10] dmapool: cleanup dma_pool_destroy
On 2022-05-31 19:22, Tony Battersby wrote: Remove a small amount of code duplication between dma_pool_destroy() and pool_free_page() in preparation for adding more code without having to duplicate it. No functional changes. Signed-off-by: Tony Battersby --- mm/dmapool.c | 34 -- 1 file changed, 20 insertions(+), 14 deletions(-) diff --git a/mm/dmapool.c b/mm/dmapool.c index 8749a9d7927e..58c11dcaa4e4 100644 --- a/mm/dmapool.c +++ b/mm/dmapool.c @@ -250,14 +250,25 @@ static inline bool is_page_busy(struct dma_page *page) return page->in_use != 0; } -static void pool_free_page(struct dma_pool *pool, struct dma_page *page) +static void pool_free_page(struct dma_pool *pool, + struct dma_page *page, + bool destroying_pool) { + void *vaddr = page->vaddr; dma_addr_t dma = page->dma; + if (destroying_pool && is_page_busy(page)) { + dev_err(pool->dev, + "dma_pool_destroy %s, %p busy\n", + pool->name, vaddr); + /* leak the still-in-use consistent memory */ + } else { #ifdefDMAPOOL_DEBUG - memset(page->vaddr, POOL_POISON_FREED, pool->allocation); + memset(vaddr, POOL_POISON_FREED, pool->allocation); #endif - dma_free_coherent(pool->dev, pool->allocation, page->vaddr, dma); + dma_free_coherent(pool->dev, pool->allocation, vaddr, dma); + } + list_del(&page->page_list); If we're tearing down the whole pool, surely we can skip this as well? (Same for the second list in patch #9) In fact I think it might make more sense to refactor in the opposite direction and just streamline this directly into dma_pool_destroy(), more like: list_for_each_entry_safe() { if (is_page_busy()) { dev_err(); } else { dma_free_coherent(); } kfree(page); } kfree(page); } @@ -272,7 +283,7 @@ static void pool_free_page(struct dma_pool *pool, struct dma_page *page) */ void dma_pool_destroy(struct dma_pool *pool) { - struct dma_page *page, *tmp; + struct dma_page *page; Nit: you bring this back again in patch #10, so we may as well leave the list_for_each_entry_safe() iterator in place until then as well, and save a bit of churn in this patch. bool empty = false; if (unlikely(!pool)) @@ -288,15 +299,10 @@ void dma_pool_destroy(struct dma_pool *pool) device_remove_file(pool->dev, &dev_attr_pools); mutex_unlock(&pools_reg_lock); - list_for_each_entry_safe(page, tmp, &pool->page_list, page_list) { - if (is_page_busy(page)) { - dev_err(pool->dev, "%s %s, %p busy\n", __func__, - pool->name, page->vaddr); - /* leak the still-in-use consistent memory */ - list_del(&page->page_list); - kfree(page); - } else - pool_free_page(pool, page); + while ((page = list_first_entry_or_null(&pool->page_list, + struct dma_page, + page_list))) { + pool_free_page(pool, page, true); } kfree(pool); @@ -469,7 +475,7 @@ void dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t dma) page->offset = offset; /* * Resist a temptation to do -*if (!is_page_busy(page)) pool_free_page(pool, page); +*if (!is_page_busy(page)) pool_free_page(pool, page, false); Further to the above, even if we did retain a separate function, if an argument is hard-coded at the one single callsite, and the only reference to passing any other value is in a comment effectively saying "don't do this", do we really need to pretend it's an argument at all? ;) FWIW I'd just reword the comment in more general terms, e.g. "Resist the temptation to free unused pages immediately..." Thanks, Robin. * Better have a few empty pages hang around. */ spin_unlock_irqrestore(&pool->lock, flags); ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v4 1/6] iommu: Add a per domain PASID for DMA API
On Tue, May 31, 2022 at 10:29:55AM -0700, Jacob Pan wrote: > The reason why I store PASID at IOMMU domain is for IOTLB flush within the > domain. Device driver is not aware of domain level IOTLB flush. We also > have iova_cookie for each domain which essentially is for RIDPASID. You need to make the PASID stuff work generically. The domain needs to hold a list of all the places it needs to flush and that list needs to be maintained during attach/detach. A single PASID on the domain is obviously never going to work generically. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs
On Tue, May 31, 2022 at 07:07:32PM +0100, Robin Murphy wrote: > > And we expect the iommu driver to be unable to free page table levels > > that have IOVA boundaries in them? > > I'm not entirely sure what you mean there, but in general an unmap request > is expected to match some previous map request atomic cmpxchg is OK for inserting new page table levels but it can't protect you against concurrent freeing of page table levels. So without locks it means that page tables can't usually be freed. Which seems to match what the Intel driver does - at least from a cursory look. This is one of the reasons the mm has the mmap/etc lock and spinlocks because we do expect page table levels to get wiped out when VMA's are zap'd - all the different locks provide the protection against page tables disappearing under from something manipulating them. Basically every "lockless" walk in (process) MM land is actually protected by some kind of lock that blocks zap_page_range() from removing the page table levels themselves. > They might either unmap the entire region originally mapped, or just > the requested part, or might fail entirely (IIRC there was some > nasty code in VFIO for detecting a particular behaviour). This is something I did differently in iommufd. It always generates unmaps that are strict supersets of the maps it issued. So it would be a kernel bug if the driver unmaps more or less than requested. > Oh, I've spent the last couple of weeks hacking up horrible things > manipulating entries in init_mm, and never realised that that was actually > the special case. Oh well, live and learn. The init_mm is sort of different, it doesn't have zap in quite the same way, for example. I was talking about the typical process mm. Anyhow, the right solution is to use RCU as I described before, Baolu do you want to try? Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH 02/10] dmapool: cleanup integer types
To represent the size of a single allocation, dmapool currently uses 'unsigned int' in some places and 'size_t' in other places. Standardize on 'unsigned int' to reduce overhead, but use 'size_t' when counting all the blocks in the entire pool. Signed-off-by: Tony Battersby --- This puts an upper bound on 'size' of INT_MAX to avoid overflowing the following comparison in pool_initialise_page(): unsigned int offset = 0; unsigned int next = offset + pool->size; if (unlikely((next + pool->size) > ... 'boundary' is passed in as a size_t but gets stored as an unsigned int. 'boundary' values >= 'allocation' do not have any effect, so clipping 'boundary' to 'allocation' keeps it within the range of unsigned int without affecting anything else. A few lines above (not in the diff) you can see that if 'boundary' is passed in as 0 then it is set to 'allocation', so it is nothing new. For reference, here is the relevant code after being patched: if (!boundary) boundary = allocation; else if ((boundary < size) || (boundary & (boundary - 1))) return NULL; boundary = min(boundary, allocation); mm/dmapool.c | 19 +++ 1 file changed, 11 insertions(+), 8 deletions(-) diff --git a/mm/dmapool.c b/mm/dmapool.c index 0f89de408cbe..d7b372248111 100644 --- a/mm/dmapool.c +++ b/mm/dmapool.c @@ -43,10 +43,10 @@ struct dma_pool { /* the pool */ struct list_head page_list; spinlock_t lock; - size_t size; + unsigned int size; struct device *dev; - size_t allocation; - size_t boundary; + unsigned int allocation; + unsigned int boundary; char name[32]; struct list_head pools; }; @@ -80,7 +80,7 @@ static ssize_t pools_show(struct device *dev, struct device_attribute *attr, cha mutex_lock(&pools_lock); list_for_each_entry(pool, &dev->dma_pools, pools) { unsigned pages = 0; - unsigned blocks = 0; + size_t blocks = 0; spin_lock_irq(&pool->lock); list_for_each_entry(page, &pool->page_list, page_list) { @@ -90,9 +90,10 @@ static ssize_t pools_show(struct device *dev, struct device_attribute *attr, cha spin_unlock_irq(&pool->lock); /* per-pool info, no real statistics yet */ - temp = scnprintf(next, size, "%-16s %4u %4zu %4zu %2u\n", + temp = scnprintf(next, size, "%-16s %4zu %4zu %4u %2u\n", pool->name, blocks, -pages * (pool->allocation / pool->size), +(size_t) pages * +(pool->allocation / pool->size), pool->size, pages); size -= temp; next += temp; @@ -139,7 +140,7 @@ struct dma_pool *dma_pool_create(const char *name, struct device *dev, else if (align & (align - 1)) return NULL; - if (size == 0) + if (size == 0 || size > INT_MAX) return NULL; else if (size < 4) size = 4; @@ -152,6 +153,8 @@ struct dma_pool *dma_pool_create(const char *name, struct device *dev, else if ((boundary < size) || (boundary & (boundary - 1))) return NULL; + boundary = min(boundary, allocation); + retval = kmalloc(sizeof(*retval), GFP_KERNEL); if (!retval) return retval; @@ -312,7 +315,7 @@ void *dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags, { unsigned long flags; struct dma_page *page; - size_t offset; + unsigned int offset; void *retval; might_alloc(mem_flags); -- 2.25.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH 00/10] mpt3sas and dmapool scalability
This patch series improves dmapool scalability by replacing linear scans with red-black trees. History: In 2018 this patch series made it through 4 versions. v1 used red-black trees; v2 - v4 put the dma pool info directly into struct page and used virt_to_page() to get at it. v4 made a brief appearance in linux-next, but it caused problems on non-x86 archs where virt_to_page() doesn't work with dma_alloc_coherent, so it was reverted. I was too busy at the time to repost the red-black tree version, and I forgot about it until now. This version is based on the red-black trees of v1, but addressing all the review comments I got at the time and with additional cleanup patches. Note that Keith Busch is also working on improving dmapool scalability, so for now I would recommend not merging my scalability patches until Keith's approach can be evaluated. In the meantime, my patches can serve as a benchmark comparison. I also have a number of cleanup patches in my series that could be useful on their own. References: v1 https://lore.kernel.org/linux-mm/73ec1f52-d758-05df-fb6a-41d269e91...@cybernetics.com/ v2 https://lore.kernel.org/linux-mm/ec701153-fdc9-37f3-c267-f056159b4...@cybernetics.com/ v3 https://lore.kernel.org/linux-mm/d48854ff-995d-228e-8356-54c141c32...@cybernetics.com/ v4 https://lore.kernel.org/linux-mm/88395080-efc1-4e7b-f813-bb90c86d0...@cybernetics.com/ problem caused by virt_to_page() https://lore.kernel.org/linux-kernel/20181206013054.gi6...@atomide.com/ Keith Busch's dmapool performance enhancements https://lore.kernel.org/linux-mm/20220428202714.17630-1-kbu...@kernel.org/ Below is my original description of the motivation for these patches. drivers/scsi/mpt3sas is running into a scalability problem with the kernel's DMA pool implementation. With a LSI/Broadcom SAS 9300-8i 12Gb/s HBA and max_sgl_entries=256, during modprobe, mpt3sas does the equivalent of: chain_dma_pool = dma_pool_create(size = 128); for (i = 0; i < 373959; i++) { dma_addr[i] = dma_pool_alloc(chain_dma_pool); } And at rmmod, system shutdown, or system reboot, mpt3sas does the equivalent of: for (i = 0; i < 373959; i++) { dma_pool_free(chain_dma_pool, dma_addr[i]); } dma_pool_destroy(chain_dma_pool); With this usage, both dma_pool_alloc() and dma_pool_free() exhibit O(n^2) complexity, although dma_pool_free() is much worse due to implementation details. On my system, the dma_pool_free() loop above takes about 9 seconds to run. Note that the problem was even worse before commit 74522a92bbf0 ("scsi: mpt3sas: Optimize I/O memory consumption in driver."), where the dma_pool_free() loop could take ~30 seconds. mpt3sas also has some other DMA pools, but chain_dma_pool is the only one with so many allocations: cat /sys/devices/pci:80/:80:07.0/:85:00.0/pools (manually cleaned up column alignment) poolinfo - 0.1 reply_post_free_array pool 1 21 192 1 reply_free pool 1 1 41728 1 reply pool 1 1 1335296 1 sense pool 1 1 970272 1 chain pool 373959 386048 128 12064 reply_post_free pool12 12 166528 12 The patches in this series improve the scalability of the DMA pool implementation, which significantly reduces the running time of the DMA alloc/free loops. With the patches applied, "modprobe mpt3sas", "rmmod mpt3sas", and system shutdown/reboot with mpt3sas loaded are significantly faster. Here are some benchmarks (of DMA alloc/free only, not the entire modprobe/rmmod): dma_pool_create() + dma_pool_alloc() loop, size = 128, count = 373959 original:350 ms ( 1x) dmapool patches: 18 ms (19x) dma_pool_free() loop + dma_pool_destroy(), size = 128, count = 373959 original:8901 ms ( 1x) dmapool patches: 19 ms ( 477x) ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH 01/10] dmapool: remove checks for dev == NULL
dmapool originally tried to support pools without a device because dma_alloc_coherent() supports allocations without a device. But nobody ended up using dma pools without a device, so the current checks in dmapool.c for pool->dev == NULL are both insufficient and causing bloat. Remove them. Signed-off-by: Tony Battersby --- mm/dmapool.c | 42 +++--- 1 file changed, 11 insertions(+), 31 deletions(-) diff --git a/mm/dmapool.c b/mm/dmapool.c index a7eb5d0eb2da..0f89de408cbe 100644 --- a/mm/dmapool.c +++ b/mm/dmapool.c @@ -275,7 +275,7 @@ void dma_pool_destroy(struct dma_pool *pool) mutex_lock(&pools_reg_lock); mutex_lock(&pools_lock); list_del(&pool->pools); - if (pool->dev && list_empty(&pool->dev->dma_pools)) + if (list_empty(&pool->dev->dma_pools)) empty = true; mutex_unlock(&pools_lock); if (empty) @@ -284,12 +284,8 @@ void dma_pool_destroy(struct dma_pool *pool) list_for_each_entry_safe(page, tmp, &pool->page_list, page_list) { if (is_page_busy(page)) { - if (pool->dev) - dev_err(pool->dev, "%s %s, %p busy\n", __func__, - pool->name, page->vaddr); - else - pr_err("%s %s, %p busy\n", __func__, - pool->name, page->vaddr); + dev_err(pool->dev, "%s %s, %p busy\n", __func__, + pool->name, page->vaddr); /* leak the still-in-use consistent memory */ list_del(&page->page_list); kfree(page); @@ -351,12 +347,8 @@ void *dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags, for (i = sizeof(page->offset); i < pool->size; i++) { if (data[i] == POOL_POISON_FREED) continue; - if (pool->dev) - dev_err(pool->dev, "%s %s, %p (corrupted)\n", - __func__, pool->name, retval); - else - pr_err("%s %s, %p (corrupted)\n", - __func__, pool->name, retval); + dev_err(pool->dev, "%s %s, %p (corrupted)\n", + __func__, pool->name, retval); /* * Dump the first 4 bytes even if they are not @@ -411,12 +403,8 @@ void dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t dma) page = pool_find_page(pool, dma); if (!page) { spin_unlock_irqrestore(&pool->lock, flags); - if (pool->dev) - dev_err(pool->dev, "%s %s, %p/%pad (bad dma)\n", - __func__, pool->name, vaddr, &dma); - else - pr_err("%s %s, %p/%pad (bad dma)\n", - __func__, pool->name, vaddr, &dma); + dev_err(pool->dev, "%s %s, %p/%pad (bad dma)\n", + __func__, pool->name, vaddr, &dma); return; } @@ -426,12 +414,8 @@ void dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t dma) #ifdef DMAPOOL_DEBUG if ((dma - page->dma) != offset) { spin_unlock_irqrestore(&pool->lock, flags); - if (pool->dev) - dev_err(pool->dev, "%s %s, %p (bad vaddr)/%pad\n", - __func__, pool->name, vaddr, &dma); - else - pr_err("%s %s, %p (bad vaddr)/%pad\n", - __func__, pool->name, vaddr, &dma); + dev_err(pool->dev, "%s %s, %p (bad vaddr)/%pad\n", + __func__, pool->name, vaddr, &dma); return; } { @@ -442,12 +426,8 @@ void dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t dma) continue; } spin_unlock_irqrestore(&pool->lock, flags); - if (pool->dev) - dev_err(pool->dev, "%s %s, dma %pad already free\n", - __func__, pool->name, &dma); - else - pr_err("%s %s, dma %pad already free\n", - __func__, pool->name, &dma); + dev_err(pool->dev, "%s %s, dma %pad already free\n", + __func__, pool->name, &dma); return; } } -- 2.25.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH 03/10] dmapool: fix boundary comparison
Fix the boundary comparison when constructing the list of free blocks for the case that 'size' is a power of two. Since 'boundary' is also a power of two, that would make 'boundary' a multiple of 'size', in which case a single block would never cross the boundary. This bug would cause some of the allocated memory to be wasted (but not leaked). Example: size = 512 boundary = 2048 allocation = 4096 Address range 0 - 511 512 - 1023 1024 - 1535 1536 - 2047 * 2048 - 2559 2560 - 3071 3072 - 3583 3584 - 4095 * Prior to this fix, the address ranges marked with "*" would not have been used even though they didn't cross the given boundary. Fixes: e34f44b3517f ("pool: Improve memory usage for devices which can't cross boundaries") Signed-off-by: Tony Battersby --- mm/dmapool.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/dmapool.c b/mm/dmapool.c index d7b372248111..782143144a32 100644 --- a/mm/dmapool.c +++ b/mm/dmapool.c @@ -210,7 +210,7 @@ static void pool_initialise_page(struct dma_pool *pool, struct dma_page *page) do { unsigned int next = offset + pool->size; - if (unlikely((next + pool->size) >= next_boundary)) { + if (unlikely((next + pool->size) > next_boundary)) { next = next_boundary; next_boundary += pool->boundary; } -- 2.25.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 01/10] dmapool: remove checks for dev == NULL
On 2022-05-31 19:12, Tony Battersby wrote: dmapool originally tried to support pools without a device because dma_alloc_coherent() supports allocations without a device. But nobody ended up using dma pools without a device, so the current checks in dmapool.c for pool->dev == NULL are both insufficient and causing bloat. Remove them. Furthermore, dma_pool_create() already dereferences the dev argument unconditionally, so there's no way we could possibly get as far as these checks even if a caller did ever pass NULL. Reviewed-by: Robin Murphy Signed-off-by: Tony Battersby --- mm/dmapool.c | 42 +++--- 1 file changed, 11 insertions(+), 31 deletions(-) diff --git a/mm/dmapool.c b/mm/dmapool.c index a7eb5d0eb2da..0f89de408cbe 100644 --- a/mm/dmapool.c +++ b/mm/dmapool.c @@ -275,7 +275,7 @@ void dma_pool_destroy(struct dma_pool *pool) mutex_lock(&pools_reg_lock); mutex_lock(&pools_lock); list_del(&pool->pools); - if (pool->dev && list_empty(&pool->dev->dma_pools)) + if (list_empty(&pool->dev->dma_pools)) empty = true; mutex_unlock(&pools_lock); if (empty) @@ -284,12 +284,8 @@ void dma_pool_destroy(struct dma_pool *pool) list_for_each_entry_safe(page, tmp, &pool->page_list, page_list) { if (is_page_busy(page)) { - if (pool->dev) - dev_err(pool->dev, "%s %s, %p busy\n", __func__, - pool->name, page->vaddr); - else - pr_err("%s %s, %p busy\n", __func__, - pool->name, page->vaddr); + dev_err(pool->dev, "%s %s, %p busy\n", __func__, + pool->name, page->vaddr); /* leak the still-in-use consistent memory */ list_del(&page->page_list); kfree(page); @@ -351,12 +347,8 @@ void *dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags, for (i = sizeof(page->offset); i < pool->size; i++) { if (data[i] == POOL_POISON_FREED) continue; - if (pool->dev) - dev_err(pool->dev, "%s %s, %p (corrupted)\n", - __func__, pool->name, retval); - else - pr_err("%s %s, %p (corrupted)\n", - __func__, pool->name, retval); + dev_err(pool->dev, "%s %s, %p (corrupted)\n", + __func__, pool->name, retval); /* * Dump the first 4 bytes even if they are not @@ -411,12 +403,8 @@ void dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t dma) page = pool_find_page(pool, dma); if (!page) { spin_unlock_irqrestore(&pool->lock, flags); - if (pool->dev) - dev_err(pool->dev, "%s %s, %p/%pad (bad dma)\n", - __func__, pool->name, vaddr, &dma); - else - pr_err("%s %s, %p/%pad (bad dma)\n", - __func__, pool->name, vaddr, &dma); + dev_err(pool->dev, "%s %s, %p/%pad (bad dma)\n", + __func__, pool->name, vaddr, &dma); return; } @@ -426,12 +414,8 @@ void dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t dma) #ifdefDMAPOOL_DEBUG if ((dma - page->dma) != offset) { spin_unlock_irqrestore(&pool->lock, flags); - if (pool->dev) - dev_err(pool->dev, "%s %s, %p (bad vaddr)/%pad\n", - __func__, pool->name, vaddr, &dma); - else - pr_err("%s %s, %p (bad vaddr)/%pad\n", - __func__, pool->name, vaddr, &dma); + dev_err(pool->dev, "%s %s, %p (bad vaddr)/%pad\n", + __func__, pool->name, vaddr, &dma); return; } { @@ -442,12 +426,8 @@ void dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t dma) continue; } spin_unlock_irqrestore(&pool->lock, flags); - if (pool->dev) - dev_err(pool->dev, "%s %s, dma %pad already free\n", - __func__, pool->name, &dma); - else - pr_err("%s %s, dma %pad already free\n", - __func__, pool->name, &dma); + dev_err(pool->dev, "%s %s, dma %pad already free\n", + __func__, pool->name, &dma);
[PATCH 10/10] dmapool: improve scalability of dma_pool_free
dma_pool_free() scales poorly when the pool contains many pages because pool_find_page() does a linear scan of all allocated pages. Improve its scalability by replacing the linear scan with a red-black tree lookup. In big O notation, this improves the algorithm from O(n^2) to O(n * log n). Signed-off-by: Tony Battersby --- mm/dmapool.c | 128 --- 1 file changed, 100 insertions(+), 28 deletions(-) diff --git a/mm/dmapool.c b/mm/dmapool.c index b3dd2ace0d2a..24535483f781 100644 --- a/mm/dmapool.c +++ b/mm/dmapool.c @@ -12,11 +12,12 @@ * Many older drivers still have their own code to do this. * * The current design of this allocator is fairly simple. The pool is - * represented by the 'struct dma_pool' which keeps a doubly-linked list of - * allocated pages. Each page in the page_list is split into blocks of at - * least 'size' bytes. Free blocks are tracked in an unsorted singly-linked - * list of free blocks within the page. Used blocks aren't tracked, but we - * keep a count of how many are currently allocated from each page. + * represented by the 'struct dma_pool' which keeps a red-black tree of all + * allocated pages, keyed by DMA address for fast lookup when freeing. + * Each page in the page_tree is split into blocks of at least 'size' bytes. + * Free blocks are tracked in an unsorted singly-linked list of free blocks + * within the page. Used blocks aren't tracked, but we keep a count of how + * many are currently allocated from each page. * * The avail_page_list keeps track of pages that have one or more free blocks * available to (re)allocate. Pages are moved in and out of avail_page_list @@ -36,6 +37,7 @@ #include #include #include +#include #include #include #include @@ -45,7 +47,7 @@ #endif struct dma_pool { /* the pool */ - struct list_head page_list; + struct rb_root page_tree; struct list_head avail_page_list; spinlock_t lock; unsigned int size; @@ -58,7 +60,7 @@ struct dma_pool { /* the pool */ }; struct dma_page { /* cacheable header for 'allocation' bytes */ - struct list_head page_list; + struct rb_node page_node; struct list_head avail_page_link; void *vaddr; dma_addr_t dma; @@ -69,6 +71,11 @@ struct dma_page {/* cacheable header for 'allocation' bytes */ static DEFINE_MUTEX(pools_lock); static DEFINE_MUTEX(pools_reg_lock); +static inline struct dma_page *rb_to_dma_page(struct rb_node *node) +{ + return rb_entry(node, struct dma_page, page_node); +} + static ssize_t pools_show(struct device *dev, struct device_attribute *attr, char *buf) { unsigned temp; @@ -76,6 +83,7 @@ static ssize_t pools_show(struct device *dev, struct device_attribute *attr, cha char *next; struct dma_page *page; struct dma_pool *pool; + struct rb_node *node; next = buf; size = PAGE_SIZE; @@ -90,7 +98,10 @@ static ssize_t pools_show(struct device *dev, struct device_attribute *attr, cha size_t blocks = 0; spin_lock_irq(&pool->lock); - list_for_each_entry(page, &pool->page_list, page_list) { + for (node = rb_first(&pool->page_tree); +node; +node = rb_next(node)) { + page = rb_to_dma_page(node); pages++; blocks += page->in_use; } @@ -169,7 +180,7 @@ struct dma_pool *dma_pool_create(const char *name, struct device *dev, retval->dev = dev; - INIT_LIST_HEAD(&retval->page_list); + retval->page_tree = RB_ROOT; INIT_LIST_HEAD(&retval->avail_page_list); spin_lock_init(&retval->lock); retval->size = size; @@ -213,6 +224,63 @@ struct dma_pool *dma_pool_create(const char *name, struct device *dev, } EXPORT_SYMBOL(dma_pool_create); +/* + * Find the dma_page that manages the given DMA address. + */ +static struct dma_page *pool_find_page(struct dma_pool *pool, dma_addr_t dma) +{ + struct rb_node *node = pool->page_tree.rb_node; + + while (node) { + struct dma_page *page = rb_to_dma_page(node); + + if (dma < page->dma) + node = node->rb_left; + else if ((dma - page->dma) >= pool->allocation) + node = node->rb_right; + else + return page; + } + return NULL; +} + +/* + * Insert a dma_page into the page_tree. + */ +static int pool_insert_page(struct dma_pool *pool, struct dma_page *new_page) +{ + dma_addr_t dma = new_page->dma; + struct rb_node **node = &(pool->page_tree.rb_node), *parent = NULL; + + while (*node) { + struct dma_page *this_page = rb_to_dma_page(*node); + + parent = *node; + if (dma < this_page-
[PATCH 09/10] dmapool: improve scalability of dma_pool_alloc
dma_pool_alloc() scales poorly when allocating a large number of pages because it does a linear scan of all previously-allocated pages before allocating a new one. Improve its scalability by maintaining a separate list of pages that have free blocks ready to (re)allocate. In big O notation, this improves the algorithm from O(n^2) to O(n). Signed-off-by: Tony Battersby --- mm/dmapool.c | 27 +++ 1 file changed, 23 insertions(+), 4 deletions(-) diff --git a/mm/dmapool.c b/mm/dmapool.c index 58c11dcaa4e4..b3dd2ace0d2a 100644 --- a/mm/dmapool.c +++ b/mm/dmapool.c @@ -17,6 +17,10 @@ * least 'size' bytes. Free blocks are tracked in an unsorted singly-linked * list of free blocks within the page. Used blocks aren't tracked, but we * keep a count of how many are currently allocated from each page. + * + * The avail_page_list keeps track of pages that have one or more free blocks + * available to (re)allocate. Pages are moved in and out of avail_page_list + * as their blocks are allocated and freed. */ #include @@ -42,6 +46,7 @@ struct dma_pool { /* the pool */ struct list_head page_list; + struct list_head avail_page_list; spinlock_t lock; unsigned int size; struct device *dev; @@ -54,6 +59,7 @@ struct dma_pool { /* the pool */ struct dma_page { /* cacheable header for 'allocation' bytes */ struct list_head page_list; + struct list_head avail_page_link; void *vaddr; dma_addr_t dma; unsigned int in_use; @@ -164,6 +170,7 @@ struct dma_pool *dma_pool_create(const char *name, struct device *dev, retval->dev = dev; INIT_LIST_HEAD(&retval->page_list); + INIT_LIST_HEAD(&retval->avail_page_list); spin_lock_init(&retval->lock); retval->size = size; retval->boundary = boundary; @@ -270,6 +277,7 @@ static void pool_free_page(struct dma_pool *pool, } list_del(&page->page_list); + list_del(&page->avail_page_link); kfree(page); } @@ -330,10 +338,11 @@ void *dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags, might_alloc(mem_flags); spin_lock_irqsave(&pool->lock, flags); - list_for_each_entry(page, &pool->page_list, page_list) { - if (page->offset < pool->allocation) - goto ready; - } + page = list_first_entry_or_null(&pool->avail_page_list, + struct dma_page, + avail_page_link); + if (page) + goto ready; /* pool_alloc_page() might sleep, so temporarily drop &pool->lock */ spin_unlock_irqrestore(&pool->lock, flags); @@ -345,10 +354,13 @@ void *dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags, spin_lock_irqsave(&pool->lock, flags); list_add(&page->page_list, &pool->page_list); + list_add(&page->avail_page_link, &pool->avail_page_list); ready: page->in_use++; offset = page->offset; page->offset = *(int *)(page->vaddr + offset); + if (page->offset >= pool->allocation) + list_del_init(&page->avail_page_link); retval = offset + page->vaddr; *handle = offset + page->dma; #ifdef DMAPOOL_DEBUG @@ -470,6 +482,13 @@ void dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t dma) memset(vaddr, 0, pool->size); #endif + /* +* list_empty() on the page tests if the page is already linked into +* avail_page_list to avoid adding it more than once. +*/ + if (list_empty(&page->avail_page_link)) + list_add(&page->avail_page_link, &pool->avail_page_list); + page->in_use--; *(int *)vaddr = page->offset; page->offset = offset; -- 2.25.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH 08/10] dmapool: cleanup dma_pool_destroy
Remove a small amount of code duplication between dma_pool_destroy() and pool_free_page() in preparation for adding more code without having to duplicate it. No functional changes. Signed-off-by: Tony Battersby --- mm/dmapool.c | 34 -- 1 file changed, 20 insertions(+), 14 deletions(-) diff --git a/mm/dmapool.c b/mm/dmapool.c index 8749a9d7927e..58c11dcaa4e4 100644 --- a/mm/dmapool.c +++ b/mm/dmapool.c @@ -250,14 +250,25 @@ static inline bool is_page_busy(struct dma_page *page) return page->in_use != 0; } -static void pool_free_page(struct dma_pool *pool, struct dma_page *page) +static void pool_free_page(struct dma_pool *pool, + struct dma_page *page, + bool destroying_pool) { + void *vaddr = page->vaddr; dma_addr_t dma = page->dma; + if (destroying_pool && is_page_busy(page)) { + dev_err(pool->dev, + "dma_pool_destroy %s, %p busy\n", + pool->name, vaddr); + /* leak the still-in-use consistent memory */ + } else { #ifdef DMAPOOL_DEBUG - memset(page->vaddr, POOL_POISON_FREED, pool->allocation); + memset(vaddr, POOL_POISON_FREED, pool->allocation); #endif - dma_free_coherent(pool->dev, pool->allocation, page->vaddr, dma); + dma_free_coherent(pool->dev, pool->allocation, vaddr, dma); + } + list_del(&page->page_list); kfree(page); } @@ -272,7 +283,7 @@ static void pool_free_page(struct dma_pool *pool, struct dma_page *page) */ void dma_pool_destroy(struct dma_pool *pool) { - struct dma_page *page, *tmp; + struct dma_page *page; bool empty = false; if (unlikely(!pool)) @@ -288,15 +299,10 @@ void dma_pool_destroy(struct dma_pool *pool) device_remove_file(pool->dev, &dev_attr_pools); mutex_unlock(&pools_reg_lock); - list_for_each_entry_safe(page, tmp, &pool->page_list, page_list) { - if (is_page_busy(page)) { - dev_err(pool->dev, "%s %s, %p busy\n", __func__, - pool->name, page->vaddr); - /* leak the still-in-use consistent memory */ - list_del(&page->page_list); - kfree(page); - } else - pool_free_page(pool, page); + while ((page = list_first_entry_or_null(&pool->page_list, + struct dma_page, + page_list))) { + pool_free_page(pool, page, true); } kfree(pool); @@ -469,7 +475,7 @@ void dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t dma) page->offset = offset; /* * Resist a temptation to do -*if (!is_page_busy(page)) pool_free_page(pool, page); +*if (!is_page_busy(page)) pool_free_page(pool, page, false); * Better have a few empty pages hang around. */ spin_unlock_irqrestore(&pool->lock, flags); -- 2.25.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH 07/10] dmapool: speedup DMAPOOL_DEBUG with init_on_alloc
Avoid double-memset of the same allocated memory in dma_pool_alloc() when both DMAPOOL_DEBUG is enabled and init_on_alloc=1. Signed-off-by: Tony Battersby --- mm/dmapool.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/mm/dmapool.c b/mm/dmapool.c index 49019ef6dd83..8749a9d7927e 100644 --- a/mm/dmapool.c +++ b/mm/dmapool.c @@ -365,7 +365,7 @@ void *dma_pool_alloc(struct dma_pool *pool, gfp_t mem_flags, break; } } - if (!(mem_flags & __GFP_ZERO)) + if (!want_init_on_alloc(mem_flags)) memset(retval, POOL_POISON_ALLOCATED, pool->size); #endif spin_unlock_irqrestore(&pool->lock, flags); -- 2.25.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH 06/10] dmapool: ignore init_on_free when DMAPOOL_DEBUG enabled
There are two cases: 1) In the normal case that the memory is being freed correctly, then DMAPOOL_DEBUG will memset the memory anyway, so speed things up by avoiding a double-memset of the same memory. 2) In the abnormal case that DMAPOOL_DEBUG detects that a driver passes incorrect parameters to dma_pool_free() (e.g. double-free, invalid free, mismatched vaddr/dma, etc.), then that is a kernel bug, and we don't want to clear the passed-in possibly-invalid memory pointer because we can't be sure that the memory is really free. So don't clear it just because init_on_free=1. Signed-off-by: Tony Battersby --- mm/dmapool.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/mm/dmapool.c b/mm/dmapool.c index 7a9161d4f7a6..49019ef6dd83 100644 --- a/mm/dmapool.c +++ b/mm/dmapool.c @@ -415,8 +415,6 @@ void dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t dma) } offset = vaddr - page->vaddr; - if (want_init_on_free()) - memset(vaddr, 0, pool->size); #ifdef DMAPOOL_DEBUG if ((dma - page->dma) != offset) { spin_unlock_irqrestore(&pool->lock, flags); @@ -461,6 +459,9 @@ void dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t dma) goto freelist_corrupt; } memset(vaddr, POOL_POISON_FREED, pool->size); +#else + if (want_init_on_free()) + memset(vaddr, 0, pool->size); #endif page->in_use--; -- 2.25.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH 05/10] dmapool: debug: prevent endless loop in case of corruption
Prevent a possible endless loop with DMAPOOL_DEBUG enabled if a buggy driver corrupts DMA pool memory. Signed-off-by: Tony Battersby --- mm/dmapool.c | 37 ++--- 1 file changed, 30 insertions(+), 7 deletions(-) diff --git a/mm/dmapool.c b/mm/dmapool.c index 9e30f4425dea..7a9161d4f7a6 100644 --- a/mm/dmapool.c +++ b/mm/dmapool.c @@ -426,16 +426,39 @@ void dma_pool_free(struct dma_pool *pool, void *vaddr, dma_addr_t dma) } { unsigned int chain = page->offset; + unsigned int free_blks = 0; + while (chain < pool->allocation) { - if (chain != offset) { - chain = *(int *)(page->vaddr + chain); - continue; + if (unlikely(chain == offset)) { + spin_unlock_irqrestore(&pool->lock, flags); + dev_err(pool->dev, + "%s %s, dma %pad already free\n", + __func__, pool->name, &dma); + return; } - spin_unlock_irqrestore(&pool->lock, flags); - dev_err(pool->dev, "%s %s, dma %pad already free\n", - __func__, pool->name, &dma); - return; + + /* +* A buggy driver could corrupt the freelist by +* use-after-free, buffer overflow, etc. Besides +* checking for corruption, this also prevents an +* endless loop in case corruption causes a circular +* loop in the freelist. +*/ + if (unlikely(++free_blks + page->in_use > +pool->blks_per_alloc)) { + freelist_corrupt: + spin_unlock_irqrestore(&pool->lock, flags); + dev_err(pool->dev, + "%s %s, freelist corrupted\n", + __func__, pool->name); + return; + } + + chain = *(int *)(page->vaddr + chain); } + if (unlikely(free_blks + page->in_use != +pool->blks_per_alloc)) + goto freelist_corrupt; } memset(vaddr, POOL_POISON_FREED, pool->size); #endif -- 2.25.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH 04/10] dmapool: improve accuracy of debug statistics
The "total number of blocks in pool" debug statistic currently does not take the boundary value into account, so it diverges from the "total number of blocks in use" statistic when a boundary is in effect. Add a calculation for the number of blocks per allocation that takes the boundary into account, and use it to replace the inaccurate calculation. This depends on the patch "dmapool: fix boundary comparison" for the calculated blks_per_alloc value to be correct. Signed-off-by: Tony Battersby --- mm/dmapool.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/mm/dmapool.c b/mm/dmapool.c index 782143144a32..9e30f4425dea 100644 --- a/mm/dmapool.c +++ b/mm/dmapool.c @@ -47,6 +47,7 @@ struct dma_pool { /* the pool */ struct device *dev; unsigned int allocation; unsigned int boundary; + unsigned int blks_per_alloc; char name[32]; struct list_head pools; }; @@ -92,8 +93,7 @@ static ssize_t pools_show(struct device *dev, struct device_attribute *attr, cha /* per-pool info, no real statistics yet */ temp = scnprintf(next, size, "%-16s %4zu %4zu %4u %2u\n", pool->name, blocks, -(size_t) pages * -(pool->allocation / pool->size), +(size_t) pages * pool->blks_per_alloc, pool->size, pages); size -= temp; next += temp; @@ -168,6 +168,9 @@ struct dma_pool *dma_pool_create(const char *name, struct device *dev, retval->size = size; retval->boundary = boundary; retval->allocation = allocation; + retval->blks_per_alloc = + (allocation / boundary) * (boundary / size) + + (allocation % boundary) / size; INIT_LIST_HEAD(&retval->pools); -- 2.25.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs
On 2022-05-31 17:21, Jason Gunthorpe wrote: On Tue, May 31, 2022 at 05:01:46PM +0100, Robin Murphy wrote: The DMA API doesn't need locking, partly since it can trust itself not to do stupid things, and mostly because it's DMA API performance that's fundamentally incompatible with serialisation anyway. Why do you think we have a complicated per-CPU IOVA caching mechanism, if not to support big multi-queue devices with multiple CPU threads mapping/unmapping in different parts of the same DMA domain concurrently? Well, per-CPU is a form of locking. Right, but that only applies for alloc_iova_fast() itself - once the CPUs have each got their distinct IOVA region, they can then pile into iommu_map() completely unhindered (and the inverse for the unmap path). So what are the actual locking rules here? We can call map/unmap concurrently but not if ... ? IOVA overlaps? Well, I think the de-facto rule is that you technically *can* make overlapping requests, but one or both may fail, and the final outcome in terms of what ends up mapped in the domain is undefined (especially if they both succeed). The only real benefit of enforcing serialisation would be better failure behaviour in such cases, but it remains fundamentally nonsensical for callers to make contradictory requests anyway, whether concurrently or sequentially, so there doesn't seem much point in spending effort on improving support for nonsense. And we expect the iommu driver to be unable to free page table levels that have IOVA boundaries in them? I'm not entirely sure what you mean there, but in general an unmap request is expected to match some previous map request - there isn't a defined API-level behaviour for partial unmaps. They might either unmap the entire region originally mapped, or just the requested part, or might fail entirely (IIRC there was some nasty code in VFIO for detecting a particular behaviour). Similarly for unmapping anything that's already not mapped, some drivers treat that as a no-op, others as an error. But again, this is even further unrelated to concurrency. The simpler drivers already serialise on a per-domain lock internally, while the more performance-focused ones implement lock-free atomic pagetable management in a similar style to CPU arch code; either way it should work fine as-is. The mm has page table locks at every level and generally expects them to be held for a lot of manipulations. There are some lockless cases, but it is not as aggressive as this sounds. Oh, I've spent the last couple of weeks hacking up horrible things manipulating entries in init_mm, and never realised that that was actually the special case. Oh well, live and learn. Robin. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v4 1/6] iommu: Add a per domain PASID for DMA API
Hi Baolu, On Tue, 31 May 2022 20:45:28 +0800, Baolu Lu wrote: > On 2022/5/31 18:12, Tian, Kevin wrote: > +++ b/include/linux/iommu.h > @@ -105,6 +105,8 @@ struct iommu_domain { > enum iommu_page_response_code (*iopf_handler)(struct > >> iommu_fault *fault, > void *data); > void *fault_data; > +ioasid_t pasid; /* Used for DMA requests > with PASID */ > +atomic_t pasid_users; > >>> These are poorly named, this is really the DMA API global PASID and > >>> shouldn't be used for other things. > >>> > >>> > >>> > >>> Perhaps I misunderstood, do you mind explaining more? > >> You still haven't really explained what this is for in this patch, > >> maybe it just needs a better commit message, or maybe something is > >> wrong. > >> > >> I keep saying the DMA API usage is not special, so why do we need to > >> create a new global pasid and refcount? Realistically this is only > >> going to be used by IDXD, why can't we just allocate a PASID and > >> return it to the driver every time a driver asks for DMA API on PASI > >> mode? Why does the core need to do anything special? > >> The reason why I store PASID at IOMMU domain is for IOTLB flush within the domain. Device driver is not aware of domain level IOTLB flush. We also have iova_cookie for each domain which essentially is for RIDPASID. > > Agree. I guess it was a mistake caused by treating ENQCMD as the > > only user although the actual semantics of the invented interfaces > > have already evolved to be quite general. > > > > This is very similar to what we have been discussing for iommufd. > > a PASID is just an additional routing info when attaching a device > > to an I/O address space (DMA API in this context) and by default > > it should be a per-device resource except when ENQCMD is > > explicitly opt in. > > > > Hence it's right time for us to develop common facility working > > for both this DMA API usage and iommufd, i.e.: > > > > for normal PASID attach to a domain, driver: > > > > allocates a local pasid from device local space; > > attaches the local pasid to a domain; > > > > for PASID attach in particular for ENQCMD, driver: > > > > allocates a global pasid in system-wide; > > attaches the global pasid to a domain; > > set the global pasid in PASID_MSR; > > > > In both cases the pasid is stored in the attach data instead of the > > domain. > > So during IOTLB flush for the domain, do we loop through the attach data? > > DMA API pasid is no special from above except it needs to allow > > one device attached to the same domain twice (one with RID > > and the other with RID+PASID). > > > > for iommufd those operations are initiated by userspace via > > iommufd uAPI. > > My understanding is that device driver owns its PASID policy. If ENQCMD > is supported on the device, the PASIDs should be allocated through > ioasid_alloc(). Otherwise, the whole PASID pool is managed by the device > driver. > It seems the changes we want for this patchset are: 1. move ioasid_alloc() from the core to device (allocation scope will be based on whether ENQCMD is intended or not) 2. store pasid in the attach data 3. use the same iommufd api to attach/set pasid on its default domain Am I summarizing correctly? > For kernel DMA w/ PASID, after the driver has a PASID for this purpose, > it can just set the default domain to the PASID on device. There's no > need for enable/disable() interfaces. > > Best regards, > baolu Thanks, Jacob ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V3 5/8] dt-bindings: Add xen,grant-dma IOMMU description for xen-grant DMA ops
On 31.05.22 14:52, Krzysztof Kozlowski wrote: Hello Krzysztof On 30/05/2022 23:00, Oleksandr Tyshchenko wrote: From: Oleksandr Tyshchenko Thank you for your patch. There is something to discuss/improve. diff --git a/Documentation/devicetree/bindings/iommu/xen,grant-dma.yaml b/Documentation/devicetree/bindings/iommu/xen,grant-dma.yaml new file mode 100644 index ..ab5765c --- /dev/null +++ b/Documentation/devicetree/bindings/iommu/xen,grant-dma.yaml @@ -0,0 +1,49 @@ +# SPDX-License-Identifier: (GPL-2.0-only or BSD-2-Clause) +%YAML 1.2 +--- +$id: http://devicetree.org/schemas/iommu/xen,grant-dma.yaml# +$schema: http://devicetree.org/meta-schemas/core.yaml# + +title: Xen specific IOMMU for virtualized devices (e.g. virtio) + +maintainers: + - Stefano Stabellini + +description: + The reference to Xen specific IOMMU node using "iommus" property indicates + that Xen grant mappings need to be enabled for the device, and it specifies + the ID of the domain where the corresponding backend resides. + The binding is required to restrict memory access using Xen grant mappings. + +properties: + compatible: +const: xen,grant-dma + + '#iommu-cells': +const: 1 +description: + Xen specific IOMMU is multiple-master IOMMU device. + The single cell describes the domid (domain ID) of the domain where + the backend is running. + +required: + - compatible + - "#iommu-cells" + +additionalProperties: false + +examples: + - | +xen_iommu { No underscores in node names, generic node names, so this looks like "iommu". ok, will change +compatible = "xen,grant-dma"; +#iommu-cells = <1>; +}; + +virtio@3000 { +compatible = "virtio,mmio"; +reg = <0x3000 0x100>; +interrupts = <41>; + +/* The backend is located in Xen domain with ID 1 */ +iommus = <&xen_iommu 1>; There is no need usually to give consumer examples in provider binding. If there is nothing specific here (looks exactly like every IOMMU consumer in Linux kernel), drop the consumer. I got it. There is nothing specific from the device tree's perspective, I was thinking to draw attention to the IOMMU specifier (in our case, the master device's ID == backend's domain ID). But '#iommu-cells' description above already clarifies that. Will drop. +}; Best regards, Krzysztof -- Regards, Oleksandr Tyshchenko ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [git pull] IOMMU Updates for Linux v5.19
The pull request you sent on Tue, 31 May 2022 14:37:42 +0200: > git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git > tags/iommu-updates-v5.19 has been merged into torvalds/linux.git: https://git.kernel.org/torvalds/c/e1cbc3b96a9974746b2a80c3a6c8a0f7eff7b1b5 Thank you! -- Deet-doot-dot, I am a bot. https://korg.docs.kernel.org/prtracker.html ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs
On 2022-05-31 16:59, Jason Gunthorpe wrote: On Tue, May 31, 2022 at 02:52:28PM +0100, Robin Murphy wrote: + break; + pgtable_walk_level(m, phys_to_virt(phys_addr), Also, obligatory reminder that pfn_valid() only means that pfn_to_page() gets you a valid struct page. Whether that page is direct-mapped kernel memory or not is a different matter. Even though this is debugfs, if the operation is sketchy like that and can theortically crash the kernel the driver should test capabilities, CAP_SYS_RAWIO or something may be appropriate. I don't think we have a better cap for 'userspace may crash the kernel' It shouldn't be insurmountable to make this safe, it just needs a bit more than pfn_valid(), which can still return true off the ends of the memory map if they're not perfectly section-aligned, and for random reserved holes in the middle. Robin. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 2/6] iommu/qcom: Write TCR before TTBRs to fix ASID access behavior
On 2022-05-31 16:55, Will Deacon wrote: On Fri, May 27, 2022 at 11:28:57PM +0200, Konrad Dybcio wrote: From: AngeloGioacchino Del Regno As also stated in the arm-smmu driver, we must write the TCR before writing the TTBRs, since the TCR determines the access behavior of some fields. Where is this stated in the arm-smmu driver? In arm_smmu_write_context_bank() - IIRC it's mostly about the case where if you write a 16-bit ASID to TTBR before setting TCR2.AS you might end up losing the top 8 bits of it. However, in the context of a pantomime where we just have to pretend to program the "hardware" the way the firmware has already programmed it (on pain of getting randomly reset if we look at it wrong), I can't imagine it really matters. Robin. Signed-off-by: AngeloGioacchino Del Regno Signed-off-by: Marijn Suijten Signed-off-by: Konrad Dybcio --- drivers/iommu/arm/arm-smmu/qcom_iommu.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/drivers/iommu/arm/arm-smmu/qcom_iommu.c b/drivers/iommu/arm/arm-smmu/qcom_iommu.c index 1728d4d7fe25..75f353866c40 100644 --- a/drivers/iommu/arm/arm-smmu/qcom_iommu.c +++ b/drivers/iommu/arm/arm-smmu/qcom_iommu.c @@ -273,18 +273,18 @@ static int qcom_iommu_init_domain(struct iommu_domain *domain, ctx->secure_init = true; } - /* TTBRs */ - iommu_writeq(ctx, ARM_SMMU_CB_TTBR0, - pgtbl_cfg.arm_lpae_s1_cfg.ttbr | - FIELD_PREP(ARM_SMMU_TTBRn_ASID, ctx->asid)); - iommu_writeq(ctx, ARM_SMMU_CB_TTBR1, 0); - /* TCR */ iommu_writel(ctx, ARM_SMMU_CB_TCR2, arm_smmu_lpae_tcr2(&pgtbl_cfg)); iommu_writel(ctx, ARM_SMMU_CB_TCR, arm_smmu_lpae_tcr(&pgtbl_cfg) | ARM_SMMU_TCR_EAE); + /* TTBRs */ + iommu_writeq(ctx, ARM_SMMU_CB_TTBR0, + pgtbl_cfg.arm_lpae_s1_cfg.ttbr | + FIELD_PREP(ARM_SMMU_TTBRn_ASID, ctx->asid)); + iommu_writeq(ctx, ARM_SMMU_CB_TTBR1, 0); I'd have thought that SCTLR.M would be clear here, so it shouldn't matter what order we write these in. Will ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs
On Tue, May 31, 2022 at 05:01:46PM +0100, Robin Murphy wrote: > The DMA API doesn't need locking, partly since it can trust itself not to do > stupid things, and mostly because it's DMA API performance that's > fundamentally incompatible with serialisation anyway. Why do you think we > have a complicated per-CPU IOVA caching mechanism, if not to support big > multi-queue devices with multiple CPU threads mapping/unmapping in different > parts of the same DMA domain concurrently? Well, per-CPU is a form of locking. So what are the actual locking rules here? We can call map/unmap concurrently but not if ... ? IOVA overlaps? And we expect the iommu driver to be unable to free page table levels that have IOVA boundaries in them? > The simpler drivers already serialise on a per-domain lock internally, while > the more performance-focused ones implement lock-free atomic pagetable > management in a similar style to CPU arch code; either way it should work > fine as-is. The mm has page table locks at every level and generally expects them to be held for a lot of manipulations. There are some lockless cases, but it is not as aggressive as this sounds. > The difference with debugfs is that it's a completely orthogonal > side-channel - an iommu_domain user like VFIO or iommu-dma can make sure its > *own* API usage is sane, but can't be aware of the user triggering some > driver-internal introspection of that domain in a manner that could race > more harmfully. The mm solution to this problem is to RCU free the page table levels. This way something like debugfs can read a page table under RCU completely safely, though incoherently, and there is no performance cost on the map/unmap fast path side. Today struct page has a rcu_head that can be used to rcu free it, so it costs nothing. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 1/6] iommu/qcom: Use the asid read from device-tree if specified
On Tue, May 31, 2022 at 09:15:22AM -0700, Rob Clark wrote: > On Tue, May 31, 2022 at 8:46 AM Will Deacon wrote: > > > > On Fri, May 27, 2022 at 11:28:56PM +0200, Konrad Dybcio wrote: > > > From: AngeloGioacchino Del Regno > > > > > > > > > As specified in this driver, the context banks are 0x1000 apart. > > > Problem is that sometimes the context number (our asid) does not > > > match this logic and we end up using the wrong one: this starts > > > being a problem in the case that we need to send TZ commands > > > to do anything on a specific context. > > > > I don't understand this. The ASID is a software construct, so it shouldn't > > matter what we use. If it does matter, then please can you explain why? The > > fact that the context banks are 0x1000 apart seems unrelated. > > I think the connection is that mapping from ctx bank to ASID is 1:1 But in what sense? How is the ASID used beyond a tag in the TLB? The commit message hints at "TZ commands" being a problem. I'm not doubting that this is needed to make the thing work, I just don't understand why. Will ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 1/6] iommu/qcom: Use the asid read from device-tree if specified
On Tue, May 31, 2022 at 8:46 AM Will Deacon wrote: > > On Fri, May 27, 2022 at 11:28:56PM +0200, Konrad Dybcio wrote: > > From: AngeloGioacchino Del Regno > > > > As specified in this driver, the context banks are 0x1000 apart. > > Problem is that sometimes the context number (our asid) does not > > match this logic and we end up using the wrong one: this starts > > being a problem in the case that we need to send TZ commands > > to do anything on a specific context. > > I don't understand this. The ASID is a software construct, so it shouldn't > matter what we use. If it does matter, then please can you explain why? The > fact that the context banks are 0x1000 apart seems unrelated. I think the connection is that mapping from ctx bank to ASID is 1:1 BR, -R ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v4 1/6] iommu: Add a per domain PASID for DMA API
On Tue, May 31, 2022 at 08:45:28PM +0800, Baolu Lu wrote: > My understanding is that device driver owns its PASID policy. I'm not sure this is actually useful, the core code should provide the allocator as I can't think of any device that actually needs a special allocator. > If ENQCMD is supported on the device, the PASIDs should be allocated > through ioasid_alloc(). Otherwise, the whole PASID pool is managed > by the device driver. This is OK only if we know in-advance that the device needs a global PASID. ENQCMD is one reason, maybe there are others down the road. Better to make this core code policy. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs
On 2022-05-31 16:13, Jason Gunthorpe wrote: On Tue, May 31, 2022 at 04:01:04PM +0100, Robin Murphy wrote: On 2022-05-31 15:53, Jason Gunthorpe wrote: On Tue, May 31, 2022 at 10:11:18PM +0800, Baolu Lu wrote: On 2022/5/31 21:10, Jason Gunthorpe wrote: On Tue, May 31, 2022 at 11:02:06AM +0800, Baolu Lu wrote: For case 2, it is a bit weird. I tried to add a rwsem lock to make the iommu_unmap() and dumping tables in debugfs exclusive. This does not work because debugfs may depend on the DMA of the devices to work. It seems that what we can do is to allow this race, but when we traverse the page table in debugfs, we will check the validity of the physical address retrieved from the page table entry. Then, the worst case is to print some useless information. Sounds horrible, don't you have locking around the IOPTEs of some kind? How does updating them work reliably? There's no locking around updating the IOPTEs. The basic assumption is that at any time, there's only a single thread manipulating the mappings of the range specified in iommu_map/unmap() APIs. Therefore, the race only exists when multiple ranges share some high-level IOPTEs. The IOMMU driver updates those IOPTEs using the compare-and-exchange atomic operation. Oh? Did I miss where that was documented as part of the iommu API? Daniel posted patches for VFIO to multi-thread iommu_domin mapping. iommufd goes out of its way to avoid this kind of serialization so that userspace can parallel map IOVA. I think if this is the requirement then the iommu API needs to provide a lock around the domain for the driver.. Eww, no, we can't kill performance by forcing serialisation on the entire API just for one silly driver-internal debugfs corner :( I'm not worried about debugfs, I'm worried about these efforts to speed up VFIO VM booting by parallel domain loading: https://lore.kernel.org/kvm/20220106004656.126790-1-daniel.m.jor...@oracle.com/ The DMA API should maintain its own external lock, but the general domain interface to the rest of the kernel should be safe, IMHO. The DMA API doesn't need locking, partly since it can trust itself not to do stupid things, and mostly because it's DMA API performance that's fundamentally incompatible with serialisation anyway. Why do you think we have a complicated per-CPU IOVA caching mechanism, if not to support big multi-queue devices with multiple CPU threads mapping/unmapping in different parts of the same DMA domain concurrently? Or at least it should be documented.. As far as I'm aware there's never been any restriction at the IOMMU API level. It should be self-evident that racing concurrent iommu_{map,unmap} requests against iommu_domain_free(), or against each other for overlapping IOVAs, is a bad idea and don't do that if you want a well-defined outcome. The simpler drivers already serialise on a per-domain lock internally, while the more performance-focused ones implement lock-free atomic pagetable management in a similar style to CPU arch code; either way it should work fine as-is. The difference with debugfs is that it's a completely orthogonal side-channel - an iommu_domain user like VFIO or iommu-dma can make sure its *own* API usage is sane, but can't be aware of the user triggering some driver-internal introspection of that domain in a manner that could race more harmfully. It has to be down to individual drivers to make that safe if they choose to expose such an interface. Robin. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs
On Tue, May 31, 2022 at 02:52:28PM +0100, Robin Murphy wrote: > > + break; > > + pgtable_walk_level(m, phys_to_virt(phys_addr), > > Also, obligatory reminder that pfn_valid() only means that pfn_to_page() > gets you a valid struct page. Whether that page is direct-mapped kernel > memory or not is a different matter. Even though this is debugfs, if the operation is sketchy like that and can theortically crash the kernel the driver should test capabilities, CAP_SYS_RAWIO or something may be appropriate. I don't think we have a better cap for 'userspace may crash the kernel' Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 2/6] iommu/qcom: Write TCR before TTBRs to fix ASID access behavior
On Fri, May 27, 2022 at 11:28:57PM +0200, Konrad Dybcio wrote: > From: AngeloGioacchino Del Regno > > As also stated in the arm-smmu driver, we must write the TCR before > writing the TTBRs, since the TCR determines the access behavior of > some fields. Where is this stated in the arm-smmu driver? > > Signed-off-by: AngeloGioacchino Del Regno > > Signed-off-by: Marijn Suijten > Signed-off-by: Konrad Dybcio > --- > drivers/iommu/arm/arm-smmu/qcom_iommu.c | 12 ++-- > 1 file changed, 6 insertions(+), 6 deletions(-) > > diff --git a/drivers/iommu/arm/arm-smmu/qcom_iommu.c > b/drivers/iommu/arm/arm-smmu/qcom_iommu.c > index 1728d4d7fe25..75f353866c40 100644 > --- a/drivers/iommu/arm/arm-smmu/qcom_iommu.c > +++ b/drivers/iommu/arm/arm-smmu/qcom_iommu.c > @@ -273,18 +273,18 @@ static int qcom_iommu_init_domain(struct iommu_domain > *domain, > ctx->secure_init = true; > } > > - /* TTBRs */ > - iommu_writeq(ctx, ARM_SMMU_CB_TTBR0, > - pgtbl_cfg.arm_lpae_s1_cfg.ttbr | > - FIELD_PREP(ARM_SMMU_TTBRn_ASID, ctx->asid)); > - iommu_writeq(ctx, ARM_SMMU_CB_TTBR1, 0); > - > /* TCR */ > iommu_writel(ctx, ARM_SMMU_CB_TCR2, > arm_smmu_lpae_tcr2(&pgtbl_cfg)); > iommu_writel(ctx, ARM_SMMU_CB_TCR, >arm_smmu_lpae_tcr(&pgtbl_cfg) | ARM_SMMU_TCR_EAE); > > + /* TTBRs */ > + iommu_writeq(ctx, ARM_SMMU_CB_TTBR0, > + pgtbl_cfg.arm_lpae_s1_cfg.ttbr | > + FIELD_PREP(ARM_SMMU_TTBRn_ASID, ctx->asid)); > + iommu_writeq(ctx, ARM_SMMU_CB_TTBR1, 0); I'd have thought that SCTLR.M would be clear here, so it shouldn't matter what order we write these in. Will ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 10/19] iommu/amd: Add unmap_read_dirty() support
On 5/31/22 13:39, Suravee Suthikulpanit wrote: > On 4/29/22 4:09 AM, Joao Martins wrote: >> AMD implementation of unmap_read_dirty() is pretty simple as >> mostly reuses unmap code with the extra addition of marshalling >> the dirty bit into the bitmap as it walks the to-be-unmapped >> IOPTE. >> >> Extra care is taken though, to switch over to cmpxchg as opposed >> to a non-serialized store to the PTE and testing the dirty bit >> only set until cmpxchg succeeds to set to 0. >> >> Signed-off-by: Joao Martins >> --- >> drivers/iommu/amd/io_pgtable.c | 44 +- >> drivers/iommu/amd/iommu.c | 22 + >> 2 files changed, 60 insertions(+), 6 deletions(-) >> >> diff --git a/drivers/iommu/amd/io_pgtable.c b/drivers/iommu/amd/io_pgtable.c >> index 8325ef193093..1868c3b58e6d 100644 >> --- a/drivers/iommu/amd/io_pgtable.c >> +++ b/drivers/iommu/amd/io_pgtable.c >> @@ -355,6 +355,16 @@ static void free_clear_pte(u64 *pte, u64 pteval, struct >> list_head *freelist) >> free_sub_pt(pt, mode, freelist); >> } >> >> +static bool free_pte_dirty(u64 *pte, u64 pteval) > > Nitpick: Since we free and clearing the dirty bit, should we change > the function name to free_clear_pte_dirty()? > We free and *read* the dirty bit. It just so happens that we clear dirty bit and every other one in the process. Just to make sure that I am not clear the dirty bit explicitly (like the read_and_clear_dirty()) >> +{ >> +bool dirty = false; >> + >> +while (IOMMU_PTE_DIRTY(cmpxchg64(pte, pteval, 0))) > > We should use 0ULL instead of 0. > ack. >> +dirty = true; >> + >> +return dirty; >> +} >> + > > Actually, what do you think if we enhance the current free_clear_pte() > to also handle the check dirty as well? > See further below, about dropping this patch. >> /* >>* Generic mapping functions. It maps a physical address into a DMA >>* address space. It allocates the page table pages if necessary. >> @@ -428,10 +438,11 @@ static int iommu_v1_map_page(struct io_pgtable_ops >> *ops, unsigned long iova, >> return ret; >> } >> >> -static unsigned long iommu_v1_unmap_page(struct io_pgtable_ops *ops, >> - unsigned long iova, >> - size_t size, >> - struct iommu_iotlb_gather *gather) >> +static unsigned long __iommu_v1_unmap_page(struct io_pgtable_ops *ops, >> + unsigned long iova, >> + size_t size, >> + struct iommu_iotlb_gather *gather, >> + struct iommu_dirty_bitmap *dirty) >> { >> struct amd_io_pgtable *pgtable = io_pgtable_ops_to_data(ops); >> unsigned long long unmapped; >> @@ -445,11 +456,15 @@ static unsigned long iommu_v1_unmap_page(struct >> io_pgtable_ops *ops, >> while (unmapped < size) { >> pte = fetch_pte(pgtable, iova, &unmap_size); >> if (pte) { >> -int i, count; >> +unsigned long i, count; >> +bool pte_dirty = false; >> >> count = PAGE_SIZE_PTE_COUNT(unmap_size); >> for (i = 0; i < count; i++) >> -pte[i] = 0ULL; >> +pte_dirty |= free_pte_dirty(&pte[i], pte[i]); >> + > > Actually, what if we change the existing free_clear_pte() to > free_and_clear_dirty_pte(), > and incorporate the logic for > Likewise, but otherwise it would be a good idea. >> ... >> diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c >> index 0a86392b2367..a8fcb6e9a684 100644 >> --- a/drivers/iommu/amd/iommu.c >> +++ b/drivers/iommu/amd/iommu.c >> @@ -2144,6 +2144,27 @@ static size_t amd_iommu_unmap(struct iommu_domain >> *dom, unsigned long iova, >> return r; >> } >> >> +static size_t amd_iommu_unmap_read_dirty(struct iommu_domain *dom, >> + unsigned long iova, size_t page_size, >> + struct iommu_iotlb_gather *gather, >> + struct iommu_dirty_bitmap *dirty) >> +{ >> +struct protection_domain *domain = to_pdomain(dom); >> +struct io_pgtable_ops *ops = &domain->iop.iop.ops; >> +size_t r; >> + >> +if ((amd_iommu_pgtable == AMD_IOMMU_V1) && >> +(domain->iop.mode == PAGE_MODE_NONE)) >> +return 0; >> + >> +r = (ops->unmap_read_dirty) ? >> +ops->unmap_read_dirty(ops, iova, page_size, gather, dirty) : 0; >> + >> +amd_iommu_iotlb_gather_add_page(dom, gather, iova, page_size); >> + >> +return r; >> +} >> + > > Instead of creating a new function, what if we enhance the current > amd_iommu_unmap() > to also handle read dirty part as well (e.g. __amd_iommu_unmap_read_dirty()), > and > then both amd_iommu_unmap() and amd_io
Re: [PATCH 1/6] iommu/qcom: Use the asid read from device-tree if specified
On Fri, May 27, 2022 at 11:28:56PM +0200, Konrad Dybcio wrote: > From: AngeloGioacchino Del Regno > > As specified in this driver, the context banks are 0x1000 apart. > Problem is that sometimes the context number (our asid) does not > match this logic and we end up using the wrong one: this starts > being a problem in the case that we need to send TZ commands > to do anything on a specific context. I don't understand this. The ASID is a software construct, so it shouldn't matter what we use. If it does matter, then please can you explain why? The fact that the context banks are 0x1000 apart seems unrelated. Will ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 09/19] iommu/amd: Access/Dirty bit support in IOPTEs
On 5/31/22 12:34, Suravee Suthikulpanit wrote: > Joao, > > On 4/29/22 4:09 AM, Joao Martins wrote: >> . >> +static int amd_iommu_set_dirty_tracking(struct iommu_domain *domain, >> +bool enable) >> +{ >> +struct protection_domain *pdomain = to_pdomain(domain); >> +struct iommu_dev_data *dev_data; >> +bool dom_flush = false; >> + >> +if (!amd_iommu_had_support) >> +return -EOPNOTSUPP; >> + >> +list_for_each_entry(dev_data, &pdomain->dev_list, list) { > > Since we iterate through device list for the domain, we would need to > call spin_lock_irqsave(&pdomain->lock, flags) here. > Ugh, yes. Will fix. >> +struct amd_iommu *iommu; >> +u64 pte_root; >> + >> +iommu = amd_iommu_rlookup_table[dev_data->devid]; >> +pte_root = amd_iommu_dev_table[dev_data->devid].data[0]; >> + >> +/* No change? */ >> +if (!(enable ^ !!(pte_root & DTE_FLAG_HAD))) >> +continue; >> + >> +pte_root = (enable ? >> +pte_root | DTE_FLAG_HAD : pte_root & ~DTE_FLAG_HAD); >> + >> +/* Flush device DTE */ >> +amd_iommu_dev_table[dev_data->devid].data[0] = pte_root; >> +device_flush_dte(dev_data); >> +dom_flush = true; >> +} >> + >> +/* Flush IOTLB to mark IOPTE dirty on the next translation(s) */ >> +if (dom_flush) { >> +unsigned long flags; >> + >> +spin_lock_irqsave(&pdomain->lock, flags); >> +amd_iommu_domain_flush_tlb_pde(pdomain); >> +amd_iommu_domain_flush_complete(pdomain); >> +spin_unlock_irqrestore(&pdomain->lock, flags); >> +} > > And call spin_unlock_irqrestore(&pdomain->lock, flags); here. ack Additionally, something that I am thinking for v2 was going to have @had bool field in iommu_dev_data. That would align better with the rest of amd iommu code rather than me introducing this pattern of using hardware location of PTE roots. Let me know if you disagree. >> + >> +return 0; >> +} >> + >> +static bool amd_iommu_get_dirty_tracking(struct iommu_domain *domain) >> +{ >> +struct protection_domain *pdomain = to_pdomain(domain); >> +struct iommu_dev_data *dev_data; >> +u64 dte; >> + > > Also call spin_lock_irqsave(&pdomain->lock, flags) here > ack >> +list_for_each_entry(dev_data, &pdomain->dev_list, list) { >> +dte = amd_iommu_dev_table[dev_data->devid].data[0]; >> +if (!(dte & DTE_FLAG_HAD)) >> +return false; >> +} >> + > > And call spin_unlock_irqsave(&pdomain->lock, flags) here > ack Same comment as I was saying above, and replace the @dte checking to just instead check this new variable. >> +return true; >> +} >> + >> +static int amd_iommu_read_and_clear_dirty(struct iommu_domain *domain, >> + unsigned long iova, size_t size, >> + struct iommu_dirty_bitmap *dirty) >> +{ >> +struct protection_domain *pdomain = to_pdomain(domain); >> +struct io_pgtable_ops *ops = &pdomain->iop.iop.ops; >> + >> +if (!amd_iommu_get_dirty_tracking(domain)) >> +return -EOPNOTSUPP; >> + >> +if (!ops || !ops->read_and_clear_dirty) >> +return -ENODEV; > > We move this check before the amd_iommu_get_dirty_tracking(). > Yeap, better fail earlier. > Best Regards, > Suravee > >> + >> +return ops->read_and_clear_dirty(ops, iova, size, dirty); >> +} >> + >> + >> static void amd_iommu_get_resv_regions(struct device *dev, >> struct list_head *head) >> { >> @@ -2293,6 +2368,8 @@ const struct iommu_ops amd_iommu_ops = { >> .flush_iotlb_all = amd_iommu_flush_iotlb_all, >> .iotlb_sync = amd_iommu_iotlb_sync, >> .free = amd_iommu_domain_free, >> +.set_dirty_tracking = amd_iommu_set_dirty_tracking, >> +.read_and_clear_dirty = amd_iommu_read_and_clear_dirty, >> } >> }; >> ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs
On Tue, May 31, 2022 at 04:01:04PM +0100, Robin Murphy wrote: > On 2022-05-31 15:53, Jason Gunthorpe wrote: > > On Tue, May 31, 2022 at 10:11:18PM +0800, Baolu Lu wrote: > > > On 2022/5/31 21:10, Jason Gunthorpe wrote: > > > > On Tue, May 31, 2022 at 11:02:06AM +0800, Baolu Lu wrote: > > > > > > > > > For case 2, it is a bit weird. I tried to add a rwsem lock to make the > > > > > iommu_unmap() and dumping tables in debugfs exclusive. This does not > > > > > work because debugfs may depend on the DMA of the devices to work. It > > > > > seems that what we can do is to allow this race, but when we traverse > > > > > the page table in debugfs, we will check the validity of the physical > > > > > address retrieved from the page table entry. Then, the worst case is > > > > > to > > > > > print some useless information. > > > > > > > > Sounds horrible, don't you have locking around the IOPTEs of some > > > > kind? How does updating them work reliably? > > > > > > There's no locking around updating the IOPTEs. The basic assumption is > > > that at any time, there's only a single thread manipulating the mappings > > > of the range specified in iommu_map/unmap() APIs. Therefore, the race > > > only exists when multiple ranges share some high-level IOPTEs. The IOMMU > > > driver updates those IOPTEs using the compare-and-exchange atomic > > > operation. > > > > Oh? Did I miss where that was documented as part of the iommu API? > > > > Daniel posted patches for VFIO to multi-thread iommu_domin mapping. > > > > iommufd goes out of its way to avoid this kind of serialization so > > that userspace can parallel map IOVA. > > > > I think if this is the requirement then the iommu API needs to > > provide a lock around the domain for the driver.. > > Eww, no, we can't kill performance by forcing serialisation on the entire > API just for one silly driver-internal debugfs corner :( I'm not worried about debugfs, I'm worried about these efforts to speed up VFIO VM booting by parallel domain loading: https://lore.kernel.org/kvm/20220106004656.126790-1-daniel.m.jor...@oracle.com/ The DMA API should maintain its own external lock, but the general domain interface to the rest of the kernel should be safe, IMHO. Or at least it should be documented.. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs
On 2022-05-31 15:53, Jason Gunthorpe wrote: On Tue, May 31, 2022 at 10:11:18PM +0800, Baolu Lu wrote: On 2022/5/31 21:10, Jason Gunthorpe wrote: On Tue, May 31, 2022 at 11:02:06AM +0800, Baolu Lu wrote: For case 2, it is a bit weird. I tried to add a rwsem lock to make the iommu_unmap() and dumping tables in debugfs exclusive. This does not work because debugfs may depend on the DMA of the devices to work. It seems that what we can do is to allow this race, but when we traverse the page table in debugfs, we will check the validity of the physical address retrieved from the page table entry. Then, the worst case is to print some useless information. Sounds horrible, don't you have locking around the IOPTEs of some kind? How does updating them work reliably? There's no locking around updating the IOPTEs. The basic assumption is that at any time, there's only a single thread manipulating the mappings of the range specified in iommu_map/unmap() APIs. Therefore, the race only exists when multiple ranges share some high-level IOPTEs. The IOMMU driver updates those IOPTEs using the compare-and-exchange atomic operation. Oh? Did I miss where that was documented as part of the iommu API? Daniel posted patches for VFIO to multi-thread iommu_domin mapping. iommufd goes out of its way to avoid this kind of serialization so that userspace can parallel map IOVA. I think if this is the requirement then the iommu API needs to provide a lock around the domain for the driver.. Eww, no, we can't kill performance by forcing serialisation on the entire API just for one silly driver-internal debugfs corner :( Robin. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs
On Tue, May 31, 2022 at 10:11:18PM +0800, Baolu Lu wrote: > On 2022/5/31 21:10, Jason Gunthorpe wrote: > > On Tue, May 31, 2022 at 11:02:06AM +0800, Baolu Lu wrote: > > > > > For case 2, it is a bit weird. I tried to add a rwsem lock to make the > > > iommu_unmap() and dumping tables in debugfs exclusive. This does not > > > work because debugfs may depend on the DMA of the devices to work. It > > > seems that what we can do is to allow this race, but when we traverse > > > the page table in debugfs, we will check the validity of the physical > > > address retrieved from the page table entry. Then, the worst case is to > > > print some useless information. > > > > Sounds horrible, don't you have locking around the IOPTEs of some > > kind? How does updating them work reliably? > > There's no locking around updating the IOPTEs. The basic assumption is > that at any time, there's only a single thread manipulating the mappings > of the range specified in iommu_map/unmap() APIs. Therefore, the race > only exists when multiple ranges share some high-level IOPTEs. The IOMMU > driver updates those IOPTEs using the compare-and-exchange atomic > operation. Oh? Did I miss where that was documented as part of the iommu API? Daniel posted patches for VFIO to multi-thread iommu_domin mapping. iommufd goes out of its way to avoid this kind of serialization so that userspace can parallel map IOVA. I think if this is the requirement then the iommu API needs to provide a lock around the domain for the driver.. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [RFC PATCH V2 1/2] swiotlb: Add Child IO TLB mem support
From: h...@lst.de Sent: Tuesday, May 31, 2022 12:17 AM > > On Mon, May 30, 2022 at 01:52:37AM +, Michael Kelley (LINUX) wrote: > > B) The contents of the memory buffer must transition between > > encrypted and not encrypted. The hardware doesn't provide > > any mechanism to do such a transition "in place". The only > > way to transition is for the CPU to copy the contents between > > an encrypted area and an unencrypted area of memory. > > > > Because of (B), we're stuck needing bounce buffers. There's no > > way to avoid them with the current hardware. Tianyu also pointed > > out not wanting to expose uninitialized guest memory to the host, > > so, for example, sharing a read buffer with the host requires that > > it first be initialized to zero. > > Ok, B is a deal breaker. I just brought this in because I've received > review comments that state bouncing is just the easiest option for > now and we could map it into the hypervisor in the future. But at > least for SEV that does not seem like an option without hardware > changes. > > > We should reset and make sure we agree on the top-level approach. > > 1) We want general scalability improvements to swiotlb. These > > improvements should scale to high CPUs counts (> 100) and for > > multiple NUMA nodes. > > 2) Drivers should not require any special knowledge of swiotlb to > > benefit from the improvements. No new swiotlb APIs should be > > need to be used by drivers -- the swiotlb scalability improvements > > should be transparent. > > 3) The scalability improvements should not be based on device > > boundaries since a single device may have multiple channels > > doing bounce buffering on multiple CPUs in parallel. > > Agreed to all counts. > > > The patch from Andi Kleen [3] (not submitted upstream, but referenced > > by Tianyu as the basis for his patches) seems like a good starting point > > for meeting the top-level approach. > > Yes, I think doing per-cpu and/or per-node scaling sounds like the > right general approach. Why was this never sent out? I'll defer to Andi on what his thinking/plan is for this patch. > > > Andi and Robin had some > > back-and-forth about Andi's patch that I haven't delved into yet, but > > getting that worked out seems like a better overall approach. I had > > an offline chat with Tianyu, and he would agree as well. > > Where was this discussion? I was just referring to this thread that you are already on: https://lore.kernel.org/lkml/e7b644f0-6c90-fe99-792d-75c38505d...@arm.com/ Michael ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs
On 2022/5/31 21:10, Jason Gunthorpe wrote: On Tue, May 31, 2022 at 11:02:06AM +0800, Baolu Lu wrote: For case 2, it is a bit weird. I tried to add a rwsem lock to make the iommu_unmap() and dumping tables in debugfs exclusive. This does not work because debugfs may depend on the DMA of the devices to work. It seems that what we can do is to allow this race, but when we traverse the page table in debugfs, we will check the validity of the physical address retrieved from the page table entry. Then, the worst case is to print some useless information. Sounds horrible, don't you have locking around the IOPTEs of some kind? How does updating them work reliably? There's no locking around updating the IOPTEs. The basic assumption is that at any time, there's only a single thread manipulating the mappings of the range specified in iommu_map/unmap() APIs. Therefore, the race only exists when multiple ranges share some high-level IOPTEs. The IOMMU driver updates those IOPTEs using the compare-and-exchange atomic operation. It is just debugfs, so maybe it is not the end of the world, but still.. Fair enough. I think this is somewhat similar to that IOMMU hardware can traverse the page table at any time without considering when the CPUs update it. The IOMMU hardware will generate faults when it encounters failure during the traverse of page table. Similarly, perhaps debugfs could dump all-ones for an invalid IOPTE? Best regards, baolu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs
On 2022-05-31 04:02, Baolu Lu wrote: On 2022/5/30 20:14, Jason Gunthorpe wrote: On Sun, May 29, 2022 at 01:14:46PM +0800, Baolu Lu wrote: From 1e87b5df40c6ce9414cdd03988c3b52bfb17af5f Mon Sep 17 00:00:00 2001 From: Lu Baolu Date: Sun, 29 May 2022 10:18:56 +0800 Subject: [PATCH 1/1] iommu/vt-d: debugfs: Remove device_domain_lock usage The domain_translation_struct debugfs node is used to dump static mappings of PCI devices. It potentially races with setting new domains to devices and the iommu_map/unmap() interfaces. The existing code tries to use the global spinlock device_domain_lock to avoid the races, but this is problematical as this lock is only used to protect the device tracking lists of the domains. Instead of using an immature lock to cover up the problem, it's better to explicitly restrict the use of this debugfs node. This also makes device_domain_lock static. What does "explicitly restrict" mean? I originally thought about adding restrictions on this interface to a document. But obviously, this is a naive idea. :-) I went over the code again. The races exist in two paths: 1. Dump the page table in use while setting a new page table to the device. 2. A high-level page table entry has been marked as non-present, but the dumping code has walked down to the low-level tables. For case 1, we can try to solve it by dumping tables while holding the group->mutex. For case 2, it is a bit weird. I tried to add a rwsem lock to make the iommu_unmap() and dumping tables in debugfs exclusive. This does not work because debugfs may depend on the DMA of the devices to work. It seems that what we can do is to allow this race, but when we traverse the page table in debugfs, we will check the validity of the physical address retrieved from the page table entry. Then, the worst case is to print some useless information. The real code looks like this: From 3feb0727f9d7095729ef75ab1967270045b3a38c Mon Sep 17 00:00:00 2001 From: Lu Baolu Date: Sun, 29 May 2022 10:18:56 +0800 Subject: [PATCH 1/1] iommu/vt-d: debugfs: Remove device_domain_lock usage The domain_translation_struct debugfs node is used to dump the DMAR page tables for the PCI devices. It potentially races with setting domains to devices and the iommu_unmap() interface. The existing code uses a global spinlock device_domain_lock to avoid the races, but this is problematical as this lock is only used to protect the device tracking lists of each domain. This replaces device_domain_lock with group->mutex to protect the traverse of the page tables from setting a new domain and always check the physical address retrieved from the page table entry before traversing to the next- level page table. As a cleanup, this also makes device_domain_lock static. Signed-off-by: Lu Baolu --- drivers/iommu/intel/debugfs.c | 42 ++- drivers/iommu/intel/iommu.c | 2 +- drivers/iommu/intel/iommu.h | 1 - 3 files changed, 27 insertions(+), 18 deletions(-) diff --git a/drivers/iommu/intel/debugfs.c b/drivers/iommu/intel/debugfs.c index d927ef10641b..e6f4835b8d9f 100644 --- a/drivers/iommu/intel/debugfs.c +++ b/drivers/iommu/intel/debugfs.c @@ -333,25 +333,28 @@ static void pgtable_walk_level(struct seq_file *m, struct dma_pte *pde, continue; path[level] = pde->val; - if (dma_pte_superpage(pde) || level == 1) + if (dma_pte_superpage(pde) || level == 1) { dump_page_info(m, start, path); - else - pgtable_walk_level(m, phys_to_virt(dma_pte_addr(pde)), + } else { + unsigned long phys_addr; + + phys_addr = (unsigned long)dma_pte_addr(pde); + if (!pfn_valid(__phys_to_pfn(phys_addr))) Given that pte_present(pde) passed just above, it was almost certainly a valid entry, so it seems unlikely that the physical address it pointed to could have disappeared in the meantime. If you're worried about the potential case where we've been preempted during this walk for long enough that the page has already been freed by an unmap, reallocated, and filled with someone else's data that happens to look like valid PTEs, this still isn't enough, since that data could just as well happen to look like valid physical addresses too. I imagine that if you want to safely walk pagetables concurrently with them potentially being freed, you'd probably need to get RCU involved. + break; + pgtable_walk_level(m, phys_to_virt(phys_addr), Also, obligatory reminder that pfn_valid() only means that pfn_to_page() gets you a valid struct page. Whether that page is direct-mapped kernel memory or not is a different matter. level - 1, start, path); + } path[level] = 0; } } -static int show_device_domain_translation(struct device *dev, void *data) +static int __show_device_domain_translation(struct device *dev, void *data) { struct d
Re: [PATCH 01/12] iommu/vt-d: Use iommu_get_domain_for_dev() in debugfs
On Tue, May 31, 2022 at 11:02:06AM +0800, Baolu Lu wrote: > For case 2, it is a bit weird. I tried to add a rwsem lock to make the > iommu_unmap() and dumping tables in debugfs exclusive. This does not > work because debugfs may depend on the DMA of the devices to work. It > seems that what we can do is to allow this race, but when we traverse > the page table in debugfs, we will check the validity of the physical > address retrieved from the page table entry. Then, the worst case is to > print some useless information. Sounds horrible, don't you have locking around the IOPTEs of some kind? How does updating them work reliably? It is just debugfs, so maybe it is not the end of the world, but still.. Jason ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v4 1/6] iommu: Add a per domain PASID for DMA API
On 2022/5/31 18:12, Tian, Kevin wrote: +++ b/include/linux/iommu.h @@ -105,6 +105,8 @@ struct iommu_domain { enum iommu_page_response_code (*iopf_handler)(struct iommu_fault *fault, void *data); void *fault_data; + ioasid_t pasid; /* Used for DMA requests with PASID */ + atomic_t pasid_users; These are poorly named, this is really the DMA API global PASID and shouldn't be used for other things. Perhaps I misunderstood, do you mind explaining more? You still haven't really explained what this is for in this patch, maybe it just needs a better commit message, or maybe something is wrong. I keep saying the DMA API usage is not special, so why do we need to create a new global pasid and refcount? Realistically this is only going to be used by IDXD, why can't we just allocate a PASID and return it to the driver every time a driver asks for DMA API on PASI mode? Why does the core need to do anything special? Agree. I guess it was a mistake caused by treating ENQCMD as the only user although the actual semantics of the invented interfaces have already evolved to be quite general. This is very similar to what we have been discussing for iommufd. a PASID is just an additional routing info when attaching a device to an I/O address space (DMA API in this context) and by default it should be a per-device resource except when ENQCMD is explicitly opt in. Hence it's right time for us to develop common facility working for both this DMA API usage and iommufd, i.e.: for normal PASID attach to a domain, driver: allocates a local pasid from device local space; attaches the local pasid to a domain; for PASID attach in particular for ENQCMD, driver: allocates a global pasid in system-wide; attaches the global pasid to a domain; set the global pasid in PASID_MSR; In both cases the pasid is stored in the attach data instead of the domain. DMA API pasid is no special from above except it needs to allow one device attached to the same domain twice (one with RID and the other with RID+PASID). for iommufd those operations are initiated by userspace via iommufd uAPI. My understanding is that device driver owns its PASID policy. If ENQCMD is supported on the device, the PASIDs should be allocated through ioasid_alloc(). Otherwise, the whole PASID pool is managed by the device driver. For kernel DMA w/ PASID, after the driver has a PASID for this purpose, it can just set the default domain to the PASID on device. There's no need for enable/disable() interfaces. Best regards, baolu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 10/19] iommu/amd: Add unmap_read_dirty() support
On 4/29/22 4:09 AM, Joao Martins wrote: AMD implementation of unmap_read_dirty() is pretty simple as mostly reuses unmap code with the extra addition of marshalling the dirty bit into the bitmap as it walks the to-be-unmapped IOPTE. Extra care is taken though, to switch over to cmpxchg as opposed to a non-serialized store to the PTE and testing the dirty bit only set until cmpxchg succeeds to set to 0. Signed-off-by: Joao Martins --- drivers/iommu/amd/io_pgtable.c | 44 +- drivers/iommu/amd/iommu.c | 22 + 2 files changed, 60 insertions(+), 6 deletions(-) diff --git a/drivers/iommu/amd/io_pgtable.c b/drivers/iommu/amd/io_pgtable.c index 8325ef193093..1868c3b58e6d 100644 --- a/drivers/iommu/amd/io_pgtable.c +++ b/drivers/iommu/amd/io_pgtable.c @@ -355,6 +355,16 @@ static void free_clear_pte(u64 *pte, u64 pteval, struct list_head *freelist) free_sub_pt(pt, mode, freelist); } +static bool free_pte_dirty(u64 *pte, u64 pteval) Nitpick: Since we free and clearing the dirty bit, should we change the function name to free_clear_pte_dirty()? +{ + bool dirty = false; + + while (IOMMU_PTE_DIRTY(cmpxchg64(pte, pteval, 0))) We should use 0ULL instead of 0. + dirty = true; + + return dirty; +} + Actually, what do you think if we enhance the current free_clear_pte() to also handle the check dirty as well? /* * Generic mapping functions. It maps a physical address into a DMA * address space. It allocates the page table pages if necessary. @@ -428,10 +438,11 @@ static int iommu_v1_map_page(struct io_pgtable_ops *ops, unsigned long iova, return ret; } -static unsigned long iommu_v1_unmap_page(struct io_pgtable_ops *ops, - unsigned long iova, - size_t size, - struct iommu_iotlb_gather *gather) +static unsigned long __iommu_v1_unmap_page(struct io_pgtable_ops *ops, + unsigned long iova, + size_t size, + struct iommu_iotlb_gather *gather, + struct iommu_dirty_bitmap *dirty) { struct amd_io_pgtable *pgtable = io_pgtable_ops_to_data(ops); unsigned long long unmapped; @@ -445,11 +456,15 @@ static unsigned long iommu_v1_unmap_page(struct io_pgtable_ops *ops, while (unmapped < size) { pte = fetch_pte(pgtable, iova, &unmap_size); if (pte) { - int i, count; + unsigned long i, count; + bool pte_dirty = false; count = PAGE_SIZE_PTE_COUNT(unmap_size); for (i = 0; i < count; i++) - pte[i] = 0ULL; + pte_dirty |= free_pte_dirty(&pte[i], pte[i]); + Actually, what if we change the existing free_clear_pte() to free_and_clear_dirty_pte(), and incorporate the logic for ... diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c index 0a86392b2367..a8fcb6e9a684 100644 --- a/drivers/iommu/amd/iommu.c +++ b/drivers/iommu/amd/iommu.c @@ -2144,6 +2144,27 @@ static size_t amd_iommu_unmap(struct iommu_domain *dom, unsigned long iova, return r; } +static size_t amd_iommu_unmap_read_dirty(struct iommu_domain *dom, +unsigned long iova, size_t page_size, +struct iommu_iotlb_gather *gather, +struct iommu_dirty_bitmap *dirty) +{ + struct protection_domain *domain = to_pdomain(dom); + struct io_pgtable_ops *ops = &domain->iop.iop.ops; + size_t r; + + if ((amd_iommu_pgtable == AMD_IOMMU_V1) && + (domain->iop.mode == PAGE_MODE_NONE)) + return 0; + + r = (ops->unmap_read_dirty) ? + ops->unmap_read_dirty(ops, iova, page_size, gather, dirty) : 0; + + amd_iommu_iotlb_gather_add_page(dom, gather, iova, page_size); + + return r; +} + Instead of creating a new function, what if we enhance the current amd_iommu_unmap() to also handle read dirty part as well (e.g. __amd_iommu_unmap_read_dirty()), and then both amd_iommu_unmap() and amd_iommu_unmap_read_dirty() can call the __amd_iommu_unmap_read_dirty()? Best Regards, Suravee ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[git pull] IOMMU Updates for Linux v5.19
Hi Linus, Apologies for the late pull request, I know you prefer the main stuff in the first week. Some vacation and a public holiday came in between here. So here are the IOMMU updates for 5.19. Some patches are probably arleady merged via the VFIO tree, namely everyting from the vfio-notifier-fix topic branch. Also, there will be a merge conflict in MAINTAINERS and drivers/iommu/amd/iommu.c. The latter one is resolved by removing the function in question, for the former I attached my resolution. With that in mind: The following changes since commit 42226c989789d8da4af1de0c31070c96726d990c: Linux 5.18-rc7 (2022-05-15 18:08:58 -0700) are available in the Git repository at: git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu.git tags/iommu-updates-v5.19 for you to fetch changes up to b0dacee202efbf1a5d9f5cdfd82049e8b5b085d2: Merge branches 'apple/dart', 'arm/mediatek', 'arm/msm', 'arm/smmu', 'ppc/pamu', 'x86/vt-d', 'x86/amd' and 'vfio-notifier-fix' into next (2022-05-20 12:27:17 +0200) IOMMU Updates for Linux v5.19 Including: - Intel VT-d driver updates - Domain force snooping improvement. - Cleanups, no intentional functional changes. - ARM SMMU driver updates - Add new Qualcomm device-tree compatible strings - Add new Nvidia device-tree compatible string for Tegra234 - Fix UAF in SMMUv3 shared virtual addressing code - Force identity-mapped domains for users of ye olde SMMU legacy binding - Minor cleanups - Patches to fix a BUG_ON in the vfio_iommu_group_notifier - Groundwork for upcoming iommufd framework - Introduction of DMA ownership so that an entire IOMMU group is either controlled by the kernel or by user-space - MT8195 and MT8186 support in the Mediatek IOMMU driver - Patches to make forcing of cache-coherent DMA more coherent between IOMMU drivers - Fixes for thunderbolt device DMA protection - Various smaller fixes and cleanups Bjorn Andersson (2): dt-bindings: arm-smmu: Add compatible for Qualcomm SC8280XP iommu/arm-smmu-qcom: Add SC8280XP support Christophe Leroy (1): iommu/fsl_pamu: Prepare cleanup of powerpc's asm/prom.h Jason Gunthorpe (5): vfio: Delete the unbound_list iommu: Introduce the domain op enforce_cache_coherency() vfio: Move the Intel no-snoop control off of IOMMU_CACHE iommu: Redefine IOMMU_CAP_CACHE_COHERENCY as the cap flag for IOMMU_CACHE vfio: Require that devices support DMA cache coherence Jason Gunthorpe via iommu (1): iommu: iommu_group_claim_dma_owner() must always assign a domain Jean-Philippe Brucker (1): iommu/arm-smmu-v3-sva: Fix mm use-after-free Joerg Roedel (4): Merge tag 'arm-smmu-updates' of git://git.kernel.org/pub/scm/linux/kernel/git/will/linux into arm/smmu iommu/amd: Increase timeout waiting for GA log enablement Merge tag 'v5.18-rc7' into arm/smmu Merge branches 'apple/dart', 'arm/mediatek', 'arm/msm', 'arm/smmu', 'ppc/pamu', 'x86/vt-d', 'x86/amd' and 'vfio-notifier-fix' into next Lu Baolu (17): iommu: Add DMA ownership management interfaces driver core: Add dma_cleanup callback in bus_type amba: Stop sharing platform_dma_configure() bus: platform,amba,fsl-mc,PCI: Add device DMA ownership management PCI: pci_stub: Set driver_managed_dma PCI: portdrv: Set driver_managed_dma vfio: Set DMA ownership for VFIO devices vfio: Remove use of vfio_group_viable() vfio: Remove iommu group notifier iommu: Remove iommu group changes notifier iommu/vt-d: Change return type of dmar_insert_one_dev_info() iommu/vt-d: Fold dmar_insert_one_dev_info() into its caller iommu/vt-d: Size Page Request Queue to avoid overflow condition iommu/vt-d: Block force-snoop domain attaching if no SC support iommu/vt-d: Check domain force_snooping against attached devices iommu/vt-d: Remove domain_update_iommu_snooping() iommu/vt-d: Remove hard coding PGSNP bit in PASID entries Mario Limonciello (3): iommu/amd: Enable swiotlb in all cases dma-iommu: Check that swiotlb is active before trying to use it iommu/amd: Indicate whether DMA remap support is enabled Matthew Rosato (1): iommu/s390: Tolerate repeat attach_dev calls Miles Chen (1): iommu/mediatek: Fix NULL pointer dereference when printing dev_name Muhammad Usama Anjum (1): iommu/vt-d: Remove unneeded validity check on dev Rob Herring (1): dt-bindings: iommu: Drop client node in examples Robin Murphy (5): iommu: Introduce device_iommu_capable() iommu: Add capability for pre-boot DMA protection thunderbolt: Make iommu_dma_protection more accura
Re: [PATCH RFC 09/19] iommu/amd: Access/Dirty bit support in IOPTEs
Hi Suravee , On 2022/5/31 19:34, Suravee Suthikulpanit wrote: On 4/29/22 4:09 AM, Joao Martins wrote: . +static int amd_iommu_set_dirty_tracking(struct iommu_domain *domain, + bool enable) +{ + struct protection_domain *pdomain = to_pdomain(domain); + struct iommu_dev_data *dev_data; + bool dom_flush = false; + + if (!amd_iommu_had_support) + return -EOPNOTSUPP; + + list_for_each_entry(dev_data, &pdomain->dev_list, list) { Since we iterate through device list for the domain, we would need to call spin_lock_irqsave(&pdomain->lock, flags) here. Not related, just out of curiosity. Does it really need to disable the interrupt while holding this lock? Any case this list would be traversed in any interrupt context? Perhaps I missed anything? Best regards, baolu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V3 5/8] dt-bindings: Add xen,grant-dma IOMMU description for xen-grant DMA ops
On 30/05/2022 23:00, Oleksandr Tyshchenko wrote: > From: Oleksandr Tyshchenko Thank you for your patch. There is something to discuss/improve. > diff --git a/Documentation/devicetree/bindings/iommu/xen,grant-dma.yaml > b/Documentation/devicetree/bindings/iommu/xen,grant-dma.yaml > new file mode 100644 > index ..ab5765c > --- /dev/null > +++ b/Documentation/devicetree/bindings/iommu/xen,grant-dma.yaml > @@ -0,0 +1,49 @@ > +# SPDX-License-Identifier: (GPL-2.0-only or BSD-2-Clause) > +%YAML 1.2 > +--- > +$id: http://devicetree.org/schemas/iommu/xen,grant-dma.yaml# > +$schema: http://devicetree.org/meta-schemas/core.yaml# > + > +title: Xen specific IOMMU for virtualized devices (e.g. virtio) > + > +maintainers: > + - Stefano Stabellini > + > +description: > + The reference to Xen specific IOMMU node using "iommus" property indicates > + that Xen grant mappings need to be enabled for the device, and it specifies > + the ID of the domain where the corresponding backend resides. > + The binding is required to restrict memory access using Xen grant mappings. > + > +properties: > + compatible: > +const: xen,grant-dma > + > + '#iommu-cells': > +const: 1 > +description: > + Xen specific IOMMU is multiple-master IOMMU device. > + The single cell describes the domid (domain ID) of the domain where > + the backend is running. > + > +required: > + - compatible > + - "#iommu-cells" > + > +additionalProperties: false > + > +examples: > + - | > +xen_iommu { No underscores in node names, generic node names, so this looks like "iommu". > +compatible = "xen,grant-dma"; > +#iommu-cells = <1>; > +}; > + > +virtio@3000 { > +compatible = "virtio,mmio"; > +reg = <0x3000 0x100>; > +interrupts = <41>; > + > +/* The backend is located in Xen domain with ID 1 */ > +iommus = <&xen_iommu 1>; There is no need usually to give consumer examples in provider binding. If there is nothing specific here (looks exactly like every IOMMU consumer in Linux kernel), drop the consumer. > +}; Best regards, Krzysztof ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH RFC 09/19] iommu/amd: Access/Dirty bit support in IOPTEs
Joao, On 4/29/22 4:09 AM, Joao Martins wrote: . +static int amd_iommu_set_dirty_tracking(struct iommu_domain *domain, + bool enable) +{ + struct protection_domain *pdomain = to_pdomain(domain); + struct iommu_dev_data *dev_data; + bool dom_flush = false; + + if (!amd_iommu_had_support) + return -EOPNOTSUPP; + + list_for_each_entry(dev_data, &pdomain->dev_list, list) { Since we iterate through device list for the domain, we would need to call spin_lock_irqsave(&pdomain->lock, flags) here. + struct amd_iommu *iommu; + u64 pte_root; + + iommu = amd_iommu_rlookup_table[dev_data->devid]; + pte_root = amd_iommu_dev_table[dev_data->devid].data[0]; + + /* No change? */ + if (!(enable ^ !!(pte_root & DTE_FLAG_HAD))) + continue; + + pte_root = (enable ? + pte_root | DTE_FLAG_HAD : pte_root & ~DTE_FLAG_HAD); + + /* Flush device DTE */ + amd_iommu_dev_table[dev_data->devid].data[0] = pte_root; + device_flush_dte(dev_data); + dom_flush = true; + } + + /* Flush IOTLB to mark IOPTE dirty on the next translation(s) */ + if (dom_flush) { + unsigned long flags; + + spin_lock_irqsave(&pdomain->lock, flags); + amd_iommu_domain_flush_tlb_pde(pdomain); + amd_iommu_domain_flush_complete(pdomain); + spin_unlock_irqrestore(&pdomain->lock, flags); + } And call spin_unlock_irqrestore(&pdomain->lock, flags); here. + + return 0; +} + +static bool amd_iommu_get_dirty_tracking(struct iommu_domain *domain) +{ + struct protection_domain *pdomain = to_pdomain(domain); + struct iommu_dev_data *dev_data; + u64 dte; + Also call spin_lock_irqsave(&pdomain->lock, flags) here + list_for_each_entry(dev_data, &pdomain->dev_list, list) { + dte = amd_iommu_dev_table[dev_data->devid].data[0]; + if (!(dte & DTE_FLAG_HAD)) + return false; + } + And call spin_unlock_irqsave(&pdomain->lock, flags) here + return true; +} + +static int amd_iommu_read_and_clear_dirty(struct iommu_domain *domain, + unsigned long iova, size_t size, + struct iommu_dirty_bitmap *dirty) +{ + struct protection_domain *pdomain = to_pdomain(domain); + struct io_pgtable_ops *ops = &pdomain->iop.iop.ops; + + if (!amd_iommu_get_dirty_tracking(domain)) + return -EOPNOTSUPP; + + if (!ops || !ops->read_and_clear_dirty) + return -ENODEV; We move this check before the amd_iommu_get_dirty_tracking(). Best Regards, Suravee + + return ops->read_and_clear_dirty(ops, iova, size, dirty); +} + + static void amd_iommu_get_resv_regions(struct device *dev, struct list_head *head) { @@ -2293,6 +2368,8 @@ const struct iommu_ops amd_iommu_ops = { .flush_iotlb_all = amd_iommu_flush_iotlb_all, .iotlb_sync = amd_iommu_iotlb_sync, .free = amd_iommu_domain_free, + .set_dirty_tracking = amd_iommu_set_dirty_tracking, + .read_and_clear_dirty = amd_iommu_read_and_clear_dirty, } }; ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH v4 1/6] iommu: Add a per domain PASID for DMA API
> From: Jason Gunthorpe > Sent: Monday, May 30, 2022 8:23 PM > > On Tue, May 24, 2022 at 08:17:27AM -0700, Jacob Pan wrote: > > Hi Jason, > > > > On Tue, 24 May 2022 10:50:34 -0300, Jason Gunthorpe > wrote: > > > > > On Wed, May 18, 2022 at 11:21:15AM -0700, Jacob Pan wrote: > > > > DMA requests tagged with PASID can target individual IOMMU domains. > > > > Introduce a domain-wide PASID for DMA API, it will be used on the > same > > > > mapping as legacy DMA without PASID. Let it be IOVA or PA in case of > > > > identity domain. > > > > > > Huh? I can't understand what this is trying to say or why this patch > > > makes sense. > > > > > > We really should not have pasid's like this attached to the domains.. > > > > > This is the same "DMA API global PASID" you reviewed in v3, I just > > singled it out as a standalone patch and renamed it. Here is your previous > > review comment. > > > > > +++ b/include/linux/iommu.h > > > @@ -105,6 +105,8 @@ struct iommu_domain { > > > enum iommu_page_response_code (*iopf_handler)(struct > iommu_fault *fault, > > > void *data); > > > void *fault_data; > > > + ioasid_t pasid; /* Used for DMA requests with PASID */ > > > + atomic_t pasid_users; > > > > These are poorly named, this is really the DMA API global PASID and > > shouldn't be used for other things. > > > > > > > > Perhaps I misunderstood, do you mind explaining more? > > You still haven't really explained what this is for in this patch, > maybe it just needs a better commit message, or maybe something is > wrong. > > I keep saying the DMA API usage is not special, so why do we need to > create a new global pasid and refcount? Realistically this is only > going to be used by IDXD, why can't we just allocate a PASID and > return it to the driver every time a driver asks for DMA API on PASI > mode? Why does the core need to do anything special? > Agree. I guess it was a mistake caused by treating ENQCMD as the only user although the actual semantics of the invented interfaces have already evolved to be quite general. This is very similar to what we have been discussing for iommufd. a PASID is just an additional routing info when attaching a device to an I/O address space (DMA API in this context) and by default it should be a per-device resource except when ENQCMD is explicitly opt in. Hence it's right time for us to develop common facility working for both this DMA API usage and iommufd, i.e.: for normal PASID attach to a domain, driver: allocates a local pasid from device local space; attaches the local pasid to a domain; for PASID attach in particular for ENQCMD, driver: allocates a global pasid in system-wide; attaches the global pasid to a domain; set the global pasid in PASID_MSR; In both cases the pasid is stored in the attach data instead of the domain. DMA API pasid is no special from above except it needs to allow one device attached to the same domain twice (one with RID and the other with RID+PASID). for iommufd those operations are initiated by userspace via iommufd uAPI. Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH V2 1/2] swiotlb: Add Child IO TLB mem support
On Mon, May 30, 2022 at 01:52:37AM +, Michael Kelley (LINUX) wrote: > B) The contents of the memory buffer must transition between > encrypted and not encrypted. The hardware doesn't provide > any mechanism to do such a transition "in place". The only > way to transition is for the CPU to copy the contents between > an encrypted area and an unencrypted area of memory. > > Because of (B), we're stuck needing bounce buffers. There's no > way to avoid them with the current hardware. Tianyu also pointed > out not wanting to expose uninitialized guest memory to the host, > so, for example, sharing a read buffer with the host requires that > it first be initialized to zero. Ok, B is a deal breaker. I just brought this in because I've received review comments that state bouncing is just the easiest option for now and we could map it into the hypervisor in the future. But at least for SEV that does not seem like an option without hardware changes. > We should reset and make sure we agree on the top-level approach. > 1) We want general scalability improvements to swiotlb. These > improvements should scale to high CPUs counts (> 100) and for > multiple NUMA nodes. > 2) Drivers should not require any special knowledge of swiotlb to > benefit from the improvements. No new swiotlb APIs should be > need to be used by drivers -- the swiotlb scalability improvements > should be transparent. > 3) The scalability improvements should not be based on device > boundaries since a single device may have multiple channels > doing bounce buffering on multiple CPUs in parallel. Agreed to all counts. > The patch from Andi Kleen [3] (not submitted upstream, but referenced > by Tianyu as the basis for his patches) seems like a good starting point > for meeting the top-level approach. Yes, I think doing per-cpu and/or per-node scaling sounds like the right general approach. Why was this never sent out? > Andi and Robin had some > back-and-forth about Andi's patch that I haven't delved into yet, but > getting that worked out seems like a better overall approach. I had > an offline chat with Tianyu, and he would agree as well. Where was this discussion? ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu