Re: A problem of Intel IOMMU hardware ?
> On Mar 17, 2021, at 9:46 PM, Longpeng (Mike, Cloud Infrastructure Service > Product Dept.) wrote: > [Snip] > > NOTE, the magical thing happen...(*Operation-4*) we write the PTE > of Operation-1 from 0 to 0x3 which means can Read/Write, and then > we trigger DMA read again, it success and return the data of HPA 0 !! > > Why we modify the older page table would make sense ? As we > have discussed previously, the cache flush part of the driver is correct, > it call flush_iotlb after (b) and no need to flush after (c). But the result > of the experiment shows the older page table or older caches is effective > actually. > > Any ideas ? Interesting. Sounds as if there is some page-walk cache that was not invalidated properly. signature.asc Description: Message signed with OpenPGP ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: A problem of Intel IOMMU hardware ?
Hi Nadav, > -Original Message- > From: Nadav Amit [mailto:nadav.a...@gmail.com] > Sent: Thursday, March 18, 2021 2:13 AM > To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.) > > Cc: David Woodhouse ; Lu Baolu > ; Joerg Roedel ; w...@kernel.org; > alex.william...@redhat.com; chenjiashang ; > iommu@lists.linux-foundation.org; Gonglei (Arei) ; > LKML > Subject: Re: A problem of Intel IOMMU hardware ? > > > > > On Mar 17, 2021, at 2:35 AM, Longpeng (Mike, Cloud Infrastructure Service > Product Dept.) wrote: > > > > Hi Nadav, > > > >> -Original Message- > >> From: Nadav Amit [mailto:nadav.a...@gmail.com] > >>> reproduce the problem with high probability (~50%). > >> > >> I saw Lu replied, and he is much more knowledgable than I am (I was > >> just intrigued by your email). > >> > >> However, if I were you I would try also to remove some > >> “optimizations” to look for the root-cause (e.g., use domain specific > invalidations instead of page-specific). > >> > > > > Good suggestion! But we did it these days, we tried to use global > > invalidations as > follow: > > iommu->flush.flush_iotlb(iommu, did, 0, 0, > > DMA_TLB_DSI_FLUSH); > > But can not resolve the problem. > > > >> The first thing that comes to my mind is the invalidation hint (ih) > >> in iommu_flush_iotlb_psi(). I would remove it to see whether you get > >> the failure without it. > > > > We also notice the IH, but the IH is always ZERO in our case, as the spec > > says: > > ''' > > Paging-structure-cache entries caching second-level mappings > > associated with the specified domain-id and the > > second-level-input-address range are invalidated, if the Invalidation > > Hint > > (IH) field is Clear. > > ''' > > > > It seems the software is everything fine, so we've no choice but to suspect > > the > hardware. > > Ok, I am pretty much out of ideas. I have two more suggestions, but they are > much > less likely to help. Yet, they can further help to rule out software bugs: > > 1. dma_clear_pte() seems to be wrong IMHO. It should have used WRITE_ONCE() > to prevent split-write, which might potentially cause “invalid” (partially > cleared) PTE to be stored in the TLB. Having said that, the subsequent IOTLB > flush > should have prevented the problem. > Yes, use WRITE_ONCE is much safer, however I was just testing the following code, it didn't resolved my problem. static inline void dma_clear_pte(struct dma_pte *pte) { WRITE_ONCE(pte->val, 0ULL); } > 2. Consider ensuring that the problem is not somehow related to queued > invalidations. Try to use __iommu_flush_iotlb() instead of qi_flush_iotlb(). > I tried to force to use __iommu_flush_iotlb(), but maybe something wrong, the system crashed, so I prefer to lower the priority of this operation. > Regards, > Nadav ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: A problem of Intel IOMMU hardware ?
> From: iommu On Behalf Of > Longpeng (Mike, Cloud Infrastructure Service Product Dept.) > > > 2. Consider ensuring that the problem is not somehow related to queued > > invalidations. Try to use __iommu_flush_iotlb() instead of qi_flush_iotlb(). > > > > I tried to force to use __iommu_flush_iotlb(), but maybe something wrong, > the system crashed, so I prefer to lower the priority of this operation. > The VT-d spec clearly says that register-based invalidation can be used only when queued-invalidations are not enabled. Intel-IOMMU driver doesn't provide an option to disable queued-invalidation though, when the hardware is capable. If you really want to try, tweak the code in intel_iommu_init_qi. Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: A problem of Intel IOMMU hardware ?
> -Original Message- > From: Tian, Kevin [mailto:kevin.t...@intel.com] > Sent: Thursday, March 18, 2021 4:27 PM > To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.) > ; Nadav Amit > Cc: chenjiashang ; David Woodhouse > ; iommu@lists.linux-foundation.org; LKML > ; alex.william...@redhat.com; Gonglei (Arei) > ; w...@kernel.org > Subject: RE: A problem of Intel IOMMU hardware ? > > > From: iommu On Behalf Of > > Longpeng (Mike, Cloud Infrastructure Service Product Dept.) > > > > > 2. Consider ensuring that the problem is not somehow related to > > > queued invalidations. Try to use __iommu_flush_iotlb() instead of > qi_flush_iotlb(). > > > > > > > I tried to force to use __iommu_flush_iotlb(), but maybe something > > wrong, the system crashed, so I prefer to lower the priority of this > > operation. > > > > The VT-d spec clearly says that register-based invalidation can be used only > when > queued-invalidations are not enabled. Intel-IOMMU driver doesn't provide an > option to disable queued-invalidation though, when the hardware is capable. > If you > really want to try, tweak the code in intel_iommu_init_qi. > Hi Kevin, Thanks to point out this. Do you have any ideas about this problem ? I tried to descript the problem much clear in my reply to Alex, hope you could have a look if you're interested. > Thanks > Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: A problem of Intel IOMMU hardware ?
> From: Longpeng (Mike, Cloud Infrastructure Service Product Dept.) > > > > > -Original Message- > > From: Tian, Kevin [mailto:kevin.t...@intel.com] > > Sent: Thursday, March 18, 2021 4:27 PM > > To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.) > > ; Nadav Amit > > Cc: chenjiashang ; David Woodhouse > > ; iommu@lists.linux-foundation.org; LKML > > ; alex.william...@redhat.com; Gonglei > (Arei) > > ; w...@kernel.org > > Subject: RE: A problem of Intel IOMMU hardware ? > > > > > From: iommu On Behalf Of > > > Longpeng (Mike, Cloud Infrastructure Service Product Dept.) > > > > > > > 2. Consider ensuring that the problem is not somehow related to > > > > queued invalidations. Try to use __iommu_flush_iotlb() instead of > > qi_flush_iotlb(). > > > > > > > > > > I tried to force to use __iommu_flush_iotlb(), but maybe something > > > wrong, the system crashed, so I prefer to lower the priority of this > operation. > > > > > > > The VT-d spec clearly says that register-based invalidation can be used only > when > > queued-invalidations are not enabled. Intel-IOMMU driver doesn't provide > an > > option to disable queued-invalidation though, when the hardware is > capable. If you > > really want to try, tweak the code in intel_iommu_init_qi. > > > > Hi Kevin, > > Thanks to point out this. Do you have any ideas about this problem ? I tried > to descript the problem much clear in my reply to Alex, hope you could have > a look if you're interested. > I agree with Nadav. Looks this implies some stale paging structure cache entry (e.g. PMD) is not invalidated properly. It's better if Baolu can reproduce this problem in his local environment and then do more debug to identify whether it's a software or hardware defect. btw what is the device under test? Does it support ATS? Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: A problem of Intel IOMMU hardware ?
> -Original Message- > From: Tian, Kevin [mailto:kevin.t...@intel.com] > Sent: Thursday, March 18, 2021 4:43 PM > To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.) > ; Nadav Amit > Cc: chenjiashang ; David Woodhouse > ; iommu@lists.linux-foundation.org; LKML > ; alex.william...@redhat.com; Gonglei (Arei) > ; w...@kernel.org > Subject: RE: A problem of Intel IOMMU hardware ? > > > From: Longpeng (Mike, Cloud Infrastructure Service Product Dept.) > > > > > > > > > -Original Message- > > > From: Tian, Kevin [mailto:kevin.t...@intel.com] > > > Sent: Thursday, March 18, 2021 4:27 PM > > > To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.) > > > ; Nadav Amit > > > Cc: chenjiashang ; David Woodhouse > > > ; iommu@lists.linux-foundation.org; LKML > > > ; alex.william...@redhat.com; Gonglei > > (Arei) > > > ; w...@kernel.org > > > Subject: RE: A problem of Intel IOMMU hardware ? > > > > > > > From: iommu On Behalf > > > > Of Longpeng (Mike, Cloud Infrastructure Service Product Dept.) > > > > > > > > > 2. Consider ensuring that the problem is not somehow related to > > > > > queued invalidations. Try to use __iommu_flush_iotlb() instead > > > > > of > > > qi_flush_iotlb(). > > > > > > > > > > > > > I tried to force to use __iommu_flush_iotlb(), but maybe something > > > > wrong, the system crashed, so I prefer to lower the priority of > > > > this > > operation. > > > > > > > > > > The VT-d spec clearly says that register-based invalidation can be > > > used only > > when > > > queued-invalidations are not enabled. Intel-IOMMU driver doesn't > > > provide > > an > > > option to disable queued-invalidation though, when the hardware is > > capable. If you > > > really want to try, tweak the code in intel_iommu_init_qi. > > > > > > > Hi Kevin, > > > > Thanks to point out this. Do you have any ideas about this problem ? I > > tried to descript the problem much clear in my reply to Alex, hope you > > could have a look if you're interested. > > > > I agree with Nadav. Looks this implies some stale paging structure cache > entry (e.g. > PMD) is not invalidated properly. It's better if Baolu can reproduce this > problem in > his local environment and then do more debug to identify whether it's a > software or > hardware defect. > > btw what is the device under test? Does it support ATS? > The device is our offload card, it does not support ATS cap. > Thanks > Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: A problem of Intel IOMMU hardware ?
> From: Longpeng (Mike, Cloud Infrastructure Service Product Dept.) > > > > -Original Message- > > From: Tian, Kevin [mailto:kevin.t...@intel.com] > > Sent: Thursday, March 18, 2021 4:27 PM > > To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.) > > ; Nadav Amit > > Cc: chenjiashang ; David Woodhouse > > ; iommu@lists.linux-foundation.org; LKML > > ; alex.william...@redhat.com; Gonglei > (Arei) > > ; w...@kernel.org > > Subject: RE: A problem of Intel IOMMU hardware ? > > > > > From: iommu On Behalf Of > > > Longpeng (Mike, Cloud Infrastructure Service Product Dept.) > > > > > > > 2. Consider ensuring that the problem is not somehow related to > > > > queued invalidations. Try to use __iommu_flush_iotlb() instead of > > qi_flush_iotlb(). > > > > > > > > > > I tried to force to use __iommu_flush_iotlb(), but maybe something > > > wrong, the system crashed, so I prefer to lower the priority of this > operation. > > > > > > > The VT-d spec clearly says that register-based invalidation can be used only > when > > queued-invalidations are not enabled. Intel-IOMMU driver doesn't provide > an > > option to disable queued-invalidation though, when the hardware is > capable. If you > > really want to try, tweak the code in intel_iommu_init_qi. > > > > Hi Kevin, > > Thanks to point out this. Do you have any ideas about this problem ? I tried > to descript the problem much clear in my reply to Alex, hope you could have > a look if you're interested. > btw I saw you used 4.18 kernel in this test. What about latest kernel? Also one way to separate sw/hw bug is to trace the low level interface (e.g., qi_flush_iotlb) which actually sends invalidation descriptors to the IOMMU hardware. Check the window between b) and c) and see whether the software does the right thing as expected there. Thanks Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v4 0/2] VMD MSI Remapping Bypass
On Wed, Mar 17, 2021 at 07:14:17PM +, Derrick, Jonathan wrote: > Gentle reminder, for v5.13 ? This should go through the PCI tree, Bjorn? ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 2/5] iommu/vt-d: Remove WO permissions on second-level paging entries
Hi, On Mon, Mar 08, 2021 at 11:47:46AM -0800, Raj, Ashok wrote: > That is the primary motivation, given that we have moved to 1st level for > general IOVA, first level doesn't have a WO mapping. I didn't know enough > about the history to determine if a WO without a READ is very useful. I > guess the ZLR was invented to support those cases without a READ in PCIe. I Okay, please update the commit message and re-send. I guess these patches are 5.13 stuff. In that case, Baolu can include them into his pull request later this cycle. Regards, Joerg ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] Revert "iommu/amd: Fix performance counter initialization"
Dear Jörg, dear Suravee, Am 03.03.21 um 15:10 schrieb Alexander Monakov: On Wed, 3 Mar 2021, Suravee Suthikulpanit wrote: Additionally, alternative proposed solutions [1] were not considered or discussed. [1]:https://lore.kernel.org/linux-iommu/alpine.lnx.2.20.13.2006030935570.3...@monopod.intra.ispras.ru/ This check has been introduced early on to detect a HW issue for certain platforms in the past, where the performance counters are not accessible and would result in silent failure when try to use the counters. This is considered legacy code, and can be removed if we decide to no longer provide sanity check for such case. Which platforms? There is no such information in the code or the commit messages that introduced this. According to AMD's documentation, presence of performance counters is indicated by "PCSup" bit in the "EFR" register. I don't think the driver should second-guess that. If there were platforms where the CPU or the firmware lied to the OS (EFR[PCSup] was 1, but counters were not present), I think that should have been handled in a more explicit manner, e.g. via matching broken CPUs by cpuid. Suravee, could you please answer the questions? Jörg, I know you are probably busy, but the patch was applied to the stable series (v5.11.7). There are still too many question open regarding the patch, and Suravee has not yet addressed the comments. It’d be great, if you could revert it. Kind regards, Paul Could you please ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: A problem of Intel IOMMU hardware ?
> -Original Message- > From: Tian, Kevin [mailto:kevin.t...@intel.com] > Sent: Thursday, March 18, 2021 4:56 PM > To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.) > ; Nadav Amit > Cc: chenjiashang ; David Woodhouse > ; iommu@lists.linux-foundation.org; LKML > ; alex.william...@redhat.com; Gonglei (Arei) > ; w...@kernel.org > Subject: RE: A problem of Intel IOMMU hardware ? > > > From: Longpeng (Mike, Cloud Infrastructure Service Product Dept.) > > > > > > > -Original Message- > > > From: Tian, Kevin [mailto:kevin.t...@intel.com] > > > Sent: Thursday, March 18, 2021 4:27 PM > > > To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.) > > > ; Nadav Amit > > > Cc: chenjiashang ; David Woodhouse > > > ; iommu@lists.linux-foundation.org; LKML > > > ; alex.william...@redhat.com; Gonglei > > (Arei) > > > ; w...@kernel.org > > > Subject: RE: A problem of Intel IOMMU hardware ? > > > > > > > From: iommu On Behalf > > > > Of Longpeng (Mike, Cloud Infrastructure Service Product Dept.) > > > > > > > > > 2. Consider ensuring that the problem is not somehow related to > > > > > queued invalidations. Try to use __iommu_flush_iotlb() instead > > > > > of > > > qi_flush_iotlb(). > > > > > > > > > > > > > I tried to force to use __iommu_flush_iotlb(), but maybe something > > > > wrong, the system crashed, so I prefer to lower the priority of > > > > this > > operation. > > > > > > > > > > The VT-d spec clearly says that register-based invalidation can be > > > used only > > when > > > queued-invalidations are not enabled. Intel-IOMMU driver doesn't > > > provide > > an > > > option to disable queued-invalidation though, when the hardware is > > capable. If you > > > really want to try, tweak the code in intel_iommu_init_qi. > > > > > > > Hi Kevin, > > > > Thanks to point out this. Do you have any ideas about this problem ? I > > tried to descript the problem much clear in my reply to Alex, hope you > > could have a look if you're interested. > > > > btw I saw you used 4.18 kernel in this test. What about latest kernel? > Not test yet. It's hard to upgrade kernel in our environment. > Also one way to separate sw/hw bug is to trace the low level interface (e.g., > qi_flush_iotlb) which actually sends invalidation descriptors to the IOMMU > hardware. Check the window between b) and c) and see whether the software does > the right thing as expected there. > We add some log in iommu driver these days, the software seems fine. But we didn't look inside the qi_submit_sync yet, I'll try it tonight. > Thanks > Kevin ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH] dma-mapping: benchmark: Add support for multi-pages map/unmap
From: Xiang Chen Currently it only support one page map/unmap once a time for dma-map benchmark, but there are some other scenaries which need to support for multi-page map/unmap: for those multi-pages interfaces such as dma_alloc_coherent() and dma_map_sg(), the time spent on multi-pages map/unmap is not the time of a single page * npages (not linear) as it may use block description instead of page description when it is satified with the size such as 2M/1G, and also it can send a single TLB invalidation command to invalidate multi-pages instead of multi-times when RIL is enabled (which will short the time of unmap). So it is necessary to add support for multi-pages map/unmap. Add a parameter "-g" to support multi-pages map/unmap. Signed-off-by: Xiang Chen --- kernel/dma/map_benchmark.c | 21 ++--- tools/testing/selftests/dma/dma_map_benchmark.c | 20 2 files changed, 30 insertions(+), 11 deletions(-) diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c index e0e64f8..a5c1b01 100644 --- a/kernel/dma/map_benchmark.c +++ b/kernel/dma/map_benchmark.c @@ -38,7 +38,8 @@ struct map_benchmark { __u32 dma_bits; /* DMA addressing capability */ __u32 dma_dir; /* DMA data direction */ __u32 dma_trans_ns; /* time for DMA transmission in ns */ - __u8 expansion[80]; /* For future use */ + __u32 granule; /* how many PAGE_SIZE will do map/unmap once a time */ + __u8 expansion[76]; /* For future use */ }; struct map_benchmark_data { @@ -58,9 +59,11 @@ static int map_benchmark_thread(void *data) void *buf; dma_addr_t dma_addr; struct map_benchmark_data *map = data; + int npages = map->bparam.granule; + u64 size = npages * PAGE_SIZE; int ret = 0; - buf = (void *)__get_free_page(GFP_KERNEL); + buf = alloc_pages_exact(size, GFP_KERNEL); if (!buf) return -ENOMEM; @@ -76,10 +79,10 @@ static int map_benchmark_thread(void *data) * 66 means evertything goes well! 66 is lucky. */ if (map->dir != DMA_FROM_DEVICE) - memset(buf, 0x66, PAGE_SIZE); + memset(buf, 0x66, size); map_stime = ktime_get(); - dma_addr = dma_map_single(map->dev, buf, PAGE_SIZE, map->dir); + dma_addr = dma_map_single(map->dev, buf, size, map->dir); if (unlikely(dma_mapping_error(map->dev, dma_addr))) { pr_err("dma_map_single failed on %s\n", dev_name(map->dev)); @@ -93,7 +96,7 @@ static int map_benchmark_thread(void *data) ndelay(map->bparam.dma_trans_ns); unmap_stime = ktime_get(); - dma_unmap_single(map->dev, dma_addr, PAGE_SIZE, map->dir); + dma_unmap_single(map->dev, dma_addr, size, map->dir); unmap_etime = ktime_get(); unmap_delta = ktime_sub(unmap_etime, unmap_stime); @@ -112,7 +115,7 @@ static int map_benchmark_thread(void *data) } out: - free_page((unsigned long)buf); + free_pages_exact(buf, size); return ret; } @@ -203,7 +206,6 @@ static long map_benchmark_ioctl(struct file *file, unsigned int cmd, struct map_benchmark_data *map = file->private_data; void __user *argp = (void __user *)arg; u64 old_dma_mask; - int ret; if (copy_from_user(&map->bparam, argp, sizeof(map->bparam))) @@ -234,6 +236,11 @@ static long map_benchmark_ioctl(struct file *file, unsigned int cmd, return -EINVAL; } + if (map->bparam.granule < 1 || map->bparam.granule > 1024) { + pr_err("invalid granule size\n"); + return -EINVAL; + } + switch (map->bparam.dma_dir) { case DMA_MAP_BIDIRECTIONAL: map->dir = DMA_BIDIRECTIONAL; diff --git a/tools/testing/selftests/dma/dma_map_benchmark.c b/tools/testing/selftests/dma/dma_map_benchmark.c index fb23ce9..6f2caa6 100644 --- a/tools/testing/selftests/dma/dma_map_benchmark.c +++ b/tools/testing/selftests/dma/dma_map_benchmark.c @@ -40,7 +40,8 @@ struct map_benchmark { __u32 dma_bits; /* DMA addressing capability */ __u32 dma_dir; /* DMA data direction */ __u32 dma_trans_ns; /* time for DMA transmission in ns */ - __u8 expansion[80]; /* For future use */ + __u32 granule; /* how many PAGE_SIZE will do map/unmap once a time */ + __u8 expansion[76]; /* For future use */ }; int main(int argc, char **argv) @@ -51,11 +52,13 @@ int main(int argc, char **argv) int threads = 1, seconds = 20, node = -1; /* default dma mask 32bit, bidirectional DMA */ int bits = 32, xdelay = 0, dir = DMA_MAP_BIDIRECTIONAL; + /* de
Re: [PATCH] iommu/dma: Fix a typo in a comment
On Thu, Mar 18, 2021 at 11:21:24AM +0800, chenxiang wrote: > From: Xiang Chen > > Fix a type "SAC" to "DAC" in the comment of function > iommu_dma_alloc_iova(). > > Signed-off-by: Xiang Chen > --- > drivers/iommu/dma-iommu.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c > index af765c8..3bc17ee 100644 > --- a/drivers/iommu/dma-iommu.c > +++ b/drivers/iommu/dma-iommu.c > @@ -443,7 +443,7 @@ static dma_addr_t iommu_dma_alloc_iova(struct > iommu_domain *domain, > if (domain->geometry.force_aperture) > dma_limit = min(dma_limit, (u64)domain->geometry.aperture_end); > > - /* Try to get PCI devices a SAC address */ > + /* Try to get PCI devices a DAC address */ > if (dma_limit > DMA_BIT_MASK(32) && dev_is_pci(dev)) > iova = alloc_iova_fast(iovad, iova_len, > DMA_BIT_MASK(32) >> shift, false); This doesn't look like a typo to me... Please explain. Will ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH] dma-mapping: benchmark: Add support for multi-pages map/unmap
> -Original Message- > From: chenxiang (M) > Sent: Thursday, March 18, 2021 10:30 PM > To: h...@lst.de; m.szyprow...@samsung.com; robin.mur...@arm.com; Song Bao Hua > (Barry Song) > Cc: iommu@lists.linux-foundation.org; linux...@openeuler.org; > linux-kselft...@vger.kernel.org; chenxiang (M) > Subject: [PATCH] dma-mapping: benchmark: Add support for multi-pages map/unmap > > From: Xiang Chen > > Currently it only support one page map/unmap once a time for dma-map > benchmark, but there are some other scenaries which need to support for > multi-page map/unmap: for those multi-pages interfaces such as > dma_alloc_coherent() and dma_map_sg(), the time spent on multi-pages > map/unmap is not the time of a single page * npages (not linear) as it > may use block description instead of page description when it is satified > with the size such as 2M/1G, and also it can send a single TLB invalidation > command to invalidate multi-pages instead of multi-times when RIL is > enabled (which will short the time of unmap). So it is necessary to add > support for multi-pages map/unmap. > > Add a parameter "-g" to support multi-pages map/unmap. > > Signed-off-by: Xiang Chen > --- Acked-by: Barry Song > kernel/dma/map_benchmark.c | 21 ++--- > tools/testing/selftests/dma/dma_map_benchmark.c | 20 > 2 files changed, 30 insertions(+), 11 deletions(-) > > diff --git a/kernel/dma/map_benchmark.c b/kernel/dma/map_benchmark.c > index e0e64f8..a5c1b01 100644 > --- a/kernel/dma/map_benchmark.c > +++ b/kernel/dma/map_benchmark.c > @@ -38,7 +38,8 @@ struct map_benchmark { > __u32 dma_bits; /* DMA addressing capability */ > __u32 dma_dir; /* DMA data direction */ > __u32 dma_trans_ns; /* time for DMA transmission in ns */ > - __u8 expansion[80]; /* For future use */ > + __u32 granule; /* how many PAGE_SIZE will do map/unmap once a time */ > + __u8 expansion[76]; /* For future use */ > }; > > struct map_benchmark_data { > @@ -58,9 +59,11 @@ static int map_benchmark_thread(void *data) > void *buf; > dma_addr_t dma_addr; > struct map_benchmark_data *map = data; > + int npages = map->bparam.granule; > + u64 size = npages * PAGE_SIZE; > int ret = 0; > > - buf = (void *)__get_free_page(GFP_KERNEL); > + buf = alloc_pages_exact(size, GFP_KERNEL); > if (!buf) > return -ENOMEM; > > @@ -76,10 +79,10 @@ static int map_benchmark_thread(void *data) >* 66 means evertything goes well! 66 is lucky. >*/ > if (map->dir != DMA_FROM_DEVICE) > - memset(buf, 0x66, PAGE_SIZE); > + memset(buf, 0x66, size); > > map_stime = ktime_get(); > - dma_addr = dma_map_single(map->dev, buf, PAGE_SIZE, map->dir); > + dma_addr = dma_map_single(map->dev, buf, size, map->dir); > if (unlikely(dma_mapping_error(map->dev, dma_addr))) { > pr_err("dma_map_single failed on %s\n", > dev_name(map->dev)); > @@ -93,7 +96,7 @@ static int map_benchmark_thread(void *data) > ndelay(map->bparam.dma_trans_ns); > > unmap_stime = ktime_get(); > - dma_unmap_single(map->dev, dma_addr, PAGE_SIZE, map->dir); > + dma_unmap_single(map->dev, dma_addr, size, map->dir); > unmap_etime = ktime_get(); > unmap_delta = ktime_sub(unmap_etime, unmap_stime); > > @@ -112,7 +115,7 @@ static int map_benchmark_thread(void *data) > } > > out: > - free_page((unsigned long)buf); > + free_pages_exact(buf, size); > return ret; > } > > @@ -203,7 +206,6 @@ static long map_benchmark_ioctl(struct file *file, > unsigned > int cmd, > struct map_benchmark_data *map = file->private_data; > void __user *argp = (void __user *)arg; > u64 old_dma_mask; > - > int ret; > > if (copy_from_user(&map->bparam, argp, sizeof(map->bparam))) > @@ -234,6 +236,11 @@ static long map_benchmark_ioctl(struct file *file, > unsigned > int cmd, > return -EINVAL; > } > > + if (map->bparam.granule < 1 || map->bparam.granule > 1024) { > + pr_err("invalid granule size\n"); > + return -EINVAL; > + } > + > switch (map->bparam.dma_dir) { > case DMA_MAP_BIDIRECTIONAL: > map->dir = DMA_BIDIRECTIONAL; > diff --git a/tools/testing/selftests/dma/dma_map_benchmark.c > b/tools/testing/selftests/dma/dma_map_benchmark.c > index fb23ce9..6f2caa6 100644 > --- a/tools/testing/selftests/dma/dma_map_benchmark.c > +++ b/tools/testing/selftests/dma/dma_map_benchmark.c > @@ -40,7 +40,8 @@ struct map_benchmark { > __u32 dma_bits; /* DMA addressing capability */ > __u32 dma_dir; /* DMA data direction */ >
Re: [PATCH 0/3] iommu/amd: Fix booting with amd_iommu=off
On Wed, Mar 17, 2021 at 06:48:50PM +0800, Huang Rui wrote: > Series are Acked-by: Huang Rui Thanks, series is applied for v5.12 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v3] iommu: Check dev->iommu in iommu_dev_xxx functions
On Wed, Mar 03, 2021 at 05:36:11PM +, Shameer Kolothum wrote: > The device iommu probe/attach might have failed leaving dev->iommu > to NULL and device drivers may still invoke these functions resulting > in a crash in iommu vendor driver code. > > Hence make sure we check that. > > Fixes: a3a195929d40 ("iommu: Add APIs for multiple domains per device") > Signed-off-by: Shameer Kolothum > --- > v2 --> v3 > -Removed iommu_ops from struct dev_iommu. > v1 --> v2: > -Added iommu_ops to struct dev_iommu based on the discussion with Robin. > -Rebased against iommu-tree core branch. > --- > drivers/iommu/iommu.c | 24 +++- > 1 file changed, 15 insertions(+), 9 deletions(-) Applied, thanks. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v5 0/2] Add Unisoc IOMMU basic driver
On Fri, Mar 05, 2021 at 05:32:14PM +0800, Chunyan Zhang wrote: > .../devicetree/bindings/iommu/sprd,iommu.yaml | 57 ++ > drivers/iommu/Kconfig | 12 + > drivers/iommu/Makefile| 1 + > drivers/iommu/sprd-iommu.c| 577 ++ > 4 files changed, 647 insertions(+) > create mode 100644 Documentation/devicetree/bindings/iommu/sprd,iommu.yaml > create mode 100644 drivers/iommu/sprd-iommu.c Applied, thanks. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] iommu/dma: Resurrect the "forcedac" option
On Fri, Mar 05, 2021 at 04:32:34PM +, Robin Murphy wrote: > In converting intel-iommu over to the common IOMMU DMA ops, it quietly > lost the functionality of its "forcedac" option. Since this is a handy > thing both for testing and for performance optimisation on certain > platforms, reimplement it under the common IOMMU parameter namespace. > > For the sake of fixing the inadvertent breakage of the Intel-specific > parameter, remove the dmar_forcedac remnants and hook it up as an alias > while documenting the transition to the new common parameter. > > Fixes: c588072bba6b ("iommu/vt-d: Convert intel iommu driver to the iommu > ops") > Signed-off-by: Robin Murphy > --- > Documentation/admin-guide/kernel-parameters.txt | 15 --- > drivers/iommu/dma-iommu.c | 13 - > drivers/iommu/intel/iommu.c | 5 ++--- > include/linux/dma-iommu.h | 2 ++ > 4 files changed, 24 insertions(+), 11 deletions(-) Applied for v5.13, thanks. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] iommu/dma: Fix a typo in a comment
Hi will, 在 2021/3/18 17:34, Will Deacon 写道: On Thu, Mar 18, 2021 at 11:21:24AM +0800, chenxiang wrote: From: Xiang Chen Fix a type "SAC" to "DAC" in the comment of function iommu_dma_alloc_iova(). Signed-off-by: Xiang Chen --- drivers/iommu/dma-iommu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index af765c8..3bc17ee 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -443,7 +443,7 @@ static dma_addr_t iommu_dma_alloc_iova(struct iommu_domain *domain, if (domain->geometry.force_aperture) dma_limit = min(dma_limit, (u64)domain->geometry.aperture_end); - /* Try to get PCI devices a SAC address */ + /* Try to get PCI devices a DAC address */ if (dma_limit > DMA_BIT_MASK(32) && dev_is_pci(dev)) iova = alloc_iova_fast(iovad, iova_len, DMA_BIT_MASK(32) >> shift, false); This doesn't look like a typo to me... Please explain. I think it means double-address cycle(DAC), and in LLD3 452 page, there is a description about it "PCI double-address cycle mappings Normally, the DMA support layer works with 32-bit bus addresses, possibly restricted by a specific device’s DMA mask. The PCI bus, however, also supports a 64-bit addressing mode, the double-address cycle (DAC)." Please point it out if i am wrong :) Best Regards, chenxiang Will . ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 1/2] iommu/iova: Add rbtree entry helper
On Fri, Mar 05, 2021 at 04:35:22PM +, Robin Murphy wrote: > Repeating the rb_entry() boilerplate all over the place gets old fast. > Before adding yet more instances, add a little hepler to tidy it up. > > Signed-off-by: Robin Murphy > --- > drivers/iommu/iova.c | 23 ++- > 1 file changed, 14 insertions(+), 9 deletions(-) Applied both, thanks Robin. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 1/1] iommu/vt-d: Don't set then immediately clear in prq_event_thread()
Hi Baolu, On Tue, Mar 09, 2021 at 08:46:41AM +0800, Lu Baolu wrote: > The private data field of a page group response descriptor is set then > immediately cleared in prq_event_thread(). Fix this by moving clearing > code up. > > Fixes: 5b438f4ba315d ("iommu/vt-d: Support page request in scalable mode") > Cc: Jacob Pan > Reviewed-by: Liu Yi L > Signed-off-by: Lu Baolu Does this fix an actual bug? If so, please state it in the commit message and also fix the subject line to state what is set/cleared. Thanks, Joerg ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 1/1] iommu/vt-d: Fix lockdep splat in intel_pasid_get_entry()
On Wed, Mar 17, 2021 at 08:58:34AM +0800, Lu Baolu wrote: > The pasid_lock is used to synchronize different threads from modifying a > same pasid directory entry at the same time. It causes below lockdep splat. > > [ 83.296538] > [ 83.296538] WARNING: possible irq lock inversion dependency detected > [ 83.296539] 5.12.0-rc3+ #25 Tainted: GW > [ 83.296539] > [ 83.296540] bash/780 just changed the state of lock: > [ 83.296540] 82b29c98 (device_domain_lock){..-.}-{2:2}, at: >iommu_flush_dev_iotlb.part.0+0x32/0x110 > [ 83.296547] but this lock took another, SOFTIRQ-unsafe lock in the past: > [ 83.296547] (pasid_lock){+.+.}-{2:2} > [ 83.296548] > >and interrupts could create inverse lock ordering between them. > > [ 83.296549] other info that might help us debug this: > [ 83.296549] Chain exists of: > device_domain_lock --> &iommu->lock --> pasid_lock > [ 83.296551] Possible interrupt unsafe locking scenario: > > [ 83.296551]CPU0CPU1 > [ 83.296552] > [ 83.296552] lock(pasid_lock); > [ 83.296553]local_irq_disable(); > [ 83.296553]lock(device_domain_lock); > [ 83.296554]lock(&iommu->lock); > [ 83.296554] > [ 83.296554] lock(device_domain_lock); > [ 83.296555] > *** DEADLOCK *** > > Fix it by replacing the pasid_lock with an atomic exchange operation. > > Reported-and-tested-by: Dave Jiang > Signed-off-by: Lu Baolu > --- > drivers/iommu/intel/pasid.c | 14 ++ > 1 file changed, 6 insertions(+), 8 deletions(-) > > diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c > index 9fb3d3e80408..1ddcb8295f72 100644 > --- a/drivers/iommu/intel/pasid.c > +++ b/drivers/iommu/intel/pasid.c > @@ -24,7 +24,6 @@ > /* > * Intel IOMMU system wide PASID name space: > */ > -static DEFINE_SPINLOCK(pasid_lock); > u32 intel_pasid_max_id = PASID_MAX; > > int vcmd_alloc_pasid(struct intel_iommu *iommu, u32 *pasid) > @@ -259,19 +258,18 @@ struct pasid_entry *intel_pasid_get_entry(struct device > *dev, u32 pasid) > dir_index = pasid >> PASID_PDE_SHIFT; > index = pasid & PASID_PTE_MASK; > > - spin_lock(&pasid_lock); > entries = get_pasid_table_from_pde(&dir[dir_index]); > if (!entries) { > entries = alloc_pgtable_page(info->iommu->node); > - if (!entries) { > - spin_unlock(&pasid_lock); > + if (!entries) > return NULL; > - } > > - WRITE_ONCE(dir[dir_index].val, > -(u64)virt_to_phys(entries) | PASID_PTE_PRESENT); > + if (cmpxchg64(&dir[dir_index].val, 0ULL, > + (u64)virt_to_phys(entries) | PASID_PTE_PRESENT)) { > + free_pgtable_page(entries); > + entries = get_pasid_table_from_pde(&dir[dir_index]); This is racy, someone could have already cleared the pasid-entry again. What you need to do here is to retry the whole path by adding a goto to before the first get_pasid_table_from_pde() call. Btw, what makes sure that the pasid_entry does not go away when it is returned here? Regards, Joerg ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] iommu/vt-d: Disable SVM when ATS/PRI/PASID are not enabled in the device
On Sun, Mar 14, 2021 at 01:15:34PM -0700, Kyung Min Park wrote: > Currently, the Intel VT-d supports Shared Virtual Memory (SVM) only when > IO page fault is supported. Otherwise, shared memory pages can not be > swapped out and need to be pinned. The device needs the Address Translation > Service (ATS), Page Request Interface (PRI) and Process Address Space > Identifier (PASID) capabilities to be enabled to support IO page fault. > > Disable SVM when ATS, PRI and PASID are not enabled in the device. > > Signed-off-by: Kyung Min Park Applied, thanks. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 1/1] iommu/vt-d: Report more information about invalidation errors
On Thu, Mar 18, 2021 at 08:53:40AM +0800, Lu Baolu wrote: > When the invalidation queue errors are encountered, dump the information > logged by the VT-d hardware together with the pending queue invalidation > descriptors. > > Signed-off-by: Ashok Raj > Tested-by: Guo Kaijie > Signed-off-by: Lu Baolu > --- > drivers/iommu/intel/dmar.c | 68 ++--- > include/linux/intel-iommu.h | 6 > 2 files changed, 70 insertions(+), 4 deletions(-) Applied, thanks. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v1] iommu/tegra-smmu: Make tegra_smmu_probe_device() to handle all IOMMU phandles
On Fri, Mar 12, 2021 at 06:54:39PM +0300, Dmitry Osipenko wrote: > The tegra_smmu_probe_device() handles only the first IOMMU device-tree > phandle, skipping the rest. Devices like 3D module on Tegra30 have > multiple IOMMU phandles, one for each h/w block, and thus, only one > IOMMU phandle is added to fwspec for the 3D module, breaking GPU. > Previously this problem was masked by tegra_smmu_attach_dev() which > didn't use the fwspec, but parsed the DT by itself. The previous commit > to tegra-smmu driver partially reverted changes that caused problems for > T124 and now we have tegra_smmu_attach_dev() that uses the fwspec and > the old-buggy variant of tegra_smmu_probe_device() which skips secondary > IOMMUs. > > Make tegra_smmu_probe_device() not to skip the secondary IOMMUs. This > fixes a partially attached IOMMU of the 3D module on Tegra30 and now GPU > works properly once again. > > Fixes: 765a9d1d02b2 ("iommu/tegra-smmu: Fix mc errors on tegra124-nyan") > Signed-off-by: Dmitry Osipenko > --- > drivers/iommu/tegra-smmu.c | 7 +++ > 1 file changed, 3 insertions(+), 4 deletions(-) Applied for v5.12, thanks. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v2 0/4] Misc vSVA fixes for VT-d
On Tue, Mar 02, 2021 at 02:13:56AM -0800, Jacob Pan wrote: > Hi Baolu et al, > > This is a collection of SVA-related fixes. > > ChangeLog: > > v2: > - For guest SVA, call pasid_set_wpe directly w/o checking host CR0.wp > (Review comments by Kevin T.) > - Added fixes tag > > Thanks, > > Jacob > > Jacob Pan (4): > iommu/vt-d: Enable write protect for supervisor SVM > iommu/vt-d: Enable write protect propagation from guest > iommu/vt-d: Reject unsupported page request modes > iommu/vt-d: Calculate and set flags for handle_mm_fault > > drivers/iommu/intel/pasid.c | 29 + > drivers/iommu/intel/svm.c | 21 + > include/uapi/linux/iommu.h | 3 ++- > 3 files changed, 48 insertions(+), 5 deletions(-) Applied, thanks. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 3/3] iommu/virtio: Enable x86 support
On Tue, Mar 16, 2021 at 08:16:54PM +0100, Jean-Philippe Brucker wrote: > With the VIOT support in place, x86 platforms can now use the > virtio-iommu. > > The arm64 Kconfig selects IOMMU_DMA, while x86 IOMMU drivers select it > themselves. > > Signed-off-by: Jean-Philippe Brucker > --- > drivers/iommu/Kconfig | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig > index 2819b5c8ec30..ccca83ef2f06 100644 > --- a/drivers/iommu/Kconfig > +++ b/drivers/iommu/Kconfig > @@ -400,8 +400,9 @@ config HYPERV_IOMMU > config VIRTIO_IOMMU > tristate "Virtio IOMMU driver" > depends on VIRTIO > - depends on ARM64 > + depends on (ARM64 || X86) > select IOMMU_API > + select IOMMU_DMA if X86 > select INTERVAL_TREE > select ACPI_VIOT if ACPI > help Acked-by: Joerg Roedel ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] iommu/dma: Fix a typo in a comment
On 2021-03-18 09:55, chenxiang (M) wrote: Hi will, 在 2021/3/18 17:34, Will Deacon 写道: On Thu, Mar 18, 2021 at 11:21:24AM +0800, chenxiang wrote: From: Xiang Chen Fix a type "SAC" to "DAC" in the comment of function iommu_dma_alloc_iova(). Signed-off-by: Xiang Chen --- drivers/iommu/dma-iommu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index af765c8..3bc17ee 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -443,7 +443,7 @@ static dma_addr_t iommu_dma_alloc_iova(struct iommu_domain *domain, if (domain->geometry.force_aperture) dma_limit = min(dma_limit, (u64)domain->geometry.aperture_end); - /* Try to get PCI devices a SAC address */ + /* Try to get PCI devices a DAC address */ if (dma_limit > DMA_BIT_MASK(32) && dev_is_pci(dev)) iova = alloc_iova_fast(iovad, iova_len, DMA_BIT_MASK(32) >> shift, false); This doesn't look like a typo to me... Please explain. As the author of said comment, it is definitely not a typo. I think it means double-address cycle(DAC), and in LLD3 452 page, there is a description about it "PCI double-address cycle mappings Normally, the DMA support layer works with 32-bit bus addresses, possibly restricted by a specific device’s DMA mask. The PCI bus, however, also supports a 64-bit addressing mode, the double-address cycle (DAC)." Please point it out if i am wrong :) Yes, now look at what the code above does: *if* a PCI device has a DMA mask greater than 32 bits (i.e. can support dual address cycles), we first try an IOVA allocation with an explicit 32-bit limit (i.e. that can be expressed in a single address cycle), in the hope that we don't have to fall back to allocating from the upper range and forcing the device to actually use DAC (and suffer the potential performance impact). Robin. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 2/2] iommu/iova: Improve restart logic
On 10/03/2021 17:47, John Garry wrote: On 09/03/2021 15:55, John Garry wrote: On 05/03/2021 16:35, Robin Murphy wrote: Hi Robin, When restarting after searching below the cached node fails, resetting the start point to the anchor node is often overly pessimistic. If allocations are made with mixed limits - particularly in the case of the opportunistic 32-bit allocation for PCI devices - this could mean significant time wasted walking through the whole populated upper range just to reach the initial limit. Right We can improve on that by implementing a proper tree traversal to find the first node above the relevant limit, and set the exact start point. Thanks for this. However, as mentioned in the other thread, this does not help our performance regression Re: commit 4e89dce72521. And mentioning this "retry" approach again, in case it was missed, from my experiment on the affected HW, it also has generally a dreadfully low success rate - less than 1% for the 32b range retry. Retry rate is about 20%. So I am saying from this 20%, less than 1% of those succeed. So since the retry means that we search through the complete pfn range most of the time (due to poor success rate), we should be able to do a better job at maintaining an accurate max alloc size, by calculating it from the range search, and not relying on max alloc failed or resetting it frequently. Hopefully that would mean that we're smarter about not trying the allocation. So I tried that out and we seem to be able to scrap back an appreciable amount of performance. Maybe 80% of original, with with another change, below. Thanks, John Signed-off-by: Robin Murphy --- drivers/iommu/iova.c | 39 ++- 1 file changed, 38 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/iova.c b/drivers/iommu/iova.c index c28003e1d2ee..471c48dd71e7 100644 --- a/drivers/iommu/iova.c +++ b/drivers/iommu/iova.c @@ -154,6 +154,43 @@ __cached_rbnode_delete_update(struct iova_domain *iovad, struct iova *free) iovad->cached_node = rb_next(&free->node); } +static struct rb_node *iova_find_limit(struct iova_domain *iovad, unsigned long limit_pfn) +{ + struct rb_node *node, *next; + /* + * Ideally what we'd like to judge here is whether limit_pfn is close + * enough to the highest-allocated IOVA that starting the allocation + * walk from the anchor node will be quicker than this initial work to + * find an exact starting point (especially if that ends up being the + * anchor node anyway). This is an incredibly crude approximation which + * only really helps the most likely case, but is at least trivially easy. + */ + if (limit_pfn > iovad->dma_32bit_pfn) + return &iovad->anchor.node; + + node = iovad->rbroot.rb_node; + while (to_iova(node)->pfn_hi < limit_pfn) + node = node->rb_right; + +search_left: + while (node->rb_left && to_iova(node->rb_left)->pfn_lo >= limit_pfn) + node = node->rb_left; + + if (!node->rb_left) + return node; + + next = node->rb_left; + while (next->rb_right) { + next = next->rb_right; + if (to_iova(next)->pfn_lo >= limit_pfn) { + node = next; + goto search_left; + } + } + + return node; +} + /* Insert the iova into domain rbtree by holding writer lock */ static void iova_insert_rbtree(struct rb_root *root, struct iova *iova, @@ -219,7 +256,7 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad, if (low_pfn == iovad->start_pfn && retry_pfn < limit_pfn) { high_pfn = limit_pfn; low_pfn = retry_pfn; - curr = &iovad->anchor.node; + curr = iova_find_limit(iovad, limit_pfn); I see that it is now applied. However, alternatively could we just add a zero-length 32b boundary marker node for the 32b pfn restart point? curr_iova = to_iova(curr); goto retry; } ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu . ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 3/3] iommu/virtio: Enable x86 support
On 2021-03-16 19:16, Jean-Philippe Brucker wrote: With the VIOT support in place, x86 platforms can now use the virtio-iommu. The arm64 Kconfig selects IOMMU_DMA, while x86 IOMMU drivers select it themselves. Actually, now that both AMD and Intel are converted over, maybe it's finally time to punt that to x86 arch code to match arm64? Robin. Signed-off-by: Jean-Philippe Brucker --- drivers/iommu/Kconfig | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig index 2819b5c8ec30..ccca83ef2f06 100644 --- a/drivers/iommu/Kconfig +++ b/drivers/iommu/Kconfig @@ -400,8 +400,9 @@ config HYPERV_IOMMU config VIRTIO_IOMMU tristate "Virtio IOMMU driver" depends on VIRTIO - depends on ARM64 + depends on (ARM64 || X86) select IOMMU_API + select IOMMU_DMA if X86 select INTERVAL_TREE select ACPI_VIOT if ACPI help ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH v4 0/2] VMD MSI Remapping Bypass
On Thu, Mar 18, 2021 at 10:07:38AM +0100, j...@8bytes.org wrote: > On Wed, Mar 17, 2021 at 07:14:17PM +, Derrick, Jonathan wrote: > > Gentle reminder, for v5.13 ? > > This should go through the PCI tree, Bjorn? I will start queuing code next week, noted. Thanks, Lorenzo ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 2/2] iommu/iova: Improve restart logic
On 2021-03-18 11:38, John Garry wrote: On 10/03/2021 17:47, John Garry wrote: On 09/03/2021 15:55, John Garry wrote: On 05/03/2021 16:35, Robin Murphy wrote: Hi Robin, When restarting after searching below the cached node fails, resetting the start point to the anchor node is often overly pessimistic. If allocations are made with mixed limits - particularly in the case of the opportunistic 32-bit allocation for PCI devices - this could mean significant time wasted walking through the whole populated upper range just to reach the initial limit. Right We can improve on that by implementing a proper tree traversal to find the first node above the relevant limit, and set the exact start point. Thanks for this. However, as mentioned in the other thread, this does not help our performance regression Re: commit 4e89dce72521. And mentioning this "retry" approach again, in case it was missed, from my experiment on the affected HW, it also has generally a dreadfully low success rate - less than 1% for the 32b range retry. Retry rate is about 20%. So I am saying from this 20%, less than 1% of those succeed. Well yeah, in your particular case you're allocating from a heavily over-contended address space, so much of the time it is genuinely full. Plus you're primarily churning one or two sizes of IOVA, so there's a high chance that you will either allocate immediately from the cached node (after a previous free), or search the whole space and fail. In case it was missed, searching only some arbitrary subset of the space before giving up is not a good behaviour for an allocator to have in general. So since the retry means that we search through the complete pfn range most of the time (due to poor success rate), we should be able to do a better job at maintaining an accurate max alloc size, by calculating it from the range search, and not relying on max alloc failed or resetting it frequently. Hopefully that would mean that we're smarter about not trying the allocation. So I tried that out and we seem to be able to scrap back an appreciable amount of performance. Maybe 80% of original, with with another change, below. TBH if you really want to make allocation more efficient I think there are more radical changes that would be worth experimenting with, like using some form of augmented rbtree to also encode the amount of free space under each branch, or representing the free space in its own parallel tree, or whether some other structure entirely might be a better bet these days. And if you just want to make your thing acceptably fast, now I'm going to say stick a quirk somewhere to force the "forcedac" option on your platform ;) [...] @@ -219,7 +256,7 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad, if (low_pfn == iovad->start_pfn && retry_pfn < limit_pfn) { high_pfn = limit_pfn; low_pfn = retry_pfn; - curr = &iovad->anchor.node; + curr = iova_find_limit(iovad, limit_pfn); I see that it is now applied. However, alternatively could we just add a zero-length 32b boundary marker node for the 32b pfn restart point? That would need special cases all over the place to prevent the marker getting merged into reservations or hit by lookups, and at worst break the ordering of the tree if a legitimate node straddles the boundary. I did consider having the insert/delete routines keep track of yet another cached node for whatever's currently the first thing above the 32-bit boundary, but I was worried that might be a bit too invasive. FWIW I'm currently planning to come back to this again when I have a bit more time, since the optimum thing to do (modulo replacing the entire algorithm...) is actually to make the second part of the search *upwards* from the cached node to the limit. Furthermore, to revive my arch/arm conversion I think we're realistically going to need a compatibility option for bottom-up allocation to avoid too many nasty surprises, so I'd like to generalise things to tackle both concerns at once. Thanks, Robin. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH 4/7] iommu/amd: Initial support for AMD IOMMU v2 page table
Hi Suravee, On Fri, Mar 12, 2021 at 03:04:08AM -0600, Suravee Suthikulpanit wrote: > @@ -503,6 +504,7 @@ struct amd_io_pgtable { > int mode; > u64 *root; > atomic64_t pt_root;/* pgtable root and pgtable mode */ > + struct mm_structv2_mm; > }; A whole mm_struct is a bit too much when all we really need is an 8-byte page-table root pointer. > +static pte_t *fetch_pte(struct amd_io_pgtable *pgtable, > + unsigned long iova, > + unsigned long *page_size) > +{ > + int level; > + pte_t *ptep; > + > + ptep = lookup_address_in_mm(&pgtable->v2_mm, iova, &level); > + if (!ptep || pte_none(*ptep) || (level == PG_LEVEL_NONE)) > + return NULL; So you are re-using the in-kernel page-table building code. That safes some lines of code, but has several problems: 1) When you boot a kernel with this code on a machine with 5-level paging, the IOMMU code will build a 5-level page-table too, breaking IOMMU mappings. 2) You need a whole mm_struct per domain, which is big. 3) The existing macros for CPU page-tables require locking. For IOMMU page-tables this is not really necessary and might cause scalability issues. Overall I think you should write your own code to build a 4-level page-table and use cmpxchg64 to avoid the need for locking. Then things will not break when such a kernel is suddenly booted on a machine which as 5-level paging enabled. Regards, Joerg ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH 5/7] iommu/amd: Add support for Guest IO protection
On Fri, Mar 12, 2021 at 03:04:09AM -0600, Suravee Suthikulpanit wrote: > @@ -519,6 +521,7 @@ struct protection_domain { > spinlock_t lock;/* mostly used to lock the page table*/ > u16 id; /* the domain id written to the device table */ > int glx;/* Number of levels for GCR3 table */ > + bool giov; /* guest IO protection domain */ Could this be turned into a flag? Regards, Joerg ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH 6/7] iommu/amd: Introduce amd_iommu_pgtable command-line option
On Fri, Mar 12, 2021 at 03:04:10AM -0600, Suravee Suthikulpanit wrote: > To allow specification whether to use v1 or v2 IOMMU pagetable for > DMA remapping when calling kernel DMA-API. > > Signed-off-by: Suravee Suthikulpanit > --- > Documentation/admin-guide/kernel-parameters.txt | 6 ++ > drivers/iommu/amd/init.c| 15 +++ > 2 files changed, 21 insertions(+) > > diff --git a/Documentation/admin-guide/kernel-parameters.txt > b/Documentation/admin-guide/kernel-parameters.txt > index 04545725f187..466e807369ea 100644 > --- a/Documentation/admin-guide/kernel-parameters.txt > +++ b/Documentation/admin-guide/kernel-parameters.txt > @@ -319,6 +319,12 @@ >This mode requires kvm-amd.avic=1. >(Default when IOMMU HW support is present.) > > + amd_iommu_pgtable= [HW,X86-64] > + Specifies one of the following AMD IOMMU page table to > + be used for DMA remapping for DMA-API: > + v1 - Use v1 page table (Default) > + v2 - Use v2 page table Any reason v2 can not be the default when it is supported by the IOMMU? ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [RFC PATCH 7/7] iommu/amd: Add support for using AMD IOMMU v2 page table for DMA-API
On Fri, Mar 12, 2021 at 03:04:11AM -0600, Suravee Suthikulpanit wrote: > Introduce init function for setting up DMA domain for DMA-API with > the IOMMU v2 page table. > > Signed-off-by: Suravee Suthikulpanit > --- > drivers/iommu/amd/iommu.c | 21 + > 1 file changed, 21 insertions(+) > > diff --git a/drivers/iommu/amd/iommu.c b/drivers/iommu/amd/iommu.c > index e29ece6e1e68..bd26de8764bd 100644 > --- a/drivers/iommu/amd/iommu.c > +++ b/drivers/iommu/amd/iommu.c > @@ -1937,6 +1937,24 @@ static int protection_domain_init_v1(struct > protection_domain *domain, int mode) > return 0; > } > > +static int protection_domain_init_v2(struct protection_domain *domain) > +{ > + spin_lock_init(&domain->lock); > + domain->id = domain_id_alloc(); > + if (!domain->id) > + return -ENOMEM; > + INIT_LIST_HEAD(&domain->dev_list); > + > + domain->giov = true; > + > + if (amd_iommu_pgtable == AMD_IOMMU_V2 && > + domain_enable_v2(domain, 1, false)) { > + return -ENOMEM; > + } > + > + return 0; > +} > + You also need to advertise a different aperture size and a different pgsize-bitmap. The host page-table can only map a 48 bit address space, not a 64 bit one like with v1 page-tables. Regards, Joerg ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 2/2] iommu/iova: Improve restart logic
Well yeah, in your particular case you're allocating from a heavily over-contended address space, so much of the time it is genuinely full. Plus you're primarily churning one or two sizes of IOVA, so there's a high chance that you will either allocate immediately from the cached node (after a previous free), or search the whole space and fail. In case it was missed, searching only some arbitrary subset of the space before giving up is not a good behaviour for an allocator to have in general. So since the retry means that we search through the complete pfn range most of the time (due to poor success rate), we should be able to do a better job at maintaining an accurate max alloc size, by calculating it from the range search, and not relying on max alloc failed or resetting it frequently. Hopefully that would mean that we're smarter about not trying the allocation. So I tried that out and we seem to be able to scrap back an appreciable amount of performance. Maybe 80% of original, with with another change, below. TBH if you really want to make allocation more efficient I think there are more radical changes that would be worth experimenting with, like using some form of augmented rbtree to also encode the amount of free space under each branch, or representing the free space in its own parallel tree, or whether some other structure entirely might be a better bet these days. And if you just want to make your thing acceptably fast, now I'm going to say stick a quirk somewhere to force the "forcedac" option on your platform ;) Easier said than done :) But still, I'd like to just be able to cache all IOVA sizes for my DMA engine, so we should not have to go near the RB tree often. I have put together a series to allow upper limit of rcache range be increased per domain. So naturally that gives better performance than we originally had. I don't want to prejudice the solution by saying what I think of it now, so will send it out... [...] @@ -219,7 +256,7 @@ static int __alloc_and_insert_iova_range(struct iova_domain *iovad, if (low_pfn == iovad->start_pfn && retry_pfn < limit_pfn) { high_pfn = limit_pfn; low_pfn = retry_pfn; - curr = &iovad->anchor.node; + curr = iova_find_limit(iovad, limit_pfn); I see that it is now applied. However, alternatively could we just add a zero-length 32b boundary marker node for the 32b pfn restart point? That would need special cases all over the place to prevent the marker getting merged into reservations or hit by lookups, and at worst break the ordering of the tree if a legitimate node straddles the boundary. I did consider having the insert/delete routines keep track of yet another cached node for whatever's currently the first thing above the 32-bit boundary, but I was worried that might be a bit too invasive. Yeah, I did think of that. I don't think that it would have too much overhead. FWIW I'm currently planning to come back to this again when I have a bit more time, since the optimum thing to do (modulo replacing the entire algorithm...) is actually to make the second part of the search *upwards* from the cached node to the limit. Furthermore, to revive my arch/arm conversion I think we're realistically going to need a compatibility option for bottom-up allocation to avoid too many nasty surprises, so I'd like to generalise things to tackle both concerns at once. Thanks, John ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
swiotlb cleanups v3
Hi Konrad, this series contains a bunch of swiotlb cleanups, mostly to reduce the amount of internals exposed to code outside of swiotlb.c, which should helper to prepare for supporting multiple different bounce buffer pools. Changes since v2: - fix a bisetion hazard that did not allocate the alloc_size array - dropped all patches already merged Changes since v1: - rebased to v5.12-rc1 - a few more cleanups - merge and forward port the patch from Claire to move all the global variables into a struct to prepare for multiple instances ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH 1/3] swiotlb: move global variables into a new io_tlb_mem structure
From: Claire Chang Added a new struct, io_tlb_mem, as the IO TLB memory pool descriptor and moved relevant global variables into that struct. This will be useful later to allow for restricted DMA pool. Signed-off-by: Claire Chang [hch: rebased] Signed-off-by: Christoph Hellwig --- drivers/xen/swiotlb-xen.c | 2 +- include/linux/swiotlb.h | 43 - kernel/dma/swiotlb.c | 354 ++ 3 files changed, 206 insertions(+), 193 deletions(-) diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c index 4ecfce2c6f7263..5329ad54a5f34e 100644 --- a/drivers/xen/swiotlb-xen.c +++ b/drivers/xen/swiotlb-xen.c @@ -548,7 +548,7 @@ xen_swiotlb_sync_sg_for_device(struct device *dev, struct scatterlist *sgl, static int xen_swiotlb_dma_supported(struct device *hwdev, u64 mask) { - return xen_phys_to_dma(hwdev, io_tlb_end - 1) <= mask; + return xen_phys_to_dma(hwdev, io_tlb_default_mem.end - 1) <= mask; } const struct dma_map_ops xen_swiotlb_dma_ops = { diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h index 0696bdc8072e97..5ec5378b17c333 100644 --- a/include/linux/swiotlb.h +++ b/include/linux/swiotlb.h @@ -6,6 +6,7 @@ #include #include #include +#include struct device; struct page; @@ -61,11 +62,49 @@ dma_addr_t swiotlb_map(struct device *dev, phys_addr_t phys, #ifdef CONFIG_SWIOTLB extern enum swiotlb_force swiotlb_force; -extern phys_addr_t io_tlb_start, io_tlb_end; + +/** + * struct io_tlb_mem - IO TLB Memory Pool Descriptor + * + * @start: The start address of the swiotlb memory pool. Used to do a quick + * range check to see if the memory was in fact allocated by this + * API. + * @end: The end address of the swiotlb memory pool. Used to do a quick + * range check to see if the memory was in fact allocated by this + * API. + * @nslabs:The number of IO TLB blocks (in groups of 64) between @start and + * @end. This is command line adjustable via setup_io_tlb_npages. + * @used: The number of used IO TLB block. + * @list: The free list describing the number of free entries available + * from each index. + * @index: The index to start searching in the next round. + * @orig_addr: The original address corresponding to a mapped entry. + * @alloc_size:Size of the allocated buffer. + * @lock: The lock to protect the above data structures in the map and + * unmap calls. + * @debugfs: The dentry to debugfs. + * @late_alloc:%true if allocated using the page allocator + */ +struct io_tlb_mem { + phys_addr_t start; + phys_addr_t end; + unsigned long nslabs; + unsigned long used; + unsigned int *list; + unsigned int index; + phys_addr_t *orig_addr; + size_t *alloc_size; + spinlock_t lock; + struct dentry *debugfs; + bool late_alloc; +}; +extern struct io_tlb_mem io_tlb_default_mem; static inline bool is_swiotlb_buffer(phys_addr_t paddr) { - return paddr >= io_tlb_start && paddr < io_tlb_end; + struct io_tlb_mem *mem = &io_tlb_default_mem; + + return paddr >= mem->start && paddr < mem->end; } void __init swiotlb_exit(void); diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index 35e24f0ff8b207..d9c097f0f78cec 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -59,32 +59,11 @@ */ #define IO_TLB_MIN_SLABS ((1<<20) >> IO_TLB_SHIFT) -enum swiotlb_force swiotlb_force; - -/* - * Used to do a quick range check in swiotlb_tbl_unmap_single and - * swiotlb_tbl_sync_single_*, to see if the memory was in fact allocated by this - * API. - */ -phys_addr_t io_tlb_start, io_tlb_end; - -/* - * The number of IO TLB blocks (in groups of 64) between io_tlb_start and - * io_tlb_end. This is command line adjustable via setup_io_tlb_npages. - */ -static unsigned long io_tlb_nslabs; +#define INVALID_PHYS_ADDR (~(phys_addr_t)0) -/* - * The number of used IO TLB block - */ -static unsigned long io_tlb_used; +enum swiotlb_force swiotlb_force; -/* - * This is a free list describing the number of free entries available from - * each index - */ -static unsigned int *io_tlb_list; -static unsigned int io_tlb_index; +struct io_tlb_mem io_tlb_default_mem; /* * Max segment that we can provide which (if pages are contingous) will @@ -92,32 +71,15 @@ static unsigned int io_tlb_index; */ static unsigned int max_segment; -/* - * We need to save away the original address corresponding to a mapped entry - * for the sync operations. - */ -#define INVALID_PHYS_ADDR (~(phys_addr_t)0) -static phys_addr_t *io_tlb_orig_addr; - -/* - * The mapped buffer's size should be validated during a sync operation. - */ -static size_t *io_tlb_alloc_size; - -/* - * Protect the above data structures in the map and unmap calls - */ -static DEFINE_SPINLOCK(io_tlb_lock); - -static int late_alloc; - static int __init setup_io_tl
[PATCH 2/3] swiotlb: dynamically allocate io_tlb_default_mem
Instead of allocating ->list and ->orig_addr separately just do one dynamic allocation for the actual io_tlb_mem structure. This simplifies a lot of the initialization code, and also allows to just check io_tlb_default_mem to see if swiotlb is in use. Signed-off-by: Christoph Hellwig --- drivers/xen/swiotlb-xen.c | 22 +-- include/linux/swiotlb.h | 18 ++- kernel/dma/swiotlb.c | 306 -- 3 files changed, 117 insertions(+), 229 deletions(-) diff --git a/drivers/xen/swiotlb-xen.c b/drivers/xen/swiotlb-xen.c index 5329ad54a5f34e..4c89afc0df6289 100644 --- a/drivers/xen/swiotlb-xen.c +++ b/drivers/xen/swiotlb-xen.c @@ -158,17 +158,14 @@ static const char *xen_swiotlb_error(enum xen_swiotlb_err err) int __ref xen_swiotlb_init(void) { enum xen_swiotlb_err m_ret = XEN_SWIOTLB_UNKNOWN; - unsigned long nslabs, bytes, order; - unsigned int repeat = 3; + unsigned long bytes = swiotlb_size_or_default(); + unsigned long nslabs = bytes >> IO_TLB_SHIFT; + unsigned int order, repeat = 3; int rc = -ENOMEM; char *start; - nslabs = swiotlb_nr_tbl(); - if (!nslabs) - nslabs = DEFAULT_NSLABS; retry: m_ret = XEN_SWIOTLB_ENOMEM; - bytes = nslabs << IO_TLB_SHIFT; order = get_order(bytes); /* @@ -221,19 +218,16 @@ int __ref xen_swiotlb_init(void) #ifdef CONFIG_X86 void __init xen_swiotlb_init_early(void) { - unsigned long nslabs, bytes; + unsigned long bytes = swiotlb_size_or_default(); + unsigned long nslabs = bytes >> IO_TLB_SHIFT; unsigned int repeat = 3; char *start; int rc; - nslabs = swiotlb_nr_tbl(); - if (!nslabs) - nslabs = DEFAULT_NSLABS; retry: /* * Get IO TLB memory from any location. */ - bytes = nslabs << IO_TLB_SHIFT; start = memblock_alloc(PAGE_ALIGN(bytes), PAGE_SIZE); if (!start) panic("%s: Failed to allocate %lu bytes align=0x%lx\n", @@ -248,8 +242,8 @@ void __init xen_swiotlb_init_early(void) if (repeat--) { /* Min is 2MB */ nslabs = max(1024UL, (nslabs >> 1)); - pr_info("Lowering to %luMB\n", - (nslabs << IO_TLB_SHIFT) >> 20); + bytes = nslabs << IO_TLB_SHIFT; + pr_info("Lowering to %luMB\n", bytes >> 20); goto retry; } panic("%s (rc:%d)", xen_swiotlb_error(XEN_SWIOTLB_EFIXUP), rc); @@ -548,7 +542,7 @@ xen_swiotlb_sync_sg_for_device(struct device *dev, struct scatterlist *sgl, static int xen_swiotlb_dma_supported(struct device *hwdev, u64 mask) { - return xen_phys_to_dma(hwdev, io_tlb_default_mem.end - 1) <= mask; + return xen_phys_to_dma(hwdev, io_tlb_default_mem->end - 1) <= mask; } const struct dma_map_ops xen_swiotlb_dma_ops = { diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h index 5ec5378b17c333..63f7a63f61d098 100644 --- a/include/linux/swiotlb.h +++ b/include/linux/swiotlb.h @@ -90,28 +90,30 @@ struct io_tlb_mem { phys_addr_t end; unsigned long nslabs; unsigned long used; - unsigned int *list; unsigned int index; - phys_addr_t *orig_addr; - size_t *alloc_size; spinlock_t lock; struct dentry *debugfs; bool late_alloc; + struct io_tlb_slot { + phys_addr_t orig_addr; + size_t alloc_size; + unsigned int list; + } slots[]; }; -extern struct io_tlb_mem io_tlb_default_mem; +extern struct io_tlb_mem *io_tlb_default_mem; static inline bool is_swiotlb_buffer(phys_addr_t paddr) { - struct io_tlb_mem *mem = &io_tlb_default_mem; + struct io_tlb_mem *mem = io_tlb_default_mem; - return paddr >= mem->start && paddr < mem->end; + return mem && paddr >= mem->start && paddr < mem->end; } void __init swiotlb_exit(void); unsigned int swiotlb_max_segment(void); size_t swiotlb_max_mapping_size(struct device *dev); bool is_swiotlb_active(void); -void __init swiotlb_adjust_size(unsigned long new_size); +void __init swiotlb_adjust_size(unsigned long size); #else #define swiotlb_force SWIOTLB_NO_FORCE static inline bool is_swiotlb_buffer(phys_addr_t paddr) @@ -135,7 +137,7 @@ static inline bool is_swiotlb_active(void) return false; } -static inline void swiotlb_adjust_size(unsigned long new_size) +static inline void swiotlb_adjust_size(unsigned long size) { } #endif /* CONFIG_SWIOTLB */ diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index d9c097f0f78cec..13de669a9b4681 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -63,7 +63,7 @@ enum swiotlb_force swiotlb_force; -struct io_tlb_mem io_tlb_default_mem; +struct io_tlb_mem *io_tlb_default_mem; /* * Max segment that we can
[PATCH 3/3] swiotlb: remove swiotlb_nr_tbl
All callers just use it to check if swiotlb is active at all, for which they can just use is_swiotlb_active. In the longer run drivers need to stop using is_swiotlb_active as well, but let's do the simple step first. Signed-off-by: Christoph Hellwig --- drivers/gpu/drm/i915/gem/i915_gem_internal.c | 2 +- drivers/gpu/drm/nouveau/nouveau_ttm.c| 2 +- drivers/pci/xen-pcifront.c | 2 +- include/linux/swiotlb.h | 1 - kernel/dma/swiotlb.c | 7 +-- 5 files changed, 4 insertions(+), 10 deletions(-) diff --git a/drivers/gpu/drm/i915/gem/i915_gem_internal.c b/drivers/gpu/drm/i915/gem/i915_gem_internal.c index ad22f42541bda6..a9d65fc8aa0eab 100644 --- a/drivers/gpu/drm/i915/gem/i915_gem_internal.c +++ b/drivers/gpu/drm/i915/gem/i915_gem_internal.c @@ -42,7 +42,7 @@ static int i915_gem_object_get_pages_internal(struct drm_i915_gem_object *obj) max_order = MAX_ORDER; #ifdef CONFIG_SWIOTLB - if (swiotlb_nr_tbl()) { + if (is_swiotlb_active()) { unsigned int max_segment; max_segment = swiotlb_max_segment(); diff --git a/drivers/gpu/drm/nouveau/nouveau_ttm.c b/drivers/gpu/drm/nouveau/nouveau_ttm.c index a37bc3d7b38b3b..9662522aa0664a 100644 --- a/drivers/gpu/drm/nouveau/nouveau_ttm.c +++ b/drivers/gpu/drm/nouveau/nouveau_ttm.c @@ -321,7 +321,7 @@ nouveau_ttm_init(struct nouveau_drm *drm) } #if IS_ENABLED(CONFIG_SWIOTLB) && IS_ENABLED(CONFIG_X86) - need_swiotlb = !!swiotlb_nr_tbl(); + need_swiotlb = is_swiotlb_active(); #endif ret = ttm_bo_device_init(&drm->ttm.bdev, &nouveau_bo_driver, diff --git a/drivers/pci/xen-pcifront.c b/drivers/pci/xen-pcifront.c index 2d75026482197d..b7a8f3a1921f83 100644 --- a/drivers/pci/xen-pcifront.c +++ b/drivers/pci/xen-pcifront.c @@ -693,7 +693,7 @@ static int pcifront_connect_and_init_dma(struct pcifront_device *pdev) spin_unlock(&pcifront_dev_lock); - if (!err && !swiotlb_nr_tbl()) { + if (!err && !is_swiotlb_active()) { err = pci_xen_swiotlb_init_late(); if (err) dev_err(&pdev->xdev->dev, "Could not setup SWIOTLB!\n"); diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h index 63f7a63f61d098..216854a5e5134b 100644 --- a/include/linux/swiotlb.h +++ b/include/linux/swiotlb.h @@ -37,7 +37,6 @@ enum swiotlb_force { extern void swiotlb_init(int verbose); int swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose); -extern unsigned long swiotlb_nr_tbl(void); unsigned long swiotlb_size_or_default(void); extern int swiotlb_late_init_with_tbl(char *tlb, unsigned long nslabs); extern int swiotlb_late_init_with_default_size(size_t default_size); diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index 13de669a9b4681..539c76beb52e07 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -94,12 +94,6 @@ setup_io_tlb_npages(char *str) } early_param("swiotlb", setup_io_tlb_npages); -unsigned long swiotlb_nr_tbl(void) -{ - return io_tlb_default_mem ? io_tlb_default_mem->nslabs : 0; -} -EXPORT_SYMBOL_GPL(swiotlb_nr_tbl); - unsigned int swiotlb_max_segment(void) { return io_tlb_default_mem ? max_segment : 0; @@ -652,6 +646,7 @@ bool is_swiotlb_active(void) { return io_tlb_default_mem != NULL; } +EXPORT_SYMBOL_GPL(is_swiotlb_active); #ifdef CONFIG_DEBUG_FS -- 2.30.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: A problem of Intel IOMMU hardware ?
> On Mar 18, 2021, at 2:25 AM, Longpeng (Mike, Cloud Infrastructure Service > Product Dept.) wrote: > > > >> -Original Message- >> From: Tian, Kevin [mailto:kevin.t...@intel.com] >> Sent: Thursday, March 18, 2021 4:56 PM >> To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.) >> ; Nadav Amit >> Cc: chenjiashang ; David Woodhouse >> ; iommu@lists.linux-foundation.org; LKML >> ; alex.william...@redhat.com; Gonglei (Arei) >> ; w...@kernel.org >> Subject: RE: A problem of Intel IOMMU hardware ? >> >>> From: Longpeng (Mike, Cloud Infrastructure Service Product Dept.) >>> >>> -Original Message- From: Tian, Kevin [mailto:kevin.t...@intel.com] Sent: Thursday, March 18, 2021 4:27 PM To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.) ; Nadav Amit Cc: chenjiashang ; David Woodhouse ; iommu@lists.linux-foundation.org; LKML ; alex.william...@redhat.com; Gonglei >>> (Arei) ; w...@kernel.org Subject: RE: A problem of Intel IOMMU hardware ? > From: iommu On Behalf > Of Longpeng (Mike, Cloud Infrastructure Service Product Dept.) > >> 2. Consider ensuring that the problem is not somehow related to >> queued invalidations. Try to use __iommu_flush_iotlb() instead >> of qi_flush_iotlb(). >> > > I tried to force to use __iommu_flush_iotlb(), but maybe something > wrong, the system crashed, so I prefer to lower the priority of > this >>> operation. > The VT-d spec clearly says that register-based invalidation can be used only >>> when queued-invalidations are not enabled. Intel-IOMMU driver doesn't provide >>> an option to disable queued-invalidation though, when the hardware is >>> capable. If you really want to try, tweak the code in intel_iommu_init_qi. >>> >>> Hi Kevin, >>> >>> Thanks to point out this. Do you have any ideas about this problem ? I >>> tried to descript the problem much clear in my reply to Alex, hope you >>> could have a look if you're interested. >>> >> >> btw I saw you used 4.18 kernel in this test. What about latest kernel? >> > > Not test yet. It's hard to upgrade kernel in our environment. > >> Also one way to separate sw/hw bug is to trace the low level interface (e.g., >> qi_flush_iotlb) which actually sends invalidation descriptors to the IOMMU >> hardware. Check the window between b) and c) and see whether the software >> does >> the right thing as expected there. >> > > We add some log in iommu driver these days, the software seems fine. But we > didn't look inside the qi_submit_sync yet, I'll try it tonight. So here is my guess: Intel probably used as a basis for the IOTLB an implementation of some other (regular) TLB design. Intel SDM says regarding TLBs (4.10.4.2 “Recommended Invalidation”): "Software wishing to prevent this uncertainty should not write to a paging-structure entry in a way that would change, for any linear address, both the page size and either the page frame, access rights, or other attributes.” Now the aforementioned uncertainty is a bit different (multiple *valid* translations of a single address). Yet, perhaps this is yet another thing that might happen. From a brief look on the handling of MMU (not IOMMU) hugepages in Linux, indeed the PMD is first cleared and flushed before a new valid PMD is set. This is possible for MMUs since they allow the software to handle spurious page-faults gracefully. This is not the case for the IOMMU though (without PRI). Not sure this explains everything though. If that is the problem, then during a mapping that changes page-sizes, a TLB flush is needed, similarly to the one Longpeng did manually. signature.asc Description: Message signed with OpenPGP ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 1/3] ACPICA: iASL: Add definitions for the VIOT table
Hi Jean, On 3/16/21 8:16 PM, Jean-Philippe Brucker wrote: > Just here for reference, don't merge! > > The actual commits will be pulled from the next ACPICA release. > I squashed the three relevant commits: > > ACPICA commit fc4e33319c1ee08f20f5c44853dd8426643f6dfd > ACPICA commit 2197e354fb5dcafaddd2016ffeb0620e5bc3d5e2 > ACPICA commit 856a96fdf4b51b2b8da17529df0255e6f51f1b5b > > Link: https://github.com/acpica/acpica/commit/fc4e3331 > Link: https://github.com/acpica/acpica/commit/2197e354 > Link: https://github.com/acpica/acpica/commit/856a96fd > Signed-off-by: Bob Moore > Signed-off-by: Jean-Philippe Brucker > --- > include/acpi/actbl3.h | 67 +++ > 1 file changed, 67 insertions(+) > > diff --git a/include/acpi/actbl3.h b/include/acpi/actbl3.h > index df5f4b27f3aa..09d15898e9a8 100644 > --- a/include/acpi/actbl3.h > +++ b/include/acpi/actbl3.h > @@ -33,6 +33,7 @@ > #define ACPI_SIG_TCPA "TCPA" /* Trusted Computing Platform > Alliance table */ > #define ACPI_SIG_TPM2 "TPM2" /* Trusted Platform Module 2.0 > H/W interface table */ > #define ACPI_SIG_UEFI "UEFI" /* Uefi Boot Optimization Table > */ > +#define ACPI_SIG_VIOT "VIOT" /* Virtual I/O Translation > Table */ > #define ACPI_SIG_WAET "WAET" /* Windows ACPI Emulated > devices Table */ > #define ACPI_SIG_WDAT "WDAT" /* Watchdog Action Table */ > #define ACPI_SIG_WDDT "WDDT" /* Watchdog Timer Description > Table */ > @@ -483,6 +484,72 @@ struct acpi_table_uefi { > u16 data_offset;/* Offset of remaining data in table */ > }; > > +/*** > + * > + * VIOT - Virtual I/O Translation Table > + *Version 1 For other tables I see Conforms to ../.. Shouldn't we have such section too > + * > + > **/ > + > +struct acpi_table_viot { > + struct acpi_table_header header;/* Common ACPI table header */ > + u16 node_count; > + u16 node_offset; > + u8 reserved[8]; > +}; > + > +/* VIOT subtable header */ > + > +struct acpi_viot_header { > + u8 type; > + u8 reserved; > + u16 length; > +}; > + > +/* Values for Type field above */ > + > +enum acpi_viot_node_type { > + ACPI_VIOT_NODE_PCI_RANGE = 0x01, > + ACPI_VIOT_NODE_MMIO = 0x02, > + ACPI_VIOT_NODE_VIRTIO_IOMMU_PCI = 0x03, > + ACPI_VIOT_NODE_VIRTIO_IOMMU_MMIO = 0x04, > + ACPI_VIOT_RESERVED = 0x05 > +}; > + > +/* VIOT subtables */ > + > +struct acpi_viot_pci_range { > + struct acpi_viot_header header; > + u32 endpoint_start; > + u16 segment_start; > + u16 segment_end; > + u16 bdf_start; > + u16 bdf_end; > + u16 output_node; > + u8 reserved[6]; > +}; > + > +struct acpi_viot_mmio { > + struct acpi_viot_header header; > + u32 endpoint; > + u64 base_address; > + u16 output_node; > + u8 reserved[6]; > +}; > + > +struct acpi_viot_virtio_iommu_pci { > + struct acpi_viot_header header; > + u16 segment; > + u16 bdf; > + u8 reserved[8]; > +}; > + > +struct acpi_viot_virtio_iommu_mmio { > + struct acpi_viot_header header; > + u8 reserved[4]; > + u64 base_address; > +}; > + > > /*** > * > * WAET - Windows ACPI Emulated devices Table > Besides Reviewed-by: Eric Auger Thanks Eric ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 3/3] iommu/virtio: Enable x86 support
On Tue, Mar 16, 2021 at 08:16:54PM +0100, Jean-Philippe Brucker wrote: > With the VIOT support in place, x86 platforms can now use the > virtio-iommu. > > The arm64 Kconfig selects IOMMU_DMA, while x86 IOMMU drivers select it > themselves. > > Signed-off-by: Jean-Philippe Brucker Acked-by: Michael S. Tsirkin > --- > drivers/iommu/Kconfig | 3 ++- > 1 file changed, 2 insertions(+), 1 deletion(-) > > diff --git a/drivers/iommu/Kconfig b/drivers/iommu/Kconfig > index 2819b5c8ec30..ccca83ef2f06 100644 > --- a/drivers/iommu/Kconfig > +++ b/drivers/iommu/Kconfig > @@ -400,8 +400,9 @@ config HYPERV_IOMMU > config VIRTIO_IOMMU > tristate "Virtio IOMMU driver" > depends on VIRTIO > - depends on ARM64 > + depends on (ARM64 || X86) > select IOMMU_API > + select IOMMU_DMA if X86 Would it hurt to just select unconditionally? Seems a bit cleaner ... > select INTERVAL_TREE > select ACPI_VIOT if ACPI > help > -- > 2.30.2 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH] swiotlb: Add swiotlb=off to disable SWIOTLB
It may be useful to disable the SWIOTLB completely for testing or when a platform is known not to have any DRAM addressing limitations what so ever. Signed-off-by: Florian Fainelli --- Documentation/admin-guide/kernel-parameters.txt | 1 + include/linux/swiotlb.h | 1 + kernel/dma/swiotlb.c| 9 + 3 files changed, 11 insertions(+) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 04545725f187..b0223e48921e 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -5278,6 +5278,7 @@ force -- force using of bounce buffers even if they wouldn't be automatically used by the kernel noforce -- Never use bounce buffers (for debugging) + off -- Completely disable SWIOTLB switches= [HW,M68k] diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h index 5857a937c637..23f86243defe 100644 --- a/include/linux/swiotlb.h +++ b/include/linux/swiotlb.h @@ -15,6 +15,7 @@ enum swiotlb_force { SWIOTLB_NORMAL, /* Default - depending on HW DMA mask etc. */ SWIOTLB_FORCE, /* swiotlb=force */ SWIOTLB_NO_FORCE, /* swiotlb=noforce */ + SWIOTLB_OFF,/* swiotlb=off */ }; /* diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index c10e855a03bc..d7a4a789c7d3 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -126,6 +126,8 @@ setup_io_tlb_npages(char *str) } else if (!strcmp(str, "noforce")) { swiotlb_force = SWIOTLB_NO_FORCE; io_tlb_nslabs = 1; + } else if (!strcmp(str, "off")) { + swiotlb_force = SWIOTLB_OFF; } return 0; @@ -229,6 +231,9 @@ int __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose) unsigned long i, bytes; size_t alloc_size; + if (swiotlb_force == SWIOTLB_OFF) + return 0; + bytes = nslabs << IO_TLB_SHIFT; io_tlb_nslabs = nslabs; @@ -284,6 +289,9 @@ swiotlb_init(int verbose) unsigned char *vstart; unsigned long bytes; + if (swiotlb_force == SWIOTLB_OFF) + goto out; + if (!io_tlb_nslabs) { io_tlb_nslabs = (default_size >> IO_TLB_SHIFT); io_tlb_nslabs = ALIGN(io_tlb_nslabs, IO_TLB_SEGSIZE); @@ -302,6 +310,7 @@ swiotlb_init(int verbose) io_tlb_start = 0; } pr_warn("Cannot allocate buffer"); +out: no_iotlb_memory = true; } -- 2.25.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] swiotlb: Add swiotlb=off to disable SWIOTLB
On 3/18/2021 12:18 PM, Florian Fainelli wrote: > It may be useful to disable the SWIOTLB completely for testing or when a > platform is known not to have any DRAM addressing limitations what so > ever. > > Signed-off-by: Florian Fainelli Christoph, in addition to this change, how would you feel if we qualified the swiotlb_init() in arch/arm/mm/init.c with a: if (memblock_end_of_DRAM() >= SZ_4G) swiotlb_init(1) right now this is made unconditional whenever ARM_LPAE is enabled which is the case for the platforms I maintain (ARCH_BRCMSTB) however we do not really need a SWIOTLB so long as the largest DRAM physical address does not exceed 4GB AFAICT. Thanks! > --- > Documentation/admin-guide/kernel-parameters.txt | 1 + > include/linux/swiotlb.h | 1 + > kernel/dma/swiotlb.c| 9 + > 3 files changed, 11 insertions(+) > > diff --git a/Documentation/admin-guide/kernel-parameters.txt > b/Documentation/admin-guide/kernel-parameters.txt > index 04545725f187..b0223e48921e 100644 > --- a/Documentation/admin-guide/kernel-parameters.txt > +++ b/Documentation/admin-guide/kernel-parameters.txt > @@ -5278,6 +5278,7 @@ > force -- force using of bounce buffers even if they >wouldn't be automatically used by the kernel > noforce -- Never use bounce buffers (for debugging) > + off -- Completely disable SWIOTLB > > switches= [HW,M68k] > > diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h > index 5857a937c637..23f86243defe 100644 > --- a/include/linux/swiotlb.h > +++ b/include/linux/swiotlb.h > @@ -15,6 +15,7 @@ enum swiotlb_force { > SWIOTLB_NORMAL, /* Default - depending on HW DMA mask etc. */ > SWIOTLB_FORCE, /* swiotlb=force */ > SWIOTLB_NO_FORCE, /* swiotlb=noforce */ > + SWIOTLB_OFF,/* swiotlb=off */ > }; > > /* > diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c > index c10e855a03bc..d7a4a789c7d3 100644 > --- a/kernel/dma/swiotlb.c > +++ b/kernel/dma/swiotlb.c > @@ -126,6 +126,8 @@ setup_io_tlb_npages(char *str) > } else if (!strcmp(str, "noforce")) { > swiotlb_force = SWIOTLB_NO_FORCE; > io_tlb_nslabs = 1; > + } else if (!strcmp(str, "off")) { > + swiotlb_force = SWIOTLB_OFF; > } > > return 0; > @@ -229,6 +231,9 @@ int __init swiotlb_init_with_tbl(char *tlb, unsigned long > nslabs, int verbose) > unsigned long i, bytes; > size_t alloc_size; > > + if (swiotlb_force == SWIOTLB_OFF) > + return 0; > + > bytes = nslabs << IO_TLB_SHIFT; > > io_tlb_nslabs = nslabs; > @@ -284,6 +289,9 @@ swiotlb_init(int verbose) > unsigned char *vstart; > unsigned long bytes; > > + if (swiotlb_force == SWIOTLB_OFF) > + goto out; > + > if (!io_tlb_nslabs) { > io_tlb_nslabs = (default_size >> IO_TLB_SHIFT); > io_tlb_nslabs = ALIGN(io_tlb_nslabs, IO_TLB_SEGSIZE); > @@ -302,6 +310,7 @@ swiotlb_init(int verbose) > io_tlb_start = 0; > } > pr_warn("Cannot allocate buffer"); > +out: > no_iotlb_memory = true; > } > > -- Florian ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] swiotlb: Add swiotlb=off to disable SWIOTLB
On 2021-03-18 19:22, Florian Fainelli wrote: On 3/18/2021 12:18 PM, Florian Fainelli wrote: It may be useful to disable the SWIOTLB completely for testing or when a platform is known not to have any DRAM addressing limitations what so ever. Isn't that what "swiotlb=noforce" is for? If you're confident that we've really ironed out *all* the awkward corners that used to blow up if various internal bits were left uninitialised, then it would make sense to just tweak the implementation of what we already have. I wouldn't necessarily disagree with adding "off" as an additional alias for "noforce", though, since it does come across as a bit wacky for general use. Signed-off-by: Florian Fainelli Christoph, in addition to this change, how would you feel if we qualified the swiotlb_init() in arch/arm/mm/init.c with a: if (memblock_end_of_DRAM() >= SZ_4G) swiotlb_init(1) Modulo "swiotlb=force", of course ;) Robin. right now this is made unconditional whenever ARM_LPAE is enabled which is the case for the platforms I maintain (ARCH_BRCMSTB) however we do not really need a SWIOTLB so long as the largest DRAM physical address does not exceed 4GB AFAICT. Thanks! --- Documentation/admin-guide/kernel-parameters.txt | 1 + include/linux/swiotlb.h | 1 + kernel/dma/swiotlb.c| 9 + 3 files changed, 11 insertions(+) diff --git a/Documentation/admin-guide/kernel-parameters.txt b/Documentation/admin-guide/kernel-parameters.txt index 04545725f187..b0223e48921e 100644 --- a/Documentation/admin-guide/kernel-parameters.txt +++ b/Documentation/admin-guide/kernel-parameters.txt @@ -5278,6 +5278,7 @@ force -- force using of bounce buffers even if they wouldn't be automatically used by the kernel noforce -- Never use bounce buffers (for debugging) + off -- Completely disable SWIOTLB switches= [HW,M68k] diff --git a/include/linux/swiotlb.h b/include/linux/swiotlb.h index 5857a937c637..23f86243defe 100644 --- a/include/linux/swiotlb.h +++ b/include/linux/swiotlb.h @@ -15,6 +15,7 @@ enum swiotlb_force { SWIOTLB_NORMAL, /* Default - depending on HW DMA mask etc. */ SWIOTLB_FORCE, /* swiotlb=force */ SWIOTLB_NO_FORCE, /* swiotlb=noforce */ + SWIOTLB_OFF,/* swiotlb=off */ }; /* diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index c10e855a03bc..d7a4a789c7d3 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -126,6 +126,8 @@ setup_io_tlb_npages(char *str) } else if (!strcmp(str, "noforce")) { swiotlb_force = SWIOTLB_NO_FORCE; io_tlb_nslabs = 1; + } else if (!strcmp(str, "off")) { + swiotlb_force = SWIOTLB_OFF; } return 0; @@ -229,6 +231,9 @@ int __init swiotlb_init_with_tbl(char *tlb, unsigned long nslabs, int verbose) unsigned long i, bytes; size_t alloc_size; + if (swiotlb_force == SWIOTLB_OFF) + return 0; + bytes = nslabs << IO_TLB_SHIFT; io_tlb_nslabs = nslabs; @@ -284,6 +289,9 @@ swiotlb_init(int verbose) unsigned char *vstart; unsigned long bytes; + if (swiotlb_force == SWIOTLB_OFF) + goto out; + if (!io_tlb_nslabs) { io_tlb_nslabs = (default_size >> IO_TLB_SHIFT); io_tlb_nslabs = ALIGN(io_tlb_nslabs, IO_TLB_SEGSIZE); @@ -302,6 +310,7 @@ swiotlb_init(int verbose) io_tlb_start = 0; } pr_warn("Cannot allocate buffer"); +out: no_iotlb_memory = true; } ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 2/3] ACPI: Add driver for the VIOT table
On 2021-03-16 19:16, Jean-Philippe Brucker wrote: The ACPI Virtual I/O Translation Table describes topology of para-virtual platforms. For now it describes the relation between virtio-iommu and the endpoints it manages. Supporting that requires three steps: (1) acpi_viot_init(): parse the VIOT table, build a list of endpoints and vIOMMUs. (2) acpi_viot_set_iommu_ops(): when the vIOMMU driver is loaded and the device probed, register it to the VIOT driver. This step is required because unlike similar drivers, VIOT doesn't create the vIOMMU device. Note that you're basically the same as the DT case in this regard, so I'd expect things to be closer to that pattern than to that of IORT. [...] @@ -1506,12 +1507,17 @@ int acpi_dma_configure_id(struct device *dev, enum dev_dma_attr attr, { const struct iommu_ops *iommu; u64 dma_addr = 0, size = 0; + int ret; if (attr == DEV_DMA_NOT_SUPPORTED) { set_dma_ops(dev, &dma_dummy_ops); return 0; } + ret = acpi_viot_dma_setup(dev, attr); + if (ret) + return ret > 0 ? 0 : ret; I think things could do with a fair bit of refactoring here. Ideally we want to process a possible _DMA method (acpi_dma_get_range()) regardless of which flavour of IOMMU table might be present, and the amount of duplication we fork into at this point is unfortunate. + iort_dma_setup(dev, &dma_addr, &size); For starters I think most of that should be dragged out to this level here - it's really only the {rc,nc}_dma_get_range() bit that deserves to be the IORT-specific call. iommu = iort_iommu_configure_id(dev, input_id); Similarly, it feels like it's only the table scan part in the middle of that that needs dispatching between IORT/VIOT, and its head and tail pulled out into a common path. [...] +static const struct iommu_ops *viot_iommu_setup(struct device *dev) +{ + struct iommu_fwspec *fwspec = dev_iommu_fwspec_get(dev); + struct viot_iommu *viommu = NULL; + struct viot_endpoint *ep; + u32 epid; + int ret; + + /* Already translated? */ + if (fwspec && fwspec->ops) + return NULL; + + mutex_lock(&viommus_lock); + list_for_each_entry(ep, &viot_endpoints, list) { + if (viot_device_match(dev, &ep->dev_id, &epid)) { + epid += ep->endpoint_id; + viommu = ep->viommu; + break; + } + } + mutex_unlock(&viommus_lock); + if (!viommu) + return NULL; + + /* We're not translating ourself */ + if (viot_device_match(dev, &viommu->dev_id, &epid)) + return NULL; + + /* +* If we found a PCI range managed by the viommu, we're the one that has +* to request ACS. +*/ + if (dev_is_pci(dev)) + pci_request_acs(); + + if (!viommu->ops || WARN_ON(!viommu->dev)) + return ERR_PTR(-EPROBE_DEFER); Can you create (or look up) a viommu->fwnode when initially parsing the VIOT to represent the IOMMU devices to wait for, such that the viot_device_match() lookup can resolve to that and let you fall into the standard iommu_ops_from_fwnode() path? That's what I mean about following the DT pattern - I guess it might need a bit of trickery to rewrite things if iommu_device_register() eventually turns up with a new fwnode, so I doubt we can get away without *some* kind of private interface between virtio-iommu and VIOT, but it would be nice for the common(ish) DMA paths to stay as unaware of the specifics as possible. + + ret = iommu_fwspec_init(dev, viommu->dev->fwnode, viommu->ops); + if (ret) + return ERR_PTR(ret); + + iommu_fwspec_add_ids(dev, &epid, 1); + + /* +* If we have reason to believe the IOMMU driver missed the initial +* add_device callback for dev, replay it to get things in order. +*/ + if (dev->bus && !device_iommu_mapped(dev)) + iommu_probe_device(dev); + + return viommu->ops; +} + +/** + * acpi_viot_dma_setup - Configure DMA for an endpoint described in VIOT + * @dev: the endpoint + * @attr: coherency property of the endpoint + * + * Setup the DMA and IOMMU ops for an endpoint described by the VIOT table. + * + * Return: + * * 0 - @dev doesn't match any VIOT node + * * 1 - ops for @dev were successfully installed + * * -EPROBE_DEFER - ops for @dev aren't yet available + */ +int acpi_viot_dma_setup(struct device *dev, enum dev_dma_attr attr) +{ + const struct iommu_ops *iommu_ops = viot_iommu_setup(dev); + + if (IS_ERR_OR_NULL(iommu_ops)) { + int ret = PTR_ERR(iommu_ops); + + if (ret == -EPROBE_DEFER || ret == 0) + return ret; + dev_err(dev, "error %d while setting up virt IOMMU\n", ret); +
Re: [PATCH] swiotlb: Add swiotlb=off to disable SWIOTLB
On 3/18/2021 12:34 PM, Robin Murphy wrote: > On 2021-03-18 19:22, Florian Fainelli wrote: >> >> >> On 3/18/2021 12:18 PM, Florian Fainelli wrote: >>> It may be useful to disable the SWIOTLB completely for testing or when a >>> platform is known not to have any DRAM addressing limitations what so >>> ever. > > Isn't that what "swiotlb=noforce" is for? If you're confident that we've > really ironed out *all* the awkward corners that used to blow up if > various internal bits were left uninitialised, then it would make sense > to just tweak the implementation of what we already have. swiotlb=noforce does prevent dma_direct_map_page() from resorting to the swiotlb, however what I am also after is reclaiming these 64MB of default SWIOTLB bounce buffering memory because my systems run with large amounts of reserved memory into ZONE_MOVABLE and everything in ZONE_NORMAL is precious at that point. > > I wouldn't necessarily disagree with adding "off" as an additional alias > for "noforce", though, since it does come across as a bit wacky for > general use. > >>> Signed-off-by: Florian Fainelli >> >> Christoph, in addition to this change, how would you feel if we >> qualified the swiotlb_init() in arch/arm/mm/init.c with a: >> >> >> if (memblock_end_of_DRAM() >= SZ_4G) >> swiotlb_init(1) > > Modulo "swiotlb=force", of course ;) Indeed, we would need to handle that case as well. Does it sound reasonable to do that to you as well? -- Florian ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] swiotlb: Add swiotlb=off to disable SWIOTLB
On 2021-03-18 19:43, Florian Fainelli wrote: On 3/18/2021 12:34 PM, Robin Murphy wrote: On 2021-03-18 19:22, Florian Fainelli wrote: On 3/18/2021 12:18 PM, Florian Fainelli wrote: It may be useful to disable the SWIOTLB completely for testing or when a platform is known not to have any DRAM addressing limitations what so ever. Isn't that what "swiotlb=noforce" is for? If you're confident that we've really ironed out *all* the awkward corners that used to blow up if various internal bits were left uninitialised, then it would make sense to just tweak the implementation of what we already have. swiotlb=noforce does prevent dma_direct_map_page() from resorting to the swiotlb, however what I am also after is reclaiming these 64MB of default SWIOTLB bounce buffering memory because my systems run with large amounts of reserved memory into ZONE_MOVABLE and everything in ZONE_NORMAL is precious at that point. It also forces io_tlb_nslabs to the minimum, so it should be claiming considerably less than 64MB. IIRC the original proposal *did* skip initialisation completely, but that turned up the aforementioned issues. I wouldn't necessarily disagree with adding "off" as an additional alias for "noforce", though, since it does come across as a bit wacky for general use. Signed-off-by: Florian Fainelli Christoph, in addition to this change, how would you feel if we qualified the swiotlb_init() in arch/arm/mm/init.c with a: if (memblock_end_of_DRAM() >= SZ_4G) swiotlb_init(1) Modulo "swiotlb=force", of course ;) Indeed, we would need to handle that case as well. Does it sound reasonable to do that to you as well? I wouldn't like it done to me personally, but for arm64, observe what mem_init() in arch/arm64/mm/init.c already does. Robin. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] swiotlb: Add swiotlb=off to disable SWIOTLB
On 3/18/2021 12:53 PM, Robin Murphy wrote: > On 2021-03-18 19:43, Florian Fainelli wrote: >> >> >> On 3/18/2021 12:34 PM, Robin Murphy wrote: >>> On 2021-03-18 19:22, Florian Fainelli wrote: On 3/18/2021 12:18 PM, Florian Fainelli wrote: > It may be useful to disable the SWIOTLB completely for testing or > when a > platform is known not to have any DRAM addressing limitations what so > ever. >>> >>> Isn't that what "swiotlb=noforce" is for? If you're confident that we've >>> really ironed out *all* the awkward corners that used to blow up if >>> various internal bits were left uninitialised, then it would make sense >>> to just tweak the implementation of what we already have. >> >> swiotlb=noforce does prevent dma_direct_map_page() from resorting to the >> swiotlb, however what I am also after is reclaiming these 64MB of >> default SWIOTLB bounce buffering memory because my systems run with >> large amounts of reserved memory into ZONE_MOVABLE and everything in >> ZONE_NORMAL is precious at that point. > > It also forces io_tlb_nslabs to the minimum, so it should be claiming > considerably less than 64MB. IIRC the original proposal *did* skip > initialisation completely, but that turned up the aforementioned issues. AFAICT in that case we will have iotlb_n_slabs will set to 1, which will still make us allocate io_tlb_n_slabs << IO_TLB_SHIFT bytes in swiotlb_init(), which still gives us 64MB. > >>> I wouldn't necessarily disagree with adding "off" as an additional alias >>> for "noforce", though, since it does come across as a bit wacky for >>> general use. >>> > Signed-off-by: Florian Fainelli Christoph, in addition to this change, how would you feel if we qualified the swiotlb_init() in arch/arm/mm/init.c with a: if (memblock_end_of_DRAM() >= SZ_4G) swiotlb_init(1) >>> >>> Modulo "swiotlb=force", of course ;) >> >> Indeed, we would need to handle that case as well. Does it sound >> reasonable to do that to you as well? > > I wouldn't like it done to me personally, but for arm64, observe what > mem_init() in arch/arm64/mm/init.c already does. > > Robin. -- Florian ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] swiotlb: Add swiotlb=off to disable SWIOTLB
On 2021-03-18 21:31, Florian Fainelli wrote: On 3/18/2021 12:53 PM, Robin Murphy wrote: On 2021-03-18 19:43, Florian Fainelli wrote: On 3/18/2021 12:34 PM, Robin Murphy wrote: On 2021-03-18 19:22, Florian Fainelli wrote: On 3/18/2021 12:18 PM, Florian Fainelli wrote: It may be useful to disable the SWIOTLB completely for testing or when a platform is known not to have any DRAM addressing limitations what so ever. Isn't that what "swiotlb=noforce" is for? If you're confident that we've really ironed out *all* the awkward corners that used to blow up if various internal bits were left uninitialised, then it would make sense to just tweak the implementation of what we already have. swiotlb=noforce does prevent dma_direct_map_page() from resorting to the swiotlb, however what I am also after is reclaiming these 64MB of default SWIOTLB bounce buffering memory because my systems run with large amounts of reserved memory into ZONE_MOVABLE and everything in ZONE_NORMAL is precious at that point. It also forces io_tlb_nslabs to the minimum, so it should be claiming considerably less than 64MB. IIRC the original proposal *did* skip initialisation completely, but that turned up the aforementioned issues. AFAICT in that case we will have iotlb_n_slabs will set to 1, which will still make us allocate io_tlb_n_slabs << IO_TLB_SHIFT bytes in swiotlb_init(), which still gives us 64MB. Eh? When did 2KB become 64MB? IO_TLB_SHIFT is 11, so that's at most one page in anyone's money... I wouldn't necessarily disagree with adding "off" as an additional alias for "noforce", though, since it does come across as a bit wacky for general use. Signed-off-by: Florian Fainelli Christoph, in addition to this change, how would you feel if we qualified the swiotlb_init() in arch/arm/mm/init.c with a: if (memblock_end_of_DRAM() >= SZ_4G) swiotlb_init(1) Modulo "swiotlb=force", of course ;) Indeed, we would need to handle that case as well. Does it sound reasonable to do that to you as well? I wouldn't like it done to me personally, but for arm64, observe what mem_init() in arch/arm64/mm/init.c already does. In fact I should have looked more closely at that myself - checking debugfs on my 4GB arm64 board actually shows io_tlb_nslabs = 0, and indeed we are bypassing initialisation completely and (ab)using SWIOTLB_NO_FORCE to cover it up, so I guess it probably *is* safe now for the noforce option to do the same for itself and save even that one page. Robin. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH V4 05/18] iommu/ioasid: Redefine IOASID set and allocation APIs
Hi Jean, Slightly off the title. As we are moving to use cgroup to limit PASID allocations, it would be much simpler if we enforce on the current task. However, iommu_sva_alloc_pasid() takes an mm_struct pointer as argument which implies it can be something other the the current task mm. So far all kernel callers use current task mm. Is there a use case for doing PASID allocation on behalf of another mm? If not, can we remove the mm argument? Thanks, Jacob > /** > * iommu_sva_alloc_pasid - Allocate a PASID for the mm > @@ -35,11 +44,11 @@ int iommu_sva_alloc_pasid(struct mm_struct *mm, > ioasid_t min, ioasid_t max) mutex_lock(&iommu_sva_lock); > if (mm->pasid) { > if (mm->pasid >= min && mm->pasid <= max) > - ioasid_get(mm->pasid); > + ioasid_get(iommu_sva_pasid, mm->pasid); > else > ret = -EOVERFLOW; > } else { > - pasid = ioasid_alloc(&iommu_sva_pasid, min, max, mm); > + pasid = ioasid_alloc(iommu_sva_pasid, min, max, mm); > if (pasid == INVALID_IOASID) > ret = -ENOMEM; Thanks, Jacob ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: A problem of Intel IOMMU hardware ?
On 3/18/21 4:56 PM, Tian, Kevin wrote: From: Longpeng (Mike, Cloud Infrastructure Service Product Dept.) -Original Message- From: Tian, Kevin [mailto:kevin.t...@intel.com] Sent: Thursday, March 18, 2021 4:27 PM To: Longpeng (Mike, Cloud Infrastructure Service Product Dept.) ; Nadav Amit Cc: chenjiashang ; David Woodhouse ; iommu@lists.linux-foundation.org; LKML ; alex.william...@redhat.com; Gonglei (Arei) ; w...@kernel.org Subject: RE: A problem of Intel IOMMU hardware ? From: iommu On Behalf Of Longpeng (Mike, Cloud Infrastructure Service Product Dept.) 2. Consider ensuring that the problem is not somehow related to queued invalidations. Try to use __iommu_flush_iotlb() instead of qi_flush_iotlb(). I tried to force to use __iommu_flush_iotlb(), but maybe something wrong, the system crashed, so I prefer to lower the priority of this operation. The VT-d spec clearly says that register-based invalidation can be used only when queued-invalidations are not enabled. Intel-IOMMU driver doesn't provide an option to disable queued-invalidation though, when the hardware is capable. If you really want to try, tweak the code in intel_iommu_init_qi. Hi Kevin, Thanks to point out this. Do you have any ideas about this problem ? I tried to descript the problem much clear in my reply to Alex, hope you could have a look if you're interested. btw I saw you used 4.18 kernel in this test. What about latest kernel? Also one way to separate sw/hw bug is to trace the low level interface (e.g., qi_flush_iotlb) which actually sends invalidation descriptors to the IOMMU hardware. Check the window between b) and c) and see whether the software does the right thing as expected there. Yes. It's better if we can reproduce this with the latest kernel which has debugfs files to expose page tables and the invalidation queues etc. Best regards, baolu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 2/5] iommu/vt-d: Remove WO permissions on second-level paging entries
Hi Joerg, On 3/18/21 5:12 PM, Joerg Roedel wrote: Hi, On Mon, Mar 08, 2021 at 11:47:46AM -0800, Raj, Ashok wrote: That is the primary motivation, given that we have moved to 1st level for general IOVA, first level doesn't have a WO mapping. I didn't know enough about the history to determine if a WO without a READ is very useful. I guess the ZLR was invented to support those cases without a READ in PCIe. I Okay, please update the commit message and re-send. I guess these patches are 5.13 stuff. In that case, Baolu can include them into his pull request later this cycle. Okay! It works for me. Best regards, baolu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] swiotlb: Add swiotlb=off to disable SWIOTLB
On 3/18/2021 4:35 PM, Robin Murphy wrote: > On 2021-03-18 21:31, Florian Fainelli wrote: >> >> >> On 3/18/2021 12:53 PM, Robin Murphy wrote: >>> On 2021-03-18 19:43, Florian Fainelli wrote: On 3/18/2021 12:34 PM, Robin Murphy wrote: > On 2021-03-18 19:22, Florian Fainelli wrote: >> >> >> On 3/18/2021 12:18 PM, Florian Fainelli wrote: >>> It may be useful to disable the SWIOTLB completely for testing or >>> when a >>> platform is known not to have any DRAM addressing limitations >>> what so >>> ever. > > Isn't that what "swiotlb=noforce" is for? If you're confident that > we've > really ironed out *all* the awkward corners that used to blow up if > various internal bits were left uninitialised, then it would make > sense > to just tweak the implementation of what we already have. swiotlb=noforce does prevent dma_direct_map_page() from resorting to the swiotlb, however what I am also after is reclaiming these 64MB of default SWIOTLB bounce buffering memory because my systems run with large amounts of reserved memory into ZONE_MOVABLE and everything in ZONE_NORMAL is precious at that point. >>> >>> It also forces io_tlb_nslabs to the minimum, so it should be claiming >>> considerably less than 64MB. IIRC the original proposal *did* skip >>> initialisation completely, but that turned up the aforementioned issues. >> >> AFAICT in that case we will have iotlb_n_slabs will set to 1, which will >> still make us allocate io_tlb_n_slabs << IO_TLB_SHIFT bytes in >> swiotlb_init(), which still gives us 64MB. > > Eh? When did 2KB become 64MB? IO_TLB_SHIFT is 11, so that's at most one > page in anyone's money... Yes, sorry incorrect shift applied here. Still, and I believe this is what you mean below, architecture code setting swiotlb_force = SWIOTLB_NO_FORCE does not result in not allocating the SWIOTLB, because io_tlb_nslabs is still left set to 0 so swiotlb_init() will proceed with allocating the default size. > > I wouldn't necessarily disagree with adding "off" as an additional > alias > for "noforce", though, since it does come across as a bit wacky for > general use. > >>> Signed-off-by: Florian Fainelli >> >> Christoph, in addition to this change, how would you feel if we >> qualified the swiotlb_init() in arch/arm/mm/init.c with a: >> >> >> if (memblock_end_of_DRAM() >= SZ_4G) >> swiotlb_init(1) > > Modulo "swiotlb=force", of course ;) Indeed, we would need to handle that case as well. Does it sound reasonable to do that to you as well? >>> >>> I wouldn't like it done to me personally, but for arm64, observe what >>> mem_init() in arch/arm64/mm/init.c already does. > > In fact I should have looked more closely at that myself - checking > debugfs on my 4GB arm64 board actually shows io_tlb_nslabs = 0, and > indeed we are bypassing initialisation completely and (ab)using > SWIOTLB_NO_FORCE to cover it up, so I guess it probably *is* safe now > for the noforce option to do the same for itself and save even that one > page. OK, I can submit a patch that does that. 5.12-rc3 works correctly for me here as well and only allocates SWIOTLB when needed which in our case is either: - we have DRAM at PA >= 4GB - we have limited peripherals (Raspberry Pi 4 derivative) that can only address the lower 1GB Now let's see if we can get ARM 32-bit to match :) -- Florian ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] iommu/dma: Fix a typo in a comment
Hi, 在 2021/3/18 19:01, Robin Murphy 写道: On 2021-03-18 09:55, chenxiang (M) wrote: Hi will, 在 2021/3/18 17:34, Will Deacon 写道: On Thu, Mar 18, 2021 at 11:21:24AM +0800, chenxiang wrote: From: Xiang Chen Fix a type "SAC" to "DAC" in the comment of function iommu_dma_alloc_iova(). Signed-off-by: Xiang Chen --- drivers/iommu/dma-iommu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/iommu/dma-iommu.c b/drivers/iommu/dma-iommu.c index af765c8..3bc17ee 100644 --- a/drivers/iommu/dma-iommu.c +++ b/drivers/iommu/dma-iommu.c @@ -443,7 +443,7 @@ static dma_addr_t iommu_dma_alloc_iova(struct iommu_domain *domain, if (domain->geometry.force_aperture) dma_limit = min(dma_limit, (u64)domain->geometry.aperture_end); -/* Try to get PCI devices a SAC address */ +/* Try to get PCI devices a DAC address */ if (dma_limit > DMA_BIT_MASK(32) && dev_is_pci(dev)) iova = alloc_iova_fast(iovad, iova_len, DMA_BIT_MASK(32) >> shift, false); This doesn't look like a typo to me... Please explain. As the author of said comment, it is definitely not a typo. What's the short for SAC? I think it means double-address cycle(DAC), and in LLD3 452 page, there is a description about it "PCI double-address cycle mappings Normally, the DMA support layer works with 32-bit bus addresses, possibly restricted by a specific device’s DMA mask. The PCI bus, however, also supports a 64-bit addressing mode, the double-address cycle (DAC)." Please point it out if i am wrong :) Yes, now look at what the code above does: *if* a PCI device has a DMA mask greater than 32 bits (i.e. can support dual address cycles), we first try an IOVA allocation with an explicit 32-bit limit (i.e. that can be expressed in a single address cycle), in the hope that we don't have to fall back to allocating from the upper range and forcing the device to actually use DAC (and suffer the potential performance impact). Robin. . ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 1/1] iommu/vt-d: Fix lockdep splat in intel_pasid_get_entry()
Hi Joerg, On 3/18/21 6:21 PM, Joerg Roedel wrote: On Wed, Mar 17, 2021 at 08:58:34AM +0800, Lu Baolu wrote: The pasid_lock is used to synchronize different threads from modifying a same pasid directory entry at the same time. It causes below lockdep splat. [ 83.296538] [ 83.296538] WARNING: possible irq lock inversion dependency detected [ 83.296539] 5.12.0-rc3+ #25 Tainted: GW [ 83.296539] [ 83.296540] bash/780 just changed the state of lock: [ 83.296540] 82b29c98 (device_domain_lock){..-.}-{2:2}, at: iommu_flush_dev_iotlb.part.0+0x32/0x110 [ 83.296547] but this lock took another, SOFTIRQ-unsafe lock in the past: [ 83.296547] (pasid_lock){+.+.}-{2:2} [ 83.296548] and interrupts could create inverse lock ordering between them. [ 83.296549] other info that might help us debug this: [ 83.296549] Chain exists of: device_domain_lock --> &iommu->lock --> pasid_lock [ 83.296551] Possible interrupt unsafe locking scenario: [ 83.296551]CPU0CPU1 [ 83.296552] [ 83.296552] lock(pasid_lock); [ 83.296553]local_irq_disable(); [ 83.296553]lock(device_domain_lock); [ 83.296554]lock(&iommu->lock); [ 83.296554] [ 83.296554] lock(device_domain_lock); [ 83.296555] *** DEADLOCK *** Fix it by replacing the pasid_lock with an atomic exchange operation. Reported-and-tested-by: Dave Jiang Signed-off-by: Lu Baolu --- drivers/iommu/intel/pasid.c | 14 ++ 1 file changed, 6 insertions(+), 8 deletions(-) diff --git a/drivers/iommu/intel/pasid.c b/drivers/iommu/intel/pasid.c index 9fb3d3e80408..1ddcb8295f72 100644 --- a/drivers/iommu/intel/pasid.c +++ b/drivers/iommu/intel/pasid.c @@ -24,7 +24,6 @@ /* * Intel IOMMU system wide PASID name space: */ -static DEFINE_SPINLOCK(pasid_lock); u32 intel_pasid_max_id = PASID_MAX; int vcmd_alloc_pasid(struct intel_iommu *iommu, u32 *pasid) @@ -259,19 +258,18 @@ struct pasid_entry *intel_pasid_get_entry(struct device *dev, u32 pasid) dir_index = pasid >> PASID_PDE_SHIFT; index = pasid & PASID_PTE_MASK; - spin_lock(&pasid_lock); entries = get_pasid_table_from_pde(&dir[dir_index]); if (!entries) { entries = alloc_pgtable_page(info->iommu->node); - if (!entries) { - spin_unlock(&pasid_lock); + if (!entries) return NULL; - } - WRITE_ONCE(dir[dir_index].val, - (u64)virt_to_phys(entries) | PASID_PTE_PRESENT); + if (cmpxchg64(&dir[dir_index].val, 0ULL, + (u64)virt_to_phys(entries) | PASID_PTE_PRESENT)) { + free_pgtable_page(entries); + entries = get_pasid_table_from_pde(&dir[dir_index]); This is racy, someone could have already cleared the pasid-entry again. This code modifies the pasid directory entry. The pasid directory entries are allocated on demand and will never be freed. What you need to do here is to retry the whole path by adding a goto to before the first get_pasid_table_from_pde() call. Yes. Retrying by adding a goto makes the code clearer. Btw, what makes sure that the pasid_entry does not go away when it is returned here? As explained above, it handles the pasid directory table entry which won't go away. Regards, Joerg Best regards, baolu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH 1/1] iommu/vt-d: Don't set then immediately clear in prq_event_thread()
Hi Joerg, On 3/18/21 6:10 PM, Joerg Roedel wrote: Hi Baolu, On Tue, Mar 09, 2021 at 08:46:41AM +0800, Lu Baolu wrote: The private data field of a page group response descriptor is set then immediately cleared in prq_event_thread(). Fix this by moving clearing code up. Fixes: 5b438f4ba315d ("iommu/vt-d: Support page request in scalable mode") Cc: Jacob Pan Reviewed-by: Liu Yi L Signed-off-by: Lu Baolu Does this fix an actual bug? If so, please state it in the commit It will cause real problem according to the VT-d spec. I haven't got a chance run this on a real hardware yet. I'll add a commit message to explain why this will cause problem. message and also fix the subject line to state what is set/cleared. Sure! Thanks, Joerg Best regards, baolu ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] swiotlb: Add swiotlb=off to disable SWIOTLB
> > > > In fact I should have looked more closely at that myself - checking > > debugfs on my 4GB arm64 board actually shows io_tlb_nslabs = 0, and > > indeed we are bypassing initialisation completely and (ab)using > > SWIOTLB_NO_FORCE to cover it up, so I guess it probably *is* safe now > > for the noforce option to do the same for itself and save even that one > > page. > > OK, I can submit a patch that does that. 5.12-rc3 works correctly for me > here as well and only allocates SWIOTLB when needed which in our case is > either: > > - we have DRAM at PA >= 4GB > - we have limited peripherals (Raspberry Pi 4 derivative) that can only > address the lower 1GB > > Now let's see if we can get ARM 32-bit to match :) Whatever patch you come up with, if it is against SWIOTLB please base it on top of devel/for-linus-5.12 in https://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb.git/ Thx > -- > Florian ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH] swiotlb: Make SWIOTLB_NO_FORCE perform no allocation
When SWIOTLB_NO_FORCE is used, there should really be no allocations of io_tlb_nslabs to occur since we are not going to use those slabs. If a platform was somehow setting swiotlb_no_force and a later call to swiotlb_init() was to be made we would still be proceeding with allocating the default SWIOTLB size (64MB), whereas if swiotlb=noforce was set on the kernel command line we would have only allocated 2KB. This would be inconsistent and the point of initializing io_tlb_nslabs to 1, was to avoid hitting the test for io_tlb_nslabs being 0/not initialized. Signed-off-by: Florian Fainelli --- kernel/dma/swiotlb.c | 10 ++ 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c index c10e855a03bc..526c8321b76f 100644 --- a/kernel/dma/swiotlb.c +++ b/kernel/dma/swiotlb.c @@ -121,12 +121,10 @@ setup_io_tlb_npages(char *str) } if (*str == ',') ++str; - if (!strcmp(str, "force")) { + if (!strcmp(str, "force")) swiotlb_force = SWIOTLB_FORCE; - } else if (!strcmp(str, "noforce")) { + else if (!strcmp(str, "noforce")) swiotlb_force = SWIOTLB_NO_FORCE; - io_tlb_nslabs = 1; - } return 0; } @@ -284,6 +282,9 @@ swiotlb_init(int verbose) unsigned char *vstart; unsigned long bytes; + if (swiotlb_force == SWIOTLB_NO_FORCE) + goto out; + if (!io_tlb_nslabs) { io_tlb_nslabs = (default_size >> IO_TLB_SHIFT); io_tlb_nslabs = ALIGN(io_tlb_nslabs, IO_TLB_SEGSIZE); @@ -302,6 +303,7 @@ swiotlb_init(int verbose) io_tlb_start = 0; } pr_warn("Cannot allocate buffer"); +out: no_iotlb_memory = true; } -- 2.25.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH] ARM: Qualify enabling of swiotlb_init()
We do not need a SWIOTLB unless we have DRAM that is addressable beyond the arm_dma_limit. Compare max_pfn with arm_dma_pfn_limit to determine whether we do need a SWIOTLB to be initialized. Fixes: ad3c7b18c5b3 ("arm: use swiotlb for bounce buffering on LPAE configs") Signed-off-by: Florian Fainelli --- arch/arm/mm/init.c | 6 +- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/arch/arm/mm/init.c b/arch/arm/mm/init.c index 828a2561b229..8356bf1daa28 100644 --- a/arch/arm/mm/init.c +++ b/arch/arm/mm/init.c @@ -301,7 +301,11 @@ static void __init free_highpages(void) void __init mem_init(void) { #ifdef CONFIG_ARM_LPAE - swiotlb_init(1); + if (swiotlb_force == SWIOTLB_FORCE || + max_pfn > arm_dma_pfn_limit) + swiotlb_init(1); + else + swiotlb_force = SWIOTLB_NO_FORCE; #endif set_max_mapnr(pfn_to_page(max_pfn) - mem_map); -- 2.25.1 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] swiotlb: Make SWIOTLB_NO_FORCE perform no allocation
On Thu, Mar 18, 2021 at 09:00:54PM -0700, Florian Fainelli wrote: > When SWIOTLB_NO_FORCE is used, there should really be no allocations of > io_tlb_nslabs to occur since we are not going to use those slabs. If a > platform was somehow setting swiotlb_no_force and a later call to > swiotlb_init() was to be made we would still be proceeding with > allocating the default SWIOTLB size (64MB), whereas if swiotlb=noforce > was set on the kernel command line we would have only allocated 2KB. > > This would be inconsistent and the point of initializing io_tlb_nslabs > to 1, was to avoid hitting the test for io_tlb_nslabs being 0/not > initialized. Could you rebase this on devel/for-linus-5.13 in git://git.kernel.org/pub/scm/linux/kernel/git/konrad/swiotlb.git please? > > Signed-off-by: Florian Fainelli > --- > kernel/dma/swiotlb.c | 10 ++ > 1 file changed, 6 insertions(+), 4 deletions(-) > > diff --git a/kernel/dma/swiotlb.c b/kernel/dma/swiotlb.c > index c10e855a03bc..526c8321b76f 100644 > --- a/kernel/dma/swiotlb.c > +++ b/kernel/dma/swiotlb.c > @@ -121,12 +121,10 @@ setup_io_tlb_npages(char *str) > } > if (*str == ',') > ++str; > - if (!strcmp(str, "force")) { > + if (!strcmp(str, "force")) > swiotlb_force = SWIOTLB_FORCE; > - } else if (!strcmp(str, "noforce")) { > + else if (!strcmp(str, "noforce")) > swiotlb_force = SWIOTLB_NO_FORCE; > - io_tlb_nslabs = 1; > - } > > return 0; > } > @@ -284,6 +282,9 @@ swiotlb_init(int verbose) > unsigned char *vstart; > unsigned long bytes; > > + if (swiotlb_force == SWIOTLB_NO_FORCE) > + goto out; > + > if (!io_tlb_nslabs) { > io_tlb_nslabs = (default_size >> IO_TLB_SHIFT); > io_tlb_nslabs = ALIGN(io_tlb_nslabs, IO_TLB_SEGSIZE); > @@ -302,6 +303,7 @@ swiotlb_init(int verbose) > io_tlb_start = 0; > } > pr_warn("Cannot allocate buffer"); > +out: > no_iotlb_memory = true; > } > > -- > 2.25.1 > ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu