Re: [PATCH v2 1/2] uacce: Remove mm_exit() op
On 2020/4/23 下午8:53, Jean-Philippe Brucker wrote: The mm_exit() op will be removed from the SVA API. When a process dies and its mm goes away, the IOMMU driver won't notify device drivers anymore. Drivers should expect to handle a lot more aborted DMA. On the upside, it does greatly simplify the queue management. The uacce_mm struct, that tracks all queues bound to an mm, was only used by the mm_exit() callback. Remove it. Signed-off-by: Jean-Philippe Brucker Thanks Jean Acked-by: Zhangfei Gao --- v1->v2: clear q->handle after unbind Thanks ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: iommu_iova slab eats too much memory
Hello? anyone there? Bin 于2020年4月23日周四 下午5:14写道: > Forget to mention, I've already disabled the slab merge, so this is what > it is. > > Bin 于2020年4月23日周四 下午5:11写道: > >> Hey, guys: >> >> I'm running a batch of CoreOS boxes, the lsb_release is: >> >> ``` >> # cat /etc/lsb-release >> DISTRIB_ID="Container Linux by CoreOS" >> DISTRIB_RELEASE=2303.3.0 >> DISTRIB_CODENAME="Rhyolite" >> DISTRIB_DESCRIPTION="Container Linux by CoreOS 2303.3.0 (Rhyolite)" >> ``` >> >> ``` >> # uname -a >> Linux cloud-worker-25 4.19.86-coreos #1 SMP Mon Dec 2 20:13:38 -00 2019 >> x86_64 Intel(R) Xeon(R) CPU E5-2640 v2 @ 2.00GHz GenuineIntel GNU/Linux >> ``` >> Recently, I found my vms constently being killed due to OOM, and after >> digging into the problem, I finally realized that the kernel is leaking >> memory. >> >> Here's my slabinfo: >> >> Active / Total Objects (% used): 83818306 / 84191607 (99.6%) >> Active / Total Slabs (% used) : 1336293 / 1336293 (100.0%) >> Active / Total Caches (% used) : 152 / 217 (70.0%) >> Active / Total Size (% used) : 5828768.08K / 5996848.72K (97.2%) >> Minimum / Average / Maximum Object : 0.01K / 0.07K / 23.25K >> >> OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME >> >> 80253888 80253888 100%0.06K 1253967 64 5015868K iommu_iova >> >> 489472 489123 99%0.03K 3824 128 15296K kmalloc-32 >> >> 297444 271112 91%0.19K 7082 42 56656K dentry >> >> 254400 252784 99%0.06K 3975 64 15900K anon_vma_chain >> >> 222528 39255 17%0.50K 6954 32111264K kmalloc-512 >> >> 202482 201814 99%0.19K 4821 42 38568K vm_area_struct >> >> 200192 200192 100%0.01K391 512 1564K kmalloc-8 >> >> 170528 169359 99%0.25K 5329 32 42632K filp >> >> 158144 153508 97%0.06K 2471 64 9884K kmalloc-64 >> >> 149914 149365 99%0.09K 3259 46 13036K anon_vma >> >> 146640 143123 97%0.10K 3760 39 15040K buffer_head >> >> 130368 32791 25%0.09K 3104 42 12416K kmalloc-96 >> >> 129752 129752 100%0.07K 2317 56 9268K Acpi-Operand >> >> 105468 105106 99%0.04K 1034 102 4136K >> selinux_inode_security >> 73080 73080 100%0.13K 2436 30 9744K kernfs_node_cache >> >> 72360 70261 97%0.59K 1340 54 42880K inode_cache >> >> 71040 71040 100%0.12K 2220 32 8880K eventpoll_epi >> >> 68096 59262 87%0.02K266 256 1064K kmalloc-16 >> >> 53652 53652 100%0.04K526 102 2104K pde_opener >> >> 50496 31654 62%2.00K 3156 16100992K kmalloc-2048 >> >> 46242 46242 100%0.19K 1101 42 8808K cred_jar >> >> 44496 43013 96%0.66K927 48 29664K proc_inode_cache >> >> 44352 44352 100%0.06K693 64 2772K task_delay_info >> >> 43516 43471 99%0.69K946 46 30272K sock_inode_cache >> >> 37856 27626 72%1.00K 1183 32 37856K kmalloc-1024 >> >> 36736 36736 100%0.07K656 56 2624K eventpoll_pwq >> >> 34076 31282 91%0.57K 1217 28 19472K radix_tree_node >> >> 33660 30528 90%1.05K 1122 30 35904K ext4_inode_cache >> >> 32760 30959 94%0.19K780 42 6240K kmalloc-192 >> >> 32028 32028 100%0.04K314 102 1256K ext4_extent_status >> >> 30048 30048 100%0.25K939 32 7512K skbuff_head_cache >> >> 28736 28736 100%0.06K449 64 1796K fs_cache >> >> 24702 24702 100%0.69K537 46 17184K files_cache >> >> 23808 23808 100%0.66K496 48 15872K ovl_inode >> >> 23104 22945 99%0.12K722 32 2888K kmalloc-128 >> >> 22724 21307 93%0.69K494 46 15808K shmem_inode_cache >> >> 21472 21472 100%0.12K671 32 2684K seq_file >> >> 19904 19904 100%1.00K622 32 19904K UNIX >> >> 17340 17340 100%1.06K578 30 18496K mm_struct >> >> 15980 15980 100%0.02K 94 170 376K avtab_node >> >> 14070 14070 100%1.06K469 30 15008K signal_cache >> >> 13248 13248 100%0.12K414 32 1656K pid >> >> 12128 11777 97%0.25K379 32 3032K kmalloc-256 >> >> 11008 11008 100%0.02K 43 256 172K >> selinux_file_security >> 10812 10812 100%0.04K106 102 424K Acpi-Namespace >> >> These information shows that the 'iommu_iova' is the top memory consumer. >> In order to optimize the network performence of Openstack virtual machines, >> I enabled the vt-d feature in bios and sriov feature of Intel 82599 10G >> NIC. I'm assuming this is the root cause of this issue. >> >> Is there anything I can do to fix it? >> > ___ iommu mailing list
[PATCH] dma-contiguous: fix comment for dma_release_from_contiguous
Commit 90ae409f9eb3 ("dma-direct: fix zone selection after an unaddressable CMA allocation") changed the logic in dma_release_from_contiguous to remove the normal pages fallback path, but did not update the comment. Fix that. Signed-off-by: Peter Collingbourne --- kernel/dma/contiguous.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/kernel/dma/contiguous.c b/kernel/dma/contiguous.c index 8bc6f2d670f9..15bc5026c485 100644 --- a/kernel/dma/contiguous.c +++ b/kernel/dma/contiguous.c @@ -222,8 +222,8 @@ bool dma_release_from_contiguous(struct device *dev, struct page *pages, * @gfp: Allocation flags. * * This function allocates contiguous memory buffer for specified device. It - * first tries to use device specific contiguous memory area if available or - * the default global one, then tries a fallback allocation of normal pages. + * tries to use device specific contiguous memory area if available, or the + * default global one. * * Note that it byapss one-page size of allocations from the global area as * the addresses within one page are always contiguous, so there is no need -- 2.26.2.303.gf8c07b1a785-goog ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCHv2] iommu/arm-smmu: Make remove callback message more informative
Hi, On Thu, Apr 23, 2020 at 2:55 AM Sai Prakash Ranjan wrote: > > Currently on reboot/shutdown, the following messages are > displayed on the console as error messages before the > system reboots/shutdown as part of remove callback. > > On SC7180: > > arm-smmu 1500.iommu: removing device with active domains! > arm-smmu 504.iommu: removing device with active domains! > > Make this error message more informative and less scary. > > Reported-by: Douglas Anderson > Suggested-by: Robin Murphy > Signed-off-by: Sai Prakash Ranjan > --- > drivers/iommu/arm-smmu.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) Reviewed-by: Douglas Anderson ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: iommu_iova slab eats too much memory
Forget to mention, I've already disabled the slab merge, so this is what it is. Bin 于2020年4月23日周四 下午5:11写道: > Hey, guys: > > I'm running a batch of CoreOS boxes, the lsb_release is: > > ``` > # cat /etc/lsb-release > DISTRIB_ID="Container Linux by CoreOS" > DISTRIB_RELEASE=2303.3.0 > DISTRIB_CODENAME="Rhyolite" > DISTRIB_DESCRIPTION="Container Linux by CoreOS 2303.3.0 (Rhyolite)" > ``` > > ``` > # uname -a > Linux cloud-worker-25 4.19.86-coreos #1 SMP Mon Dec 2 20:13:38 -00 2019 > x86_64 Intel(R) Xeon(R) CPU E5-2640 v2 @ 2.00GHz GenuineIntel GNU/Linux > ``` > Recently, I found my vms constently being killed due to OOM, and after > digging into the problem, I finally realized that the kernel is leaking > memory. > > Here's my slabinfo: > > Active / Total Objects (% used): 83818306 / 84191607 (99.6%) > Active / Total Slabs (% used) : 1336293 / 1336293 (100.0%) > Active / Total Caches (% used) : 152 / 217 (70.0%) > Active / Total Size (% used) : 5828768.08K / 5996848.72K (97.2%) > Minimum / Average / Maximum Object : 0.01K / 0.07K / 23.25K > > OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME > > 80253888 80253888 100%0.06K 1253967 64 5015868K iommu_iova > > 489472 489123 99%0.03K 3824 128 15296K kmalloc-32 > > 297444 271112 91%0.19K 7082 42 56656K dentry > > 254400 252784 99%0.06K 3975 64 15900K anon_vma_chain > > 222528 39255 17%0.50K 6954 32111264K kmalloc-512 > > 202482 201814 99%0.19K 4821 42 38568K vm_area_struct > > 200192 200192 100%0.01K391 512 1564K kmalloc-8 > > 170528 169359 99%0.25K 5329 32 42632K filp > > 158144 153508 97%0.06K 2471 64 9884K kmalloc-64 > > 149914 149365 99%0.09K 3259 46 13036K anon_vma > > 146640 143123 97%0.10K 3760 39 15040K buffer_head > > 130368 32791 25%0.09K 3104 42 12416K kmalloc-96 > > 129752 129752 100%0.07K 2317 56 9268K Acpi-Operand > > 105468 105106 99%0.04K 1034 102 4136K > selinux_inode_security > 73080 73080 100%0.13K 2436 30 9744K kernfs_node_cache > > 72360 70261 97%0.59K 1340 54 42880K inode_cache > > 71040 71040 100%0.12K 2220 32 8880K eventpoll_epi > > 68096 59262 87%0.02K266 256 1064K kmalloc-16 > > 53652 53652 100%0.04K526 102 2104K pde_opener > > 50496 31654 62%2.00K 3156 16100992K kmalloc-2048 > > 46242 46242 100%0.19K 1101 42 8808K cred_jar > > 44496 43013 96%0.66K927 48 29664K proc_inode_cache > > 44352 44352 100%0.06K693 64 2772K task_delay_info > > 43516 43471 99%0.69K946 46 30272K sock_inode_cache > > 37856 27626 72%1.00K 1183 32 37856K kmalloc-1024 > > 36736 36736 100%0.07K656 56 2624K eventpoll_pwq > > 34076 31282 91%0.57K 1217 28 19472K radix_tree_node > > 33660 30528 90%1.05K 1122 30 35904K ext4_inode_cache > > 32760 30959 94%0.19K780 42 6240K kmalloc-192 > > 32028 32028 100%0.04K314 102 1256K ext4_extent_status > > 30048 30048 100%0.25K939 32 7512K skbuff_head_cache > > 28736 28736 100%0.06K449 64 1796K fs_cache > > 24702 24702 100%0.69K537 46 17184K files_cache > > 23808 23808 100%0.66K496 48 15872K ovl_inode > > 23104 22945 99%0.12K722 32 2888K kmalloc-128 > > 22724 21307 93%0.69K494 46 15808K shmem_inode_cache > > 21472 21472 100%0.12K671 32 2684K seq_file > > 19904 19904 100%1.00K622 32 19904K UNIX > > 17340 17340 100%1.06K578 30 18496K mm_struct > > 15980 15980 100%0.02K 94 170 376K avtab_node > > 14070 14070 100%1.06K469 30 15008K signal_cache > > 13248 13248 100%0.12K414 32 1656K pid > > 12128 11777 97%0.25K379 32 3032K kmalloc-256 > > 11008 11008 100%0.02K 43 256 172K > selinux_file_security > 10812 10812 100%0.04K106 102 424K Acpi-Namespace > > These information shows that the 'iommu_iova' is the top memory consumer. > In order to optimize the network performence of Openstack virtual machines, > I enabled the vt-d feature in bios and sriov feature of Intel 82599 10G > NIC. I'm assuming this is the root cause of this issue. > > Is there anything I can do to fix it? > ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
iommu_iova slab eats too much memory
Hey, guys: I'm running a batch of CoreOS boxes, the lsb_release is: ``` # cat /etc/lsb-release DISTRIB_ID="Container Linux by CoreOS" DISTRIB_RELEASE=2303.3.0 DISTRIB_CODENAME="Rhyolite" DISTRIB_DESCRIPTION="Container Linux by CoreOS 2303.3.0 (Rhyolite)" ``` ``` # uname -a Linux cloud-worker-25 4.19.86-coreos #1 SMP Mon Dec 2 20:13:38 -00 2019 x86_64 Intel(R) Xeon(R) CPU E5-2640 v2 @ 2.00GHz GenuineIntel GNU/Linux ``` Recently, I found my vms constently being killed due to OOM, and after digging into the problem, I finally realized that the kernel is leaking memory. Here's my slabinfo: Active / Total Objects (% used): 83818306 / 84191607 (99.6%) Active / Total Slabs (% used) : 1336293 / 1336293 (100.0%) Active / Total Caches (% used) : 152 / 217 (70.0%) Active / Total Size (% used) : 5828768.08K / 5996848.72K (97.2%) Minimum / Average / Maximum Object : 0.01K / 0.07K / 23.25K OBJS ACTIVE USE OBJ SIZE SLABS OBJ/SLAB CACHE SIZE NAME 80253888 80253888 100%0.06K 1253967 64 5015868K iommu_iova 489472 489123 99%0.03K 3824 128 15296K kmalloc-32 297444 271112 91%0.19K 7082 42 56656K dentry 254400 252784 99%0.06K 3975 64 15900K anon_vma_chain 222528 39255 17%0.50K 6954 32111264K kmalloc-512 202482 201814 99%0.19K 4821 42 38568K vm_area_struct 200192 200192 100%0.01K391 512 1564K kmalloc-8 170528 169359 99%0.25K 5329 32 42632K filp 158144 153508 97%0.06K 2471 64 9884K kmalloc-64 149914 149365 99%0.09K 3259 46 13036K anon_vma 146640 143123 97%0.10K 3760 39 15040K buffer_head 130368 32791 25%0.09K 3104 42 12416K kmalloc-96 129752 129752 100%0.07K 2317 56 9268K Acpi-Operand 105468 105106 99%0.04K 1034 102 4136K selinux_inode_security 73080 73080 100%0.13K 2436 30 9744K kernfs_node_cache 72360 70261 97%0.59K 1340 54 42880K inode_cache 71040 71040 100%0.12K 2220 32 8880K eventpoll_epi 68096 59262 87%0.02K266 256 1064K kmalloc-16 53652 53652 100%0.04K526 102 2104K pde_opener 50496 31654 62%2.00K 3156 16100992K kmalloc-2048 46242 46242 100%0.19K 1101 42 8808K cred_jar 44496 43013 96%0.66K927 48 29664K proc_inode_cache 44352 44352 100%0.06K693 64 2772K task_delay_info 43516 43471 99%0.69K946 46 30272K sock_inode_cache 37856 27626 72%1.00K 1183 32 37856K kmalloc-1024 36736 36736 100%0.07K656 56 2624K eventpoll_pwq 34076 31282 91%0.57K 1217 28 19472K radix_tree_node 33660 30528 90%1.05K 1122 30 35904K ext4_inode_cache 32760 30959 94%0.19K780 42 6240K kmalloc-192 32028 32028 100%0.04K314 102 1256K ext4_extent_status 30048 30048 100%0.25K939 32 7512K skbuff_head_cache 28736 28736 100%0.06K449 64 1796K fs_cache 24702 24702 100%0.69K537 46 17184K files_cache 23808 23808 100%0.66K496 48 15872K ovl_inode 23104 22945 99%0.12K722 32 2888K kmalloc-128 22724 21307 93%0.69K494 46 15808K shmem_inode_cache 21472 21472 100%0.12K671 32 2684K seq_file 19904 19904 100%1.00K622 32 19904K UNIX 17340 17340 100%1.06K578 30 18496K mm_struct 15980 15980 100%0.02K 94 170 376K avtab_node 14070 14070 100%1.06K469 30 15008K signal_cache 13248 13248 100%0.12K414 32 1656K pid 12128 11777 97%0.25K379 32 3032K kmalloc-256 11008 11008 100%0.02K 43 256 172K selinux_file_security 10812 10812 100%0.04K106 102 424K Acpi-Namespace These information shows that the 'iommu_iova' is the top memory consumer. In order to optimize the network performence of Openstack virtual machines, I enabled the vt-d feature in bios and sriov feature of Intel 82599 10G NIC. I'm assuming this is the root cause of this issue. Is there anything I can do to fix it? ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCH v2 1/2] uacce: Remove mm_exit() op
The mm_exit() op will be removed from the SVA API. When a process dies and its mm goes away, the IOMMU driver won't notify device drivers anymore. Drivers should expect to handle a lot more aborted DMA. On the upside, it does greatly simplify the queue management. The uacce_mm struct, that tracks all queues bound to an mm, was only used by the mm_exit() callback. Remove it. Signed-off-by: Jean-Philippe Brucker --- v1->v2: clear q->handle after unbind --- include/linux/uacce.h | 34 ++-- drivers/misc/uacce/uacce.c | 172 + 2 files changed, 51 insertions(+), 155 deletions(-) diff --git a/include/linux/uacce.h b/include/linux/uacce.h index 0e215e6d0534a..454c2f6672d79 100644 --- a/include/linux/uacce.h +++ b/include/linux/uacce.h @@ -68,19 +68,21 @@ enum uacce_q_state { * @uacce: pointer to uacce * @priv: private pointer * @wait: wait queue head - * @list: index into uacce_mm - * @uacce_mm: the corresponding mm + * @list: index into uacce queues list * @qfrs: pointer of qfr regions * @state: queue state machine + * @pasid: pasid associated to the mm + * @handle: iommu_sva handle returned by iommu_sva_bind_device() */ struct uacce_queue { struct uacce_device *uacce; void *priv; wait_queue_head_t wait; struct list_head list; - struct uacce_mm *uacce_mm; struct uacce_qfile_region *qfrs[UACCE_MAX_REGION]; enum uacce_q_state state; + int pasid; + struct iommu_sva *handle; }; /** @@ -96,8 +98,8 @@ struct uacce_queue { * @cdev: cdev of the uacce * @dev: dev of the uacce * @priv: private pointer of the uacce - * @mm_list: list head of uacce_mm->list - * @mm_lock: lock for mm_list + * @queues: list of queues + * @queues_lock: lock for queues list * @inode: core vfs */ struct uacce_device { @@ -112,27 +114,9 @@ struct uacce_device { struct cdev *cdev; struct device dev; void *priv; - struct list_head mm_list; - struct mutex mm_lock; - struct inode *inode; -}; - -/** - * struct uacce_mm - keep track of queues bound to a process - * @list: index into uacce_device - * @queues: list of queues - * @mm: the mm struct - * @lock: protects the list of queues - * @pasid: pasid of the uacce_mm - * @handle: iommu_sva handle return from iommu_sva_bind_device - */ -struct uacce_mm { - struct list_head list; struct list_head queues; - struct mm_struct *mm; - struct mutex lock; - int pasid; - struct iommu_sva *handle; + struct mutex queues_lock; + struct inode *inode; }; #if IS_ENABLED(CONFIG_UACCE) diff --git a/drivers/misc/uacce/uacce.c b/drivers/misc/uacce/uacce.c index d39307f060bd6..107028e77ca37 100644 --- a/drivers/misc/uacce/uacce.c +++ b/drivers/misc/uacce/uacce.c @@ -90,109 +90,39 @@ static long uacce_fops_compat_ioctl(struct file *filep, } #endif -static int uacce_sva_exit(struct device *dev, struct iommu_sva *handle, - void *data) +static int uacce_bind_queue(struct uacce_device *uacce, struct uacce_queue *q) { - struct uacce_mm *uacce_mm = data; - struct uacce_queue *q; - - /* -* No new queue can be added concurrently because no caller can have a -* reference to this mm. But there may be concurrent calls to -* uacce_mm_put(), so we need the lock. -*/ - mutex_lock(_mm->lock); - list_for_each_entry(q, _mm->queues, list) - uacce_put_queue(q); - uacce_mm->mm = NULL; - mutex_unlock(_mm->lock); + int pasid; + struct iommu_sva *handle; - return 0; -} - -static struct iommu_sva_ops uacce_sva_ops = { - .mm_exit = uacce_sva_exit, -}; - -static struct uacce_mm *uacce_mm_get(struct uacce_device *uacce, -struct uacce_queue *q, -struct mm_struct *mm) -{ - struct uacce_mm *uacce_mm = NULL; - struct iommu_sva *handle = NULL; - int ret; - - lockdep_assert_held(>mm_lock); - - list_for_each_entry(uacce_mm, >mm_list, list) { - if (uacce_mm->mm == mm) { - mutex_lock(_mm->lock); - list_add(>list, _mm->queues); - mutex_unlock(_mm->lock); - return uacce_mm; - } - } - - uacce_mm = kzalloc(sizeof(*uacce_mm), GFP_KERNEL); - if (!uacce_mm) - return NULL; + if (!(uacce->flags & UACCE_DEV_SVA)) + return 0; - if (uacce->flags & UACCE_DEV_SVA) { - /* -* Safe to pass an incomplete uacce_mm, since mm_exit cannot -* fire while we hold a reference to the mm. -*/ - handle = iommu_sva_bind_device(uacce->parent, mm, uacce_mm); - if (IS_ERR(handle)) - goto err_free; + handle =
[PATCH v2 2/2] iommu: Remove iommu_sva_ops::mm_exit()
After binding a device to an mm, device drivers currently need to register a mm_exit handler. This function is called when the mm exits, to gracefully stop DMA targeting the address space and flush page faults to the IOMMU. This is deemed too complex for the MMU release() notifier, which may be triggered by any mmput() invocation, from about 120 callsites [1]. The upcoming SVA module has an example of such complexity: the I/O Page Fault handler would need to call mmput_async() instead of mmput() after handling an IOPF, to avoid triggering the release() notifier which would in turn drain the IOPF queue and lock up. Another concern is the DMA stop function taking too long, up to several minutes [2]. For some mmput() callers this may disturb other users. For example, if the OOM killer picks the mm bound to a device as the victim and that mm's memory is locked, if the release() takes too long, it might choose additional innocent victims to kill. To simplify the MMU release notifier, don't forward the notification to device drivers. Since they don't stop DMA on mm exit anymore, the PASID lifetime is extended: (1) The device driver calls bind(). A PASID is allocated. Here any DMA fault is handled by mm, and on error we don't print anything to dmesg. Userspace can easily trigger errors by issuing DMA on unmapped buffers. (2) exit_mmap(), for example the process took a SIGKILL. This step doesn't happen during normal operations. Remove the pgd from the PASID table, since the page tables are about to be freed. Invalidate the IOTLBs. Here the device may still perform DMA on the address space. Incoming transactions are aborted but faults aren't printed out. ATS Translation Requests return Successful Translation Completions with R=W=0. PRI Page Requests return with Invalid Request. (3) The device driver stops DMA, possibly following release of a fd, and calls unbind(). PASID table is cleared, IOTLB invalidated if necessary. The page fault queues are drained, and the PASID is freed. If DMA for that PASID is still running here, something went seriously wrong and errors should be reported. For now remove iommu_sva_ops entirely. We might need to re-introduce them at some point, for example to notify device drivers of unhandled IOPF. [1] https://lore.kernel.org/linux-iommu/20200306174239.gm31...@ziepe.ca/ [2] https://lore.kernel.org/linux-iommu/4d68da96-0ad5-b412-5987-2f7a6aa79...@amd.com/ Signed-off-by: Jean-Philippe Brucker --- v1->v2: tried to clarify the lifetime of the PASID --- include/linux/iommu.h | 30 -- drivers/iommu/iommu.c | 11 --- 2 files changed, 41 deletions(-) diff --git a/include/linux/iommu.h b/include/linux/iommu.h index 7ef8b0bda6951..bd330d6343b78 100644 --- a/include/linux/iommu.h +++ b/include/linux/iommu.h @@ -53,8 +53,6 @@ struct iommu_fault_event; typedef int (*iommu_fault_handler_t)(struct iommu_domain *, struct device *, unsigned long, int, void *); -typedef int (*iommu_mm_exit_handler_t)(struct device *dev, struct iommu_sva *, - void *); typedef int (*iommu_dev_fault_handler_t)(struct iommu_fault *, void *); struct iommu_domain_geometry { @@ -171,25 +169,6 @@ enum iommu_dev_features { #define IOMMU_PASID_INVALID(-1U) -/** - * struct iommu_sva_ops - device driver callbacks for an SVA context - * - * @mm_exit: called when the mm is about to be torn down by exit_mmap. After - * @mm_exit returns, the device must not issue any more transaction - * with the PASID given as argument. - * - * The @mm_exit handler is allowed to sleep. Be careful about the - * locks taken in @mm_exit, because they might lead to deadlocks if - * they are also held when dropping references to the mm. Consider the - * following call chain: - * mutex_lock(A); mmput(mm) -> exit_mm() -> @mm_exit() -> mutex_lock(A) - * Using mmput_async() prevents this scenario. - * - */ -struct iommu_sva_ops { - iommu_mm_exit_handler_t mm_exit; -}; - #ifdef CONFIG_IOMMU_API /** @@ -605,7 +584,6 @@ struct iommu_fwspec { */ struct iommu_sva { struct device *dev; - const struct iommu_sva_ops *ops; }; int iommu_fwspec_init(struct device *dev, struct fwnode_handle *iommu_fwnode, @@ -653,8 +631,6 @@ struct iommu_sva *iommu_sva_bind_device(struct device *dev, struct mm_struct *mm, void *drvdata); void iommu_sva_unbind_device(struct iommu_sva *handle); -int iommu_sva_set_ops(struct iommu_sva *handle, - const struct iommu_sva_ops *ops); int iommu_sva_get_pasid(struct iommu_sva *handle); #else /* CONFIG_IOMMU_API */ @@ -1058,12 +1034,6 @@ static inline void iommu_sva_unbind_device(struct iommu_sva *handle) { } -static inline int
[PATCH v2 0/2] iommu: Remove iommu_sva_ops::mm_exit()
The IOMMU SVA API currently requires device drivers to implement an mm_exit() callback, which stops device jobs that do DMA. This function is called in the release() MMU notifier, when an address space that is shared with a device exits. It has been noted several time during discussions about SVA that cancelling DMA jobs can be slow and complex, and doing it in the release() notifier might cause synchronization issues. Device drivers must in any case call unbind() to remove their bond, after stopping DMA from a more favorable context (release of a file descriptor). Patch 1 removes the mm_exit() callback from the uacce module, and patch 2 removes it from the IOMMU API. Since v1 [1] I fixed the uacce unbind reported by Zhangfei and added details in the commit message of patch 2. Jean-Philippe Brucker (2): uacce: Remove mm_exit() op iommu: Remove iommu_sva_ops::mm_exit() include/linux/iommu.h | 30 --- include/linux/uacce.h | 34 ++-- drivers/iommu/iommu.c | 11 --- drivers/misc/uacce/uacce.c | 172 + 4 files changed, 51 insertions(+), 196 deletions(-) -- 2.26.0 ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [EXT] Re: [RFC PATCH 1/4] bus: fsl-mc: add custom .dma_configure implementation
On Thu, Apr 23, 2020 at 09:56:54AM +, Makarand Pawagi wrote: [...] > > > At a first glance, looks like this could very well fix the ACPI > > > scenario, but I have some unclarities on the approach: > > > * are we going to rely in DT and ACPI generic layers even if these > > > devices are not published / enumerated through DT or ACPI tables? > > > > Assuming you mean the DPRC devices rather than the MC itself, then yes; in > > that > > sense it's exactly the same as how we treat dynamically-discovered PCI > > devices. > > > > > * the firmware manages and provides discrete streamids for the > > > devices it exposes so there's no translation involved. There's no > > > requestor_id / input_id involved but it seems that we would still > > > do some kind of translation relying for this on the DT/ACPI functions. > > > > Wrong - last time I looked, what that firmware actually manages are > > *ICIDs* for the devices, not SMMU Stream IDs or GIC Device IDs; what DT/ACPI > > specifies is a translation from ICID to Stream ID/Device ID. The ICID is > > very much > > the requester/input ID for that translation. Yes, in practice the > > "translation" is > > effectively always a trivial identity mapping, but conceptually it most > > definitely > > exists. Yes, the subtlety is incredibly easy to overlook because it's > > basically > > drawing a distinction between one end of some wires vs. the other end, but > > it > > matters. > > > > (and of course "trivial 1:1 translation" isn't even true in the case of SMMU > > Stream ID values, since IIRC they are really composed of 5 different > > inputs, only > > one of which is (part of) the incoming ICID) > > > > > * MC firmware has its own stream_id (e.g. on some chips 0x4000, > > > others 0xf00, so outside the range of stream_ids used for the mc devices) > > > while for the devices on this bus, MC allocates stream_ids from a > > > range (e.g. 0x17 - 0x3f). Is it possible to describe this in the IORT > > > table? > > > > If it represents a unique ICID allocated to the MC itself, then sure, it > > simply goes > > through the mapping like anything else. Just like a PCI host bridge owns > > requester ID 0:0.0 and thus whatever Stream ID/Device ID that might map to. > > > > If (for the sake of argument, because AIUI everything is an ICID in this > > particular > > case) it's some hard-wired thing that exists in Stream ID/Device ID space > > only, > > then it's a little trickier, but still in scope. In DT we have a lovely > > distinction > > between between "originating from the node" and "translated through the > > node", e.g. "msi-parent" vs. > > "msi-map"; IORT is not quite as clear-cut, but there are at least a few > > options. If > > the valid input ID space is smaller than 32 bits, then the "Named Component > > as > > bridge" binding could simply define special out-of-range values to > > represent IDs > > originating from the bridge itself, such that the NC driver knows what to > > do and > > from IORT's point of view everything is just a normal mapping. Alternatively > > there's already the example of SMMUv3 where we can have a mix of the normal > > mappings from Stream ID to Device ID for the upstream masters plus a single > > mapping for the SMMU's own Device ID - admittedly that depends on the > > additional SMMUv3-specific Device ID Mapping Index property, but if > > necessary > > it might be workable to have a de-facto interface for NCs that only > > considers > > single mappings when configuring the NC itself, and only considers normal > > mappings when configuring its children. Or maybe define a new mapping flag > > or > > NC property if there's a real need to specify such a situation > > unambiguously at > > the IORT level. > > > > > * Regarding the of_map_rid() use you mentioned, I was planning to > > > decouple the mc bus from the DT layer by dropping the use of > > > of_map_rid(), see patch 4. > > > I briefly glanced over the iort code and spotted this static function: > > > iort_iommu_xlate(). Wouldn't it also help, of course after making it > > > public? > > > > I won't speak for Lorenzo or claim we've agreed on an approach at all (not > > least > > because in all honesty we haven't really discussed it beyond these various > > email > > threads), but FWIW my vision is that ultimately the DT/ACPI code would > > expose > > a *_dma_configure() interface that takes an optional input ID, or (perhaps > > more > > likely) an explicit pair of interfaces for "configure this regular device" > > and > > "configure this device based on this 'host' device and this ID", and it > > becomes > > the bus code's responsibility to pass the right device(s) and deal with > > multiple IDs > > (i.e. for starters all the PCI alias stuff goes back to the PCI code where > > it really > > should be, rather than having multiple copies of magic PCI awareness deep > > down in DT/ACPI code). > > > > Robin. > > Hi Lorenzo, Robin,
RE: [EXT] Re: [RFC PATCH 1/4] bus: fsl-mc: add custom .dma_configure implementation
> -Original Message- > From: Robin Murphy > Sent: Wednesday, April 15, 2020 11:02 PM > To: Laurentiu Tudor ; Lorenzo Pieralisi > > Cc: linux-ker...@vger.kernel.org; iommu@lists.linux-foundation.org; linux-arm- > ker...@lists.infradead.org; linux-a...@vger.kernel.org; > ard.biesheu...@linaro.org; Ioana Ciornei ; Diana > Madalina Craciun (OSS) ; m...@kernel.org; > j...@solid-run.com; Pankaj Bansal ; Makarand > Pawagi ; Calvin Johnson > ; Varun Sethi ; Cristi Sovaiala > ; stuart.yo...@arm.com; jeremy.lin...@arm.com; > j...@8bytes.org; t...@linutronix.de; ja...@lakedaemon.net > Subject: [EXT] Re: [RFC PATCH 1/4] bus: fsl-mc: add custom .dma_configure > implementation > > Caution: EXT Email > > On 2020-04-15 4:44 pm, Laurentiu Tudor wrote: > > > > > > On 4/14/2020 5:32 PM, Lorenzo Pieralisi wrote: > >> On Wed, Mar 25, 2020 at 06:48:55PM +0200, Laurentiu Tudor wrote: > >>> Hi Lorenzo, > >>> > >>> On 3/25/2020 2:51 PM, Lorenzo Pieralisi wrote: > On Thu, Feb 27, 2020 at 12:05:39PM +0200, laurentiu.tu...@nxp.com > wrote: > > From: Laurentiu Tudor > > > > The devices on this bus are not discovered by way of device tree > > but by queries to the firmware. It makes little sense to trick the > > generic of layer into thinking that these devices are of related > > so that we can get our dma configuration. Instead of doing that, > > add our custom dma configuration implementation. > > > > Signed-off-by: Laurentiu Tudor > > --- > > drivers/bus/fsl-mc/fsl-mc-bus.c | 31 > ++- > > 1 file changed, 30 insertions(+), 1 deletion(-) > > > > diff --git a/drivers/bus/fsl-mc/fsl-mc-bus.c > > b/drivers/bus/fsl-mc/fsl-mc-bus.c index 36eb25f82c8e..eafaa0e0b906 > > 100644 > > --- a/drivers/bus/fsl-mc/fsl-mc-bus.c > > +++ b/drivers/bus/fsl-mc/fsl-mc-bus.c > > @@ -132,11 +132,40 @@ static int fsl_mc_bus_uevent(struct device > *dev, struct kobj_uevent_env *env) > > static int fsl_mc_dma_configure(struct device *dev) > > { > > struct device *dma_dev = dev; > > + struct iommu_fwspec *fwspec; > > + const struct iommu_ops *iommu_ops; struct fsl_mc_device *mc_dev > > + = to_fsl_mc_device(dev); int ret; > > + u32 icid; > > > > while (dev_is_fsl_mc(dma_dev)) > > dma_dev = dma_dev->parent; > > > > - return of_dma_configure(dev, dma_dev->of_node, 0); > > + fwspec = dev_iommu_fwspec_get(dma_dev); if (!fwspec) > > + return -ENODEV; > > + iommu_ops = iommu_ops_from_fwnode(fwspec->iommu_fwnode); > > + if (!iommu_ops) > > + return -ENODEV; > > + > > + ret = iommu_fwspec_init(dev, fwspec->iommu_fwnode, iommu_ops); > > + if (ret) > > + return ret; > > + > > + icid = mc_dev->icid; > > + ret = iommu_fwspec_add_ids(dev, , 1); > > I see. So with this patch we would use the MC named component only > to retrieve the iommu_ops > >>> > >>> Right. I'd also add that the implementation tries to follow the > >>> existing standard .dma_configure implementations, e.g. > >>> of_dma_configure + of_iommu_configure. I'd also note that similarly > >>> to the ACPI case, this MC FW device is probed as a platform device > >>> in the DT scenario, binding here [1]. > >>> A similar approach is used for the retrieval of the msi irq domain, > >>> see following patch. > >>> > - the streamid are injected directly here bypassing OF/IORT bindings > translations altogether. > >>> > >>> Actually I've submitted a v2 [2] that calls into .of_xlate() to > >>> allow the smmu driver to do some processing on the raw streamid > >>> coming from the firmware. I have not yet tested this with ACPI but > >>> expect it to work, however, it's debatable how valid is this > >>> approach in the context of ACPI. > >> > >> Actually, what I think you need is of_map_rid() (and an IORT > >> equivalent, that I am going to write - generalizing iort_msi_map_rid()). > >> > >> Would that be enough to enable IORT "normal" mappings in the MC bus > >> named components ? > >> > > > > At a first glance, looks like this could very well fix the ACPI > > scenario, but I have some unclarities on the approach: > > * are we going to rely in DT and ACPI generic layers even if these > > devices are not published / enumerated through DT or ACPI tables? > > Assuming you mean the DPRC devices rather than the MC itself, then yes; in > that > sense it's exactly the same as how we treat dynamically-discovered PCI > devices. > > > * the firmware manages and provides discrete streamids for the > > devices it exposes so there's no translation involved. There's no > > requestor_id / input_id involved but it seems that we would still > > do some kind of translation relying for this on the DT/ACPI functions. > > Wrong - last time I looked, what that firmware actually manages are > *ICIDs* for the devices, not SMMU
Re: [PATCH 04/29] staging: media: ipu3: use vmap instead of reimplementing it
On Tue, Apr 14, 2020 at 03:13:23PM +0200, Christoph Hellwig wrote: > Just use vmap instead of messing with vmalloc internals. > > Signed-off-by: Christoph Hellwig > Acked-by: Peter Zijlstra (Intel) Thanks! Acked-by: Sakari Ailus -- Sakari Ailus ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
[PATCHv2] iommu/arm-smmu: Make remove callback message more informative
Currently on reboot/shutdown, the following messages are displayed on the console as error messages before the system reboots/shutdown as part of remove callback. On SC7180: arm-smmu 1500.iommu: removing device with active domains! arm-smmu 504.iommu: removing device with active domains! Make this error message more informative and less scary. Reported-by: Douglas Anderson Suggested-by: Robin Murphy Signed-off-by: Sai Prakash Ranjan --- drivers/iommu/arm-smmu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/iommu/arm-smmu.c b/drivers/iommu/arm-smmu.c index e622f4e33379..8ea634876e6c 100644 --- a/drivers/iommu/arm-smmu.c +++ b/drivers/iommu/arm-smmu.c @@ -2244,7 +2244,7 @@ static int arm_smmu_device_remove(struct platform_device *pdev) return -ENODEV; if (!bitmap_empty(smmu->context_map, ARM_SMMU_MAX_CBS)) - dev_err(>dev, "removing device with active domains!\n"); + dev_info(>dev, "disabling translation\n"); arm_smmu_bus_init(NULL); iommu_device_unregister(>iommu); -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] iommu/arm-smmu: Demote error messages to debug in shutdown callback
On 2020-04-23 14:58, Robin Murphy wrote: On 2020-04-23 9:17 am, Sai Prakash Ranjan wrote: [...] Any update on the status here? If I'm reading the conversation above, Robin said: "we'll *always* see the warning because there's no way to tear down the default DMA domains, and even if all devices *have* been nicely quiesced there's no way to tell". Did I understand that properly? If so, it seems like it's fully expected to see this message on every reboot and it doesn't necessarily signify anything bad. Understanding is the same, waiting for Will and Robin to check if its OK to make the message more friendly. The way I see it, we essentially just want *something* visible that will correlate with any misbehaviour that *might* result from turning off a possibly-live context. How about simply "disabling translation", at dev_warn or dev_info level? Sounds good, I'll go with disabling translation with dev_info. Thanks, Sai -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] iommu/arm-smmu: Demote error messages to debug in shutdown callback
On 2020-04-23 9:17 am, Sai Prakash Ranjan wrote: [...] Any update on the status here? If I'm reading the conversation above, Robin said: "we'll *always* see the warning because there's no way to tear down the default DMA domains, and even if all devices *have* been nicely quiesced there's no way to tell". Did I understand that properly? If so, it seems like it's fully expected to see this message on every reboot and it doesn't necessarily signify anything bad. Understanding is the same, waiting for Will and Robin to check if its OK to make the message more friendly. The way I see it, we essentially just want *something* visible that will correlate with any misbehaviour that *might* result from turning off a possibly-live context. How about simply "disabling translation", at dev_warn or dev_info level? Robin. ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
Re: [PATCH] iommu/arm-smmu: Demote error messages to debug in shutdown callback
Hi, On 2020-04-23 01:19, Doug Anderson wrote: Hi, On Tue, Mar 31, 2020 at 12:53 AM Sai Prakash Ranjan wrote: Hi Will, On 2020-03-31 13:14, Will Deacon wrote: > On Tue, Mar 31, 2020 at 01:06:11PM +0530, Sai Prakash Ranjan wrote: >> On 2020-03-30 23:54, Doug Anderson wrote: >> > On Sat, Mar 28, 2020 at 12:35 AM Sai Prakash Ranjan >> > wrote: >> > > >> > > > Of course the fact that in practice we'll *always* see the warning >> > > > because there's no way to tear down the default DMA domains, and even >> > > > if all devices *have* been nicely quiesced there's no way to tell, is >> > > > certainly less than ideal. Like I say, it's not entirely clear-cut >> > > > either way... >> > > > >> > > >> > > Thanks for these examples, good to know these scenarios in case we >> > > come >> > > across these. >> > > However, if we see these error/warning messages appear everytime then >> > > what will be >> > > the credibility of these messages? We will just ignore these messages >> > > when >> > > these issues you mention actually appears because we see them >> > > everytime >> > > on >> > > reboot or shutdown. >> > >> > I would agree that if these messages are expected to be seen every >> > time, there's no way to fix them, and they're not indicative of any >> > problem then something should be done. Seeing something printed at >> > "dev_error" level with an exclamation point (!) at the end makes me >> > feel like this is something that needs immediate action on my part. >> > >> > If we really can't do better but feel that the messages need to be >> > there, at least make them dev_info and less scary like: >> > >> > arm-smmu 1500.iommu: turning off; DMA should be quiesced before >> > now >> > >> > ...that would still give you a hint in the logs that if you saw a DMA >> > transaction after the message that it was a bug but also wouldn't >> > sound scary to someone who wasn't seeing any other problems. >> > >> >> We can do this if Robin is OK? > > It would be nice if you could figure out which domains are still live > when > the SMMU is being shut down in your case and verify that it *is* infact > benign before we start making the message more friendly. As Robin said > earlier, rogue DMA is a real nightmare to debug. > I could see this error message for all the clients of apps_smmu. I checked manually enabling bypass and removing iommus dt property for each client of apps_smmu. Any update on the status here? If I'm reading the conversation above, Robin said: "we'll *always* see the warning because there's no way to tear down the default DMA domains, and even if all devices *have* been nicely quiesced there's no way to tell". Did I understand that properly? If so, it seems like it's fully expected to see this message on every reboot and it doesn't necessarily signify anything bad. Understanding is the same, waiting for Will and Robin to check if its OK to make the message more friendly. -Sai -- QUALCOMM INDIA, on behalf of Qualcomm Innovation Center, Inc. is a member of Code Aurora Forum, hosted by The Linux Foundation ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu
RE: [PATCH v12 2/8] iommu/vt-d: Use a helper function to skip agaw for SL
> From: Jacob Pan > Sent: Wednesday, April 22, 2020 2:53 AM > > An Intel iommu domain uses 5-level page table by default. If the > iommu that the domain tries to attach supports less page levels, > the top level page tables should be skipped. Add a helper to do > this so that it could be used in other places. > > Signed-off-by: Jacob Pan > Reviewed-by: Eric Auger > --- > drivers/iommu/intel-pasid.c | 33 +++-- > 1 file changed, 23 insertions(+), 10 deletions(-) > > diff --git a/drivers/iommu/intel-pasid.c b/drivers/iommu/intel-pasid.c > index 22b30f10b396..d9cea3011b58 100644 > --- a/drivers/iommu/intel-pasid.c > +++ b/drivers/iommu/intel-pasid.c > @@ -500,6 +500,25 @@ int intel_pasid_setup_first_level(struct intel_iommu > *iommu, > } > > /* > + * Skip top levels of page tables for iommu which has less agaw > + * than default. Unnecessary for PT mode. > + */ > +static inline int iommu_skip_agaw(struct dmar_domain *domain, > + struct intel_iommu *iommu, > + struct dma_pte **pgd) > +{ > + int agaw; > + > + for (agaw = domain->agaw; agaw > iommu->agaw; agaw--) { > + *pgd = phys_to_virt(dma_pte_addr(*pgd)); > + if (!dma_pte_present(*pgd)) > + return -EINVAL; > + } > + > + return agaw; > +} > + > +/* > * Set up the scalable mode pasid entry for second only translation type. > */ > int intel_pasid_setup_second_level(struct intel_iommu *iommu, > @@ -522,17 +541,11 @@ int intel_pasid_setup_second_level(struct > intel_iommu *iommu, > return -EINVAL; > } > > - /* > - * Skip top levels of page tables for iommu which has less agaw > - * than default. Unnecessary for PT mode. > - */ > pgd = domain->pgd; > - for (agaw = domain->agaw; agaw > iommu->agaw; agaw--) { > - pgd = phys_to_virt(dma_pte_addr(pgd)); > - if (!dma_pte_present(pgd)) { > - dev_err(dev, "Invalid domain page table\n"); > - return -EINVAL; > - } > + agaw = iommu_skip_agaw(domain, iommu, ); > + if (agaw < 0) { > + dev_err(dev, "Invalid domain page table\n"); > + return -EINVAL; > } > > pgd_val = virt_to_phys(pgd); > -- > 2.7.4 Reviewed-by: Kevin Tian ___ iommu mailing list iommu@lists.linux-foundation.org https://lists.linuxfoundation.org/mailman/listinfo/iommu