Re: [PATCH] mm: Remove double faults once write a device pfn
"Zhou, Xianrong" writes: > [AMD Official Use Only - General] > >> > The vmf_insert_pfn_prot could cause unnecessary double faults on a >> > device pfn. Because currently the vmf_insert_pfn_prot does not >> > make the pfn writable so the pte entry is normally read-only or >> > dirty catching. >> What? How do you got to this conclusion? >> >>> Sorry. I did not mention that this problem only exists on arm64 platform. >> >> Ok, that makes at least a little bit more sense. >> >> >> >>> Because on arm64 platform the PTE_RDONLY is automatically attached >> >>> to the userspace pte entries even through VM_WRITE + VM_SHARE. >> >>> The PTE_RDONLY needs to be cleared in vmf_insert_pfn_prot. However >> >>> vmf_insert_pfn_prot do not make the pte writable passing false >> >>> @mkwrite to insert_pfn. >> >> Question is why is arm64 doing this? As far as I can see they must >> >> have some hardware reason for that. >> >> >> >> The mkwrite parameter to insert_pfn() was added by commit >> >> b2770da642540 to make insert_pfn() look more like insert_pfn_pmd() so >> >> that the DAX code can insert PTEs which are writable and dirty at the same >> time. >> >> >> > This is one scenario to do so. In fact on arm64 there are many >> > scenarios could be to do so. So we can let vmf_insert_pfn_prot >> > supporting @mkwrite for drivers at core layer and let drivers to >> > decide whether or not to make writable and dirty at one time. The >> > patch did this. Otherwise double faults on arm64 when call >> vmf_insert_pfn_prot. >> >> Well, that doesn't answer my question why arm64 is double faulting in the >> first place,. >> > > > Eh. > > On arm64 When userspace mmap() with PROT_WRITE and MAP_SHARED the > vma->vm_page_prot has the PTE_RDONLY and PTE_WRITE within > PAGE_SHARED_EXEC. (seeing arm64 protection_map) > > When write the userspace virtual address the first fault happen and call > into driver's > .fault->ttm_bo_vm_fault_reserved->vmf_insert_pfn_prot->insert_pfn. > The insert_pfn will establish the pte entry. However the vmf_insert_pfn_prot > pass false @mkwrite to insert_pfn by default and so insert_pfn could not make > the pfn writable and it do not call maybe_mkwrite(pte_mkdirty(entry), vma) > to clear the PTE_RDONLY bit. So the pte entry is actually write protection > for mmu. > So when the first fault return and re-execute the store instruction the second > fault happen again. And in second fault it just only do pte_mkdirty(entry) > which > clear the PTE_RDONLY. It depends if the ARM64 CPU in question supports hardware dirty bit management (DBM). If that is the case and PTE_DBM (ie. PTE_WRITE) is set HW will automatically clear PTE_RDONLY bit to mark the entry dirty instead of raising a write fault. So you shouldn't see a double fault if PTE_DBM/WRITE is set. On ARM64 you can kind of think of PTE_RDONLY as the HW dirty bit and PTE_DBM as the read/write permission bit with SW being responsible for updating PTE_RDONLY via the fault handler if DBM is not supported by HW. At least that's my understanding from having hacked on this in the past. You can see all this weirdness happening in the definitions of pte_dirty() and pte_write() for ARM64. > I think so and hope no wrong. > >> So as long as this isn't sorted out I'm going to reject this patch. >> >> Regards, >> Christian. >> >> > >> >> This is a completely different use case to what you try to use it >> >> here for and that looks extremely fishy to me. >> >> >> >> Regards, >> >> Christian. >> >> >> > The first fault only sets up the pte entry which actually is dirty >> > catching. And the second immediate fault to the pfn due to first >> > dirty catching when the cpu re-execute the store instruction. >> It could be that this is done to work around some hw behavior, but >> not because of dirty catching. >> >> > Normally if the drivers call vmf_insert_pfn_prot and also supply >> > 'pfn_mkwrite' callback within vm_operations_struct which requires >> > the pte to be dirty catching then the vmf_insert_pfn_prot and the >> > double fault are reasonable. It is not a problem. >> Well, as far as I can see that behavior absolutely doesn't make sense. >> >> When pfn_mkwrite is requested then the driver should use PAGE_COPY, >> which is exactly what VMWGFX (the only driver using dirty tracking) >> is >> >> doing. >> Everybody else uses PAGE_SHARED which should make the pte writeable >> immediately. >> >> Regards, >> Christian. >> >> > However the most of drivers calling vmf_insert_pfn_prot do not >> > supply the 'pfn_mkwrite' callback so that the second fault is >> unnecessary. >> > >> > So just like vmf_insert_mixed and vmf_insert_mixed_mkwrite pair, >> > we should also supply vmf_insert_pfn_mkwrite for drivers as well. >> > >> > Signed-off-by: Xianrong Zhou >> > --- >> > arch/x86/entry/vdso/vma.c | 3 +
Re: [Intel-gfx] [RFC PATCH 0/6] Supporting GMEM (generalized memory management) for external memory devices
Christian König writes: > Am 01.12.23 um 06:48 schrieb Zeng, Oak: >> [SNIP] >> Besides memory eviction/oversubscription, there are a few other pain points >> when I use hmm: >> >> 1) hmm doesn't support file-back memory, so it is hard to share > memory b/t process in a gpu environment. You mentioned you have a > plan... How hard is it to support file-backed in your approach? > > As hard as it is to support it through HMM. That's what I meant that > this approach doesn't integrate well, as far as I know the problem > isn't inside HMM or any other solution but rather in the file system > layer. In what way does HMM not support file-backed memory? I was under the impression that at least hmm_range_fault() does. - Alistair > Regards, > Christian. > >> 2)virtual address range based memory attribute/hint: with hmadvise, > where do you save the memory attribute of a virtual address range? Do > you need to extend vm_area_struct to save it? With hmm, we have to > maintain such information at driver. This ends up with pretty > complicated logic to split/merge those address range. I know core mm > has similar logic to split/merge vma... >> >> Oak >> >> >>> -Weixi >>> >>> -Original Message- >>> From: Christian König >>> Sent: Thursday, November 30, 2023 4:28 PM >>> To: Zeng, Oak; Christian König >>> ; zhuweixi; linux- >>> m...@kvack.org;linux-ker...@vger.kernel.org;a...@linux-foundation.org; >>> Danilo Krummrich; Dave Airlie; Daniel >>> Vetter >>> Cc:intel-gvt-...@lists.freedesktop.org;rcampb...@nvidia.com; >>> mhairgr...@nvidia.com;j...@nvidia.com;weixi@openeuler.sh; >>> jhubb...@nvidia.com;intel-gfx@lists.freedesktop.org;apop...@nvidia.com; >>> xinhui@amd.com;amd-...@lists.freedesktop.org; >>> tvrtko.ursu...@linux.intel.com;ogab...@kernel.org;jgli...@redhat.com; dri- >>> de...@lists.freedesktop.org;z...@nvidia.com; Vivi, Rodrigo >>> ;alexander.deuc...@amd.com;leo...@nvidia.com; >>> felix.kuehl...@amd.com; Wang, Zhi A; >>> mgor...@suse.de >>> Subject: Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory >>> management) for external memory devices >>> >>> Hi Oak, >>> >>> yeah, #4 is indeed a really good point and I think Felix will agree to that >>> as well. >>> >>> HMM is basically still missing a way to advise device attributes for the CPU >>> address space. Both migration strategy as well as device specific >>> information (like >>> cache preferences) fall into this category. >>> >>> Since there is a device specific component in those attributes as well I >>> think >>> device specific IOCTLs still make sense to update them, but HMM should offer >>> the functionality to manage and store those information. >>> >>> Split and merge of VMAs only become a problem if you attach those >>> information >>> to VMAs, if you keep them completely separate than that doesn't become an >>> issue either. The down side of this approach is that you don't get >>> automatically >>> extending attribute ranges for growing VMAs for example. >>> >>> Regards, >>> Christian. >>> >>> Am 29.11.23 um 23:23 schrieb Zeng, Oak: Hi Weixi, Even though Christian has listed reasons rejecting this proposal (yes they are >>> very reasonable to me), I would open my mind and further explore the >>> possibility >>> here. Since the current GPU driver uses a hmm based implementation (AMD and >>> NV has done this; At Intel we are catching up), I want to explore how much >>> we >>> can benefit from the proposed approach and how your approach can solve some >>> pain points of our development. So basically what I am questioning here is: >>> what >>> is the advantage of your approach against hmm. To implement a UVM (unified virtual address space b/t cpu and gpu device), >>> with hmm, driver essentially need to implement below functions: 1. device page table update. Your approach requires the same because this is device specific codes 2. Some migration functions to migrate memory b/t system memory and GPU >>> local memory. My understanding is, even though you generalized this a bit, >>> such >>> as modified cpu page fault path, provided "general" gm_dev_fault handler... >>> but >>> device driver still need to provide migration functions because migration >>> functions have to be device specific (i.e., using device dma/copy engine for >>> performance purpose). Right? 3. GPU physical memory management, this part is now in drm/buddy, shared >>> by all drivers. I think with your approach, driver still need to provide >>> callback >>> functions to allocate/free physical pages. Right? Or do you let linux core >>> mm >>> buddy manage device memory directly? 4. madvise/hints/virtual address range management. This has been pain point >>> for us. Right now device driver has to maintain certain virtual address >>> range data >>> structure to maintain hints and other virtual address range based memory >>> attributes. Driver need to sync with linux vma. Driver need to explicitly >
Re: [Intel-gfx] [RFC PATCH 0/6] Supporting GMEM (generalized memory management) for external memory devices
"Zeng, Oak" writes: > See inline comments > >> -Original Message- >> From: dri-devel On Behalf Of >> zhuweixi >> Sent: Thursday, November 30, 2023 5:48 AM >> To: Christian König ; Zeng, Oak >> ; Christian König ; linux- >> m...@kvack.org; linux-ker...@vger.kernel.org; a...@linux-foundation.org; >> Danilo Krummrich ; Dave Airlie ; Daniel >> Vetter >> Cc: tvrtko.ursu...@linux.intel.com; rcampb...@nvidia.com; apop...@nvidia.com; >> z...@nvidia.com; weixi@openeuler.sh; jhubb...@nvidia.com; intel- >> g...@lists.freedesktop.org; mhairgr...@nvidia.com; Wang, Zhi A >> ; xinhui@amd.com; amd-...@lists.freedesktop.org; >> jgli...@redhat.com; dri-de...@lists.freedesktop.org; j...@nvidia.com; Vivi, >> Rodrigo ; alexander.deuc...@amd.com; >> felix.kuehl...@amd.com; intel-gvt-...@lists.freedesktop.org; >> ogab...@kernel.org; leo...@nvidia.com; mgor...@suse.de >> Subject: RE: [RFC PATCH 0/6] Supporting GMEM (generalized memory >> management) for external memory devices >> >> Glad to know that there is a common demand for a new syscall like >> hmadvise(). I >> expect it would also be useful for homogeneous NUMA cases. Credits to >> cudaMemAdvise() API which brought this idea to GMEM's design. >> >> To answer @Oak's questions about GMEM vs. HMM, >> >> Here is the major difference: >> GMEM's main target is to stop drivers from reinventing MM code, while >> HMM/MMU notifiers provide a compatible struct page solution and a >> coordination mechanism for existing device driver MMs that requires adding >> extra code to interact with CPU MM. >> >> A straightforward qualitative result for the main target: after integrating >> Huawei's >> Ascend NPU driver with GMEM's interface, 30,000 lines of MM code were cut, >> leaving <100 lines invoking GMEM interface and 3,700 lines implementing >> vendor- >> specific functions. Some code from the 3,700 lines should be further moved to >> GMEM as a generalized feature like device memory oversubscription, but not >> included in this RFC patch yet. >> >> A list of high-level differences: >> 1. With HMM/MMU notifiers, drivers need to first implement a full MM >> subsystem. With GMEM, drivers can reuse Linux's core MM. > > A full mm subsystem essentially has below functions: > > Physical memory management: neither your approach nor hmm-based > solution provide device physical memory management. You mentioned you > have a plan but at least for now driver need to mange device physical > memory. > > Virtual address space management: both approach leverage linux core mm, vma > for this. > > Data eviction, migration: with hmm, driver need to implement this. It > is not clear whether gmem has this function. I guess even gmem has it, > it might be slow cpu data copy, compared to modern gpu's fast data > copy engine. > > Device page table update, va-pa mapping: I think it is driver's > responsibility in both approach. > > So from the point of re-use core MM, I don't see big difference. Maybe > you did it more elegantly. I think it is very possible with your > approach driver can be simpler, less codes. > >> >> 2. HMM encodes device mapping information in the CPU arch-dependent PTEs, >> while GMEM proposes an abstraction layer in vm_object. Since GMEM's >> approach further decouples the arch-related stuff, drivers do not need to >> implement separate code for X86/ARM and etc. > > I don't understand this...with hmm, when a virtual address range's > backing store is in device memory, cpu pte is encoded to point to > device memory. Device page table is also encoded to point to the same > device memory location. But since device memory is not accessible to > CPU (DEVICE_PRIVATE), so when cpu access this virtual address, there > is a cpu page fault. Device mapping info is still in device page > table, not in cpu ptes. > > I do not see with hmm why driver need to implement x86/arm > code... driver only take cares of device page table. Hmm/core mm take > care of cpu page table, right? I see our replies have crossed, but that is my understanding as well. >> >> 3. MMU notifiers register hooks at certain core MM events, while GMEM >> declares basic functions and internally invokes them. GMEM requires less from >> the driver side -- no need to understand what core MM behaves at certain MMU >> events. GMEM also expects fewer bugs than MMU notifiers: implementing basic >> operations with standard declarations vs. implementing whatever random device >> MM logic in MMU notifiers. > > This seems true to me. I feel the mmu notifier thing, especially the > synchronization/lock design (those sequence numbers, interacting with > driver lock, and the mmap lock) are very complicated. I indeed spent > time to understand the specification documented in hmm.rst... No argument there, but I think that's something we could look at providing an improved interface for. I don't think it needs a whole new subsystem to fix. Probably just a version of hmm_range_fault() that takes the lock and s
Re: [Intel-gfx] [RFC PATCH 0/6] Supporting GMEM (generalized memory management) for external memory devices
zhuweixi writes: > Glad to know that there is a common demand for a new syscall like > hmadvise(). I expect it would also be useful for homogeneous NUMA > cases. Credits to cudaMemAdvise() API which brought this idea to > GMEM's design. It's not clear to me that this would need to be a new syscall. Scanning the patches it looks like your adding a NUMA node anyway, so the existing interfaces (eg. madvise) with its various options (MPOL_PREFERRED/PREFERRED_MANY) and set_mempolicy/set_mempolicy_home_node() could potentially cover this for both NUMA and hNUMA nodes. The main advantage here would be providing a common userspace interface for setting these kind of hints. > To answer @Oak's questions about GMEM vs. HMM, > > Here is the major difference: > GMEM's main target is to stop drivers from reinventing MM code, > while HMM/MMU notifiers provide a compatible struct page solution and > a coordination mechanism for existing device driver MMs that requires > adding extra code to interact with CPU MM. > > A straightforward qualitative result for the main target: after > integrating Huawei's Ascend NPU driver with GMEM's interface, 30,000 > lines of MM code were cut, leaving <100 lines invoking GMEM interface > and 3,700 lines implementing vendor-specific functions. Some code from > the 3,700 lines should be further moved to GMEM as a generalized > feature like device memory oversubscription, but not included in this > RFC patch yet. I think it would be helpful if you could be a bit more specific on what functionality the current HMM/migrate_vma/MMU notifier interfaces are missing that every driver has to implement in a common way. Because I'm not convinced we can't either improve those interfaces to provide what's needed or add specific components (eg. a physical page allocator) instead of a whole new framework. > A list of high-level differences: > 1. With HMM/MMU notifiers, drivers need to first implement a full MM > subsystem. With GMEM, drivers can reuse Linux's core MM. What do the common bits of this full MM subsystem look like? Fundamentally the existing HMM functionality can already make use of Linux core MM to manage page tables and migrate pages and everything else seems pretty device specific (ie. actual copying of data, programming of MMUs, TLBs, etc.) I can see that there would be scope to have say a generic memory allocator, which I vaguely recall discussing in relation to DEIVCE_PRIVATE pages in the past but @Oak suggests something close already exists (drm/buddy). Potentially I suppose there is VA allocation that might be common across devices. However I have not had any experience working with devices with VA requirements different enough from the CPU to matter. If they are so different I'm not convinced it would be easy to have a common implementation anyway. > 2. HMM encodes device mapping information in the CPU arch-dependent > PTEs, while GMEM proposes an abstraction layer in vm_object. Since > GMEM's approach further decouples the arch-related stuff, drivers do > not need to implement separate code for X86/ARM and etc. I'm not following this. At present all HMM encodes in CPU PTEs is the fact a page has been migrated to the device and what permissions it has. I'm not aware of needing to treat X86 and ARM differently for example here. Are you saying you want somewhere to store other bits attached to a particular VA? > 3. MMU notifiers register hooks at certain core MM events, while > GMEM declares basic functions and internally invokes them. GMEM > requires less from the driver side -- no need to understand what core > MM behaves at certain MMU events. GMEM also expects fewer bugs than > MMU notifiers: implementing basic operations with standard > declarations vs. implementing whatever random device MM logic in MMU > notifiers. How is this proposal any different though? From what I can see it replaces MMU notifier callbacks with TLB invalidation callbacks, but that is essentially what MMU notifier callbacks are anyway. The "random device MM logic" should just be clearing device TLBs. What other MM logic has to be implemented in the MMU notifier callbacks that is the same between devices? > 4. GMEM plans to support a more lightweight physical memory > management. The discussion about this part can be found in my cover > letter. The question is whether struct page should be compatible > (directly use HMM's ZONE_DEVICE solution) or a trimmed, smaller struct > page that satisfies generalized demands from accelerators is more > preferrable? What is wrong with the current ZONE_DEVICE solution? You mention size of struct page, but that is already being worked on through the conversion to folios. Admittedly higher order HMM ZONE_DEVICE folios are not currently supported, but that is something I'm actively working on at the moment. > 5. GMEM has been demonstrated to allow device memory > oversubscription (a GMEM-based 32GB NPU card can run a GPT model > oversubscribin
[Intel-gfx] [PATCH v4 1/5] arm64/smmu: Use TLBI ASID when invalidating entire range
The ARM SMMU has a specific command for invalidating the TLB for an entire ASID. Currently this is used for the IO_PGTABLE API but not for ATS when called from the MMU notifier. The current implementation of notifiers does not attempt to invalidate such a large address range, instead walking each VMA and invalidating each range individually during mmap removal. However in future SMMU TLB invalidations are going to be sent as part of the normal flush_tlb_*() kernel calls. To better deal with that add handling to use TLBI ASID when invalidating the entire address space. Signed-off-by: Alistair Popple Reviewed-by: Jason Gunthorpe --- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c | 16 +--- 1 file changed, 13 insertions(+), 3 deletions(-) diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c index a5a63b1..2a19784 100644 --- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c +++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c @@ -200,10 +200,20 @@ static void arm_smmu_mm_invalidate_range(struct mmu_notifier *mn, * range. So do a simple translation here by calculating size correctly. */ size = end - start; + if (size == ULONG_MAX) + size = 0; + + if (!(smmu_domain->smmu->features & ARM_SMMU_FEAT_BTM)) { + if (!size) + arm_smmu_tlb_inv_asid(smmu_domain->smmu, + smmu_mn->cd->asid); + else + arm_smmu_tlb_inv_range_asid(start, size, + smmu_mn->cd->asid, + PAGE_SIZE, false, + smmu_domain); + } - if (!(smmu_domain->smmu->features & ARM_SMMU_FEAT_BTM)) - arm_smmu_tlb_inv_range_asid(start, size, smmu_mn->cd->asid, - PAGE_SIZE, false, smmu_domain); arm_smmu_atc_inv_domain(smmu_domain, mm->pasid, start, size); } -- git-series 0.9.1
[Intel-gfx] [PATCH v4 2/5] mmu_notifiers: Fixup comment in mmu_interval_read_begin()
The comment in mmu_interval_read_begin() refers to a function that doesn't exist and uses the wrong call-back name. The op for mmu interval notifiers is mmu_interval_notifier_ops->invalidate() so fix the comment up to reflect that. Signed-off-by: Alistair Popple Reviewed-by: Jason Gunthorpe --- mm/mmu_notifier.c | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index 50c0dde..b7ad155 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -199,7 +199,7 @@ mmu_interval_read_begin(struct mmu_interval_notifier *interval_sub) * invalidate_start/end and is colliding. * * The locking looks broadly like this: -* mn_tree_invalidate_start(): mmu_interval_read_begin(): +* mn_itree_inv_start(): mmu_interval_read_begin(): * spin_lock * seq = READ_ONCE(interval_sub->invalidate_seq); * seq == subs->invalidate_seq @@ -207,7 +207,7 @@ mmu_interval_read_begin(struct mmu_interval_notifier *interval_sub) *spin_lock * seq = ++subscriptions->invalidate_seq *spin_unlock -* op->invalidate_range(): +* op->invalidate(): * user_lock *mmu_interval_set_seq() * interval_sub->invalidate_seq = seq -- git-series 0.9.1
[Intel-gfx] [PATCH v4 5/5] mmu_notifiers: Rename invalidate_range notifier
There are two main use cases for mmu notifiers. One is by KVM which uses mmu_notifier_invalidate_range_start()/end() to manage a software TLB. The other is to manage hardware TLBs which need to use the invalidate_range() callback because HW can establish new TLB entries at any time. Hence using start/end() can lead to memory corruption as these callbacks happen too soon/late during page unmap. mmu notifier users should therefore either use the start()/end() callbacks or the invalidate_range() callbacks. To make this usage clearer rename the invalidate_range() callback to arch_invalidate_secondary_tlbs() and update documention. Signed-off-by: Alistair Popple Suggested-by: Jason Gunthorpe Acked-by: Catalin Marinas Reviewed-by: Jason Gunthorpe --- arch/arm64/include/asm/tlbflush.h | 6 +- arch/powerpc/mm/book3s64/radix_hugetlbpage.c| 2 +- arch/powerpc/mm/book3s64/radix_tlb.c| 8 +-- arch/x86/include/asm/tlbflush.h | 2 +- arch/x86/mm/tlb.c | 2 +- drivers/iommu/amd/iommu_v2.c| 10 ++-- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c | 13 ++--- drivers/iommu/intel/svm.c | 8 +-- drivers/misc/ocxl/link.c| 8 +-- include/linux/mmu_notifier.h| 48 +- mm/huge_memory.c| 4 +- mm/hugetlb.c| 7 +-- mm/mmu_notifier.c | 21 ++-- 13 files changed, 76 insertions(+), 63 deletions(-) diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h index a99349d..84a05a0 100644 --- a/arch/arm64/include/asm/tlbflush.h +++ b/arch/arm64/include/asm/tlbflush.h @@ -253,7 +253,7 @@ static inline void flush_tlb_mm(struct mm_struct *mm) __tlbi(aside1is, asid); __tlbi_user(aside1is, asid); dsb(ish); - mmu_notifier_invalidate_range(mm, 0, -1UL); + mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL); } static inline void __flush_tlb_page_nosync(struct mm_struct *mm, @@ -265,7 +265,7 @@ static inline void __flush_tlb_page_nosync(struct mm_struct *mm, addr = __TLBI_VADDR(uaddr, ASID(mm)); __tlbi(vale1is, addr); __tlbi_user(vale1is, addr); - mmu_notifier_invalidate_range(mm, uaddr & PAGE_MASK, + mmu_notifier_arch_invalidate_secondary_tlbs(mm, uaddr & PAGE_MASK, (uaddr & PAGE_MASK) + PAGE_SIZE); } @@ -400,7 +400,7 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma, scale++; } dsb(ish); - mmu_notifier_invalidate_range(vma->vm_mm, start, end); + mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end); } static inline void flush_tlb_range(struct vm_area_struct *vma, diff --git a/arch/powerpc/mm/book3s64/radix_hugetlbpage.c b/arch/powerpc/mm/book3s64/radix_hugetlbpage.c index f3fb49f..17075c7 100644 --- a/arch/powerpc/mm/book3s64/radix_hugetlbpage.c +++ b/arch/powerpc/mm/book3s64/radix_hugetlbpage.c @@ -39,7 +39,7 @@ void radix__flush_hugetlb_tlb_range(struct vm_area_struct *vma, unsigned long st radix__flush_tlb_pwc_range_psize(vma->vm_mm, start, end, psize); else radix__flush_tlb_range_psize(vma->vm_mm, start, end, psize); - mmu_notifier_invalidate_range(vma->vm_mm, start, end); + mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end); } void radix__huge_ptep_modify_prot_commit(struct vm_area_struct *vma, diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c index 4d44902..06e647e 100644 --- a/arch/powerpc/mm/book3s64/radix_tlb.c +++ b/arch/powerpc/mm/book3s64/radix_tlb.c @@ -987,7 +987,7 @@ void radix__flush_tlb_mm(struct mm_struct *mm) } } preempt_enable(); - mmu_notifier_invalidate_range(mm, 0, -1UL); + mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL); } EXPORT_SYMBOL(radix__flush_tlb_mm); @@ -1021,7 +1021,7 @@ static void __flush_all_mm(struct mm_struct *mm, bool fullmm) _tlbiel_pid_multicast(mm, pid, RIC_FLUSH_ALL); } preempt_enable(); - mmu_notifier_invalidate_range(mm, 0, -1UL); + mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL); } void radix__flush_all_mm(struct mm_struct *mm) @@ -1230,7 +1230,7 @@ static inline void __radix__flush_tlb_range(struct mm_struct *mm, } out: preempt_enable(); - mmu_notifier_invalidate_range(mm, start, end); + mmu_notifier_arch_invalidate_secondary_tlbs(mm, start, end); } void radix__flush_tlb_range(struct vm_area_struct *vma, unsigned long start, @@ -1395,7 +1395,7 @@ static void __radix__flush_tlb_range_psize(struct mm_struct *mm, } out: preempt_enable(); - mmu_notifier_i
Re: [Intel-gfx] Regression in linux-next
Thanks Chaitanya for the detailed report. Dan Carpenter also reported a Smatch warning for this: https://lore.kernel.org/linux-mm/38ed0627-1283-4da2-827a-e90484d7bd7d@moroto.mountain/ The below should fix the problem, will respin the series to include the fix. --- diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c index 63c8eb740af7..ec3b068cbbe6 100644 --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -621,9 +621,10 @@ int __mmu_notifier_register(struct mmu_notifier *subscription, * Subsystems should only register for invalidate_secondary_tlbs() or * invalidate_range_start()/end() callbacks, not both. */ - if (WARN_ON_ONCE(subscription->ops->arch_invalidate_secondary_tlbs && - (subscription->ops->invalidate_range_start || - subscription->ops->invalidate_range_end))) + if (WARN_ON_ONCE(subscription && +(subscription->ops->arch_invalidate_secondary_tlbs && +(subscription->ops->invalidate_range_start || + subscription->ops->invalidate_range_end return -EINVAL; if (!mm->notifier_subscriptions) { "Borah, Chaitanya Kumar" writes: > Hello Alistair, > > Hope you are doing well. I am Chaitanya from the linux graphics team in Intel. > > This mail is regarding a regression we are seeing in our CI runs[1] on > linux-next > repository. > > On next-20230720 [2], we are seeing the following error > > <4>[ 76.189375] Hardware name: Intel Corporation Meteor Lake Client > Platform/MTL-P DDR5 SODIMM SBS RVP, BIOS MTLPFWI1.R00.3271.D81.2307101805 > 07/10/2023 > <4>[ 76.202534] RIP: 0010:__mmu_notifier_register+0x40/0x210 > <4>[ 76.207804] Code: 1a 71 5a 01 85 c0 0f 85 ec 00 00 00 48 8b 85 30 > 01 00 00 48 85 c0 0f 84 04 01 00 00 8b 85 cc 00 00 00 85 c0 0f 8e bb > 01 00 00 <49> 8b 44 24 10 48 83 78 38 00 74 1a 48 83 78 28 00 74 0c 0f > 0b b8 > <4>[ 76.226368] RSP: 0018:c900019d7ca8 EFLAGS: 00010202 > <4>[ 76.231549] RAX: 0001 RBX: 1000 RCX: > 0001 > <4>[ 76.238613] RDX: RSI: 823ceb7b RDI: > 823ee12d > <4>[ 76.245680] RBP: 888102ec9b40 R08: R09: > 0001 > <4>[ 76.252747] R10: 0001 R11: 8881157cd2c0 R12: > > <4>[ 76.259811] R13: 888102ec9c70 R14: a07de500 R15: > 888102ec9ce0 > <4>[ 76.266875] FS: 7fbcabe11c00() GS:88846ec0() > knlGS: > <4>[ 76.274884] CS: 0010 DS: ES: CR0: 80050033 > <4>[ 76.280578] CR2: 0010 CR3: 00010d4c2005 CR4: > 00f70ee0 > <4>[ 76.287643] DR0: DR1: DR2: > > <4>[ 76.294711] DR3: DR6: 07f0 DR7: > 0400 > <4>[ 76.301775] PKRU: 5554 > <4>[ 76.304463] Call Trace: > <4>[ 76.306893] > <4>[ 76.308983] ? __die_body+0x1a/0x60 > <4>[ 76.312444] ? page_fault_oops+0x156/0x450 > <4>[ 76.316510] ? do_user_addr_fault+0x65/0x980 > <4>[ 76.320747] ? exc_page_fault+0x68/0x1a0 > <4>[ 76.324643] ? asm_exc_page_fault+0x26/0x30 > <4>[ 76.328796] ? __mmu_notifier_register+0x40/0x210 > <4>[ 76.333460] ? __mmu_notifier_register+0x11c/0x210 > <4>[ 76.338206] ? preempt_count_add+0x4c/0xa0 > <4>[ 76.342273] mmu_notifier_register+0x30/0xe0 > <4>[ 76.346509] mmu_interval_notifier_insert+0x74/0xb0 > <4>[ 76.351344] i915_gem_userptr_ioctl+0x21a/0x320 [i915] > <4>[ 76.356565] ? __pfx_i915_gem_userptr_ioctl+0x10/0x10 [i915] > <4>[ 76.362271] drm_ioctl_kernel+0xb4/0x150 > <4>[ 76.366159] drm_ioctl+0x21d/0x420 > <4>[ 76.369537] ? __pfx_i915_gem_userptr_ioctl+0x10/0x10 [i915] > <4>[ 76.375242] ? find_held_lock+0x2b/0x80 > <4>[ 76.379046] __x64_sys_ioctl+0x79/0xb0 > <4>[ 76.382766] do_syscall_64+0x3c/0x90 > <4>[ 76.386312] entry_SYSCALL_64_after_hwframe+0x6e/0xd8 > <4>[ 76.391317] RIP: 0033:0x7fbcae63f3ab > > Details log can be found in [3]. > > After bisecting the tree, the following patch seems to be causing the > regression. > > commit 828fe4085cae77acb3abf7dd3d25b3ed6c560edf > Author: Alistair Popple apop...@nvidia.com > Date: Wed Jul 19 22:18:46 2023 +1000 > > mmu_notifiers: rename invalidate_range notifier > > There are two main use cases for mmu notifiers. One is by KV
[Intel-gfx] [PATCH v4 0/5] Invalidate secondary IOMMU TLB on permission upgrade
The main change is to move secondary TLB invalidation mmu notifier callbacks into the architecture specific TLB flushing functions. This makes secondary TLB invalidation mostly match CPU invalidation while still allowing efficient range based invalidations based on the existing TLB batching code. Changes for v4: - Fixed a NULL pointer dereference when registering the first notifier with mmu_interval_notifier_insert() instead of mmu_notifier_register() - Thanks Dan and Chaitanya for the bug reports. - Collected Acked/Reviewed tags. - Don't call the notifier from radix__local_flush_tlb_page() on PowerPC. Changes for v3: - On x86 call the invalidation when adding pending TLB invalidates rather than when flushing the batch. This is because the mm is required. It also matches what happens on ARM64. Fixes a bug reported by SeongJae Park (thanks!) Changes for v2: - Rebased on linux-next commit 906fa30154ef ("mm/rmap: correct stale comment of rmap_walk_anon and rmap_walk_file") to fix a minor integration conflict with "arm64: support batched/deferred tlb shootdown during page reclamation/migration". This series will need to be applied after the conflicting patch. - Reordered the function rename until the end of the series as many places that were getting renamed ended up being removed anyway. - Fixed a couple of build issues which broke bisection. - Added a minor patch to fix up a stale/incorrect comment. == Background == The arm64 architecture specifies TLB permission bits may be cached and therefore the TLB must be invalidated during permission upgrades. For the CPU this currently occurs in the architecture specific ptep_set_access_flags() routine. Secondary TLBs such as implemented by the SMMU IOMMU match the CPU architecture specification and may also cache permission bits and require the same TLB invalidations. This may be achieved in one of two ways. Some SMMU implementations implement broadcast TLB maintenance (BTM). This snoops CPU TLB invalidates and will invalidate any secondary TLB at the same time as the CPU. However implementations are not required to implement BTM. Implementations without BTM rely on mmu notifier callbacks to send explicit TLB invalidation commands to invalidate SMMU TLB. Therefore either generic kernel code or architecture specific code needs to call the mmu notifier on permission upgrade. Currently that doesn't happen so devices will fault indefinitely when writing to a PTE that was previously read-only as nothing invalidates the SMMU TLB. Solution To fix this the series first renames the .invalidate_range() callback to .arch_invalidate_secondary_tlbs() as suggested by Jason and Sean to make it clear this callback is only used for secondary TLBs. That was made possible thanks to Sean's series [1] to remove KVM's incorrect usage. Based on feedback from Jason [2] the proposed solution to the bug is to move the calls to mmu_notifier_arch_invalidate_secondary_tlbs() closer to the architecture specific TLB invalidation code. This ensures the secondary TLB won't miss invalidations, including the existing invalidation in the ARM64 code to deal with permission upgrade. Currently only ARM64, PowerPC and x86 have IOMMU with secondary TLBs requiring SW invalidation so the notifier is only called for those architectures. It is also not called for invalidation of kernel mappings as no secondary IOMMU implementations can access those and hence it is not required. [1] - https://lore.kernel.org/all/20230602011518.787006-1-sea...@google.com/ [2] - https://lore.kernel.org/linux-mm/zjmr5bw8l+bbz...@ziepe.ca/ Alistair Popple (5): arm64/smmu: Use TLBI ASID when invalidating entire range mmu_notifiers: Fixup comment in mmu_interval_read_begin() mmu_notifiers: Call invalidate_range() when invalidating TLBs mmu_notifiers: Don't invalidate secondary TLBs as part of mmu_notifier_invalidate_range_end() mmu_notifiers: Rename invalidate_range notifier arch/arm64/include/asm/tlbflush.h | 5 +- arch/powerpc/include/asm/book3s/64/tlbflush.h | 1 +- arch/powerpc/mm/book3s64/radix_hugetlbpage.c| 1 +- arch/powerpc/mm/book3s64/radix_tlb.c| 4 +- arch/x86/include/asm/tlbflush.h | 2 +- arch/x86/mm/tlb.c | 2 +- drivers/iommu/amd/iommu_v2.c| 10 +- drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c | 29 +++-- drivers/iommu/intel/svm.c | 8 +- drivers/misc/ocxl/link.c| 8 +- include/asm-generic/tlb.h | 1 +- include/linux/mmu_notifier.h| 104 - kernel/events/uprobes.c | 2 +- mm/huge_memory.c| 29 + mm/hugetlb.c| 8 +- mm/memory.
[Intel-gfx] [PATCH v4 4/5] mmu_notifiers: Don't invalidate secondary TLBs as part of mmu_notifier_invalidate_range_end()
Secondary TLBs are now invalidated from the architecture specific TLB invalidation functions. Therefore there is no need to explicitly notify or invalidate as part of the range end functions. This means we can remove mmu_notifier_invalidate_range_end_only() and some of the ptep_*_notify() functions. Signed-off-by: Alistair Popple Reviewed-by: Jason Gunthorpe --- include/linux/mmu_notifier.h | 56 + kernel/events/uprobes.c | 2 +- mm/huge_memory.c | 25 ++--- mm/hugetlb.c | 1 +- mm/memory.c | 8 + mm/migrate_device.c | 9 +- mm/mmu_notifier.c| 25 ++--- mm/rmap.c| 40 +-- 8 files changed, 14 insertions(+), 152 deletions(-) diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index 64a3e05..f2e9edc 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -395,8 +395,7 @@ extern int __mmu_notifier_test_young(struct mm_struct *mm, extern void __mmu_notifier_change_pte(struct mm_struct *mm, unsigned long address, pte_t pte); extern int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *r); -extern void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *r, - bool only_end); +extern void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *r); extern void __mmu_notifier_invalidate_range(struct mm_struct *mm, unsigned long start, unsigned long end); extern bool @@ -481,14 +480,7 @@ mmu_notifier_invalidate_range_end(struct mmu_notifier_range *range) might_sleep(); if (mm_has_notifiers(range->mm)) - __mmu_notifier_invalidate_range_end(range, false); -} - -static inline void -mmu_notifier_invalidate_range_only_end(struct mmu_notifier_range *range) -{ - if (mm_has_notifiers(range->mm)) - __mmu_notifier_invalidate_range_end(range, true); + __mmu_notifier_invalidate_range_end(range); } static inline void mmu_notifier_invalidate_range(struct mm_struct *mm, @@ -582,45 +574,6 @@ static inline void mmu_notifier_range_init_owner( __young;\ }) -#defineptep_clear_flush_notify(__vma, __address, __ptep) \ -({ \ - unsigned long ___addr = __address & PAGE_MASK; \ - struct mm_struct *___mm = (__vma)->vm_mm; \ - pte_t ___pte; \ - \ - ___pte = ptep_clear_flush(__vma, __address, __ptep);\ - mmu_notifier_invalidate_range(___mm, ___addr, \ - ___addr + PAGE_SIZE); \ - \ - ___pte; \ -}) - -#define pmdp_huge_clear_flush_notify(__vma, __haddr, __pmd)\ -({ \ - unsigned long ___haddr = __haddr & HPAGE_PMD_MASK; \ - struct mm_struct *___mm = (__vma)->vm_mm; \ - pmd_t ___pmd; \ - \ - ___pmd = pmdp_huge_clear_flush(__vma, __haddr, __pmd); \ - mmu_notifier_invalidate_range(___mm, ___haddr, \ - ___haddr + HPAGE_PMD_SIZE); \ - \ - ___pmd; \ -}) - -#define pudp_huge_clear_flush_notify(__vma, __haddr, __pud)\ -({ \ - unsigned long ___haddr = __haddr & HPAGE_PUD_MASK; \ - struct mm_struct *___mm = (__vma)->vm_mm; \ - pud_t ___pud; \ - \ - ___pud = pudp_huge_clear_flush(__vma, __haddr, __pud); \ - mmu_notifier_invalidate_range(___mm, ___haddr, \ - ___haddr + HPAGE_PUD_SIZE); \ - \ - ___pud; \ -}) - /* * set_pte_at_notify() sets the pte _after_ running the notifier. * This is safe to start by updating the
[Intel-gfx] [PATCH v4 3/5] mmu_notifiers: Call invalidate_range() when invalidating TLBs
The invalidate_range() is going to become an architecture specific mmu notifier used to keep the TLB of secondary MMUs such as an IOMMU in sync with the CPU page tables. Currently it is called from separate code paths to the main CPU TLB invalidations. This can lead to a secondary TLB not getting invalidated when required and makes it hard to reason about when exactly the secondary TLB is invalidated. To fix this move the notifier call to the architecture specific TLB maintenance functions for architectures that have secondary MMUs requiring explicit software invalidations. This fixes a SMMU bug on ARM64. On ARM64 PTE permission upgrades require a TLB invalidation. This invalidation is done by the architecture specific ptep_set_access_flags() which calls flush_tlb_page() if required. However this doesn't call the notifier resulting in infinite faults being generated by devices using the SMMU if it has previously cached a read-only PTE in it's TLB. Moving the invalidations into the TLB invalidation functions ensures all invalidations happen at the same time as the CPU invalidation. The architecture specific flush_tlb_all() routines do not call the notifier as none of the IOMMUs require this. Signed-off-by: Alistair Popple Suggested-by: Jason Gunthorpe Tested-by: SeongJae Park Acked-by: Catalin Marinas Reviewed-by: Jason Gunthorpe --- arch/arm64/include/asm/tlbflush.h | 5 + arch/powerpc/include/asm/book3s/64/tlbflush.h | 1 + arch/powerpc/mm/book3s64/radix_hugetlbpage.c | 1 + arch/powerpc/mm/book3s64/radix_tlb.c | 4 arch/x86/include/asm/tlbflush.h | 2 ++ arch/x86/mm/tlb.c | 2 ++ include/asm-generic/tlb.h | 1 - 7 files changed, 15 insertions(+), 1 deletion(-) diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h index 3456866..a99349d 100644 --- a/arch/arm64/include/asm/tlbflush.h +++ b/arch/arm64/include/asm/tlbflush.h @@ -13,6 +13,7 @@ #include #include #include +#include #include #include @@ -252,6 +253,7 @@ static inline void flush_tlb_mm(struct mm_struct *mm) __tlbi(aside1is, asid); __tlbi_user(aside1is, asid); dsb(ish); + mmu_notifier_invalidate_range(mm, 0, -1UL); } static inline void __flush_tlb_page_nosync(struct mm_struct *mm, @@ -263,6 +265,8 @@ static inline void __flush_tlb_page_nosync(struct mm_struct *mm, addr = __TLBI_VADDR(uaddr, ASID(mm)); __tlbi(vale1is, addr); __tlbi_user(vale1is, addr); + mmu_notifier_invalidate_range(mm, uaddr & PAGE_MASK, + (uaddr & PAGE_MASK) + PAGE_SIZE); } static inline void flush_tlb_page_nosync(struct vm_area_struct *vma, @@ -396,6 +400,7 @@ static inline void __flush_tlb_range(struct vm_area_struct *vma, scale++; } dsb(ish); + mmu_notifier_invalidate_range(vma->vm_mm, start, end); } static inline void flush_tlb_range(struct vm_area_struct *vma, diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush.h b/arch/powerpc/include/asm/book3s/64/tlbflush.h index 0d0c144..dca0477 100644 --- a/arch/powerpc/include/asm/book3s/64/tlbflush.h +++ b/arch/powerpc/include/asm/book3s/64/tlbflush.h @@ -5,6 +5,7 @@ #define MMU_NO_CONTEXT ~0UL #include +#include #include #include diff --git a/arch/powerpc/mm/book3s64/radix_hugetlbpage.c b/arch/powerpc/mm/book3s64/radix_hugetlbpage.c index 5e31955..f3fb49f 100644 --- a/arch/powerpc/mm/book3s64/radix_hugetlbpage.c +++ b/arch/powerpc/mm/book3s64/radix_hugetlbpage.c @@ -39,6 +39,7 @@ void radix__flush_hugetlb_tlb_range(struct vm_area_struct *vma, unsigned long st radix__flush_tlb_pwc_range_psize(vma->vm_mm, start, end, psize); else radix__flush_tlb_range_psize(vma->vm_mm, start, end, psize); + mmu_notifier_invalidate_range(vma->vm_mm, start, end); } void radix__huge_ptep_modify_prot_commit(struct vm_area_struct *vma, diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c b/arch/powerpc/mm/book3s64/radix_tlb.c index 0bd4866..4d44902 100644 --- a/arch/powerpc/mm/book3s64/radix_tlb.c +++ b/arch/powerpc/mm/book3s64/radix_tlb.c @@ -987,6 +987,7 @@ void radix__flush_tlb_mm(struct mm_struct *mm) } } preempt_enable(); + mmu_notifier_invalidate_range(mm, 0, -1UL); } EXPORT_SYMBOL(radix__flush_tlb_mm); @@ -1020,6 +1021,7 @@ static void __flush_all_mm(struct mm_struct *mm, bool fullmm) _tlbiel_pid_multicast(mm, pid, RIC_FLUSH_ALL); } preempt_enable(); + mmu_notifier_invalidate_range(mm, 0, -1UL); } void radix__flush_all_mm(struct mm_struct *mm) @@ -1228,6 +1230,7 @@ static inline void __radix__flush_tlb_range(struct mm_struct *mm, } out: preempt_enable(); + mmu_notifier_invalidate_range(mm, start, end); } void radix__flush_tlb_ra