Re: [PATCH] mm: Remove double faults once write a device pfn

2024-01-24 Thread Alistair Popple


"Zhou, Xianrong"  writes:

> [AMD Official Use Only - General]
>
>> > The vmf_insert_pfn_prot could cause unnecessary double faults on a
>> > device pfn. Because currently the vmf_insert_pfn_prot does not
>> > make the pfn writable so the pte entry is normally read-only or
>> > dirty catching.
>>  What? How do you got to this conclusion?
>> >>> Sorry. I did not mention that this problem only exists on arm64 platform.
>> >> Ok, that makes at least a little bit more sense.
>> >>
>> >>> Because on arm64 platform the PTE_RDONLY is automatically attached
>> >>> to the userspace pte entries even through VM_WRITE + VM_SHARE.
>> >>> The  PTE_RDONLY needs to be cleared in vmf_insert_pfn_prot. However
>> >>> vmf_insert_pfn_prot do not make the pte writable passing false
>> >>> @mkwrite to insert_pfn.
>> >> Question is why is arm64 doing this? As far as I can see they must
>> >> have some hardware reason for that.
>> >>
>> >> The mkwrite parameter to insert_pfn() was added by commit
>> >> b2770da642540 to make insert_pfn() look more like insert_pfn_pmd() so
>> >> that the DAX code can insert PTEs which are writable and dirty at the same
>> time.
>> >>
>> > This is one scenario to do so. In fact on arm64 there are many
>> > scenarios could be to do so. So we can let vmf_insert_pfn_prot
>> > supporting @mkwrite for drivers at core layer and let drivers to
>> > decide whether or not to make writable and dirty at one time. The
>> > patch did this. Otherwise double faults on arm64 when call
>> vmf_insert_pfn_prot.
>>
>> Well, that doesn't answer my question why arm64 is double faulting in the
>> first place,.
>>
>
>
> Eh.
>
> On arm64 When userspace mmap() with PROT_WRITE and MAP_SHARED the
> vma->vm_page_prot has the PTE_RDONLY and PTE_WRITE within
> PAGE_SHARED_EXEC. (seeing arm64 protection_map)
>
> When write the userspace virtual address the first fault happen and call
> into driver's 
> .fault->ttm_bo_vm_fault_reserved->vmf_insert_pfn_prot->insert_pfn.
> The insert_pfn will establish the pte entry. However the vmf_insert_pfn_prot
> pass false @mkwrite to insert_pfn by default and so insert_pfn could not make
> the pfn writable and it do not call maybe_mkwrite(pte_mkdirty(entry), vma)
> to clear the PTE_RDONLY bit. So the pte entry is actually write protection 
> for mmu.
> So when the first fault return and re-execute the store instruction the second
> fault happen again. And in second fault it just only do pte_mkdirty(entry) 
> which
> clear the PTE_RDONLY.

It depends if the ARM64 CPU in question supports hardware dirty bit
management (DBM). If that is the case and PTE_DBM (ie. PTE_WRITE) is set
HW will automatically clear PTE_RDONLY bit to mark the entry dirty
instead of raising a write fault. So you shouldn't see a double fault if
PTE_DBM/WRITE is set.

On ARM64 you can kind of think of PTE_RDONLY as the HW dirty bit and
PTE_DBM as the read/write permission bit with SW being responsible for
updating PTE_RDONLY via the fault handler if DBM is not supported by HW.

At least that's my understanding from having hacked on this in the
past. You can see all this weirdness happening in the definitions of
pte_dirty() and pte_write() for ARM64.

> I think so and hope no wrong.
>
>> So as long as this isn't sorted out I'm going to reject this patch.
>>
>> Regards,
>> Christian.
>>
>> >
>> >> This is a completely different use case to what you try to use it
>> >> here for and that looks extremely fishy to me.
>> >>
>> >> Regards,
>> >> Christian.
>> >>
>> > The first fault only sets up the pte entry which actually is dirty
>> > catching. And the second immediate fault to the pfn due to first
>> > dirty catching when the cpu re-execute the store instruction.
>>  It could be that this is done to work around some hw behavior, but
>>  not because of dirty catching.
>> 
>> > Normally if the drivers call vmf_insert_pfn_prot and also supply
>> > 'pfn_mkwrite' callback within vm_operations_struct which requires
>> > the pte to be dirty catching then the vmf_insert_pfn_prot and the
>> > double fault are reasonable. It is not a problem.
>>  Well, as far as I can see that behavior absolutely doesn't make sense.
>> 
>>  When pfn_mkwrite is requested then the driver should use PAGE_COPY,
>>  which is exactly what VMWGFX (the only driver using dirty tracking)
>>  is
>> >> doing.
>>  Everybody else uses PAGE_SHARED which should make the pte writeable
>>  immediately.
>> 
>>  Regards,
>>  Christian.
>> 
>> > However the most of drivers calling vmf_insert_pfn_prot do not
>> > supply the 'pfn_mkwrite' callback so that the second fault is
>> unnecessary.
>> >
>> > So just like vmf_insert_mixed and vmf_insert_mixed_mkwrite pair,
>> > we should also supply vmf_insert_pfn_mkwrite for drivers as well.
>> >
>> > Signed-off-by: Xianrong Zhou 
>> > ---
>> > arch/x86/entry/vdso/vma.c  |  3 +

Re: [Intel-gfx] [RFC PATCH 0/6] Supporting GMEM (generalized memory management) for external memory devices

2023-12-03 Thread Alistair Popple


Christian König  writes:

> Am 01.12.23 um 06:48 schrieb Zeng, Oak:
>> [SNIP]

>> Besides memory eviction/oversubscription, there are a few other pain points 
>> when I use hmm:
>>
>> 1) hmm doesn't support file-back memory, so it is hard to share
> memory b/t process in a gpu environment. You mentioned you have a
> plan... How hard is it to support file-backed in your approach?
>
> As hard as it is to support it through HMM. That's what I meant that
> this approach doesn't integrate well, as far as I know the problem
> isn't inside HMM or any other solution but rather in the file system
> layer.

In what way does HMM not support file-backed memory? I was under the
impression that at least hmm_range_fault() does.

 - Alistair

> Regards,
> Christian.
>
>> 2)virtual address range based memory attribute/hint: with hmadvise,
> where do you save the memory attribute of a virtual address range? Do
> you need to extend vm_area_struct to save it? With hmm, we have to
> maintain such information at driver. This ends up with pretty
> complicated logic to split/merge those address range. I know core mm
> has similar logic to split/merge vma...
>>
>> Oak
>>
>>
>>> -Weixi
>>>
>>> -Original Message-
>>> From: Christian König
>>> Sent: Thursday, November 30, 2023 4:28 PM
>>> To: Zeng, Oak; Christian König
>>> ; zhuweixi; linux-
>>> m...@kvack.org;linux-ker...@vger.kernel.org;a...@linux-foundation.org;
>>> Danilo Krummrich; Dave Airlie; Daniel
>>> Vetter
>>> Cc:intel-gvt-...@lists.freedesktop.org;rcampb...@nvidia.com;
>>> mhairgr...@nvidia.com;j...@nvidia.com;weixi@openeuler.sh;
>>> jhubb...@nvidia.com;intel-gfx@lists.freedesktop.org;apop...@nvidia.com;
>>> xinhui@amd.com;amd-...@lists.freedesktop.org;
>>> tvrtko.ursu...@linux.intel.com;ogab...@kernel.org;jgli...@redhat.com; dri-
>>> de...@lists.freedesktop.org;z...@nvidia.com; Vivi, Rodrigo
>>> ;alexander.deuc...@amd.com;leo...@nvidia.com;
>>> felix.kuehl...@amd.com; Wang, Zhi A;
>>> mgor...@suse.de
>>> Subject: Re: [RFC PATCH 0/6] Supporting GMEM (generalized memory
>>> management) for external memory devices
>>>
>>> Hi Oak,
>>>
>>> yeah, #4 is indeed a really good point and I think Felix will agree to that 
>>> as well.
>>>
>>> HMM is basically still missing a way to advise device attributes for the CPU
>>> address space. Both migration strategy as well as device specific 
>>> information (like
>>> cache preferences) fall into this category.
>>>
>>> Since there is a device specific component in those attributes as well I 
>>> think
>>> device specific IOCTLs still make sense to update them, but HMM should offer
>>> the functionality to manage and store those information.
>>>
>>> Split and merge of VMAs only become a problem if you attach those 
>>> information
>>> to VMAs, if you keep them completely separate than that doesn't become an
>>> issue either. The down side of this approach is that you don't get 
>>> automatically
>>> extending attribute ranges for growing VMAs for example.
>>>
>>> Regards,
>>> Christian.
>>>
>>> Am 29.11.23 um 23:23 schrieb Zeng, Oak:
 Hi Weixi,

 Even though Christian has listed reasons rejecting this proposal (yes they 
 are
>>> very reasonable to me), I would open my mind and further explore the 
>>> possibility
>>> here. Since the current GPU driver uses a hmm based implementation (AMD and
>>> NV has done this; At Intel we are catching up), I want to explore how much 
>>> we
>>> can benefit from the proposed approach and how your approach can solve some
>>> pain points of our development. So basically what I am questioning here is: 
>>> what
>>> is the advantage of your approach against hmm.
 To implement a UVM (unified virtual address space b/t cpu and gpu device),
>>> with hmm, driver essentially need to implement below functions:
 1. device page table update. Your approach requires the same because
 this is device specific codes

 2. Some migration functions to migrate memory b/t system memory and GPU
>>> local memory. My understanding is, even though you generalized this a bit, 
>>> such
>>> as modified cpu page fault path, provided "general" gm_dev_fault handler... 
>>> but
>>> device driver still need to provide migration functions because migration
>>> functions have to be device specific (i.e., using device dma/copy engine for
>>> performance purpose). Right?
 3. GPU physical memory management, this part is now in drm/buddy, shared
>>> by all drivers. I think with your approach, driver still need to provide 
>>> callback
>>> functions to allocate/free physical pages. Right? Or do you let linux core 
>>> mm
>>> buddy manage device memory directly?
 4. madvise/hints/virtual address range management. This has been pain point
>>> for us. Right now device driver has to maintain certain virtual address 
>>> range data
>>> structure to maintain hints and other virtual address range based memory
>>> attributes. Driver need to sync with linux vma. Driver need to explicitly 
>

Re: [Intel-gfx] [RFC PATCH 0/6] Supporting GMEM (generalized memory management) for external memory devices

2023-11-30 Thread Alistair Popple


"Zeng, Oak"  writes:

> See inline comments
>
>> -Original Message-
>> From: dri-devel  On Behalf Of
>> zhuweixi
>> Sent: Thursday, November 30, 2023 5:48 AM
>> To: Christian König ; Zeng, Oak
>> ; Christian König ; linux-
>> m...@kvack.org; linux-ker...@vger.kernel.org; a...@linux-foundation.org;
>> Danilo Krummrich ; Dave Airlie ; Daniel
>> Vetter 
>> Cc: tvrtko.ursu...@linux.intel.com; rcampb...@nvidia.com; apop...@nvidia.com;
>> z...@nvidia.com; weixi@openeuler.sh; jhubb...@nvidia.com; intel-
>> g...@lists.freedesktop.org; mhairgr...@nvidia.com; Wang, Zhi A
>> ; xinhui@amd.com; amd-...@lists.freedesktop.org;
>> jgli...@redhat.com; dri-de...@lists.freedesktop.org; j...@nvidia.com; Vivi,
>> Rodrigo ; alexander.deuc...@amd.com;
>> felix.kuehl...@amd.com; intel-gvt-...@lists.freedesktop.org;
>> ogab...@kernel.org; leo...@nvidia.com; mgor...@suse.de
>> Subject: RE: [RFC PATCH 0/6] Supporting GMEM (generalized memory
>> management) for external memory devices
>> 
>> Glad to know that there is a common demand for a new syscall like 
>> hmadvise(). I
>> expect it would also be useful for homogeneous NUMA cases. Credits to
>> cudaMemAdvise() API which brought this idea to GMEM's design.
>> 
>> To answer @Oak's questions about GMEM vs. HMM,
>> 
>> Here is the major difference:
>>   GMEM's main target is to stop drivers from reinventing MM code, while
>> HMM/MMU notifiers provide a compatible struct page solution and a
>> coordination mechanism for existing device driver MMs that requires adding
>> extra code to interact with CPU MM.
>> 
>> A straightforward qualitative result for the main target: after integrating 
>> Huawei's
>> Ascend NPU driver with GMEM's interface, 30,000 lines of MM code were cut,
>> leaving <100 lines invoking GMEM interface and 3,700 lines implementing 
>> vendor-
>> specific functions. Some code from the 3,700 lines should be further moved to
>> GMEM as a generalized feature like device memory oversubscription, but not
>> included in this RFC patch yet.
>> 
>> A list of high-level differences:
>>   1. With HMM/MMU notifiers, drivers need to first implement a full MM
>> subsystem. With GMEM, drivers can reuse Linux's core MM.
>
> A full mm subsystem essentially has below functions:
>
> Physical memory management: neither your approach nor hmm-based
> solution provide device physical memory management. You mentioned you
> have a plan but at least for now driver need to mange device physical
> memory.
>
> Virtual address space management: both approach leverage linux core mm, vma 
> for this.
>
> Data eviction, migration: with hmm, driver need to implement this. It
> is not clear whether gmem has this function. I guess even gmem has it,
> it might be slow cpu data copy, compared to modern gpu's fast data
> copy engine.
>
> Device page table update, va-pa mapping: I think it is driver's 
> responsibility in both approach.
>
> So from the point of re-use core MM, I don't see big difference. Maybe
> you did it more elegantly. I think it is very possible with your
> approach driver can be simpler, less codes.
>
>> 
>>   2. HMM encodes device mapping information in the CPU arch-dependent PTEs,
>> while GMEM proposes an abstraction layer in vm_object. Since GMEM's
>> approach further decouples the arch-related stuff, drivers do not need to
>> implement separate code for X86/ARM and etc.
>
> I don't understand this...with hmm, when a virtual address range's
> backing store is in device memory, cpu pte is encoded to point to
> device memory. Device page table is also encoded to point to the same
> device memory location. But since device memory is not accessible to
> CPU (DEVICE_PRIVATE), so when cpu access this virtual address, there
> is a cpu page fault. Device mapping info is still in device page
> table, not in cpu ptes.
>
> I do not see with hmm why driver need to implement x86/arm
> code... driver only take cares of device page table. Hmm/core mm take
> care of cpu page table, right?

I see our replies have crossed, but that is my understanding as well.

>> 
>>   3. MMU notifiers register hooks at certain core MM events, while GMEM
>> declares basic functions and internally invokes them. GMEM requires less from
>> the driver side -- no need to understand what core MM behaves at certain MMU
>> events. GMEM also expects fewer bugs than MMU notifiers: implementing basic
>> operations with standard declarations vs. implementing whatever random device
>> MM logic in MMU notifiers.
>
> This seems true to me. I feel the mmu notifier thing, especially the
> synchronization/lock design (those sequence numbers, interacting with
> driver lock, and the mmap lock) are very complicated. I indeed spent
> time to understand the specification documented in hmm.rst...

No argument there, but I think that's something we could look at
providing an improved interface for. I don't think it needs a whole new
subsystem to fix. Probably just a version of hmm_range_fault() that
takes the lock and s

Re: [Intel-gfx] [RFC PATCH 0/6] Supporting GMEM (generalized memory management) for external memory devices

2023-11-30 Thread Alistair Popple


zhuweixi  writes:

> Glad to know that there is a common demand for a new syscall like
> hmadvise(). I expect it would also be useful for homogeneous NUMA
> cases. Credits to cudaMemAdvise() API which brought this idea to
> GMEM's design.

It's not clear to me that this would need to be a new syscall. Scanning
the patches it looks like your adding a NUMA node anyway, so the
existing interfaces (eg. madvise) with its various options
(MPOL_PREFERRED/PREFERRED_MANY) and
set_mempolicy/set_mempolicy_home_node() could potentially cover this for
both NUMA and hNUMA nodes. The main advantage here would be providing a
common userspace interface for setting these kind of hints.

> To answer @Oak's questions about GMEM vs. HMM,
>
> Here is the major difference:
>   GMEM's main target is to stop drivers from reinventing MM code,
> while HMM/MMU notifiers provide a compatible struct page solution and
> a coordination mechanism for existing device driver MMs that requires
> adding extra code to interact with CPU MM.
>
> A straightforward qualitative result for the main target: after
> integrating Huawei's Ascend NPU driver with GMEM's interface, 30,000
> lines of MM code were cut, leaving <100 lines invoking GMEM interface
> and 3,700 lines implementing vendor-specific functions. Some code from
> the 3,700 lines should be further moved to GMEM as a generalized
> feature like device memory oversubscription, but not included in this
> RFC patch yet.

I think it would be helpful if you could be a bit more specific on what
functionality the current HMM/migrate_vma/MMU notifier interfaces are
missing that every driver has to implement in a common way. Because I'm
not convinced we can't either improve those interfaces to provide what's
needed or add specific components (eg. a physical page allocator)
instead of a whole new framework.

> A list of high-level differences: 
>   1. With HMM/MMU notifiers, drivers need to first implement a full MM 
> subsystem. With GMEM, drivers can reuse Linux's core MM.

What do the common bits of this full MM subsystem look like?
Fundamentally the existing HMM functionality can already make use of
Linux core MM to manage page tables and migrate pages and everything
else seems pretty device specific (ie. actual copying of data,
programming of MMUs, TLBs, etc.)

I can see that there would be scope to have say a generic memory
allocator, which I vaguely recall discussing in relation to
DEIVCE_PRIVATE pages in the past but @Oak suggests something close
already exists (drm/buddy).

Potentially I suppose there is VA allocation that might be common across
devices. However I have not had any experience working with devices with
VA requirements different enough from the CPU to matter. If they are so
different I'm not convinced it would be easy to have a common
implementation anyway.

>   2. HMM encodes device mapping information in the CPU arch-dependent
> PTEs, while GMEM proposes an abstraction layer in vm_object. Since
> GMEM's approach further decouples the arch-related stuff, drivers do
> not need to implement separate code for X86/ARM and etc.

I'm not following this. At present all HMM encodes in CPU PTEs is the
fact a page has been migrated to the device and what permissions it
has. I'm not aware of needing to treat X86 and ARM differently for
example here. Are you saying you want somewhere to store other bits
attached to a particular VA?

>   3. MMU notifiers register hooks at certain core MM events, while
> GMEM declares basic functions and internally invokes them. GMEM
> requires less from the driver side -- no need to understand what core
> MM behaves at certain MMU events. GMEM also expects fewer bugs than
> MMU notifiers: implementing basic operations with standard
> declarations vs. implementing whatever random device MM logic in MMU
> notifiers.

How is this proposal any different though? From what I can see it
replaces MMU notifier callbacks with TLB invalidation callbacks, but
that is essentially what MMU notifier callbacks are anyway. The "random
device MM logic" should just be clearing device TLBs. What other MM
logic has to be implemented in the MMU notifier callbacks that is the
same between devices?

>   4. GMEM plans to support a more lightweight physical memory
> management. The discussion about this part can be found in my cover
> letter. The question is whether struct page should be compatible
> (directly use HMM's ZONE_DEVICE solution) or a trimmed, smaller struct
> page that satisfies generalized demands from accelerators is more
> preferrable?

What is wrong with the current ZONE_DEVICE solution? You mention size of
struct page, but that is already being worked on through the conversion
to folios. Admittedly higher order HMM ZONE_DEVICE folios are not
currently supported, but that is something I'm actively working on at
the moment.

>   5. GMEM has been demonstrated to allow device memory
> oversubscription (a GMEM-based 32GB NPU card can run a GPT model
> oversubscribin

[Intel-gfx] [PATCH v4 1/5] arm64/smmu: Use TLBI ASID when invalidating entire range

2023-07-26 Thread Alistair Popple
The ARM SMMU has a specific command for invalidating the TLB for an
entire ASID. Currently this is used for the IO_PGTABLE API but not for
ATS when called from the MMU notifier.

The current implementation of notifiers does not attempt to invalidate
such a large address range, instead walking each VMA and invalidating
each range individually during mmap removal. However in future SMMU
TLB invalidations are going to be sent as part of the normal
flush_tlb_*() kernel calls. To better deal with that add handling to
use TLBI ASID when invalidating the entire address space.

Signed-off-by: Alistair Popple 
Reviewed-by: Jason Gunthorpe 
---
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c | 16 +---
 1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c 
b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
index a5a63b1..2a19784 100644
--- a/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
+++ b/drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c
@@ -200,10 +200,20 @@ static void arm_smmu_mm_invalidate_range(struct 
mmu_notifier *mn,
 * range. So do a simple translation here by calculating size correctly.
 */
size = end - start;
+   if (size == ULONG_MAX)
+   size = 0;
+
+   if (!(smmu_domain->smmu->features & ARM_SMMU_FEAT_BTM)) {
+   if (!size)
+   arm_smmu_tlb_inv_asid(smmu_domain->smmu,
+ smmu_mn->cd->asid);
+   else
+   arm_smmu_tlb_inv_range_asid(start, size,
+   smmu_mn->cd->asid,
+   PAGE_SIZE, false,
+   smmu_domain);
+   }
 
-   if (!(smmu_domain->smmu->features & ARM_SMMU_FEAT_BTM))
-   arm_smmu_tlb_inv_range_asid(start, size, smmu_mn->cd->asid,
-   PAGE_SIZE, false, smmu_domain);
arm_smmu_atc_inv_domain(smmu_domain, mm->pasid, start, size);
 }
 
-- 
git-series 0.9.1


[Intel-gfx] [PATCH v4 2/5] mmu_notifiers: Fixup comment in mmu_interval_read_begin()

2023-07-26 Thread Alistair Popple
The comment in mmu_interval_read_begin() refers to a function that
doesn't exist and uses the wrong call-back name. The op for mmu
interval notifiers is mmu_interval_notifier_ops->invalidate() so fix
the comment up to reflect that.

Signed-off-by: Alistair Popple 
Reviewed-by: Jason Gunthorpe 
---
 mm/mmu_notifier.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 50c0dde..b7ad155 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -199,7 +199,7 @@ mmu_interval_read_begin(struct mmu_interval_notifier 
*interval_sub)
 * invalidate_start/end and is colliding.
 *
 * The locking looks broadly like this:
-*   mn_tree_invalidate_start():  mmu_interval_read_begin():
+*   mn_itree_inv_start(): mmu_interval_read_begin():
 * spin_lock
 *  seq = 
READ_ONCE(interval_sub->invalidate_seq);
 *  seq == subs->invalidate_seq
@@ -207,7 +207,7 @@ mmu_interval_read_begin(struct mmu_interval_notifier 
*interval_sub)
 *spin_lock
 * seq = ++subscriptions->invalidate_seq
 *spin_unlock
-* op->invalidate_range():
+* op->invalidate():
 *   user_lock
 *mmu_interval_set_seq()
 * interval_sub->invalidate_seq = seq
-- 
git-series 0.9.1


[Intel-gfx] [PATCH v4 5/5] mmu_notifiers: Rename invalidate_range notifier

2023-07-26 Thread Alistair Popple
There are two main use cases for mmu notifiers. One is by KVM which
uses mmu_notifier_invalidate_range_start()/end() to manage a software
TLB.

The other is to manage hardware TLBs which need to use the
invalidate_range() callback because HW can establish new TLB entries
at any time. Hence using start/end() can lead to memory corruption as
these callbacks happen too soon/late during page unmap.

mmu notifier users should therefore either use the start()/end()
callbacks or the invalidate_range() callbacks. To make this usage
clearer rename the invalidate_range() callback to
arch_invalidate_secondary_tlbs() and update documention.

Signed-off-by: Alistair Popple 
Suggested-by: Jason Gunthorpe 
Acked-by: Catalin Marinas 
Reviewed-by: Jason Gunthorpe 
---
 arch/arm64/include/asm/tlbflush.h   |  6 +-
 arch/powerpc/mm/book3s64/radix_hugetlbpage.c|  2 +-
 arch/powerpc/mm/book3s64/radix_tlb.c|  8 +--
 arch/x86/include/asm/tlbflush.h |  2 +-
 arch/x86/mm/tlb.c   |  2 +-
 drivers/iommu/amd/iommu_v2.c| 10 ++--
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c | 13 ++---
 drivers/iommu/intel/svm.c   |  8 +--
 drivers/misc/ocxl/link.c|  8 +--
 include/linux/mmu_notifier.h| 48 +-
 mm/huge_memory.c|  4 +-
 mm/hugetlb.c|  7 +--
 mm/mmu_notifier.c   | 21 ++--
 13 files changed, 76 insertions(+), 63 deletions(-)

diff --git a/arch/arm64/include/asm/tlbflush.h 
b/arch/arm64/include/asm/tlbflush.h
index a99349d..84a05a0 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -253,7 +253,7 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
__tlbi(aside1is, asid);
__tlbi_user(aside1is, asid);
dsb(ish);
-   mmu_notifier_invalidate_range(mm, 0, -1UL);
+   mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
 }
 
 static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
@@ -265,7 +265,7 @@ static inline void __flush_tlb_page_nosync(struct mm_struct 
*mm,
addr = __TLBI_VADDR(uaddr, ASID(mm));
__tlbi(vale1is, addr);
__tlbi_user(vale1is, addr);
-   mmu_notifier_invalidate_range(mm, uaddr & PAGE_MASK,
+   mmu_notifier_arch_invalidate_secondary_tlbs(mm, uaddr & PAGE_MASK,
(uaddr & PAGE_MASK) + 
PAGE_SIZE);
 }
 
@@ -400,7 +400,7 @@ static inline void __flush_tlb_range(struct vm_area_struct 
*vma,
scale++;
}
dsb(ish);
-   mmu_notifier_invalidate_range(vma->vm_mm, start, end);
+   mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end);
 }
 
 static inline void flush_tlb_range(struct vm_area_struct *vma,
diff --git a/arch/powerpc/mm/book3s64/radix_hugetlbpage.c 
b/arch/powerpc/mm/book3s64/radix_hugetlbpage.c
index f3fb49f..17075c7 100644
--- a/arch/powerpc/mm/book3s64/radix_hugetlbpage.c
+++ b/arch/powerpc/mm/book3s64/radix_hugetlbpage.c
@@ -39,7 +39,7 @@ void radix__flush_hugetlb_tlb_range(struct vm_area_struct 
*vma, unsigned long st
radix__flush_tlb_pwc_range_psize(vma->vm_mm, start, end, psize);
else
radix__flush_tlb_range_psize(vma->vm_mm, start, end, psize);
-   mmu_notifier_invalidate_range(vma->vm_mm, start, end);
+   mmu_notifier_arch_invalidate_secondary_tlbs(vma->vm_mm, start, end);
 }
 
 void radix__huge_ptep_modify_prot_commit(struct vm_area_struct *vma,
diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c 
b/arch/powerpc/mm/book3s64/radix_tlb.c
index 4d44902..06e647e 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -987,7 +987,7 @@ void radix__flush_tlb_mm(struct mm_struct *mm)
}
}
preempt_enable();
-   mmu_notifier_invalidate_range(mm, 0, -1UL);
+   mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
 }
 EXPORT_SYMBOL(radix__flush_tlb_mm);
 
@@ -1021,7 +1021,7 @@ static void __flush_all_mm(struct mm_struct *mm, bool 
fullmm)
_tlbiel_pid_multicast(mm, pid, RIC_FLUSH_ALL);
}
preempt_enable();
-   mmu_notifier_invalidate_range(mm, 0, -1UL);
+   mmu_notifier_arch_invalidate_secondary_tlbs(mm, 0, -1UL);
 }
 
 void radix__flush_all_mm(struct mm_struct *mm)
@@ -1230,7 +1230,7 @@ static inline void __radix__flush_tlb_range(struct 
mm_struct *mm,
}
 out:
preempt_enable();
-   mmu_notifier_invalidate_range(mm, start, end);
+   mmu_notifier_arch_invalidate_secondary_tlbs(mm, start, end);
 }
 
 void radix__flush_tlb_range(struct vm_area_struct *vma, unsigned long start,
@@ -1395,7 +1395,7 @@ static void __radix__flush_tlb_range_psize(struct 
mm_struct *mm,
}
 out:
preempt_enable();
-   mmu_notifier_i

Re: [Intel-gfx] Regression in linux-next

2023-07-26 Thread Alistair Popple


Thanks Chaitanya for the detailed report. Dan Carpenter also reported a
Smatch warning for this:

https://lore.kernel.org/linux-mm/38ed0627-1283-4da2-827a-e90484d7bd7d@moroto.mountain/

The below should fix the problem, will respin the series to include the
fix.

---

diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
index 63c8eb740af7..ec3b068cbbe6 100644
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -621,9 +621,10 @@ int __mmu_notifier_register(struct mmu_notifier 
*subscription,
 * Subsystems should only register for invalidate_secondary_tlbs() or
 * invalidate_range_start()/end() callbacks, not both.
 */
-   if (WARN_ON_ONCE(subscription->ops->arch_invalidate_secondary_tlbs &&
-   (subscription->ops->invalidate_range_start ||
-   subscription->ops->invalidate_range_end)))
+   if (WARN_ON_ONCE(subscription &&
+(subscription->ops->arch_invalidate_secondary_tlbs &&
+(subscription->ops->invalidate_range_start ||
+ subscription->ops->invalidate_range_end
return -EINVAL;
 
if (!mm->notifier_subscriptions) {


"Borah, Chaitanya Kumar"  writes:

> Hello Alistair,
>
> Hope you are doing well. I am Chaitanya from the linux graphics team in Intel.
>  
> This mail is regarding a regression we are seeing in our CI runs[1] on 
> linux-next
> repository.
>  
> On next-20230720 [2], we are seeing the following error
>
> <4>[   76.189375] Hardware name: Intel Corporation Meteor Lake Client 
> Platform/MTL-P DDR5 SODIMM SBS RVP, BIOS MTLPFWI1.R00.3271.D81.2307101805 
> 07/10/2023
> <4>[   76.202534] RIP: 0010:__mmu_notifier_register+0x40/0x210
> <4>[ 76.207804] Code: 1a 71 5a 01 85 c0 0f 85 ec 00 00 00 48 8b 85 30
> 01 00 00 48 85 c0 0f 84 04 01 00 00 8b 85 cc 00 00 00 85 c0 0f 8e bb
> 01 00 00 <49> 8b 44 24 10 48 83 78 38 00 74 1a 48 83 78 28 00 74 0c 0f
> 0b b8
> <4>[   76.226368] RSP: 0018:c900019d7ca8 EFLAGS: 00010202
> <4>[   76.231549] RAX: 0001 RBX: 1000 RCX: 
> 0001
> <4>[   76.238613] RDX:  RSI: 823ceb7b RDI: 
> 823ee12d
> <4>[   76.245680] RBP: 888102ec9b40 R08:  R09: 
> 0001
> <4>[   76.252747] R10: 0001 R11: 8881157cd2c0 R12: 
> 
> <4>[   76.259811] R13: 888102ec9c70 R14: a07de500 R15: 
> 888102ec9ce0
> <4>[   76.266875] FS:  7fbcabe11c00() GS:88846ec0() 
> knlGS:
> <4>[   76.274884] CS:  0010 DS:  ES:  CR0: 80050033
> <4>[   76.280578] CR2: 0010 CR3: 00010d4c2005 CR4: 
> 00f70ee0
> <4>[   76.287643] DR0:  DR1:  DR2: 
> 
> <4>[   76.294711] DR3:  DR6: 07f0 DR7: 
> 0400
> <4>[   76.301775] PKRU: 5554
> <4>[   76.304463] Call Trace:
> <4>[   76.306893]  
> <4>[   76.308983]  ? __die_body+0x1a/0x60
> <4>[   76.312444]  ? page_fault_oops+0x156/0x450
> <4>[   76.316510]  ? do_user_addr_fault+0x65/0x980
> <4>[   76.320747]  ? exc_page_fault+0x68/0x1a0
> <4>[   76.324643]  ? asm_exc_page_fault+0x26/0x30
> <4>[   76.328796]  ? __mmu_notifier_register+0x40/0x210
> <4>[   76.333460]  ? __mmu_notifier_register+0x11c/0x210
> <4>[   76.338206]  ? preempt_count_add+0x4c/0xa0
> <4>[   76.342273]  mmu_notifier_register+0x30/0xe0
> <4>[   76.346509]  mmu_interval_notifier_insert+0x74/0xb0
> <4>[   76.351344]  i915_gem_userptr_ioctl+0x21a/0x320 [i915]
> <4>[   76.356565]  ? __pfx_i915_gem_userptr_ioctl+0x10/0x10 [i915]
> <4>[   76.362271]  drm_ioctl_kernel+0xb4/0x150
> <4>[   76.366159]  drm_ioctl+0x21d/0x420
> <4>[   76.369537]  ? __pfx_i915_gem_userptr_ioctl+0x10/0x10 [i915]
> <4>[   76.375242]  ? find_held_lock+0x2b/0x80
> <4>[   76.379046]  __x64_sys_ioctl+0x79/0xb0
> <4>[   76.382766]  do_syscall_64+0x3c/0x90
> <4>[   76.386312]  entry_SYSCALL_64_after_hwframe+0x6e/0xd8
> <4>[   76.391317] RIP: 0033:0x7fbcae63f3ab
>
> Details log can be found in [3].
>
> After bisecting the tree, the following patch seems to be causing the
> regression.
>
> commit 828fe4085cae77acb3abf7dd3d25b3ed6c560edf
> Author: Alistair Popple apop...@nvidia.com
> Date:   Wed Jul 19 22:18:46 2023 +1000
>
> mmu_notifiers: rename invalidate_range notifier
>
> There are two main use cases for mmu notifiers.  One is by KV

[Intel-gfx] [PATCH v4 0/5] Invalidate secondary IOMMU TLB on permission upgrade

2023-07-26 Thread Alistair Popple
The main change is to move secondary TLB invalidation mmu notifier
callbacks into the architecture specific TLB flushing functions. This
makes secondary TLB invalidation mostly match CPU invalidation while
still allowing efficient range based invalidations based on the
existing TLB batching code.

Changes for v4:

 - Fixed a NULL pointer dereference when registering the first
   notifier with mmu_interval_notifier_insert() instead of
   mmu_notifier_register() - Thanks Dan and Chaitanya for the bug
   reports.

 - Collected Acked/Reviewed tags.

 - Don't call the notifier from radix__local_flush_tlb_page() on
   PowerPC.

Changes for v3:

 - On x86 call the invalidation when adding pending TLB invalidates
   rather than when flushing the batch. This is because the mm is
   required. It also matches what happens on ARM64. Fixes a bug
   reported by SeongJae Park (thanks!)

Changes for v2:

 - Rebased on linux-next commit 906fa30154ef ("mm/rmap: correct stale
   comment of rmap_walk_anon and rmap_walk_file") to fix a minor
   integration conflict with "arm64: support batched/deferred tlb
   shootdown during page reclamation/migration". This series will need
   to be applied after the conflicting patch.

 - Reordered the function rename until the end of the series as many
   places that were getting renamed ended up being removed anyway.

 - Fixed a couple of build issues which broke bisection.

 - Added a minor patch to fix up a stale/incorrect comment.

==
Background
==

The arm64 architecture specifies TLB permission bits may be cached and
therefore the TLB must be invalidated during permission upgrades. For
the CPU this currently occurs in the architecture specific
ptep_set_access_flags() routine.

Secondary TLBs such as implemented by the SMMU IOMMU match the CPU
architecture specification and may also cache permission bits and
require the same TLB invalidations. This may be achieved in one of two
ways.

Some SMMU implementations implement broadcast TLB maintenance
(BTM). This snoops CPU TLB invalidates and will invalidate any
secondary TLB at the same time as the CPU. However implementations are
not required to implement BTM.

Implementations without BTM rely on mmu notifier callbacks to send
explicit TLB invalidation commands to invalidate SMMU TLB. Therefore
either generic kernel code or architecture specific code needs to call
the mmu notifier on permission upgrade.

Currently that doesn't happen so devices will fault indefinitely when
writing to a PTE that was previously read-only as nothing invalidates
the SMMU TLB.


Solution


To fix this the series first renames the .invalidate_range() callback
to .arch_invalidate_secondary_tlbs() as suggested by Jason and Sean to
make it clear this callback is only used for secondary TLBs. That was
made possible thanks to Sean's series [1] to remove KVM's incorrect
usage.

Based on feedback from Jason [2] the proposed solution to the bug is
to move the calls to mmu_notifier_arch_invalidate_secondary_tlbs()
closer to the architecture specific TLB invalidation code. This
ensures the secondary TLB won't miss invalidations, including the
existing invalidation in the ARM64 code to deal with permission
upgrade.

Currently only ARM64, PowerPC and x86 have IOMMU with secondary TLBs
requiring SW invalidation so the notifier is only called for those
architectures. It is also not called for invalidation of kernel
mappings as no secondary IOMMU implementations can access those and
hence it is not required.

[1] - https://lore.kernel.org/all/20230602011518.787006-1-sea...@google.com/
[2] - https://lore.kernel.org/linux-mm/zjmr5bw8l+bbz...@ziepe.ca/

Alistair Popple (5):
  arm64/smmu: Use TLBI ASID when invalidating entire range
  mmu_notifiers: Fixup comment in mmu_interval_read_begin()
  mmu_notifiers: Call invalidate_range() when invalidating TLBs
  mmu_notifiers: Don't invalidate secondary TLBs as part of 
mmu_notifier_invalidate_range_end()
  mmu_notifiers: Rename invalidate_range notifier

 arch/arm64/include/asm/tlbflush.h   |   5 +-
 arch/powerpc/include/asm/book3s/64/tlbflush.h   |   1 +-
 arch/powerpc/mm/book3s64/radix_hugetlbpage.c|   1 +-
 arch/powerpc/mm/book3s64/radix_tlb.c|   4 +-
 arch/x86/include/asm/tlbflush.h |   2 +-
 arch/x86/mm/tlb.c   |   2 +-
 drivers/iommu/amd/iommu_v2.c|  10 +-
 drivers/iommu/arm/arm-smmu-v3/arm-smmu-v3-sva.c |  29 +++--
 drivers/iommu/intel/svm.c   |   8 +-
 drivers/misc/ocxl/link.c|   8 +-
 include/asm-generic/tlb.h   |   1 +-
 include/linux/mmu_notifier.h| 104 -
 kernel/events/uprobes.c |   2 +-
 mm/huge_memory.c|  29 +
 mm/hugetlb.c|   8 +-
 mm/memory.

[Intel-gfx] [PATCH v4 4/5] mmu_notifiers: Don't invalidate secondary TLBs as part of mmu_notifier_invalidate_range_end()

2023-07-26 Thread Alistair Popple
Secondary TLBs are now invalidated from the architecture specific TLB
invalidation functions. Therefore there is no need to explicitly
notify or invalidate as part of the range end functions. This means we
can remove mmu_notifier_invalidate_range_end_only() and some of the
ptep_*_notify() functions.

Signed-off-by: Alistair Popple 
Reviewed-by: Jason Gunthorpe 
---
 include/linux/mmu_notifier.h | 56 +
 kernel/events/uprobes.c  |  2 +-
 mm/huge_memory.c | 25 ++---
 mm/hugetlb.c |  1 +-
 mm/memory.c  |  8 +
 mm/migrate_device.c  |  9 +-
 mm/mmu_notifier.c| 25 ++---
 mm/rmap.c| 40 +--
 8 files changed, 14 insertions(+), 152 deletions(-)

diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
index 64a3e05..f2e9edc 100644
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -395,8 +395,7 @@ extern int __mmu_notifier_test_young(struct mm_struct *mm,
 extern void __mmu_notifier_change_pte(struct mm_struct *mm,
  unsigned long address, pte_t pte);
 extern int __mmu_notifier_invalidate_range_start(struct mmu_notifier_range *r);
-extern void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *r,
- bool only_end);
+extern void __mmu_notifier_invalidate_range_end(struct mmu_notifier_range *r);
 extern void __mmu_notifier_invalidate_range(struct mm_struct *mm,
  unsigned long start, unsigned long end);
 extern bool
@@ -481,14 +480,7 @@ mmu_notifier_invalidate_range_end(struct 
mmu_notifier_range *range)
might_sleep();
 
if (mm_has_notifiers(range->mm))
-   __mmu_notifier_invalidate_range_end(range, false);
-}
-
-static inline void
-mmu_notifier_invalidate_range_only_end(struct mmu_notifier_range *range)
-{
-   if (mm_has_notifiers(range->mm))
-   __mmu_notifier_invalidate_range_end(range, true);
+   __mmu_notifier_invalidate_range_end(range);
 }
 
 static inline void mmu_notifier_invalidate_range(struct mm_struct *mm,
@@ -582,45 +574,6 @@ static inline void mmu_notifier_range_init_owner(
__young;\
 })
 
-#defineptep_clear_flush_notify(__vma, __address, __ptep)   
\
-({ \
-   unsigned long ___addr = __address & PAGE_MASK;  \
-   struct mm_struct *___mm = (__vma)->vm_mm;   \
-   pte_t ___pte;   \
-   \
-   ___pte = ptep_clear_flush(__vma, __address, __ptep);\
-   mmu_notifier_invalidate_range(___mm, ___addr,   \
-   ___addr + PAGE_SIZE);   \
-   \
-   ___pte; \
-})
-
-#define pmdp_huge_clear_flush_notify(__vma, __haddr, __pmd)\
-({ \
-   unsigned long ___haddr = __haddr & HPAGE_PMD_MASK;  \
-   struct mm_struct *___mm = (__vma)->vm_mm;   \
-   pmd_t ___pmd;   \
-   \
-   ___pmd = pmdp_huge_clear_flush(__vma, __haddr, __pmd);  \
-   mmu_notifier_invalidate_range(___mm, ___haddr,  \
- ___haddr + HPAGE_PMD_SIZE);   \
-   \
-   ___pmd; \
-})
-
-#define pudp_huge_clear_flush_notify(__vma, __haddr, __pud)\
-({ \
-   unsigned long ___haddr = __haddr & HPAGE_PUD_MASK;  \
-   struct mm_struct *___mm = (__vma)->vm_mm;   \
-   pud_t ___pud;   \
-   \
-   ___pud = pudp_huge_clear_flush(__vma, __haddr, __pud);  \
-   mmu_notifier_invalidate_range(___mm, ___haddr,  \
- ___haddr + HPAGE_PUD_SIZE);   \
-   \
-   ___pud; \
-})
-
 /*
  * set_pte_at_notify() sets the pte _after_ running the notifier.
  * This is safe to start by updating the 

[Intel-gfx] [PATCH v4 3/5] mmu_notifiers: Call invalidate_range() when invalidating TLBs

2023-07-26 Thread Alistair Popple
The invalidate_range() is going to become an architecture specific mmu
notifier used to keep the TLB of secondary MMUs such as an IOMMU in
sync with the CPU page tables. Currently it is called from separate
code paths to the main CPU TLB invalidations. This can lead to a
secondary TLB not getting invalidated when required and makes it hard
to reason about when exactly the secondary TLB is invalidated.

To fix this move the notifier call to the architecture specific TLB
maintenance functions for architectures that have secondary MMUs
requiring explicit software invalidations.

This fixes a SMMU bug on ARM64. On ARM64 PTE permission upgrades
require a TLB invalidation. This invalidation is done by the
architecture specific ptep_set_access_flags() which calls
flush_tlb_page() if required. However this doesn't call the notifier
resulting in infinite faults being generated by devices using the SMMU
if it has previously cached a read-only PTE in it's TLB.

Moving the invalidations into the TLB invalidation functions ensures
all invalidations happen at the same time as the CPU invalidation. The
architecture specific flush_tlb_all() routines do not call the
notifier as none of the IOMMUs require this.

Signed-off-by: Alistair Popple 
Suggested-by: Jason Gunthorpe 
Tested-by: SeongJae Park 
Acked-by: Catalin Marinas 
Reviewed-by: Jason Gunthorpe 
---
 arch/arm64/include/asm/tlbflush.h | 5 +
 arch/powerpc/include/asm/book3s/64/tlbflush.h | 1 +
 arch/powerpc/mm/book3s64/radix_hugetlbpage.c  | 1 +
 arch/powerpc/mm/book3s64/radix_tlb.c  | 4 
 arch/x86/include/asm/tlbflush.h   | 2 ++
 arch/x86/mm/tlb.c | 2 ++
 include/asm-generic/tlb.h | 1 -
 7 files changed, 15 insertions(+), 1 deletion(-)

diff --git a/arch/arm64/include/asm/tlbflush.h 
b/arch/arm64/include/asm/tlbflush.h
index 3456866..a99349d 100644
--- a/arch/arm64/include/asm/tlbflush.h
+++ b/arch/arm64/include/asm/tlbflush.h
@@ -13,6 +13,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 
@@ -252,6 +253,7 @@ static inline void flush_tlb_mm(struct mm_struct *mm)
__tlbi(aside1is, asid);
__tlbi_user(aside1is, asid);
dsb(ish);
+   mmu_notifier_invalidate_range(mm, 0, -1UL);
 }
 
 static inline void __flush_tlb_page_nosync(struct mm_struct *mm,
@@ -263,6 +265,8 @@ static inline void __flush_tlb_page_nosync(struct mm_struct 
*mm,
addr = __TLBI_VADDR(uaddr, ASID(mm));
__tlbi(vale1is, addr);
__tlbi_user(vale1is, addr);
+   mmu_notifier_invalidate_range(mm, uaddr & PAGE_MASK,
+   (uaddr & PAGE_MASK) + 
PAGE_SIZE);
 }
 
 static inline void flush_tlb_page_nosync(struct vm_area_struct *vma,
@@ -396,6 +400,7 @@ static inline void __flush_tlb_range(struct vm_area_struct 
*vma,
scale++;
}
dsb(ish);
+   mmu_notifier_invalidate_range(vma->vm_mm, start, end);
 }
 
 static inline void flush_tlb_range(struct vm_area_struct *vma,
diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush.h 
b/arch/powerpc/include/asm/book3s/64/tlbflush.h
index 0d0c144..dca0477 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush.h
@@ -5,6 +5,7 @@
 #define MMU_NO_CONTEXT ~0UL
 
 #include 
+#include 
 #include 
 #include 
 
diff --git a/arch/powerpc/mm/book3s64/radix_hugetlbpage.c 
b/arch/powerpc/mm/book3s64/radix_hugetlbpage.c
index 5e31955..f3fb49f 100644
--- a/arch/powerpc/mm/book3s64/radix_hugetlbpage.c
+++ b/arch/powerpc/mm/book3s64/radix_hugetlbpage.c
@@ -39,6 +39,7 @@ void radix__flush_hugetlb_tlb_range(struct vm_area_struct 
*vma, unsigned long st
radix__flush_tlb_pwc_range_psize(vma->vm_mm, start, end, psize);
else
radix__flush_tlb_range_psize(vma->vm_mm, start, end, psize);
+   mmu_notifier_invalidate_range(vma->vm_mm, start, end);
 }
 
 void radix__huge_ptep_modify_prot_commit(struct vm_area_struct *vma,
diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c 
b/arch/powerpc/mm/book3s64/radix_tlb.c
index 0bd4866..4d44902 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -987,6 +987,7 @@ void radix__flush_tlb_mm(struct mm_struct *mm)
}
}
preempt_enable();
+   mmu_notifier_invalidate_range(mm, 0, -1UL);
 }
 EXPORT_SYMBOL(radix__flush_tlb_mm);
 
@@ -1020,6 +1021,7 @@ static void __flush_all_mm(struct mm_struct *mm, bool 
fullmm)
_tlbiel_pid_multicast(mm, pid, RIC_FLUSH_ALL);
}
preempt_enable();
+   mmu_notifier_invalidate_range(mm, 0, -1UL);
 }
 
 void radix__flush_all_mm(struct mm_struct *mm)
@@ -1228,6 +1230,7 @@ static inline void __radix__flush_tlb_range(struct 
mm_struct *mm,
}
 out:
preempt_enable();
+   mmu_notifier_invalidate_range(mm, start, end);
 }
 
 void radix__flush_tlb_ra