Re: Kernel build failure with MEM_ALLOC_PROFILING=y set (Talos II, v6.10-rc5)
On Thu, Aug 22, 2024 at 11:04 AM Suren Baghdasaryan wrote: > > On Thu, Aug 22, 2024 at 10:18 AM LEROY Christophe > wrote: > > > > > > > > Le 21/07/2024 à 01:09, Erhard Furtner a écrit : > > > [Vous ne recevez pas souvent de courriers de erhar...@mailbox.org. > > > D?couvrez pourquoi ceci est important ? > > > https://aka.ms/LearnAboutSenderIdentification ] > > > > > > On Sat, 29 Jun 2024 15:31:28 +0200 > > > Erhard Furtner wrote: > > > > > >> I get a build failure on v6.10-rc5 on my Talos II when > > >> MEM_ALLOC_PROFILING=y is enabled: > > >> > > >> [...] > > >>LD [M] fs/xfs/xfs.o > > >>LD [M] fs/bcachefs/bcachefs.o > > >>AR built-in.a > > >>AR vmlinux.a > > >>LD vmlinux.o > > >>OBJCOPY modules.builtin.modinfo > > >>GEN modules.builtin > > >>GEN .vmlinux.objs > > >>MODPOST Module.symvers > > >> ERROR: modpost: "page_ext_get" [arch/powerpc/kvm/kvm-hv.ko] undefined! > > >> ERROR: modpost: "mem_alloc_profiling_key" [arch/powerpc/kvm/kvm-hv.ko] > > >> undefined! > > >> ERROR: modpost: "page_ext_put" [arch/powerpc/kvm/kvm-hv.ko] undefined! > > >> make[2]: *** [scripts/Makefile.modpost:145: Module.symvers] Fehler 1 > > >> make[1]: *** [/usr/src/linux-stable/Makefile:1886: modpost] Fehler 2 > > >> make: *** [Makefile:240: __sub-make] Fehler 2 > > >> > > >> Same .config builds fine without MEM_ALLOC_PROFILING set. Kernel .config > > >> attached. > > > > > > Build problem still there on now released v6.10 with > > > MEM_ALLOC_PROFILING=y. > > > > > > Can't bisect as build with MEM_ALLOC_PROFILING fails since it's > > > introduction in v6.10-rc1. > > > > > > > I guess those three functions are missing EXPORT_SYMBOL_GPL() tagging. > > The issue should have been fixed by > https://lore.kernel.org/all/20240717181239.2510054-1-sur...@google.com/ > patchset. I probably forgot to CC stable@ for these changes. Let me > check and follow up. The issue is fixed by the patch titled "alloc_tag: outline and export free_reserved_page()" which has b3bebe44306e SHA in Linus' tree and first appears in v6.10.3 with SHA 4a9a52b70cce. > Thanks, > Suren. > > > > > Christophe
Re: Kernel build failure with MEM_ALLOC_PROFILING=y set (Talos II, v6.10-rc5)
On Thu, Aug 22, 2024 at 10:18 AM LEROY Christophe wrote: > > > > Le 21/07/2024 à 01:09, Erhard Furtner a écrit : > > [Vous ne recevez pas souvent de courriers de erhar...@mailbox.org. > > D?couvrez pourquoi ceci est important ? > > https://aka.ms/LearnAboutSenderIdentification ] > > > > On Sat, 29 Jun 2024 15:31:28 +0200 > > Erhard Furtner wrote: > > > >> I get a build failure on v6.10-rc5 on my Talos II when > >> MEM_ALLOC_PROFILING=y is enabled: > >> > >> [...] > >>LD [M] fs/xfs/xfs.o > >>LD [M] fs/bcachefs/bcachefs.o > >>AR built-in.a > >>AR vmlinux.a > >>LD vmlinux.o > >>OBJCOPY modules.builtin.modinfo > >>GEN modules.builtin > >>GEN .vmlinux.objs > >>MODPOST Module.symvers > >> ERROR: modpost: "page_ext_get" [arch/powerpc/kvm/kvm-hv.ko] undefined! > >> ERROR: modpost: "mem_alloc_profiling_key" [arch/powerpc/kvm/kvm-hv.ko] > >> undefined! > >> ERROR: modpost: "page_ext_put" [arch/powerpc/kvm/kvm-hv.ko] undefined! > >> make[2]: *** [scripts/Makefile.modpost:145: Module.symvers] Fehler 1 > >> make[1]: *** [/usr/src/linux-stable/Makefile:1886: modpost] Fehler 2 > >> make: *** [Makefile:240: __sub-make] Fehler 2 > >> > >> Same .config builds fine without MEM_ALLOC_PROFILING set. Kernel .config > >> attached. > > > > Build problem still there on now released v6.10 with MEM_ALLOC_PROFILING=y. > > > > Can't bisect as build with MEM_ALLOC_PROFILING fails since it's > > introduction in v6.10-rc1. > > > > I guess those three functions are missing EXPORT_SYMBOL_GPL() tagging. The issue should have been fixed by https://lore.kernel.org/all/20240717181239.2510054-1-sur...@google.com/ patchset. I probably forgot to CC stable@ for these changes. Let me check and follow up. Thanks, Suren. > > Christophe
Re: [PATCH 7/7] x86: mm: accelerate pagefault when badaccess
On Wed, Apr 3, 2024 at 12:58 AM Kefeng Wang wrote: > > > > On 2024/4/3 13:59, Suren Baghdasaryan wrote: > > On Tue, Apr 2, 2024 at 12:53 AM Kefeng Wang > > wrote: > >> > >> The vm_flags of vma already checked under per-VMA lock, if it is a > >> bad access, directly handle error and return, there is no need to > >> lock_mm_and_find_vma() and check vm_flags again. > >> > >> Signed-off-by: Kefeng Wang > > > > Looks safe to me. > > Using (mm != NULL) to indicate that we are holding mmap_lock is not > > ideal but I guess that works. > > > > Yes, I will add this part it into change too, > > The access_error() of vma already checked under per-VMA lock, if it > is a bad access, directly handle error, no need to retry with mmap_lock > again. In order to release the correct lock, pass the mm_struct into > bad_area_access_error(), if mm is NULL, release vma lock, or release > mmap_lock. Since the page faut is handled under per-VMA lock, count it > as a vma lock event with VMA_LOCK_SUCCESS. The part about passing mm_struct is unnecessary IMHO. It explains "how you do things" but changelog should describe only "what you do" and "why you do that". The rest we can see from the code. > > Thanks. > > > > Reviewed-by: Suren Baghdasaryan > > > >> --- > >> arch/x86/mm/fault.c | 23 ++- > >> 1 file changed, 14 insertions(+), 9 deletions(-) > >> > >> diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > >> index a4cc20d0036d..67b18adc75dd 100644 > >> --- a/arch/x86/mm/fault.c > >> +++ b/arch/x86/mm/fault.c > >> @@ -866,14 +866,17 @@ bad_area_nosemaphore(struct pt_regs *regs, unsigned > >> long error_code, > >> > >> static void > >> __bad_area(struct pt_regs *regs, unsigned long error_code, > >> - unsigned long address, u32 pkey, int si_code) > >> + unsigned long address, struct mm_struct *mm, > >> + struct vm_area_struct *vma, u32 pkey, int si_code) > >> { > >> - struct mm_struct *mm = current->mm; > >> /* > >> * Something tried to access memory that isn't in our memory map.. > >> * Fix it, but check if it's kernel or user first.. > >> */ > >> - mmap_read_unlock(mm); > >> + if (mm) > >> + mmap_read_unlock(mm); > >> + else > >> + vma_end_read(vma); > >> > >> __bad_area_nosemaphore(regs, error_code, address, pkey, si_code); > >> } > >> @@ -897,7 +900,8 @@ static inline bool bad_area_access_from_pkeys(unsigned > >> long error_code, > >> > >> static noinline void > >> bad_area_access_error(struct pt_regs *regs, unsigned long error_code, > >> - unsigned long address, struct vm_area_struct *vma) > >> + unsigned long address, struct mm_struct *mm, > >> + struct vm_area_struct *vma) > >> { > >> /* > >> * This OSPKE check is not strictly necessary at runtime. > >> @@ -927,9 +931,9 @@ bad_area_access_error(struct pt_regs *regs, unsigned > >> long error_code, > >> */ > >> u32 pkey = vma_pkey(vma); > >> > >> - __bad_area(regs, error_code, address, pkey, SEGV_PKUERR); > >> + __bad_area(regs, error_code, address, mm, vma, pkey, > >> SEGV_PKUERR); > >> } else { > >> - __bad_area(regs, error_code, address, 0, SEGV_ACCERR); > >> + __bad_area(regs, error_code, address, mm, vma, 0, > >> SEGV_ACCERR); > >> } > >> } > >> > >> @@ -1357,8 +1361,9 @@ void do_user_addr_fault(struct pt_regs *regs, > >> goto lock_mmap; > >> > >> if (unlikely(access_error(error_code, vma))) { > >> - vma_end_read(vma); > >> - goto lock_mmap; > >> + bad_area_access_error(regs, error_code, address, NULL, > >> vma); > >> + count_vm_vma_lock_event(VMA_LOCK_SUCCESS); > >> + return; > >> } > >> fault = handle_mm_fault(vma, address, flags | > >> FAULT_FLAG_VMA_LOCK, regs); > >> if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED))) > >> @@ -1394,7 +1399,7 @@ void do_user_addr_fault(struct pt_regs *regs, > >> * we can handle it.. > >> */ > >> if (unlikely(access_error(error_code, vma))) { > >> - bad_area_access_error(regs, error_code, address, vma); > >> + bad_area_access_error(regs, error_code, address, mm, vma); > >> return; > >> } > >> > >> -- > >> 2.27.0 > >>
Re: [PATCH 7/7] x86: mm: accelerate pagefault when badaccess
On Tue, Apr 2, 2024 at 12:53 AM Kefeng Wang wrote: > > The vm_flags of vma already checked under per-VMA lock, if it is a > bad access, directly handle error and return, there is no need to > lock_mm_and_find_vma() and check vm_flags again. > > Signed-off-by: Kefeng Wang Looks safe to me. Using (mm != NULL) to indicate that we are holding mmap_lock is not ideal but I guess that works. Reviewed-by: Suren Baghdasaryan > --- > arch/x86/mm/fault.c | 23 ++- > 1 file changed, 14 insertions(+), 9 deletions(-) > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > index a4cc20d0036d..67b18adc75dd 100644 > --- a/arch/x86/mm/fault.c > +++ b/arch/x86/mm/fault.c > @@ -866,14 +866,17 @@ bad_area_nosemaphore(struct pt_regs *regs, unsigned > long error_code, > > static void > __bad_area(struct pt_regs *regs, unsigned long error_code, > - unsigned long address, u32 pkey, int si_code) > + unsigned long address, struct mm_struct *mm, > + struct vm_area_struct *vma, u32 pkey, int si_code) > { > - struct mm_struct *mm = current->mm; > /* > * Something tried to access memory that isn't in our memory map.. > * Fix it, but check if it's kernel or user first.. > */ > - mmap_read_unlock(mm); > + if (mm) > + mmap_read_unlock(mm); > + else > + vma_end_read(vma); > > __bad_area_nosemaphore(regs, error_code, address, pkey, si_code); > } > @@ -897,7 +900,8 @@ static inline bool bad_area_access_from_pkeys(unsigned > long error_code, > > static noinline void > bad_area_access_error(struct pt_regs *regs, unsigned long error_code, > - unsigned long address, struct vm_area_struct *vma) > + unsigned long address, struct mm_struct *mm, > + struct vm_area_struct *vma) > { > /* > * This OSPKE check is not strictly necessary at runtime. > @@ -927,9 +931,9 @@ bad_area_access_error(struct pt_regs *regs, unsigned long > error_code, > */ > u32 pkey = vma_pkey(vma); > > - __bad_area(regs, error_code, address, pkey, SEGV_PKUERR); > + __bad_area(regs, error_code, address, mm, vma, pkey, > SEGV_PKUERR); > } else { > - __bad_area(regs, error_code, address, 0, SEGV_ACCERR); > + __bad_area(regs, error_code, address, mm, vma, 0, > SEGV_ACCERR); > } > } > > @@ -1357,8 +1361,9 @@ void do_user_addr_fault(struct pt_regs *regs, > goto lock_mmap; > > if (unlikely(access_error(error_code, vma))) { > - vma_end_read(vma); > - goto lock_mmap; > + bad_area_access_error(regs, error_code, address, NULL, vma); > + count_vm_vma_lock_event(VMA_LOCK_SUCCESS); > + return; > } > fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, > regs); > if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED))) > @@ -1394,7 +1399,7 @@ void do_user_addr_fault(struct pt_regs *regs, > * we can handle it.. > */ > if (unlikely(access_error(error_code, vma))) { > - bad_area_access_error(regs, error_code, address, vma); > + bad_area_access_error(regs, error_code, address, mm, vma); > return; > } > > -- > 2.27.0 >
Re: [PATCH 5/7] riscv: mm: accelerate pagefault when badaccess
On Tue, Apr 2, 2024 at 12:53 AM Kefeng Wang wrote: > > The vm_flags of vma already checked under per-VMA lock, if it is a > bad access, directly handle error and return, there is no need to > lock_mm_and_find_vma() and check vm_flags again. > > Signed-off-by: Kefeng Wang Reviewed-by: Suren Baghdasaryan > --- > arch/riscv/mm/fault.c | 5 - > 1 file changed, 4 insertions(+), 1 deletion(-) > > diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c > index 3ba1d4dde5dd..b3fcf7d67efb 100644 > --- a/arch/riscv/mm/fault.c > +++ b/arch/riscv/mm/fault.c > @@ -292,7 +292,10 @@ void handle_page_fault(struct pt_regs *regs) > > if (unlikely(access_error(cause, vma))) { > vma_end_read(vma); > - goto lock_mmap; > + count_vm_vma_lock_event(VMA_LOCK_SUCCESS); > + tsk->thread.bad_cause = SEGV_ACCERR; > + bad_area_nosemaphore(regs, code, addr); > + return; > } > > fault = handle_mm_fault(vma, addr, flags | FAULT_FLAG_VMA_LOCK, regs); > -- > 2.27.0 >
Re: [PATCH 4/7] powerpc: mm: accelerate pagefault when badaccess
On Tue, Apr 2, 2024 at 12:53 AM Kefeng Wang wrote: > > The vm_flags of vma already checked under per-VMA lock, if it is a > bad access, directly handle error and return, there is no need to > lock_mm_and_find_vma() and check vm_flags again. > > Signed-off-by: Kefeng Wang Code-wise looks correct to me and almost identical to x86 change but someone with more experience with this architecture should review. > --- > arch/powerpc/mm/fault.c | 33 - > 1 file changed, 20 insertions(+), 13 deletions(-) > > diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c > index 53335ae21a40..215690452495 100644 > --- a/arch/powerpc/mm/fault.c > +++ b/arch/powerpc/mm/fault.c > @@ -71,23 +71,26 @@ static noinline int bad_area_nosemaphore(struct pt_regs > *regs, unsigned long add > return __bad_area_nosemaphore(regs, address, SEGV_MAPERR); > } > > -static int __bad_area(struct pt_regs *regs, unsigned long address, int > si_code) > +static int __bad_area(struct pt_regs *regs, unsigned long address, int > si_code, > + struct mm_struct *mm, struct vm_area_struct *vma) > { > - struct mm_struct *mm = current->mm; > > /* > * Something tried to access memory that isn't in our memory map.. > * Fix it, but check if it's kernel or user first.. > */ > - mmap_read_unlock(mm); > + if (mm) > + mmap_read_unlock(mm); > + else > + vma_end_read(vma); > > return __bad_area_nosemaphore(regs, address, si_code); > } > > static noinline int bad_access_pkey(struct pt_regs *regs, unsigned long > address, > + struct mm_struct *mm, > struct vm_area_struct *vma) > { > - struct mm_struct *mm = current->mm; > int pkey; > > /* > @@ -109,7 +112,10 @@ static noinline int bad_access_pkey(struct pt_regs > *regs, unsigned long address, > */ > pkey = vma_pkey(vma); > > - mmap_read_unlock(mm); > + if (mm) > + mmap_read_unlock(mm); > + else > + vma_end_read(vma); > > /* > * If we are in kernel mode, bail out with a SEGV, this will > @@ -124,9 +130,10 @@ static noinline int bad_access_pkey(struct pt_regs > *regs, unsigned long address, > return 0; > } > > -static noinline int bad_access(struct pt_regs *regs, unsigned long address) > +static noinline int bad_access(struct pt_regs *regs, unsigned long address, > + struct mm_struct *mm, struct vm_area_struct > *vma) > { > - return __bad_area(regs, address, SEGV_ACCERR); > + return __bad_area(regs, address, SEGV_ACCERR, mm, vma); > } > > static int do_sigbus(struct pt_regs *regs, unsigned long address, > @@ -479,13 +486,13 @@ static int ___do_page_fault(struct pt_regs *regs, > unsigned long address, > > if (unlikely(access_pkey_error(is_write, is_exec, >(error_code & DSISR_KEYFAULT), vma))) { > - vma_end_read(vma); > - goto lock_mmap; > + count_vm_vma_lock_event(VMA_LOCK_SUCCESS); > + return bad_access_pkey(regs, address, NULL, vma); > } > > if (unlikely(access_error(is_write, is_exec, vma))) { > - vma_end_read(vma); > - goto lock_mmap; > + count_vm_vma_lock_event(VMA_LOCK_SUCCESS); > + return bad_access(regs, address, NULL, vma); > } > > fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, > regs); > @@ -521,10 +528,10 @@ static int ___do_page_fault(struct pt_regs *regs, > unsigned long address, > > if (unlikely(access_pkey_error(is_write, is_exec, >(error_code & DSISR_KEYFAULT), vma))) > - return bad_access_pkey(regs, address, vma); > + return bad_access_pkey(regs, address, mm, vma); > > if (unlikely(access_error(is_write, is_exec, vma))) > - return bad_access(regs, address); > + return bad_access(regs, address, mm, vma); > > /* > * If for any reason at all we couldn't handle the fault, > -- > 2.27.0 >
Re: [PATCH 3/7] arm: mm: accelerate pagefault when VM_FAULT_BADACCESS
On Tue, Apr 2, 2024 at 12:53 AM Kefeng Wang wrote: > > The vm_flags of vma already checked under per-VMA lock, if it is a > bad access, directly set fault to VM_FAULT_BADACCESS and handle error, > so no need to lock_mm_and_find_vma() and check vm_flags again. > > Signed-off-by: Kefeng Wang Reviewed-by: Suren Baghdasaryan > --- > arch/arm/mm/fault.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c > index 439dc6a26bb9..5c4b417e24f9 100644 > --- a/arch/arm/mm/fault.c > +++ b/arch/arm/mm/fault.c > @@ -294,7 +294,9 @@ do_page_fault(unsigned long addr, unsigned int fsr, > struct pt_regs *regs) > > if (!(vma->vm_flags & vm_flags)) { > vma_end_read(vma); > - goto lock_mmap; > + count_vm_vma_lock_event(VMA_LOCK_SUCCESS); > + fault = VM_FAULT_BADACCESS; > + goto bad_area; > } > fault = handle_mm_fault(vma, addr, flags | FAULT_FLAG_VMA_LOCK, regs); > if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED))) > -- > 2.27.0 >
Re: [PATCH 2/7] arm64: mm: accelerate pagefault when VM_FAULT_BADACCESS
On Tue, Apr 2, 2024 at 10:19 PM Suren Baghdasaryan wrote: > > On Tue, Apr 2, 2024 at 12:53 AM Kefeng Wang > wrote: > > > > The vm_flags of vma already checked under per-VMA lock, if it is a > > bad access, directly set fault to VM_FAULT_BADACCESS and handle error, > > no need to lock_mm_and_find_vma() and check vm_flags again, the latency > > time reduce 34% in lmbench 'lat_sig -P 1 prot lat_sig'. > > The change makes sense to me. Per-VMA lock is enough to keep > vma->vm_flags stable, so no need to retry with mmap_lock. > > > > > Signed-off-by: Kefeng Wang > > Reviewed-by: Suren Baghdasaryan > > > --- > > arch/arm64/mm/fault.c | 4 +++- > > 1 file changed, 3 insertions(+), 1 deletion(-) > > > > diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c > > index 9bb9f395351a..405f9aa831bd 100644 > > --- a/arch/arm64/mm/fault.c > > +++ b/arch/arm64/mm/fault.c > > @@ -572,7 +572,9 @@ static int __kprobes do_page_fault(unsigned long far, > > unsigned long esr, > > > > if (!(vma->vm_flags & vm_flags)) { > > vma_end_read(vma); > > - goto lock_mmap; > > + fault = VM_FAULT_BADACCESS; > > + count_vm_vma_lock_event(VMA_LOCK_SUCCESS); > > nit: VMA_LOCK_SUCCESS accounting here seems correct to me but > unrelated to the main change. Either splitting into a separate patch > or mentioning this additional fixup in the changelog would be helpful. The above nit applies to all the patches after this one, so I won't comment on each one separately. If you decide to split or adjust the changelog please do that for each patch. > > > + goto done; > > } > > fault = handle_mm_fault(vma, addr, mm_flags | FAULT_FLAG_VMA_LOCK, > > regs); > > if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED))) > > -- > > 2.27.0 > >
Re: [PATCH 2/7] arm64: mm: accelerate pagefault when VM_FAULT_BADACCESS
On Tue, Apr 2, 2024 at 12:53 AM Kefeng Wang wrote: > > The vm_flags of vma already checked under per-VMA lock, if it is a > bad access, directly set fault to VM_FAULT_BADACCESS and handle error, > no need to lock_mm_and_find_vma() and check vm_flags again, the latency > time reduce 34% in lmbench 'lat_sig -P 1 prot lat_sig'. The change makes sense to me. Per-VMA lock is enough to keep vma->vm_flags stable, so no need to retry with mmap_lock. > > Signed-off-by: Kefeng Wang Reviewed-by: Suren Baghdasaryan > --- > arch/arm64/mm/fault.c | 4 +++- > 1 file changed, 3 insertions(+), 1 deletion(-) > > diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c > index 9bb9f395351a..405f9aa831bd 100644 > --- a/arch/arm64/mm/fault.c > +++ b/arch/arm64/mm/fault.c > @@ -572,7 +572,9 @@ static int __kprobes do_page_fault(unsigned long far, > unsigned long esr, > > if (!(vma->vm_flags & vm_flags)) { > vma_end_read(vma); > - goto lock_mmap; > + fault = VM_FAULT_BADACCESS; > + count_vm_vma_lock_event(VMA_LOCK_SUCCESS); nit: VMA_LOCK_SUCCESS accounting here seems correct to me but unrelated to the main change. Either splitting into a separate patch or mentioning this additional fixup in the changelog would be helpful. > + goto done; > } > fault = handle_mm_fault(vma, addr, mm_flags | FAULT_FLAG_VMA_LOCK, > regs); > if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED))) > -- > 2.27.0 >
Re: [PATCH 1/7] arm64: mm: cleanup __do_page_fault()
On Tue, Apr 2, 2024 at 12:53 AM Kefeng Wang wrote: > > The __do_page_fault() only check vma->flags and call handle_mm_fault(), > and only called by do_page_fault(), let's squash it into do_page_fault() > to cleanup code. > > Signed-off-by: Kefeng Wang Reviewed-by: Suren Baghdasaryan > --- > arch/arm64/mm/fault.c | 27 +++ > 1 file changed, 7 insertions(+), 20 deletions(-) > > diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c > index 8251e2fea9c7..9bb9f395351a 100644 > --- a/arch/arm64/mm/fault.c > +++ b/arch/arm64/mm/fault.c > @@ -486,25 +486,6 @@ static void do_bad_area(unsigned long far, unsigned long > esr, > } > } > > -#define VM_FAULT_BADMAP((__force vm_fault_t)0x01) > -#define VM_FAULT_BADACCESS ((__force vm_fault_t)0x02) > - > -static vm_fault_t __do_page_fault(struct mm_struct *mm, > - struct vm_area_struct *vma, unsigned long > addr, > - unsigned int mm_flags, unsigned long > vm_flags, > - struct pt_regs *regs) > -{ > - /* > -* Ok, we have a good vm_area for this memory access, so we can handle > -* it. > -* Check that the permissions on the VMA allow for the fault which > -* occurred. > -*/ > - if (!(vma->vm_flags & vm_flags)) > - return VM_FAULT_BADACCESS; > - return handle_mm_fault(vma, addr, mm_flags, regs); > -} > - > static bool is_el0_instruction_abort(unsigned long esr) > { > return ESR_ELx_EC(esr) == ESR_ELx_EC_IABT_LOW; > @@ -519,6 +500,9 @@ static bool is_write_abort(unsigned long esr) > return (esr & ESR_ELx_WNR) && !(esr & ESR_ELx_CM); > } > > +#define VM_FAULT_BADMAP((__force vm_fault_t)0x01) > +#define VM_FAULT_BADACCESS ((__force vm_fault_t)0x02) > + > static int __kprobes do_page_fault(unsigned long far, unsigned long esr, >struct pt_regs *regs) > { > @@ -617,7 +601,10 @@ static int __kprobes do_page_fault(unsigned long far, > unsigned long esr, > goto done; > } > > - fault = __do_page_fault(mm, vma, addr, mm_flags, vm_flags, regs); > + if (!(vma->vm_flags & vm_flags)) > + fault = VM_FAULT_BADACCESS; > + else > + fault = handle_mm_fault(vma, addr, mm_flags, regs); > > /* Quick path to respond to signals */ > if (fault_signal_pending(fault, regs)) { > -- > 2.27.0 >
Re: [PATCH 1/1] arch/arm/mm: fix major fault accounting when retrying under per-VMA lock
On Thu, Jan 25, 2024 at 6:04 PM Andrew Morton wrote: > > On Mon, 22 Jan 2024 22:43:05 -0800 Suren Baghdasaryan > wrote: > > > The change [1] missed ARM architecture when fixing major fault accounting > > for page fault retry under per-VMA lock. Add missing code to fix ARM > > architecture fault accounting. > > > > [1] 46e714c729c8 ("arch/mm/fault: fix major fault accounting when retrying > > under per-VMA lock") > > > > Fixes: 12214eba1992 ("mm: handle read faults under the VMA lock") > > What are the userspace-visible runtime effects of this change? The user-visible effects is that it restores correct major fault accounting that was broken after [2] was merged in 6.7 kernel. The more detailed description is in [3] and this patch simply adds the same fix to ARM architecture which I missed in [3]. I can re-send the patch with the full description from [3] if needed. > > Is a cc:stable backport desirable? Yes, I guess since [2] was merged in 6.7, cc-ing stable would be desirable. Please let me know if you want me to re-send the patch with full description and CC'ing stable. [2] https://lore.kernel.org/all/20231006195318.4087158-6-wi...@infradead.org/ [3] https://lore.kernel.org/all/20231226214610.109282-1-sur...@google.com/ > > > Reported-by: Russell King (Oracle) > > Signed-off-by: Suren Baghdasaryan
Re: [PATCH 1/1] arch/mm/fault: fix major fault accounting when retrying under per-VMA lock
On Sun, Jan 21, 2024 at 11:38 PM Suren Baghdasaryan wrote: > > On Sat, Jan 20, 2024 at 1:15 PM Russell King (Oracle) > wrote: > > > > On Sat, Jan 20, 2024 at 09:09:47PM +, > > patchwork-bot+linux-ri...@kernel.org wrote: > > > Hello: > > > > > > This patch was applied to riscv/linux.git (fixes) > > > by Andrew Morton : > > > > > > On Tue, 26 Dec 2023 13:46:10 -0800 you wrote: > > > > A test [1] in Android test suite started failing after [2] was merged. > > > > It turns out that after handling a major fault under per-VMA lock, the > > > > process major fault counter does not register that fault as major. > > > > Before [2] read faults would be done under mmap_lock, in which case > > > > FAULT_FLAG_TRIED flag is set before retrying. That in turn causes > > > > mm_account_fault() to account the fault as major once retry completes. > > > > With per-VMA locks we often retry because a fault can't be handled > > > > without locking the whole mm using mmap_lock. Therefore such retries > > > > do not set FAULT_FLAG_TRIED flag. This logic does not work after [2] > > > > because we can now handle read major faults under per-VMA lock and > > > > upon retry the fact there was a major fault gets lost. Fix this by > > > > setting FAULT_FLAG_TRIED after retrying under per-VMA lock if > > > > VM_FAULT_MAJOR was returned. Ideally we would use an additional > > > > VM_FAULT bit to indicate the reason for the retry (could not handle > > > > under per-VMA lock vs other reason) but this simpler solution seems > > > > to work, so keeping it simple. > > > > > > > > [...] > > > > > > Here is the summary with links: > > > - [1/1] arch/mm/fault: fix major fault accounting when retrying under > > > per-VMA lock > > > https://git.kernel.org/riscv/c/46e714c729c8 > > > > > > You are awesome, thank you! > > > > Now that 32-bit ARM has support for the per-VMA lock, does that also > > need to be patched? > > Yes, I think so. I missed the ARM32 change that added support for > per-VMA locks. Will post a similar patch for it tomorrow. Fix for ARM posted at https://lore.kernel.org/all/20240123064305.2829244-1-sur...@google.com/ > Thanks, > Suren. > > > > > -- > > RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ > > FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!
[PATCH 1/1] arch/arm/mm: fix major fault accounting when retrying under per-VMA lock
The change [1] missed ARM architecture when fixing major fault accounting for page fault retry under per-VMA lock. Add missing code to fix ARM architecture fault accounting. [1] 46e714c729c8 ("arch/mm/fault: fix major fault accounting when retrying under per-VMA lock") Fixes: 12214eba1992 ("mm: handle read faults under the VMA lock") Reported-by: Russell King (Oracle) Signed-off-by: Suren Baghdasaryan --- arch/arm/mm/fault.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/arm/mm/fault.c b/arch/arm/mm/fault.c index e96fb40b9cc3..07565b593ed6 100644 --- a/arch/arm/mm/fault.c +++ b/arch/arm/mm/fault.c @@ -298,6 +298,8 @@ do_page_fault(unsigned long addr, unsigned int fsr, struct pt_regs *regs) goto done; } count_vm_vma_lock_event(VMA_LOCK_RETRY); + if (fault & VM_FAULT_MAJOR) + flags |= FAULT_FLAG_TRIED; /* Quick path to respond to signals */ if (fault_signal_pending(fault, regs)) { -- 2.43.0.429.g432eaa2c6b-goog
Re: [PATCH 1/1] arch/mm/fault: fix major fault accounting when retrying under per-VMA lock
On Sat, Jan 20, 2024 at 1:15 PM Russell King (Oracle) wrote: > > On Sat, Jan 20, 2024 at 09:09:47PM +, > patchwork-bot+linux-ri...@kernel.org wrote: > > Hello: > > > > This patch was applied to riscv/linux.git (fixes) > > by Andrew Morton : > > > > On Tue, 26 Dec 2023 13:46:10 -0800 you wrote: > > > A test [1] in Android test suite started failing after [2] was merged. > > > It turns out that after handling a major fault under per-VMA lock, the > > > process major fault counter does not register that fault as major. > > > Before [2] read faults would be done under mmap_lock, in which case > > > FAULT_FLAG_TRIED flag is set before retrying. That in turn causes > > > mm_account_fault() to account the fault as major once retry completes. > > > With per-VMA locks we often retry because a fault can't be handled > > > without locking the whole mm using mmap_lock. Therefore such retries > > > do not set FAULT_FLAG_TRIED flag. This logic does not work after [2] > > > because we can now handle read major faults under per-VMA lock and > > > upon retry the fact there was a major fault gets lost. Fix this by > > > setting FAULT_FLAG_TRIED after retrying under per-VMA lock if > > > VM_FAULT_MAJOR was returned. Ideally we would use an additional > > > VM_FAULT bit to indicate the reason for the retry (could not handle > > > under per-VMA lock vs other reason) but this simpler solution seems > > > to work, so keeping it simple. > > > > > > [...] > > > > Here is the summary with links: > > - [1/1] arch/mm/fault: fix major fault accounting when retrying under > > per-VMA lock > > https://git.kernel.org/riscv/c/46e714c729c8 > > > > You are awesome, thank you! > > Now that 32-bit ARM has support for the per-VMA lock, does that also > need to be patched? Yes, I think so. I missed the ARM32 change that added support for per-VMA locks. Will post a similar patch for it tomorrow. Thanks, Suren. > > -- > RMK's Patch system: https://www.armlinux.org.uk/developer/patches/ > FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!
[PATCH 1/1] arch/mm/fault: fix major fault accounting when retrying under per-VMA lock
A test [1] in Android test suite started failing after [2] was merged. It turns out that after handling a major fault under per-VMA lock, the process major fault counter does not register that fault as major. Before [2] read faults would be done under mmap_lock, in which case FAULT_FLAG_TRIED flag is set before retrying. That in turn causes mm_account_fault() to account the fault as major once retry completes. With per-VMA locks we often retry because a fault can't be handled without locking the whole mm using mmap_lock. Therefore such retries do not set FAULT_FLAG_TRIED flag. This logic does not work after [2] because we can now handle read major faults under per-VMA lock and upon retry the fact there was a major fault gets lost. Fix this by setting FAULT_FLAG_TRIED after retrying under per-VMA lock if VM_FAULT_MAJOR was returned. Ideally we would use an additional VM_FAULT bit to indicate the reason for the retry (could not handle under per-VMA lock vs other reason) but this simpler solution seems to work, so keeping it simple. [1] https://cs.android.com/android/platform/superproject/+/master:test/vts-testcase/kernel/api/drop_caches_prop/drop_caches_test.cpp [2] https://lore.kernel.org/all/20231006195318.4087158-6-wi...@infradead.org/ Fixes: 12214eba1992 ("mm: handle read faults under the VMA lock") Cc: Matthew Wilcox Signed-off-by: Suren Baghdasaryan --- arch/arm64/mm/fault.c | 2 ++ arch/powerpc/mm/fault.c | 2 ++ arch/riscv/mm/fault.c | 2 ++ arch/s390/mm/fault.c| 3 +++ arch/x86/mm/fault.c | 2 ++ 5 files changed, 11 insertions(+) diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c index 460d799e1296..55f6455a8284 100644 --- a/arch/arm64/mm/fault.c +++ b/arch/arm64/mm/fault.c @@ -607,6 +607,8 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr, goto done; } count_vm_vma_lock_event(VMA_LOCK_RETRY); + if (fault & VM_FAULT_MAJOR) + mm_flags |= FAULT_FLAG_TRIED; /* Quick path to respond to signals */ if (fault_signal_pending(fault, regs)) { diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index 9e49ede2bc1c..53335ae21a40 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -497,6 +497,8 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address, goto done; } count_vm_vma_lock_event(VMA_LOCK_RETRY); + if (fault & VM_FAULT_MAJOR) + flags |= FAULT_FLAG_TRIED; if (fault_signal_pending(fault, regs)) return user_mode(regs) ? 0 : SIGBUS; diff --git a/arch/riscv/mm/fault.c b/arch/riscv/mm/fault.c index 90d4ba36d1d0..081339ddf47e 100644 --- a/arch/riscv/mm/fault.c +++ b/arch/riscv/mm/fault.c @@ -304,6 +304,8 @@ void handle_page_fault(struct pt_regs *regs) goto done; } count_vm_vma_lock_event(VMA_LOCK_RETRY); + if (fault & VM_FAULT_MAJOR) + flags |= FAULT_FLAG_TRIED; if (fault_signal_pending(fault, regs)) { if (!user_mode(regs)) diff --git a/arch/s390/mm/fault.c b/arch/s390/mm/fault.c index 249aefcf7c4e..ab4098886e56 100644 --- a/arch/s390/mm/fault.c +++ b/arch/s390/mm/fault.c @@ -337,6 +337,9 @@ static void do_exception(struct pt_regs *regs, int access) return; } count_vm_vma_lock_event(VMA_LOCK_RETRY); + if (fault & VM_FAULT_MAJOR) + flags |= FAULT_FLAG_TRIED; + /* Quick path to respond to signals */ if (fault_signal_pending(fault, regs)) { if (!user_mode(regs)) diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index ab778eac1952..679b09cfe241 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -1370,6 +1370,8 @@ void do_user_addr_fault(struct pt_regs *regs, goto done; } count_vm_vma_lock_event(VMA_LOCK_RETRY); + if (fault & VM_FAULT_MAJOR) + flags |= FAULT_FLAG_TRIED; /* Quick path to respond to signals */ if (fault_signal_pending(fault, regs)) { -- 2.43.0.472.g3155946c3a-goog
Re: [PATCH v2 3/3] fork: lock VMAs of the parent process when forking
On Wed, Aug 9, 2023 at 2:07 PM Mateusz Guzik wrote: > > On 8/5/23, Suren Baghdasaryan wrote: > > On Fri, Aug 4, 2023 at 6:06 PM Mateusz Guzik wrote: > >> > >> On 8/5/23, Linus Torvalds wrote: > >> > On Fri, 4 Aug 2023 at 16:25, Mateusz Guzik wrote: > >> >> > >> >> I know of these guys, I think they are excluded as is -- they go > >> >> through access_remote_vm, starting with: > >> >> if (mmap_read_lock_killable(mm)) > >> >> return 0; > >> >> > >> >> while dup_mmap already write locks the parent's mm. > >> > > >> > Oh, you're only worried about vma_start_write()? > >> > > >> > That's a non-issue. It doesn't take the lock normally, since it starts > >> > off > >> > with > >> > > >> > if (__is_vma_write_locked(vma, &mm_lock_seq)) > >> > return; > >> > > >> > which catches on the lock sequence number already being set. > >> > > >> > So no extra locking there. > >> > > >> > Well, technically there's extra locking because the code stupidly > >> > doesn't initialize new vma allocations to the right sequence number, > >> > but that was talked about here: > >> > > >> > > >> > https://lore.kernel.org/all/CAHk-=wicrwaoeesbuogoqqufvesicbgp3cx0lykgevsfazn...@mail.gmail.com/ > >> > > >> > and it's a separate issue. > >> > > >> > >> I'm going to bet one beer this is the issue. > >> > >> The patch I'm responding to only consists of adding the call to > >> vma_start_write and claims the 5% slowdown from it, while fixing > >> crashes if the forking process is multithreaded. > >> > >> For the fix to work it has to lock something against the parent. > >> > >> VMA_ITERATOR(old_vmi, oldmm, 0); > >> [..] > >> for_each_vma(old_vmi, mpnt) { > >> [..] > >> vma_start_write(mpnt); > >> > >> the added line locks an obj in the parent's vm space. > >> > >> The problem you linked looks like pessimization for freshly allocated > >> vmas, but that's what is being operated on here. > > > > Sorry, now I'm having trouble understanding the problem you are > > describing. We are locking the parent's vma before copying it and the > > newly created vma is locked before it's added into the vma tree. What > > is the problem then? > > > > Sorry for the late reply! > > Looks there has been a bunch of weird talking past one another in this > thread and I don't think trying to straighten it all out is worth any > time. > > I think at least the two of us agree that if a single-threaded process > enters dup_mmap an > down_writes the mmap semaphore, then no new thread can pop up in said > process, thus no surprise page faults from that angle. 3rd parties are > supposed to interfaces like access_remote_vm, which down_read said > semaphore and are consequently also not a problem. The only worry here > is that someone is messing with another process memory without the > semaphore, but is very unlikely and patchable in the worst case -- but > someone(tm) has to audit. With all these conditions satisfied one can > elide vma_start_write for a perf win. > > Finally, I think we agreed you are going to do the audit ;) Ack. I'll look into this once the dust settles. Thanks! > > Cheers, > -- > Mateusz Guzik
Re: [PATCH v2 3/3] fork: lock VMAs of the parent process when forking
On Fri, Aug 4, 2023 at 6:06 PM Mateusz Guzik wrote: > > On 8/5/23, Linus Torvalds wrote: > > On Fri, 4 Aug 2023 at 16:25, Mateusz Guzik wrote: > >> > >> I know of these guys, I think they are excluded as is -- they go > >> through access_remote_vm, starting with: > >> if (mmap_read_lock_killable(mm)) > >> return 0; > >> > >> while dup_mmap already write locks the parent's mm. > > > > Oh, you're only worried about vma_start_write()? > > > > That's a non-issue. It doesn't take the lock normally, since it starts off > > with > > > > if (__is_vma_write_locked(vma, &mm_lock_seq)) > > return; > > > > which catches on the lock sequence number already being set. > > > > So no extra locking there. > > > > Well, technically there's extra locking because the code stupidly > > doesn't initialize new vma allocations to the right sequence number, > > but that was talked about here: > > > > > > https://lore.kernel.org/all/CAHk-=wicrwaoeesbuogoqqufvesicbgp3cx0lykgevsfazn...@mail.gmail.com/ > > > > and it's a separate issue. > > > > I'm going to bet one beer this is the issue. > > The patch I'm responding to only consists of adding the call to > vma_start_write and claims the 5% slowdown from it, while fixing > crashes if the forking process is multithreaded. > > For the fix to work it has to lock something against the parent. > > VMA_ITERATOR(old_vmi, oldmm, 0); > [..] > for_each_vma(old_vmi, mpnt) { > [..] > vma_start_write(mpnt); > > the added line locks an obj in the parent's vm space. > > The problem you linked looks like pessimization for freshly allocated > vmas, but that's what is being operated on here. Sorry, now I'm having trouble understanding the problem you are describing. We are locking the parent's vma before copying it and the newly created vma is locked before it's added into the vma tree. What is the problem then? > > -- > Mateusz Guzik
Re: [PATCH v2 3/3] fork: lock VMAs of the parent process when forking
On Fri, Aug 4, 2023 at 6:17 PM Mateusz Guzik wrote: > > On 8/5/23, Suren Baghdasaryan wrote: > > On Fri, Aug 4, 2023 at 5:49 PM Mateusz Guzik wrote: > >> However, the other users (that I know of ) go through the mmap > >> semaphore to mess with anything which means they will wait for > >> dup_mmap to finish (or do their work first). I would be surprised if > >> there were any cases which don't take the semaphore, given that it was > >> a requirement prior to the vma patchset (unless you patched some to no > >> longer need it?). I would guess worst case the semaphore can be added > >> if missing. > > > > No, the only mmap_lock read-lock that is affected is during the page > > fault, which is expected. > > > > I have difficulty parsing your statement. I was just saying that vma lock patchset did not touch any other mmap_locking paths except for the page fault one where we try to skip read-locking mmap_lock. > > I am saying that any 3rd parties which can trigger page faults already > read lock mmap_lock or can be made to do it (and I don't know any case > which does not already, but I'm not willing to spend time poking > around to make sure). One can consider 3rd parties as not a problem, > modulo the audit. > > Past that there does is no known source of trouble? In my original > e-mail I was worried about processes up the chain in ancestry, perhaps > some of the state is shared(?) and the locking at hand neuters any > problems. I'm guessing this is not necessary. > > Bottom line though it looks like this will work fine? > > That said, I'm not going to submit a patch I can't confidently defend. > As I did not dig into any of the VMA code and can't be arsed to audit > all places which mess with "foreign" mm, I'm definitely not submitting > this myself. You are most welcome to write your own variant at your > leisure. :) Ok, I see. I'll need to double check locking when a 3rd party is involved. Will post a patch when I'm confident enough it's safe. Thanks! > > -- > Mateusz Guzik
Re: [PATCH v2 3/3] fork: lock VMAs of the parent process when forking
On Fri, Aug 4, 2023 at 5:49 PM Mateusz Guzik wrote: > > On 8/5/23, Suren Baghdasaryan wrote: > > On Fri, Aug 4, 2023 at 5:26 PM Suren Baghdasaryan > > wrote: > >> > >> On Fri, Aug 4, 2023 at 5:15 PM Linus Torvalds > >> wrote: > >> > > >> > On Fri, 4 Aug 2023 at 16:25, Mateusz Guzik wrote: > >> > > > >> > > I know of these guys, I think they are excluded as is -- they go > >> > > through access_remote_vm, starting with: > >> > > if (mmap_read_lock_killable(mm)) > >> > > return 0; > >> > > > >> > > while dup_mmap already write locks the parent's mm. > >> > > >> > Oh, you're only worried about vma_start_write()? > >> > > >> > That's a non-issue. It doesn't take the lock normally, since it starts > >> > off with > >> > > >> > if (__is_vma_write_locked(vma, &mm_lock_seq)) > >> > return; > >> > > >> > which catches on the lock sequence number already being set. > >> > >> That check will prevent re-locking but if vma is not already locked > >> then the call will proceed with obtaining the lock and setting > >> vma->vm_lock_seq to mm->mm_lock_seq. > > > > The optimization Mateusz describes looks valid to me. If there is > > nobody else to fault a page and mm_users is stable (which I think it > > is because we are holding mmap_lock for write) then we can skip vma > > locking, I think. > > > > mm_users is definitely *not* stable -- it can be bumped by > get_task_mm, which is only synchronized with task lock. Ugh, you are of course correct. Poor choice for saying no new users (threads) can appear from under us. > > However, the other users (that I know of ) go through the mmap > semaphore to mess with anything which means they will wait for > dup_mmap to finish (or do their work first). I would be surprised if > there were any cases which don't take the semaphore, given that it was > a requirement prior to the vma patchset (unless you patched some to no > longer need it?). I would guess worst case the semaphore can be added > if missing. No, the only mmap_lock read-lock that is affected is during the page fault, which is expected. > > What is guaranteed is that if the forking process is single-threaded, > there will be no threads added out of nowhere -- the only thread which > could do it is busy creating one in dup_mmap. If multithreaded > operation of the forking process was the only problem, that's it. > > >> > >> > > >> > So no extra locking there. > >> > > >> > Well, technically there's extra locking because the code stupidly > >> > doesn't initialize new vma allocations to the right sequence number, > >> > but that was talked about here: > >> > > >> > > >> > https://lore.kernel.org/all/CAHk-=wicrwaoeesbuogoqqufvesicbgp3cx0lykgevsfazn...@mail.gmail.com/ > >> > > >> > and it's a separate issue. > >> > > >> > Linus > > > > > -- > Mateusz Guzik
Re: [PATCH v2 3/3] fork: lock VMAs of the parent process when forking
On Fri, Aug 4, 2023 at 5:26 PM Suren Baghdasaryan wrote: > > On Fri, Aug 4, 2023 at 5:15 PM Linus Torvalds > wrote: > > > > On Fri, 4 Aug 2023 at 16:25, Mateusz Guzik wrote: > > > > > > I know of these guys, I think they are excluded as is -- they go > > > through access_remote_vm, starting with: > > > if (mmap_read_lock_killable(mm)) > > > return 0; > > > > > > while dup_mmap already write locks the parent's mm. > > > > Oh, you're only worried about vma_start_write()? > > > > That's a non-issue. It doesn't take the lock normally, since it starts off > > with > > > > if (__is_vma_write_locked(vma, &mm_lock_seq)) > > return; > > > > which catches on the lock sequence number already being set. > > That check will prevent re-locking but if vma is not already locked > then the call will proceed with obtaining the lock and setting > vma->vm_lock_seq to mm->mm_lock_seq. The optimization Mateusz describes looks valid to me. If there is nobody else to fault a page and mm_users is stable (which I think it is because we are holding mmap_lock for write) then we can skip vma locking, I think. > > > > > So no extra locking there. > > > > Well, technically there's extra locking because the code stupidly > > doesn't initialize new vma allocations to the right sequence number, > > but that was talked about here: > > > > > > https://lore.kernel.org/all/CAHk-=wicrwaoeesbuogoqqufvesicbgp3cx0lykgevsfazn...@mail.gmail.com/ > > > > and it's a separate issue. > > > > Linus
Re: [PATCH v2 3/3] fork: lock VMAs of the parent process when forking
On Fri, Aug 4, 2023 at 5:15 PM Linus Torvalds wrote: > > On Fri, 4 Aug 2023 at 16:25, Mateusz Guzik wrote: > > > > I know of these guys, I think they are excluded as is -- they go > > through access_remote_vm, starting with: > > if (mmap_read_lock_killable(mm)) > > return 0; > > > > while dup_mmap already write locks the parent's mm. > > Oh, you're only worried about vma_start_write()? > > That's a non-issue. It doesn't take the lock normally, since it starts off > with > > if (__is_vma_write_locked(vma, &mm_lock_seq)) > return; > > which catches on the lock sequence number already being set. That check will prevent re-locking but if vma is not already locked then the call will proceed with obtaining the lock and setting vma->vm_lock_seq to mm->mm_lock_seq. > > So no extra locking there. > > Well, technically there's extra locking because the code stupidly > doesn't initialize new vma allocations to the right sequence number, > but that was talked about here: > > > https://lore.kernel.org/all/CAHk-=wicrwaoeesbuogoqqufvesicbgp3cx0lykgevsfazn...@mail.gmail.com/ > > and it's a separate issue. > > Linus
Re: [PATCH rfc -next 01/10] mm: add a generic VMA lock-based page fault handler
On Thu, Jul 13, 2023 at 9:15 AM Matthew Wilcox wrote: > > > +int try_vma_locked_page_fault(struct vm_locked_fault *vmlf, vm_fault_t > > *ret) > > +{ > > + struct vm_area_struct *vma; > > + vm_fault_t fault; > > > On Thu, Jul 13, 2023 at 05:53:29PM +0800, Kefeng Wang wrote: > > +#define VM_LOCKED_FAULT_INIT(_name, _mm, _address, _fault_flags, > > _vm_flags, _regs, _fault_code) \ > > + _name.mm= _mm; \ > > + _name.address = _address; \ > > + _name.fault_flags = _fault_flags; \ > > + _name.vm_flags = _vm_flags;\ > > + _name.regs = _regs;\ > > + _name.fault_code= _fault_code > > More consolidated code is a good idea; no question. But I don't think > this is the right way to do it. > > > +int __weak arch_vma_check_access(struct vm_area_struct *vma, > > + struct vm_locked_fault *vmlf); > > This should be: > > #ifndef vma_check_access > bool vma_check_access(struct vm_area_struct *vma, ) > { > return (vma->vm_flags & vm_flags) == 0; > } > #endif > > and then arches which want to do something different can just define > vma_check_access. > > > +int try_vma_locked_page_fault(struct vm_locked_fault *vmlf, vm_fault_t > > *ret) > > +{ > > + struct vm_area_struct *vma; > > + vm_fault_t fault; > > Declaring the vmf in this function and then copying it back is just wrong. > We need to declare vm_fault_t earlier (in the arch fault handler) and > pass it in. Did you mean to say "we need to declare vmf (struct vm_fault) earlier (in the arch fault handler) and pass it in." ? > I don't think that creating struct vm_locked_fault is the > right idea either. > > > + if (!(vmlf->fault_flags & FAULT_FLAG_USER)) > > + return -EINVAL; > > + > > + vma = lock_vma_under_rcu(vmlf->mm, vmlf->address); > > + if (!vma) > > + return -EINVAL; > > + > > + if (arch_vma_check_access(vma, vmlf)) { > > + vma_end_read(vma); > > + return -EINVAL; > > + } > > + > > + fault = handle_mm_fault(vma, vmlf->address, > > + vmlf->fault_flags | FAULT_FLAG_VMA_LOCK, > > + vmlf->regs); > > + *ret = fault; > > + > > + if (!(fault & (VM_FAULT_RETRY | VM_FAULT_COMPLETED))) > > + vma_end_read(vma); > > + > > + if ((fault & VM_FAULT_RETRY)) > > + count_vm_vma_lock_event(VMA_LOCK_RETRY); > > + else > > + count_vm_vma_lock_event(VMA_LOCK_SUCCESS); > > + > > + return 0; > > +} > > + > > #endif /* CONFIG_PER_VMA_LOCK */ > > > > #ifndef __PAGETABLE_P4D_FOLDED > > -- > > 2.27.0 > > > >
Re: [PATCH v4 00/33] Per-VMA locks
On Tue, Jul 11, 2023 at 4:09 AM Leon Romanovsky wrote: > > On Tue, Jul 11, 2023 at 02:01:41PM +0300, Leon Romanovsky wrote: > > On Tue, Jul 11, 2023 at 12:39:34PM +0200, Vlastimil Babka wrote: > > > On 7/11/23 12:35, Leon Romanovsky wrote: > > > > > > > > On Mon, Feb 27, 2023 at 09:35:59AM -0800, Suren Baghdasaryan wrote: > > > > > > > > <...> > > > > > > > >> Laurent Dufour (1): > > > >> powerc/mm: try VMA lock-based page fault handling first > > > > > > > > Hi, > > > > > > > > This series and specifically the commit above broke docker over PPC. > > > > It causes to docker service stuck while trying to activate. Revert of > > > > this commit allows us to use docker again. > > > > > > Hi, > > > > > > there have been follow-up fixes, that are part of 6.4.3 stable (also > > > 6.5-rc1) Does that version work for you? > > > > I'll recheck it again on clean system, but for the record: > > 1. We are running 6.5-rc1 kernels. > > 2. PPC doesn't compile for us on -rc1 without this fix. > > https://lore.kernel.org/all/20230629124500.1.I55e2f4e7903d686c4484cb23c033c6a9e1a9d4c4@changeid/ > > Ohh, I see it in -rc1, let's recheck. Hi Leon, Please let us know how it goes. > > > 3. I didn't see anything relevant -rc1 with "git log > > arch/powerpc/mm/fault.c". The fixes Vlastimil was referring to are not in the fault.c, they are in the main mm and fork code. More specifically, check for these patches to exist in the branch you are testing: mm: lock newly mapped VMA with corrected ordering fork: lock VMAs of the parent process when forking mm: lock newly mapped VMA which can be modified after it becomes visible mm: lock a vma before stack expansion Thanks, Suren. > > > > Do you have in mind anything specific to check? > > > > Thanks > > > > -- > To unsubscribe from this group and stop receiving emails from it, send an > email to kernel-team+unsubscr...@android.com. >
Re: [PATCH] mm: lock newly mapped VMA with corrected ordering
On Sat, Jul 8, 2023 at 4:10 PM Suren Baghdasaryan wrote: > > On Sat, Jul 8, 2023 at 4:04 PM Hugh Dickins wrote: > > > > Lockdep is certainly right to complain about > > (&vma->vm_lock->lock){}-{3:3}, at: vma_start_write+0x2d/0x3f > >but task is already holding lock: > > (&mapping->i_mmap_rwsem){+.+.}-{3:3}, at: mmap_region+0x4dc/0x6db > > Invert those to the usual ordering. > > Doh! Thanks Hugh! > I totally forgot to run this with lockdep enabled :( I verified both the lockdep warning and the fix. Thanks again, Hugh! Tested-by: Suren Baghdasaryan > > > > > Fixes: 33313a747e81 ("mm: lock newly mapped VMA which can be modified after > > it becomes visible") > > Cc: sta...@vger.kernel.org > > Signed-off-by: Hugh Dickins > > --- > > mm/mmap.c | 4 ++-- > > 1 file changed, 2 insertions(+), 2 deletions(-) > > > > diff --git a/mm/mmap.c b/mm/mmap.c > > index 84c71431a527..3eda23c9ebe7 100644 > > --- a/mm/mmap.c > > +++ b/mm/mmap.c > > @@ -2809,11 +2809,11 @@ unsigned long mmap_region(struct file *file, > > unsigned long addr, > > if (vma_iter_prealloc(&vmi)) > > goto close_and_free_vma; > > > > + /* Lock the VMA since it is modified after insertion into VMA tree > > */ > > + vma_start_write(vma); > > if (vma->vm_file) > > i_mmap_lock_write(vma->vm_file->f_mapping); > > > > - /* Lock the VMA since it is modified after insertion into VMA tree > > */ > > - vma_start_write(vma); > > vma_iter_store(&vmi, vma); > > mm->map_count++; > > if (vma->vm_file) { > > -- > > 2.35.3
Re: [PATCH] mm: lock newly mapped VMA with corrected ordering
On Sat, Jul 8, 2023 at 4:04 PM Hugh Dickins wrote: > > Lockdep is certainly right to complain about > (&vma->vm_lock->lock){}-{3:3}, at: vma_start_write+0x2d/0x3f >but task is already holding lock: > (&mapping->i_mmap_rwsem){+.+.}-{3:3}, at: mmap_region+0x4dc/0x6db > Invert those to the usual ordering. Doh! Thanks Hugh! I totally forgot to run this with lockdep enabled :( > > Fixes: 33313a747e81 ("mm: lock newly mapped VMA which can be modified after > it becomes visible") > Cc: sta...@vger.kernel.org > Signed-off-by: Hugh Dickins > --- > mm/mmap.c | 4 ++-- > 1 file changed, 2 insertions(+), 2 deletions(-) > > diff --git a/mm/mmap.c b/mm/mmap.c > index 84c71431a527..3eda23c9ebe7 100644 > --- a/mm/mmap.c > +++ b/mm/mmap.c > @@ -2809,11 +2809,11 @@ unsigned long mmap_region(struct file *file, unsigned > long addr, > if (vma_iter_prealloc(&vmi)) > goto close_and_free_vma; > > + /* Lock the VMA since it is modified after insertion into VMA tree */ > + vma_start_write(vma); > if (vma->vm_file) > i_mmap_lock_write(vma->vm_file->f_mapping); > > - /* Lock the VMA since it is modified after insertion into VMA tree */ > - vma_start_write(vma); > vma_iter_store(&vmi, vma); > mm->map_count++; > if (vma->vm_file) { > -- > 2.35.3
Re: [PATCH v2 3/3] fork: lock VMAs of the parent process when forking
On Sat, Jul 8, 2023 at 3:54 PM Linus Torvalds wrote: > > On Sat, 8 Jul 2023 at 15:36, Suren Baghdasaryan wrote: > > > > On Sat, Jul 8, 2023 at 2:18 PM Linus Torvalds > > > > > > Again - maybe I messed up, but it really feels like the missing > > > vma_start_write() was more fundamental, and not some "TLB coherency" > > > issue. > > > > Sounds plausible. I'll try to use the reproducer to verify if that's > > indeed happening here. > > I really don't think that's what people are reporting, I was just > trying to make up a completely different case that has nothing to do > with any TLB issues. > > My real point was simply this one: > > > It's likely there are multiple problematic > > scenarios due to this missing lock though. > > Right. That's my issue. I felt your explanation was *too* targeted at > some TLB non-coherency thing, when I think the problem was actually a > much larger "page faults simply must not happen while we're copying > the page tables because data isn't coherent". > > The anon_vma case was just meant as another random example of the > other kinds of things I suspect can go wrong, because we're simply not > able to do this whole "copy vma while it's being modified by page > faults". > > Now, I agree that the PTE problem is real, and probable the main > thing, ie when we as part of fork() do this: > > /* > * If it's a COW mapping, write protect it both > * in the parent and the child > */ > if (is_cow_mapping(vm_flags) && pte_write(pte)) { > ptep_set_wrprotect(src_mm, addr, src_pte); > pte = pte_wrprotect(pte); > } > > and the thing that can go wrong before the TLB flush happens is that - > because the TLB's haven't been flushed yet - some threads in the > parent happily continue to write to the page and didn't see the > wrprotect happening. > > And then you get into the situation where *some* thread see the page > protections change (maybe they had a TLB flush event on that CPU for > random reasons), and they will take a page fault and do the COW thing > and create a new page. > > And all the while *other* threads still see the old writeable TLB > state, and continue to write to the old page. > > So now you have a page that gets its data copied *while* somebody is > still writing to it, and the end result is that some write easily gets > lost, and so when that new copy is installed, you see it as data > corruption. > > And I agree completely that that is probably the thing that most > people actually saw and reacted to as corruption. > > But the reason I didn't like the explanation was that I think this is > just one random example of the more fundamental issue of "we simply > must not take page faults while copying". > > Your explanation made me think "stale TLB is the problem", and *that* > was what I objected to. The stale TLB was just one random sign of the > much larger problem. > > It might even have been the most common symptom, but I think it was > just a *symptom*, not the *cause* of the problem. > > And I must have been bad at explaining that, because David Hildenbrand > also reacted negatively to my change. > > So I'll happily take a patch that adds more commentary about this, and > gives several examples of the things that go wrong. How about adding your example to the original description as yet another scenario which is broken without this change? I guess having both issues described would not hurt. > > Linus
Re: [PATCH v2 3/3] fork: lock VMAs of the parent process when forking
On Sat, Jul 8, 2023 at 2:18 PM Linus Torvalds wrote: > > On Sat, 8 Jul 2023 at 12:12, Suren Baghdasaryan wrote: > > > > kernel/fork.c | 1 + > > 1 file changed, 1 insertion(+) > > I ended up editing your explanation a lot. > > I'm not convinced that the bug has much to do with the delayed tlb flushing. > > I think it's more fundamental than some tlb coherence issue: our VM > copying simply expects to not have any unrelated concurrent page fault > activity, and various random internal data structures simply rely on > that. > > I made up an example that I'm not sure is relevant to any of the > particular failures, but that I think is a non-TLB case: the parent > 'vma->anon_vma' chain is copied by dup_mmap() in anon_vma_fork(), and > it's possible that the parent vma didn't have any anon_vma associated > with it at that point. > > But a concurrent page fault to the same vma - even *before* the page > tables have been copied, and when the TLB is still entirely coherent - > could then cause a anon_vma_prepare() on that parent vma, and > associate one of the pages with that anon-vma. > > Then the page table copy happens, and that page gets marked read-only > again, and is added to both the parent and the child vma's, but the > child vma never got associated with the parents new anon_vma, because > it didn't exist when anon_vma_fork() happened. > > Does this ever happen? I have no idea. But it would seem to be an > example that really has nothing to do with any TLB state, and is just > simply "we cannot handle concurrent page faults while we're busy > copying the mm". > > Again - maybe I messed up, but it really feels like the missing > vma_start_write() was more fundamental, and not some "TLB coherency" > issue. Sounds plausible. I'll try to use the reproducer to verify if that's indeed happening here. It's likely there are multiple problematic scenarios due to this missing lock though. Thanks, Suren. > > Linus
Re: Fwd: Memory corruption in multithreaded user space program while calling fork
On Sat, Jul 8, 2023 at 12:23 PM Linus Torvalds wrote: > > On Sat, 8 Jul 2023 at 12:17, Suren Baghdasaryan wrote: > > > > Do you want me to disable per-VMA locks by default as well? > > No. I seriously believe that if the per-vma locking is so broken that > it needs to be disabled in a development kernel, we should just admit > failure, and revert it all. > > And not in a "revert it for a later attempt" kind of way. > > So it would be a "revert it because it added insurmountable problems > that we couldn't figure out" thing that implies *not* trying it again > in that form at all, and much soul-searching before somebody decides > that they have a more maintainable model for it all. Got it. I hope that's not the case and so far we haven't received an indication that the fixes were insufficient. > > If stable decides that the fixes are not back-portable, and the whole > thing needs to be disabled for stable, that's one thing. But if we > decide that in mainline, it's a "this was a failure" thing. The patches applied cleanly to 6.4.y stable branch the last time I checked, so should not be a problem. Thanks, Suren. > >Linus
Re: [PATCH v2 3/3] fork: lock VMAs of the parent process when forking
On Sat, Jul 8, 2023 at 12:12 PM Suren Baghdasaryan wrote: > > When forking a child process, parent write-protects an anonymous page > and COW-shares it with the child being forked using copy_present_pte(). > Parent's TLB is flushed right before we drop the parent's mmap_lock in > dup_mmap(). If we get a write-fault before that TLB flush in the parent, > and we end up replacing that anonymous page in the parent process in > do_wp_page() (because, COW-shared with the child), this might lead to > some stale writable TLB entries targeting the wrong (old) page. > Similar issue happened in the past with userfaultfd (see flush_tlb_page() > call inside do_wp_page()). > Lock VMAs of the parent process when forking a child, which prevents > concurrent page faults during fork operation and avoids this issue. > This fix can potentially regress some fork-heavy workloads. Kernel build > time did not show noticeable regression on a 56-core machine while a > stress test mapping 1 VMAs and forking 5000 times in a tight loop > shows ~5% regression. If such fork time regression is unacceptable, > disabling CONFIG_PER_VMA_LOCK should restore its performance. Further > optimizations are possible if this regression proves to be problematic. Sending this earlier version of the patch per request from Linus and with his explanation here: https://lore.kernel.org/all/CAHk-=wi-99-DyMOGywTbjRnRRC+XfpPm=r=pei4A=mel0qd...@mail.gmail.com/ > > Suggested-by: David Hildenbrand > Reported-by: Jiri Slaby > Closes: > https://lore.kernel.org/all/dbdef34c-3a07-5951-e1ae-e9c6e3cdf...@kernel.org/ > Reported-by: Holger Hoffstätte > Closes: > https://lore.kernel.org/all/b198d649-f4bf-b971-31d0-e8433ec2a...@applied-asynchrony.com/ > Reported-by: Jacob Young > Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217624 > Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first") > Cc: sta...@vger.kernel.org > Signed-off-by: Suren Baghdasaryan > --- > kernel/fork.c | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/kernel/fork.c b/kernel/fork.c > index b85814e614a5..d2e12b6d2b18 100644 > --- a/kernel/fork.c > +++ b/kernel/fork.c > @@ -686,6 +686,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, > for_each_vma(old_vmi, mpnt) { > struct file *file; > > + vma_start_write(mpnt); > if (mpnt->vm_flags & VM_DONTCOPY) { > vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt)); > continue; > -- > 2.41.0.390.g38632f3daf-goog >
[PATCH v2 3/3] fork: lock VMAs of the parent process when forking
When forking a child process, parent write-protects an anonymous page and COW-shares it with the child being forked using copy_present_pte(). Parent's TLB is flushed right before we drop the parent's mmap_lock in dup_mmap(). If we get a write-fault before that TLB flush in the parent, and we end up replacing that anonymous page in the parent process in do_wp_page() (because, COW-shared with the child), this might lead to some stale writable TLB entries targeting the wrong (old) page. Similar issue happened in the past with userfaultfd (see flush_tlb_page() call inside do_wp_page()). Lock VMAs of the parent process when forking a child, which prevents concurrent page faults during fork operation and avoids this issue. This fix can potentially regress some fork-heavy workloads. Kernel build time did not show noticeable regression on a 56-core machine while a stress test mapping 1 VMAs and forking 5000 times in a tight loop shows ~5% regression. If such fork time regression is unacceptable, disabling CONFIG_PER_VMA_LOCK should restore its performance. Further optimizations are possible if this regression proves to be problematic. Suggested-by: David Hildenbrand Reported-by: Jiri Slaby Closes: https://lore.kernel.org/all/dbdef34c-3a07-5951-e1ae-e9c6e3cdf...@kernel.org/ Reported-by: Holger Hoffstätte Closes: https://lore.kernel.org/all/b198d649-f4bf-b971-31d0-e8433ec2a...@applied-asynchrony.com/ Reported-by: Jacob Young Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217624 Fixes: 0bff0aaea03e ("x86/mm: try VMA lock-based page fault handling first") Cc: sta...@vger.kernel.org Signed-off-by: Suren Baghdasaryan --- kernel/fork.c | 1 + 1 file changed, 1 insertion(+) diff --git a/kernel/fork.c b/kernel/fork.c index b85814e614a5..d2e12b6d2b18 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -686,6 +686,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, for_each_vma(old_vmi, mpnt) { struct file *file; + vma_start_write(mpnt); if (mpnt->vm_flags & VM_DONTCOPY) { vm_stat_account(mm, mpnt->vm_flags, -vma_pages(mpnt)); continue; -- 2.41.0.390.g38632f3daf-goog
[PATCH v2 2/3] mm: lock newly mapped VMA which can be modified after it becomes visible
mmap_region adds a newly created VMA into VMA tree and might modify it afterwards before dropping the mmap_lock. This poses a problem for page faults handled under per-VMA locks because they don't take the mmap_lock and can stumble on this VMA while it's still being modified. Currently this does not pose a problem since post-addition modifications are done only for file-backed VMAs, which are not handled under per-VMA lock. However, once support for handling file-backed page faults with per-VMA locks is added, this will become a race. Fix this by write-locking the VMA before inserting it into the VMA tree. Other places where a new VMA is added into VMA tree do not modify it after the insertion, so do not need the same locking. Cc: sta...@vger.kernel.org Signed-off-by: Suren Baghdasaryan --- mm/mmap.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/mm/mmap.c b/mm/mmap.c index c66e4622a557..84c71431a527 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2812,6 +2812,8 @@ unsigned long mmap_region(struct file *file, unsigned long addr, if (vma->vm_file) i_mmap_lock_write(vma->vm_file->f_mapping); + /* Lock the VMA since it is modified after insertion into VMA tree */ + vma_start_write(vma); vma_iter_store(&vmi, vma); mm->map_count++; if (vma->vm_file) { -- 2.41.0.390.g38632f3daf-goog
[PATCH v2 1/3] mm: lock a vma before stack expansion
With recent changes necessitating mmap_lock to be held for write while expanding a stack, per-VMA locks should follow the same rules and be write-locked to prevent page faults into the VMA being expanded. Add the necessary locking. Cc: sta...@vger.kernel.org Signed-off-by: Suren Baghdasaryan --- mm/mmap.c | 4 1 file changed, 4 insertions(+) diff --git a/mm/mmap.c b/mm/mmap.c index 204ddcd52625..c66e4622a557 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -1977,6 +1977,8 @@ static int expand_upwards(struct vm_area_struct *vma, unsigned long address) return -ENOMEM; } + /* Lock the VMA before expanding to prevent concurrent page faults */ + vma_start_write(vma); /* * vma->vm_start/vm_end cannot change under us because the caller * is required to hold the mmap_lock in read mode. We need the @@ -2064,6 +2066,8 @@ int expand_downwards(struct vm_area_struct *vma, unsigned long address) return -ENOMEM; } + /* Lock the VMA before expanding to prevent concurrent page faults */ + vma_start_write(vma); /* * vma->vm_start/vm_end cannot change under us because the caller * is required to hold the mmap_lock in read mode. We need the -- 2.41.0.390.g38632f3daf-goog
Re: Fwd: Memory corruption in multithreaded user space program while calling fork
On Sat, Jul 8, 2023 at 11:05 AM Linus Torvalds wrote: > > On Sat, 8 Jul 2023 at 10:39, Andrew Morton wrote: > > > > That was the v1 fix, but after some discussion > > (https://lkml.kernel.org/r/20230705063711.2670599-1-sur...@google.com) > > it was decided to take the "excessive" approach. > > That makes absolutely _zero_ sense. > > It seems to be complete voodoo programming. > > To some degree I don't care what happens in stable kernels, but > there's no way we'll do that kind of thing in mainline without some > logic or reason, when it makes no sense. > > flush_cache_dup_mm() is entirely irrelevant to the whole issue, for > several reason, but the core one being that it only matters on broken > virtually indexed caches, so none of the architectures that do per-vma > locking. > > And the argument that "After the mmap_write_lock_killable(), there > will still be a period where page faults can happen" may be true > (that's kind of the *point* of per-vma locking), but it's irrelevant. > > It's true for *all* users of mmap_write_lock_killable, whether in fork > or anywhere else. What makes fork() so magically special? > > It's why we have that vma_start_write(), to say "I'm now modifying > *this* vma, so stop accessing it in parallel". > > Because no, flush_cache_dup_mm() is not the magical reason to do that thing. My understanding was that flush_cache_dup_mm() is there to ensure nothing is in the cache, so locking VMAs before doing that would ensure that no page faults would pollute the caches after we flushed them. Is that reasoning incorrect? > > Maybe there is something else going on, but no, we don't write crazy > code without a reason for it. That's completely unmaintainable, > because people will look at that code, not understand it (because > there is nothing to understand) and be afraid to touch it. For no > actual reason. > > The obvious place to say "I'm now starting to modify the vma" is when > you actually start to modify the vma. > > > Also, this change needs a couple more updates: > > Those updates seem sane, and come with explanations of why they exist. > Looks fine to me. > > Suren, please send me the proper fixes. Not the voodoo one. The ones > you can explain. Ok, I think these two are non-controversial: https://lkml.kernel.org/r/20230707043211.3682710-1-sur...@google.com https://lkml.kernel.org/r/20230707043211.3682710-2-sur...@google.com and the question now is how we fix the fork() case: https://lore.kernel.org/all/20230706011400.2949242-2-sur...@google.com/ (if my above explanation makes sense to you) or https://lore.kernel.org/all/20230705063711.2670599-2-sur...@google.com/ Please let me know which ones and I'll send you the patchset including these patches. Thanks, Suren. > > And if stable wants to do something else, then that's fine. But for > the development kernel,. we have two options: > > - fix the PER_VMA_LOCK code > > - decide that it's not worth it, and just revert it all > > and honestly, I'm ok with that second option, simply because this has > all been way too much pain. > > But no, we don't mark it broken thinking we can't deal with it, or do > random non-sensible code code we can't explain. > > Linus
Re: Fwd: Memory corruption in multithreaded user space program while calling fork
On Wed, Jul 5, 2023 at 9:14 AM Suren Baghdasaryan wrote: > > On Wed, Jul 5, 2023 at 8:49 AM Andrew Morton > wrote: > > > > On Wed, 5 Jul 2023 10:51:57 +0200 "Linux regression tracking (Thorsten > > Leemhuis)" wrote: > > > > > >>> I'm in wait-a-few-days-mode on this. To see if we have a backportable > > > >>> fix rather than disabling the feature in -stable. > > > > > > Andrew, how long will you remain in "wait-a-few-days-mode"? Given what > > > Greg said below and that we already had three reports I know of I'd > > > prefer if we could fix this rather sooner than later in mainline -- > > > especially as Arch Linux and openSUSE Tumbleweed likely have switched to > > > 6.4.y already or will do so soon. > > > > I'll send today's 2-patch series to Linus today or tomorrow. > > I need to make a correction to the patch fixing the issue (the first > one in the series) and will post it within an hour. Thanks! Corrected patch is posted at https://lore.kernel.org/all/20230705171213.2843068-2-sur...@google.com/. The other patch is unchanged but I posted v3 of the entire patchset anyway. Andrew, could you please replace the old version with this one before sending it to Linus? Thanks, Suren.
Re: Fwd: Memory corruption in multithreaded user space program while calling fork
On Wed, Jul 5, 2023 at 8:49 AM Andrew Morton wrote: > > On Wed, 5 Jul 2023 10:51:57 +0200 "Linux regression tracking (Thorsten > Leemhuis)" wrote: > > > >>> I'm in wait-a-few-days-mode on this. To see if we have a backportable > > >>> fix rather than disabling the feature in -stable. > > > > Andrew, how long will you remain in "wait-a-few-days-mode"? Given what > > Greg said below and that we already had three reports I know of I'd > > prefer if we could fix this rather sooner than later in mainline -- > > especially as Arch Linux and openSUSE Tumbleweed likely have switched to > > 6.4.y already or will do so soon. > > I'll send today's 2-patch series to Linus today or tomorrow. I need to make a correction to the patch fixing the issue (the first one in the series) and will post it within an hour. Thanks!
Re: Fwd: Memory corruption in multithreaded user space program while calling fork
On Tue, Jul 4, 2023 at 3:04 PM Suren Baghdasaryan wrote: > > On Tue, Jul 4, 2023 at 2:28 PM Andrew Morton > wrote: > > > > On Tue, 4 Jul 2023 13:22:54 -0700 Suren Baghdasaryan > > wrote: > > > > > On Tue, Jul 4, 2023 at 9:18 AM Andrew Morton > > > wrote: > > > > > > > > On Tue, 4 Jul 2023 09:00:19 +0100 Greg KH > > > > wrote: > > > > > > > > > > > > > Thanks! I'll investigate this later today. After discussing > > > > > > > > > with > > > > > > > > > Andrew, we would like to disable CONFIG_PER_VMA_LOCK by > > > > > > > > > default until > > > > > > > > > the issue is fixed. I'll post a patch shortly. > > > > > > > > > > > > > > > > Posted at: > > > > > > > > https://lore.kernel.org/all/20230703182150.2193578-1-sur...@google.com/ > > > > > > > > > > > > > > As that change fixes something in 6.4, why not cc: stable on it > > > > > > > as well? > > > > > > > > > > > > Sorry, I thought since per-VMA locks were introduced in 6.4 and this > > > > > > patch is fixing 6.4 I didn't need to send it to stable for older > > > > > > versions. Did I miss something? > > > > > > > > > > 6.4.y is a stable kernel tree right now, so yes, it needs to be > > > > > included > > > > > there :) > > > > > > > > I'm in wait-a-few-days-mode on this. To see if we have a backportable > > > > fix rather than disabling the feature in -stable. > > > > > > Ok, I think we have a fix posted at [2] and it's cleanly applies to > > > 6.4.y stable branch as well. However fork() performance might slightly > > > regress, therefore disabling per-VMA locks by default for now seems to > > > be preferable even with this fix (see discussion at > > > https://lore.kernel.org/all/54cd9ffb-8f4b-003f-c2d6-3b6b0d2cb...@google.com/). > > > IOW, both [1] and [2] should be applied to 6.4.y stable. Both apply > > > cleanly and I CC'ed stable on [2]. Greg, should I send [1] separately > > > to stable@vger? > > > > > > [1] > > > https://lore.kernel.org/all/20230703182150.2193578-1-sur...@google.com/ > > > > This one isn't sufficient for .configs which already have > > PER_VMA_LOCK=y. Using `depends on BROKEN' would be better. > > > > > [2] > > > https://lore.kernel.org/all/20230704200656.2526715-1-sur...@google.com/ > > > > > > > We're still awaiting tester input on this? > > Yeah, and it seems to be negative... Anyway, I'll post a dependency on BROKEN. I posted the patchset at https://lore.kernel.org/all/20230705063711.2670599-1-sur...@google.com/ CC'ing stable@vger with the cover letter explaining the situation. The negative report might have been a fluke, so let's wait for more testing. In the meantime we can disable the feature by applying the last patch in that series. > > > > > I think a clean new fully-changelogged two-patch series would be the > > best way to handle this. Please ensure that the [0/2] intro clearly > > explains what we're proposing here, and why. Done. > > > > Also, "might slightly regress" is a bit weak. These things are > > measurable, no? Because a better solution would be to fix 6.4.x and > > mainline and leave it at that. They are measurable and they were included in the fix I posted. I added the numbers in the new cover letter as well. Thanks, Suren. > >
Re: Fwd: Memory corruption in multithreaded user space program while calling fork
On Tue, Jul 4, 2023 at 2:28 PM Andrew Morton wrote: > > On Tue, 4 Jul 2023 13:22:54 -0700 Suren Baghdasaryan > wrote: > > > On Tue, Jul 4, 2023 at 9:18 AM Andrew Morton > > wrote: > > > > > > On Tue, 4 Jul 2023 09:00:19 +0100 Greg KH > > > wrote: > > > > > > > > > > > Thanks! I'll investigate this later today. After discussing with > > > > > > > > Andrew, we would like to disable CONFIG_PER_VMA_LOCK by default > > > > > > > > until > > > > > > > > the issue is fixed. I'll post a patch shortly. > > > > > > > > > > > > > > Posted at: > > > > > > > https://lore.kernel.org/all/20230703182150.2193578-1-sur...@google.com/ > > > > > > > > > > > > As that change fixes something in 6.4, why not cc: stable on it as > > > > > > well? > > > > > > > > > > Sorry, I thought since per-VMA locks were introduced in 6.4 and this > > > > > patch is fixing 6.4 I didn't need to send it to stable for older > > > > > versions. Did I miss something? > > > > > > > > 6.4.y is a stable kernel tree right now, so yes, it needs to be included > > > > there :) > > > > > > I'm in wait-a-few-days-mode on this. To see if we have a backportable > > > fix rather than disabling the feature in -stable. > > > > Ok, I think we have a fix posted at [2] and it's cleanly applies to > > 6.4.y stable branch as well. However fork() performance might slightly > > regress, therefore disabling per-VMA locks by default for now seems to > > be preferable even with this fix (see discussion at > > https://lore.kernel.org/all/54cd9ffb-8f4b-003f-c2d6-3b6b0d2cb...@google.com/). > > IOW, both [1] and [2] should be applied to 6.4.y stable. Both apply > > cleanly and I CC'ed stable on [2]. Greg, should I send [1] separately > > to stable@vger? > > > > [1] https://lore.kernel.org/all/20230703182150.2193578-1-sur...@google.com/ > > This one isn't sufficient for .configs which already have > PER_VMA_LOCK=y. Using `depends on BROKEN' would be better. > > > [2] https://lore.kernel.org/all/20230704200656.2526715-1-sur...@google.com/ > > > > We're still awaiting tester input on this? Yeah, and it seems to be negative... Anyway, I'll post a dependency on BROKEN. > > I think a clean new fully-changelogged two-patch series would be the > best way to handle this. Please ensure that the [0/2] intro clearly > explains what we're proposing here, and why. > > Also, "might slightly regress" is a bit weak. These things are > measurable, no? Because a better solution would be to fix 6.4.x and > mainline and leave it at that. >
Re: Fwd: Memory corruption in multithreaded user space program while calling fork
On Tue, Jul 4, 2023 at 9:18 AM Andrew Morton wrote: > > On Tue, 4 Jul 2023 09:00:19 +0100 Greg KH wrote: > > > > > > > Thanks! I'll investigate this later today. After discussing with > > > > > > Andrew, we would like to disable CONFIG_PER_VMA_LOCK by default > > > > > > until > > > > > > the issue is fixed. I'll post a patch shortly. > > > > > > > > > > Posted at: > > > > > https://lore.kernel.org/all/20230703182150.2193578-1-sur...@google.com/ > > > > > > > > As that change fixes something in 6.4, why not cc: stable on it as well? > > > > > > Sorry, I thought since per-VMA locks were introduced in 6.4 and this > > > patch is fixing 6.4 I didn't need to send it to stable for older > > > versions. Did I miss something? > > > > 6.4.y is a stable kernel tree right now, so yes, it needs to be included > > there :) > > I'm in wait-a-few-days-mode on this. To see if we have a backportable > fix rather than disabling the feature in -stable. Ok, I think we have a fix posted at [2] and it's cleanly applies to 6.4.y stable branch as well. However fork() performance might slightly regress, therefore disabling per-VMA locks by default for now seems to be preferable even with this fix (see discussion at https://lore.kernel.org/all/54cd9ffb-8f4b-003f-c2d6-3b6b0d2cb...@google.com/). IOW, both [1] and [2] should be applied to 6.4.y stable. Both apply cleanly and I CC'ed stable on [2]. Greg, should I send [1] separately to stable@vger? [1] https://lore.kernel.org/all/20230703182150.2193578-1-sur...@google.com/ [2] https://lore.kernel.org/all/20230704200656.2526715-1-sur...@google.com/ >
Re: Fwd: Memory corruption in multithreaded user space program while calling fork
On Mon, Jul 3, 2023 at 11:44 AM Greg KH wrote: > > On Mon, Jul 03, 2023 at 11:27:19AM -0700, Suren Baghdasaryan wrote: > > On Mon, Jul 3, 2023 at 11:08 AM Suren Baghdasaryan > > wrote: > > > > > > On Mon, Jul 3, 2023 at 2:53 AM Linux regression tracking (Thorsten > > > Leemhuis) wrote: > > > > > > > > On 02.07.23 14:27, Bagas Sanjaya wrote: > > > > > I notice a regression report on Bugzilla [1]. Quoting from it: > > > > > > > > > >> After upgrading to kernel version 6.4.0 from 6.3.9, I noticed > > > > >> frequent but random crashes in a user space program. After a lot of > > > > >> reduction, I have come up with the following reproducer program: > > > > > [...] > > > > >> After tuning the various parameters for my computer, exit code 2, > > > > >> which indicates that memory corruption was detected, occurs > > > > >> approximately 99% of the time. Exit code 1, which occurs > > > > >> approximately 1% of the time, means it ran out of > > > > >> statically-allocated memory before reproducing the issue, and > > > > >> increasing the memory usage any more only leads to diminishing > > > > >> returns. There is also something like a 0.1% chance that it > > > > >> segfaults due to memory corruption elsewhere than in the > > > > >> statically-allocated buffer. > > > > >> > > > > >> With this reproducer in hand, I was able to perform the following > > > > >> bisection: > > > > > [...] > > > > > > > > > > See Bugzilla for the full thread. > > > > > > > > Additional details from > > > > https://bugzilla.kernel.org/show_bug.cgi?id=217624#c5 : > > > > > > > > ``` > > > > I can confirm that v6.4 with 0bff0aaea03e2a3ed6bfa302155cca8a432a1829 > > > > reverted no longer causes any memory corruption with either my > > > > reproducer or the original program. > > > > ``` > > > > > > > > FWIW: 0bff0aaea03 ("x86/mm: try VMA lock-based page fault handling > > > > first") [merged for v6.4-rc1, authored by Suren Baghdasaryan [already > > > > CCed]] > > > > > > > > That's the same commit that causes build problems with go: > > > > > > > > https://lore.kernel.org/all/dbdef34c-3a07-5951-e1ae-e9c6e3cdf...@kernel.org/ > > > > > > Thanks! I'll investigate this later today. After discussing with > > > Andrew, we would like to disable CONFIG_PER_VMA_LOCK by default until > > > the issue is fixed. I'll post a patch shortly. > > > > Posted at: > > https://lore.kernel.org/all/20230703182150.2193578-1-sur...@google.com/ > > As that change fixes something in 6.4, why not cc: stable on it as well? Sorry, I thought since per-VMA locks were introduced in 6.4 and this patch is fixing 6.4 I didn't need to send it to stable for older versions. Did I miss something? Thanks, Suren. > > thanks, > > greg k-h
Re: Fwd: Memory corruption in multithreaded user space program while calling fork
On Mon, Jul 3, 2023 at 11:08 AM Suren Baghdasaryan wrote: > > On Mon, Jul 3, 2023 at 2:53 AM Linux regression tracking (Thorsten > Leemhuis) wrote: > > > > On 02.07.23 14:27, Bagas Sanjaya wrote: > > > I notice a regression report on Bugzilla [1]. Quoting from it: > > > > > >> After upgrading to kernel version 6.4.0 from 6.3.9, I noticed frequent > > >> but random crashes in a user space program. After a lot of reduction, I > > >> have come up with the following reproducer program: > > > [...] > > >> After tuning the various parameters for my computer, exit code 2, which > > >> indicates that memory corruption was detected, occurs approximately 99% > > >> of the time. Exit code 1, which occurs approximately 1% of the time, > > >> means it ran out of statically-allocated memory before reproducing the > > >> issue, and increasing the memory usage any more only leads to > > >> diminishing returns. There is also something like a 0.1% chance that it > > >> segfaults due to memory corruption elsewhere than in the > > >> statically-allocated buffer. > > >> > > >> With this reproducer in hand, I was able to perform the following > > >> bisection: > > > [...] > > > > > > See Bugzilla for the full thread. > > > > Additional details from > > https://bugzilla.kernel.org/show_bug.cgi?id=217624#c5 : > > > > ``` > > I can confirm that v6.4 with 0bff0aaea03e2a3ed6bfa302155cca8a432a1829 > > reverted no longer causes any memory corruption with either my > > reproducer or the original program. > > ``` > > > > FWIW: 0bff0aaea03 ("x86/mm: try VMA lock-based page fault handling > > first") [merged for v6.4-rc1, authored by Suren Baghdasaryan [already CCed]] > > > > That's the same commit that causes build problems with go: > > > > https://lore.kernel.org/all/dbdef34c-3a07-5951-e1ae-e9c6e3cdf...@kernel.org/ > > Thanks! I'll investigate this later today. After discussing with > Andrew, we would like to disable CONFIG_PER_VMA_LOCK by default until > the issue is fixed. I'll post a patch shortly. Posted at: https://lore.kernel.org/all/20230703182150.2193578-1-sur...@google.com/ > > > > > Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) > > -- > > Everything you wanna know about Linux kernel regression tracking: > > https://linux-regtracking.leemhuis.info/about/#tldr > > If I did something stupid, please tell me, as explained on that page. > > > > #regzbot introduced: 0bff0aaea03e2a3
Re: Fwd: Memory corruption in multithreaded user space program while calling fork
On Mon, Jul 3, 2023 at 2:53 AM Linux regression tracking (Thorsten Leemhuis) wrote: > > On 02.07.23 14:27, Bagas Sanjaya wrote: > > I notice a regression report on Bugzilla [1]. Quoting from it: > > > >> After upgrading to kernel version 6.4.0 from 6.3.9, I noticed frequent but > >> random crashes in a user space program. After a lot of reduction, I have > >> come up with the following reproducer program: > > [...] > >> After tuning the various parameters for my computer, exit code 2, which > >> indicates that memory corruption was detected, occurs approximately 99% of > >> the time. Exit code 1, which occurs approximately 1% of the time, means > >> it ran out of statically-allocated memory before reproducing the issue, > >> and increasing the memory usage any more only leads to diminishing > >> returns. There is also something like a 0.1% chance that it segfaults due > >> to memory corruption elsewhere than in the statically-allocated buffer. > >> > >> With this reproducer in hand, I was able to perform the following > >> bisection: > > [...] > > > > See Bugzilla for the full thread. > > Additional details from > https://bugzilla.kernel.org/show_bug.cgi?id=217624#c5 : > > ``` > I can confirm that v6.4 with 0bff0aaea03e2a3ed6bfa302155cca8a432a1829 > reverted no longer causes any memory corruption with either my > reproducer or the original program. > ``` > > FWIW: 0bff0aaea03 ("x86/mm: try VMA lock-based page fault handling > first") [merged for v6.4-rc1, authored by Suren Baghdasaryan [already CCed]] > > That's the same commit that causes build problems with go: > > https://lore.kernel.org/all/dbdef34c-3a07-5951-e1ae-e9c6e3cdf...@kernel.org/ Thanks! I'll investigate this later today. After discussing with Andrew, we would like to disable CONFIG_PER_VMA_LOCK by default until the issue is fixed. I'll post a patch shortly. > > Ciao, Thorsten (wearing his 'the Linux kernel's regression tracker' hat) > -- > Everything you wanna know about Linux kernel regression tracking: > https://linux-regtracking.leemhuis.info/about/#tldr > If I did something stupid, please tell me, as explained on that page. > > #regzbot introduced: 0bff0aaea03e2a3
Re: [PATCH v4 29/33] x86/mm: try VMA lock-based page fault handling first
On Fri, Jun 30, 2023 at 1:43 AM Jiri Slaby wrote: > > On 30. 06. 23, 10:28, Jiri Slaby wrote: > > > 2348 > > clone3({flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, > > child_tid=0x7fcaa5882990, parent_tid=0x7fcaa5882990, exit_signal=0, > > stack=0x7fcaa5082000, stack_size=0x7ffe00, tls=0x7fcaa58826c0} => > > {parent_tid=[2351]}, 88) = 2351 > > > 2350 <... clone3 resumed> => {parent_tid=[2372]}, 88) = 2372 > > > 2351 <... clone3 resumed> => {parent_tid=[2354]}, 88) = 2354 > > > 2351 <... clone3 resumed> => {parent_tid=[2357]}, 88) = 2357 > > > 2354 <... clone3 resumed> => {parent_tid=[2355]}, 88) = 2355 > > > 2355 <... clone3 resumed> => {parent_tid=[2370]}, 88) = 2370 > > > 2370 mmap(NULL, 262144, PROT_READ|PROT_WRITE, > > MAP_PRIVATE|MAP_ANONYMOUS, -1, 0 > > > 2370 <... mmap resumed>) = 0x7fca68249000 > > > 2372 <... clone3 resumed> => {parent_tid=[2384]}, 88) = 2384 > > > 2384 <... clone3 resumed> => {parent_tid=[2388]}, 88) = 2388 > > > 2388 <... clone3 resumed> => {parent_tid=[2392]}, 88) = 2392 > > > 2392 <... clone3 resumed> => {parent_tid=[2395]}, 88) = 2395 > > > 2395 write(2, "runtime: marked free object in s"..., 36 > ...> > > > > I.e. IIUC, all are threads (CLONE_VM) and thread 2370 mapped ANON > > 0x7fca68249000 - 0x7fca6827 and go in thread 2395 thinks for some > > reason 0x7fca6824bec8 in that region is "bad". Thanks for the analysis Jiri. Is it possible from these logs to identify whether 2370 finished the mmap operation before 2395 tried to access 0x7fca6824bec8? That access has to happen only after mmap finishes mapping the region. > > As I was noticed, this might be as well be a fail of the go's > inter-thread communication (or alike) too. It might now be only more > exposed with vma-based locks as we can do more parallelism now. Yes, with multithreaded processes like these where threads are mapping and accessing memory areas, per-VMA locks should allow for greater parallelism. So, if there is a race like the one I asked above, it might become more pronounced with per-VMA locks. I'll double check the code, but from Kernel's POV mmap would take the mmap_lock for write then will lock the VMA lock for write. That should prevent any page fault handlers from accessing this VMA in parallel until writers release the locks. Page fault path would try to find the VMA without any lock and then will try to read-lock that VMA. If it fails it will fall back to mmap_lock. So, if the writer started first and obtained the VMA lock, the reader will fall back to mmap_lock and will block until the writer releases the mmap_lock. If the reader got VMA read lock first then the writer will block while obtaining the VMA's write lock. However for your scenario, the reader (page fault) might be getting here before the writer (mmap) and upon not finding the VMA it is looking for, it will fail. Please let me know if you can verify this scenario. Thanks, Suren. > > There are older hard to reproduce bugs in go with similar symptoms (we > see this error sometimes now too): > https://github.com/golang/go/issues/15246 > > Or this 2016 bug is a red herring. Hard to tell... > > >> thanks, > -- > js > suse labs >
Re: [PATCH v4 29/33] x86/mm: try VMA lock-based page fault handling first
On Thu, Jun 29, 2023 at 7:40 AM Jiri Slaby wrote: > > Hi, > > On 27. 02. 23, 18:36, Suren Baghdasaryan wrote: > > Attempt VMA lock-based page fault handling first, and fall back to the > > existing mmap_lock-based handling if that fails. > > > > Signed-off-by: Suren Baghdasaryan > > --- > > arch/x86/Kconfig| 1 + > > arch/x86/mm/fault.c | 36 > > 2 files changed, 37 insertions(+) > > > > diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig > > index a825bf031f49..df21fba77db1 100644 > > --- a/arch/x86/Kconfig > > +++ b/arch/x86/Kconfig > > @@ -27,6 +27,7 @@ config X86_64 > > # Options that are inherently 64-bit kernel only: > > select ARCH_HAS_GIGANTIC_PAGE > > select ARCH_SUPPORTS_INT128 if CC_HAS_INT128 > > + select ARCH_SUPPORTS_PER_VMA_LOCK > > select ARCH_USE_CMPXCHG_LOCKREF > > select HAVE_ARCH_SOFT_DIRTY > > select MODULES_USE_ELF_RELA > > diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c > > index a498ae1fbe66..e4399983c50c 100644 > > --- a/arch/x86/mm/fault.c > > +++ b/arch/x86/mm/fault.c > > @@ -19,6 +19,7 @@ > > #include /* faulthandler_disabled() */ > > #include /* > > efi_crash_gracefully_on_page_fault()*/ > > #include > > +#include /* find_and_lock_vma() */ > > > > #include /* boot_cpu_has, ...*/ > > #include /* dotraplinkage, ... > > */ > > @@ -1333,6 +1334,38 @@ void do_user_addr_fault(struct pt_regs *regs, > > } > > #endif > > > > +#ifdef CONFIG_PER_VMA_LOCK > > + if (!(flags & FAULT_FLAG_USER)) > > + goto lock_mmap; > > + > > + vma = lock_vma_under_rcu(mm, address); > > + if (!vma) > > + goto lock_mmap; > > + > > + if (unlikely(access_error(error_code, vma))) { > > + vma_end_read(vma); > > + goto lock_mmap; > > + } > > + fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, > > regs); > > + vma_end_read(vma); > > + > > + if (!(fault & VM_FAULT_RETRY)) { > > + count_vm_vma_lock_event(VMA_LOCK_SUCCESS); > > + goto done; > > + } > > + count_vm_vma_lock_event(VMA_LOCK_RETRY); > > This is apparently not strong enough as it causes go build failures like: > > [ 409s] strconv > [ 409s] releasep: m=0x579e2000 m->p=0x5781c600 p->m=0x0 p->status=2 > [ 409s] fatal error: releasep: invalid p state > [ 409s] > > [ 325s] hash/adler32 > [ 325s] hash/crc32 > [ 325s] cmd/internal/codesign > [ 336s] fatal error: runtime: out of memory Hi Jiri, Thanks for reporting! I'm not familiar with go builds. Could you please explain the error to me or point me to some documentation to decipher that error? Thanks, Suren. > > There are many kinds of similar errors. It happens in 1-3 out of 20 > builds only. > > If I revert the commit on top of 6.4, they all dismiss. Any idea? > > The downstream report: > https://bugzilla.suse.com/show_bug.cgi?id=1212775 > > > + > > + /* Quick path to respond to signals */ > > + if (fault_signal_pending(fault, regs)) { > > + if (!user_mode(regs)) > > + kernelmode_fixup_or_oops(regs, error_code, address, > > + SIGBUS, BUS_ADRERR, > > + ARCH_DEFAULT_PKEY); > > + return; > > + } > > +lock_mmap: > > +#endif /* CONFIG_PER_VMA_LOCK */ > > + > > /* > >* Kernel-mode access to the user address space should only occur > >* on well-defined single instructions listed in the exception > > @@ -1433,6 +1466,9 @@ void do_user_addr_fault(struct pt_regs *regs, > > } > > > > mmap_read_unlock(mm); > > +#ifdef CONFIG_PER_VMA_LOCK > > +done: > > +#endif > > if (likely(!(fault & VM_FAULT_ERROR))) > > return; > > > > thanks, > -- > js > suse labs >
Re: [PATCH v4 0/7] introduce vm_flags modifier functions
On Fri, Mar 17, 2023 at 3:41 PM Alex Williamson wrote: > > On Fri, 17 Mar 2023 12:08:32 -0700 > Suren Baghdasaryan wrote: > > > On Tue, Mar 14, 2023 at 1:11 PM Alex Williamson > > wrote: > > > > > > On Thu, 26 Jan 2023 11:37:45 -0800 > > > Suren Baghdasaryan wrote: > > > > > > > This patchset was originally published as a part of per-VMA locking [1] > > > > and > > > > was split after suggestion that it's viable on its own and to facilitate > > > > the review process. It is now a preprequisite for the next version of > > > > per-VMA > > > > lock patchset, which reuses vm_flags modifier functions to lock the VMA > > > > when > > > > vm_flags are being updated. > > > > > > > > VMA vm_flags modifications are usually done under exclusive mmap_lock > > > > protection because this attrubute affects other decisions like VMA > > > > merging > > > > or splitting and races should be prevented. Introduce vm_flags modifier > > > > functions to enforce correct locking. > > > > > > > > The patchset applies cleanly over mm-unstable branch of mm tree. > > > > > > With this series, vfio-pci developed a bunch of warnings around not > > > holding the mmap_lock write semaphore while calling > > > io_remap_pfn_range() from our fault handler, vfio_pci_mmap_fault(). > > > > > > I suspect vdpa has the same issue for their use of remap_pfn_range() > > > from their fault handler, JasonW, MST, FYI. > > > > > > It also looks like gru_fault() would have the same issue, Dimitri. > > > > > > In all cases, we're preemptively setting vm_flags to what > > > remap_pfn_range_notrack() uses, so I thought we were safe here as I > > > specifically remember trying to avoid changing vm_flags from the > > > fault handler. But apparently that doesn't take into account > > > track_pfn_remap() where VM_PAT comes into play. > > > > > > The reason for using remap_pfn_range() on fault in vfio-pci is that > > > we're mapping device MMIO to userspace, where that MMIO can be disabled > > > and we'd rather zap the mapping when that occurs so that we can sigbus > > > the user rather than allow the user to trigger potentially fatal bus > > > errors on the host. > > > > > > Peter Xu has suggested offline that a non-lazy approach to reinsert the > > > mappings might be more inline with mm expectations relative to touching > > > vm_flags during fault. What's the right solution here? Can the fault > > > handling be salvaged, is proactive remapping the right approach, or is > > > there something better? Thanks, > > > > Hi Alex, > > If in your case it's safe to change vm_flags without holding exclusive > > mmap_lock, maybe you can use __vm_flags_mod() the way I used it in > > https://lore.kernel.org/all/20230126193752.297968-7-sur...@google.com, > > while explaining why this should be safe? > > Hi Suren, > > Thanks for the reply, but I'm not sure I'm following. Are you > suggesting a bool arg added to io_remap_pfn_range(), or some new > variant of that function to conditionally use __vm_flags_mod() in place > of vm_flags_set() across the call chain? Thanks, I think either way could work but after taking a closer look, both ways would be quite ugly. If we could somehow identify that we are handling a page fault and use __vm_flags_mod() without additional parameters it would be more palatable IMHO... Peter's suggestion to avoid touching vm_flags during fault would be much cleaner but I'm not sure how easily that can be done. > > Alex > > -- > To unsubscribe from this group and stop receiving emails from it, send an > email to kernel-team+unsubscr...@android.com. >
Re: [PATCH v4 0/7] introduce vm_flags modifier functions
On Tue, Mar 14, 2023 at 1:11 PM Alex Williamson wrote: > > On Thu, 26 Jan 2023 11:37:45 -0800 > Suren Baghdasaryan wrote: > > > This patchset was originally published as a part of per-VMA locking [1] and > > was split after suggestion that it's viable on its own and to facilitate > > the review process. It is now a preprequisite for the next version of > > per-VMA > > lock patchset, which reuses vm_flags modifier functions to lock the VMA when > > vm_flags are being updated. > > > > VMA vm_flags modifications are usually done under exclusive mmap_lock > > protection because this attrubute affects other decisions like VMA merging > > or splitting and races should be prevented. Introduce vm_flags modifier > > functions to enforce correct locking. > > > > The patchset applies cleanly over mm-unstable branch of mm tree. > > With this series, vfio-pci developed a bunch of warnings around not > holding the mmap_lock write semaphore while calling > io_remap_pfn_range() from our fault handler, vfio_pci_mmap_fault(). > > I suspect vdpa has the same issue for their use of remap_pfn_range() > from their fault handler, JasonW, MST, FYI. > > It also looks like gru_fault() would have the same issue, Dimitri. > > In all cases, we're preemptively setting vm_flags to what > remap_pfn_range_notrack() uses, so I thought we were safe here as I > specifically remember trying to avoid changing vm_flags from the > fault handler. But apparently that doesn't take into account > track_pfn_remap() where VM_PAT comes into play. > > The reason for using remap_pfn_range() on fault in vfio-pci is that > we're mapping device MMIO to userspace, where that MMIO can be disabled > and we'd rather zap the mapping when that occurs so that we can sigbus > the user rather than allow the user to trigger potentially fatal bus > errors on the host. > > Peter Xu has suggested offline that a non-lazy approach to reinsert the > mappings might be more inline with mm expectations relative to touching > vm_flags during fault. What's the right solution here? Can the fault > handling be salvaged, is proactive remapping the right approach, or is > there something better? Thanks, Hi Alex, If in your case it's safe to change vm_flags without holding exclusive mmap_lock, maybe you can use __vm_flags_mod() the way I used it in https://lore.kernel.org/all/20230126193752.297968-7-sur...@google.com, while explaining why this should be safe? > > Alex >
Re: [PATCH v4 31/33] powerc/mm: try VMA lock-based page fault handling first
On Mon, Feb 27, 2023 at 9:37 AM Suren Baghdasaryan wrote: > > From: Laurent Dufour > > Attempt VMA lock-based page fault handling first, and fall back to the > existing mmap_lock-based handling if that fails. > Copied from "x86/mm: try VMA lock-based page fault handling first" Hi Andrew, Laurent posted a fix for this patch at https://lore.kernel.org/all/20230306154244.17560-1-lduf...@linux.ibm.com/. Could you please squash the fix into this patch? Thanks, Suren. > > Signed-off-by: Laurent Dufour > Signed-off-by: Suren Baghdasaryan > --- > arch/powerpc/mm/fault.c| 41 ++ > arch/powerpc/platforms/powernv/Kconfig | 1 + > arch/powerpc/platforms/pseries/Kconfig | 1 + > 3 files changed, 43 insertions(+) > > diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c > index 2bef19cc1b98..c7ae86b04b8a 100644 > --- a/arch/powerpc/mm/fault.c > +++ b/arch/powerpc/mm/fault.c > @@ -469,6 +469,44 @@ static int ___do_page_fault(struct pt_regs *regs, > unsigned long address, > if (is_exec) > flags |= FAULT_FLAG_INSTRUCTION; > > +#ifdef CONFIG_PER_VMA_LOCK > + if (!(flags & FAULT_FLAG_USER)) > + goto lock_mmap; > + > + vma = lock_vma_under_rcu(mm, address); > + if (!vma) > + goto lock_mmap; > + > + if (unlikely(access_pkey_error(is_write, is_exec, > + (error_code & DSISR_KEYFAULT), vma))) { > + int rc = bad_access_pkey(regs, address, vma); > + > + vma_end_read(vma); > + return rc; > + } > + > + if (unlikely(access_error(is_write, is_exec, vma))) { > + int rc = bad_access(regs, address); > + > + vma_end_read(vma); > + return rc; > + } > + > + fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, > regs); > + vma_end_read(vma); > + > + if (!(fault & VM_FAULT_RETRY)) { > + count_vm_vma_lock_event(VMA_LOCK_SUCCESS); > + goto done; > + } > + count_vm_vma_lock_event(VMA_LOCK_RETRY); > + > + if (fault_signal_pending(fault, regs)) > + return user_mode(regs) ? 0 : SIGBUS; > + > +lock_mmap: > +#endif /* CONFIG_PER_VMA_LOCK */ > + > /* When running in the kernel we expect faults to occur only to > * addresses in user space. All other faults represent errors in the > * kernel and should generate an OOPS. Unfortunately, in the case of > an > @@ -545,6 +583,9 @@ static int ___do_page_fault(struct pt_regs *regs, > unsigned long address, > > mmap_read_unlock(current->mm); > > +#ifdef CONFIG_PER_VMA_LOCK > +done: > +#endif > if (unlikely(fault & VM_FAULT_ERROR)) > return mm_fault_error(regs, address, fault); > > diff --git a/arch/powerpc/platforms/powernv/Kconfig > b/arch/powerpc/platforms/powernv/Kconfig > index ae248a161b43..70a46acc70d6 100644 > --- a/arch/powerpc/platforms/powernv/Kconfig > +++ b/arch/powerpc/platforms/powernv/Kconfig > @@ -16,6 +16,7 @@ config PPC_POWERNV > select PPC_DOORBELL > select MMU_NOTIFIER > select FORCE_SMP > + select ARCH_SUPPORTS_PER_VMA_LOCK > default y > > config OPAL_PRD > diff --git a/arch/powerpc/platforms/pseries/Kconfig > b/arch/powerpc/platforms/pseries/Kconfig > index b481c5c8bae1..9c205fe0e619 100644 > --- a/arch/powerpc/platforms/pseries/Kconfig > +++ b/arch/powerpc/platforms/pseries/Kconfig > @@ -21,6 +21,7 @@ config PPC_PSERIES > select HOTPLUG_CPU > select FORCE_SMP > select SWIOTLB > + select ARCH_SUPPORTS_PER_VMA_LOCK > default y > > config PARAVIRT > -- > 2.39.2.722.g9855ee24e9-goog >
Re: [PATCH] powerpc/mm: fix mmap_lock bad unlock
On Mon, Mar 6, 2023 at 6:09 AM Laurent Dufour wrote: > > On 06/03/2023 15:07:26, David Hildenbrand wrote: > > On 06.03.23 14:55, Laurent Dufour wrote: > >> When page fault is tried holding the per VMA lock, bad_access_pkey() and > >> bad_access() should not be called because it is assuming the mmap_lock is > >> held. > >> In the case a bad access is detected, fall back to the default path, > >> grabbing the mmap_lock to handle the fault and report the error. > >> > >> Fixes: 169db3bb4609 ("powerc/mm: try VMA lock-based page fault handling > >> first") > >> Reported-by: Sachin Sant > >> Link: > >> https://lore.kernel.org/linux-mm/842502fb-f99c-417c-9648-a37d0ecdc...@linux.ibm.com > >> Cc: Suren Baghdasaryan > >> Signed-off-by: Laurent Dufour > >> --- > >> arch/powerpc/mm/fault.c | 8 ++-- > >> 1 file changed, 2 insertions(+), 6 deletions(-) > >> > >> diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c > >> index c7ae86b04b8a..e191b3ebd8d6 100644 > >> --- a/arch/powerpc/mm/fault.c > >> +++ b/arch/powerpc/mm/fault.c > >> @@ -479,17 +479,13 @@ static int ___do_page_fault(struct pt_regs *regs, > >> unsigned long address, > >> if (unlikely(access_pkey_error(is_write, is_exec, > >> (error_code & DSISR_KEYFAULT), vma))) { > >> -int rc = bad_access_pkey(regs, address, vma); > >> - > >> vma_end_read(vma); > >> -return rc; > >> +goto lock_mmap; > >> } > >> if (unlikely(access_error(is_write, is_exec, vma))) { > >> -int rc = bad_access(regs, address); > >> - > >> vma_end_read(vma); > >> -return rc; > >> +goto lock_mmap; > >> } > >> fault = handle_mm_fault(vma, address, flags | > >> FAULT_FLAG_VMA_LOCK, regs); > > > > IIUC, that commit is neither upstream not in mm-stable -- it's unstable. > > Maybe raise that as a review comment in reply to the original patch, so we > > can easily connect the dots and squash it into the original, problematic > > patch that is still under review. > > > Oh yes, I missed that. I'll reply to the Suren's thread. Thanks Laurent! Seems simple enough to patch the original change. > > Thanks, > Laurent.
Re: [PATCH 1/1] mm/nommu: remove unnecessary VMA locking
On Fri, Mar 3, 2023 at 1:05 AM David Hildenbrand wrote: > > >> > >> Just a general comment: usually, if review of the original series is > >> still going on, it makes a lot more sense to raise such things in the > >> original series so the author can fixup while things are still in > >> mm-unstable. Once the series is in mm-stable, it's a different story. In > >> that case, it is usually good to have the mail subjects be something > >> like "[PATCH mm-stable 1/1] ...". > > > > Ok... For my education, do you mean the title of this patch should > > somehow reflect that it should be folded into the original patch? Just > > trying to understand the actionable item here. How would you change > > this patch when posting for mm-unstable and for mm-stable? > > For patches that fixup something in mm-stable (stable commit ID but not > yet master -> we cannot squash anymore so we need separate commits), > it's good to include "mm-stable". The main difference to patches that > target master is that by indicating "mm-stable", everyone knows that > this is not broken in some upstream/production kernel. > > > For patches that fixup something that is in mm-unstable (no stable > commit ID -> still under review and fixup easily possible), IMHO we > distinguish between two cases: > > (1) You fixup your own patches: simply send the fixup as reply to the > original patch. Andrew will pick it up and squash it before including it > in mm-stable. Sometimes a complete resend of a series makes sense instead. > > (2) You fixup patches from someone else: simply raise it as a review > comment in reply to the original patch. It might make sense to send a > patch, but usually you just raise the issue to the patch author as a > review comment and the author will address that. Again, Andrew will pick > it up and squash it before moving it to mm-stable. > > > That way, it's clearer when stumbling over patches on the mailing list > if they fix a real issue in upstream, fix a issue in > soon-to-be-upstream, or are simply part of a WIP series that is still > under review. Thanks for the detailed explanation, David. I'll post fixups to mm-unstable patches by replying to the original ones from now on. Interestingly enough, I have another fix today (internal syzcaller found a potential deadlock) which might be interesting enough to be in a separate patch. So, I'll post it as a separate patch and we can discuss whether it should be squashed or kept apart. Thanks, Suren. > > -- > Thanks, > > David / dhildenb >
Re: [PATCH 1/1] mm/nommu: remove unnecessary VMA locking
On Thu, Mar 2, 2023 at 1:41 AM David Hildenbrand wrote: > > On 01.03.23 20:04, Suren Baghdasaryan wrote: > > Since CONFIG_PER_VMA_LOCK depends on CONFIG_MMU, the changes in nommu > > are not needed. Remove them. > > > > Fixes: bad94decd6a4 ("mm: write-lock VMAs before removing them from VMA > > tree") > > Reported-by: Hyeonggon Yoo <42.hye...@gmail.com> > > Link: https://lore.kernel.org/all/Y%2F8CJQGNuMUTdLwP@localhost/ > > Signed-off-by: Suren Baghdasaryan > > --- > > Fix cleanly applies over mm-unstable, SHA in "Fixes" is from that tree. > > > > mm/nommu.c | 5 - > > 1 file changed, 5 deletions(-) > > > > diff --git a/mm/nommu.c b/mm/nommu.c > > index 2ab162d773e2..57ba243c6a37 100644 > > --- a/mm/nommu.c > > +++ b/mm/nommu.c > > @@ -588,7 +588,6 @@ static int delete_vma_from_mm(struct vm_area_struct > > *vma) > > current->pid); > > return -ENOMEM; > > } > > - vma_start_write(vma); > > cleanup_vma_from_mm(vma); > > > > /* remove from the MM's tree and list */ > > @@ -1520,10 +1519,6 @@ void exit_mmap(struct mm_struct *mm) > >*/ > > mmap_write_lock(mm); > > for_each_vma(vmi, vma) { > > - /* > > - * No need to lock VMA because this is the only mm user and no > > - * page fault handled can race with it. > > - */ > > cleanup_vma_from_mm(vma); > > delete_vma(mm, vma); > > cond_resched(); > > So, i assume this should be squashed. Yes, that would be best. > > Reviewed-by: David Hildenbrand Thanks1 > > > Just a general comment: usually, if review of the original series is > still going on, it makes a lot more sense to raise such things in the > original series so the author can fixup while things are still in > mm-unstable. Once the series is in mm-stable, it's a different story. In > that case, it is usually good to have the mail subjects be something > like "[PATCH mm-stable 1/1] ...". Ok... For my education, do you mean the title of this patch should somehow reflect that it should be folded into the original patch? Just trying to understand the actionable item here. How would you change this patch when posting for mm-unstable and for mm-stable? Thanks, Suren. > > -- > Thanks, > > David / dhildenb >
Re: [PATCH v4 18/33] mm: write-lock VMAs before removing them from VMA tree
On Wed, Mar 1, 2023 at 4:54 PM Hyeonggon Yoo <42.hye...@gmail.com> wrote: > > On Wed, Mar 01, 2023 at 10:42:48AM -0800, Suren Baghdasaryan wrote: > > On Wed, Mar 1, 2023 at 10:34 AM Suren Baghdasaryan > > wrote: > > > > > > On Tue, Feb 28, 2023 at 11:57 PM Hyeonggon Yoo <42.hye...@gmail.com> > > > wrote: > > > > > > > > On Wed, Mar 01, 2023 at 07:43:33AM +0000, Hyeonggon Yoo wrote: > > > > > On Mon, Feb 27, 2023 at 09:36:17AM -0800, Suren Baghdasaryan wrote: > > > > > > Write-locking VMAs before isolating them ensures that page fault > > > > > > handlers don't operate on isolated VMAs. > > > > > > > > > > > > Signed-off-by: Suren Baghdasaryan > > > > > > --- > > > > > > mm/mmap.c | 1 + > > > > > > mm/nommu.c | 5 + > > > > > > 2 files changed, 6 insertions(+) > > > > > > > > > > > > diff --git a/mm/mmap.c b/mm/mmap.c > > > > > > index 1f42b9a52b9b..f7ed357056c4 100644 > > > > > > --- a/mm/mmap.c > > > > > > +++ b/mm/mmap.c > > > > > > @@ -2255,6 +2255,7 @@ int split_vma(struct vma_iterator *vmi, > > > > > > struct vm_area_struct *vma, > > > > > > static inline int munmap_sidetree(struct vm_area_struct *vma, > > > > > >struct ma_state *mas_detach) > > > > > > { > > > > > > + vma_start_write(vma); > > > > > > mas_set_range(mas_detach, vma->vm_start, vma->vm_end - 1); > > > > > > > > > > I may be missing something, but have few questions: > > > > > > > > > > 1) Why does a writer need to both write-lock a VMA and mark the > > > > > VMA detached > > > > > when unmapping it, isn't it enough to just only write-lock a > > > > > VMA? > > > > > > We need to mark the VMA detached to avoid handling page fault in a > > > detached VMA. The possible scenario is: > > > > > > lock_vma_under_rcu > > > vma = mas_walk(&mas) > > > munmap_sidetree > > > > > > vma_start_write(vma) > > > > > > mas_store_gfp() // remove VMA from the tree > > > > > > vma_end_write_all() > > > vma_start_read(vma) > > > // we locked the VMA but it is not part of the tree anymore. > > > > > > So, marking the VMA locked before vma_end_write_all() and checking > > > > Sorry, I should have said "marking the VMA *detached* before > > vma_end_write_all() and checking vma->detached after vma_start_read() > > helps us avoid handling faults in the detached VMA." > > > > > vma->detached after vma_start_read() helps us avoid handling faults in > > > the detached VMA. > > Thank you for explanation, that makes sense! > > By the way, if there are no 32bit users of Per-VMA lock (are there?), > "detached" bool could be a VMA flag (i.e. making it depend on 64BIT > and selecting ARCH_USES_HIGH_VMA_FLAGS) Yeah, I thought about it but didn't want to make assumptions about potential users just yet. Besides, I heard there are attempts to make vm_flags to be always 64-bit (I think Matthew mentioned that to me once). If that happens, we won't need any dependencies here. Either way, this conversion into a flag can be done as an additional optimization later on. I prefer to keep the main patchset as simple as possible for now. Thanks, Suren. > > Thanks, > Hyeonggon > > -- > To unsubscribe from this group and stop receiving emails from it, send an > email to kernel-team+unsubscr...@android.com. >
Re: [PATCH v4 18/33] mm: write-lock VMAs before removing them from VMA tree
On Wed, Mar 1, 2023 at 10:34 AM Suren Baghdasaryan wrote: > > On Tue, Feb 28, 2023 at 11:57 PM Hyeonggon Yoo <42.hye...@gmail.com> wrote: > > > > On Wed, Mar 01, 2023 at 07:43:33AM +, Hyeonggon Yoo wrote: > > > On Mon, Feb 27, 2023 at 09:36:17AM -0800, Suren Baghdasaryan wrote: > > > > Write-locking VMAs before isolating them ensures that page fault > > > > handlers don't operate on isolated VMAs. > > > > > > > > Signed-off-by: Suren Baghdasaryan > > > > --- > > > > mm/mmap.c | 1 + > > > > mm/nommu.c | 5 + > > > > 2 files changed, 6 insertions(+) > > > > > > > > diff --git a/mm/mmap.c b/mm/mmap.c > > > > index 1f42b9a52b9b..f7ed357056c4 100644 > > > > --- a/mm/mmap.c > > > > +++ b/mm/mmap.c > > > > @@ -2255,6 +2255,7 @@ int split_vma(struct vma_iterator *vmi, struct > > > > vm_area_struct *vma, > > > > static inline int munmap_sidetree(struct vm_area_struct *vma, > > > >struct ma_state *mas_detach) > > > > { > > > > + vma_start_write(vma); > > > > mas_set_range(mas_detach, vma->vm_start, vma->vm_end - 1); > > > > > > I may be missing something, but have few questions: > > > > > > 1) Why does a writer need to both write-lock a VMA and mark the VMA > > > detached > > > when unmapping it, isn't it enough to just only write-lock a VMA? > > We need to mark the VMA detached to avoid handling page fault in a > detached VMA. The possible scenario is: > > lock_vma_under_rcu > vma = mas_walk(&mas) > munmap_sidetree > vma_start_write(vma) > > mas_store_gfp() // remove VMA from the tree > vma_end_write_all() > vma_start_read(vma) > // we locked the VMA but it is not part of the tree anymore. > > So, marking the VMA locked before vma_end_write_all() and checking > vma->detached after vma_start_read() helps us avoid handling faults in > the detached VMA. > > > > > > > > 2) as VMAs that are going to be removed are already locked in > > > vma_prepare(), > > > so I think this hunk could be dropped? > > > > After sending this just realized that I did not consider simple munmap case > > :) > > But I still think 1) and 3) are valid question. > > > > > > > > > if (mas_store_gfp(mas_detach, vma, GFP_KERNEL)) > > > > return -ENOMEM; > > > > diff --git a/mm/nommu.c b/mm/nommu.c > > > > index 57ba243c6a37..2ab162d773e2 100644 > > > > --- a/mm/nommu.c > > > > +++ b/mm/nommu.c > > > > @@ -588,6 +588,7 @@ static int delete_vma_from_mm(struct vm_area_struct > > > > *vma) > > > >current->pid); > > > > return -ENOMEM; > > > > } > > > > + vma_start_write(vma); > > > > cleanup_vma_from_mm(vma); > > > > > > 3) I think this hunk could be dropped as Per-VMA lock depends on > > > MMU anyway. > > Ah, yes, you are right. We can safely remove the changes in nommu.c > Andrew, should I post a fixup or you can make the removal directly in > mm-unstable? I went ahead and posted the fixup for this at: https://lore.kernel.org/all/20230301190457.1498985-1-sur...@google.com/ > Thanks, > Suren. > > > > > > > Thanks, > > > Hyeonggon > > > > > > > > > > > /* remove from the MM's tree and list */ > > > > @@ -1519,6 +1520,10 @@ void exit_mmap(struct mm_struct *mm) > > > > */ > > > > mmap_write_lock(mm); > > > > for_each_vma(vmi, vma) { > > > > + /* > > > > +* No need to lock VMA because this is the only mm user and > > > > no > > > > +* page fault handled can race with it. > > > > +*/ > > > > cleanup_vma_from_mm(vma); > > > > delete_vma(mm, vma); > > > > cond_resched(); > > > > -- > > > > 2.39.2.722.g9855ee24e9-goog > > > > > > > > > > > > > > > -- > > To unsubscribe from this group and stop receiving emails from it, send an > > email to kernel-team+unsubscr...@android.com. > >
[PATCH 1/1] mm/nommu: remove unnecessary VMA locking
Since CONFIG_PER_VMA_LOCK depends on CONFIG_MMU, the changes in nommu are not needed. Remove them. Fixes: bad94decd6a4 ("mm: write-lock VMAs before removing them from VMA tree") Reported-by: Hyeonggon Yoo <42.hye...@gmail.com> Link: https://lore.kernel.org/all/Y%2F8CJQGNuMUTdLwP@localhost/ Signed-off-by: Suren Baghdasaryan --- Fix cleanly applies over mm-unstable, SHA in "Fixes" is from that tree. mm/nommu.c | 5 - 1 file changed, 5 deletions(-) diff --git a/mm/nommu.c b/mm/nommu.c index 2ab162d773e2..57ba243c6a37 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -588,7 +588,6 @@ static int delete_vma_from_mm(struct vm_area_struct *vma) current->pid); return -ENOMEM; } - vma_start_write(vma); cleanup_vma_from_mm(vma); /* remove from the MM's tree and list */ @@ -1520,10 +1519,6 @@ void exit_mmap(struct mm_struct *mm) */ mmap_write_lock(mm); for_each_vma(vmi, vma) { - /* -* No need to lock VMA because this is the only mm user and no -* page fault handled can race with it. -*/ cleanup_vma_from_mm(vma); delete_vma(mm, vma); cond_resched(); -- 2.40.0.rc0.216.gc4246ad0f0-goog
Re: [PATCH v4 18/33] mm: write-lock VMAs before removing them from VMA tree
On Wed, Mar 1, 2023 at 10:34 AM Suren Baghdasaryan wrote: > > On Tue, Feb 28, 2023 at 11:57 PM Hyeonggon Yoo <42.hye...@gmail.com> wrote: > > > > On Wed, Mar 01, 2023 at 07:43:33AM +, Hyeonggon Yoo wrote: > > > On Mon, Feb 27, 2023 at 09:36:17AM -0800, Suren Baghdasaryan wrote: > > > > Write-locking VMAs before isolating them ensures that page fault > > > > handlers don't operate on isolated VMAs. > > > > > > > > Signed-off-by: Suren Baghdasaryan > > > > --- > > > > mm/mmap.c | 1 + > > > > mm/nommu.c | 5 + > > > > 2 files changed, 6 insertions(+) > > > > > > > > diff --git a/mm/mmap.c b/mm/mmap.c > > > > index 1f42b9a52b9b..f7ed357056c4 100644 > > > > --- a/mm/mmap.c > > > > +++ b/mm/mmap.c > > > > @@ -2255,6 +2255,7 @@ int split_vma(struct vma_iterator *vmi, struct > > > > vm_area_struct *vma, > > > > static inline int munmap_sidetree(struct vm_area_struct *vma, > > > >struct ma_state *mas_detach) > > > > { > > > > + vma_start_write(vma); > > > > mas_set_range(mas_detach, vma->vm_start, vma->vm_end - 1); > > > > > > I may be missing something, but have few questions: > > > > > > 1) Why does a writer need to both write-lock a VMA and mark the VMA > > > detached > > > when unmapping it, isn't it enough to just only write-lock a VMA? > > We need to mark the VMA detached to avoid handling page fault in a > detached VMA. The possible scenario is: > > lock_vma_under_rcu > vma = mas_walk(&mas) > munmap_sidetree > vma_start_write(vma) > > mas_store_gfp() // remove VMA from the tree > vma_end_write_all() > vma_start_read(vma) > // we locked the VMA but it is not part of the tree anymore. > > So, marking the VMA locked before vma_end_write_all() and checking Sorry, I should have said "marking the VMA *detached* before vma_end_write_all() and checking vma->detached after vma_start_read() helps us avoid handling faults in the detached VMA." > vma->detached after vma_start_read() helps us avoid handling faults in > the detached VMA. > > > > > > > > 2) as VMAs that are going to be removed are already locked in > > > vma_prepare(), > > > so I think this hunk could be dropped? > > > > After sending this just realized that I did not consider simple munmap case > > :) > > But I still think 1) and 3) are valid question. > > > > > > > > > if (mas_store_gfp(mas_detach, vma, GFP_KERNEL)) > > > > return -ENOMEM; > > > > diff --git a/mm/nommu.c b/mm/nommu.c > > > > index 57ba243c6a37..2ab162d773e2 100644 > > > > --- a/mm/nommu.c > > > > +++ b/mm/nommu.c > > > > @@ -588,6 +588,7 @@ static int delete_vma_from_mm(struct vm_area_struct > > > > *vma) > > > >current->pid); > > > > return -ENOMEM; > > > > } > > > > + vma_start_write(vma); > > > > cleanup_vma_from_mm(vma); > > > > > > 3) I think this hunk could be dropped as Per-VMA lock depends on > > > MMU anyway. > > Ah, yes, you are right. We can safely remove the changes in nommu.c > Andrew, should I post a fixup or you can make the removal directly in > mm-unstable? > Thanks, > Suren. > > > > > > > Thanks, > > > Hyeonggon > > > > > > > > > > > /* remove from the MM's tree and list */ > > > > @@ -1519,6 +1520,10 @@ void exit_mmap(struct mm_struct *mm) > > > > */ > > > > mmap_write_lock(mm); > > > > for_each_vma(vmi, vma) { > > > > + /* > > > > +* No need to lock VMA because this is the only mm user and > > > > no > > > > +* page fault handled can race with it. > > > > +*/ > > > > cleanup_vma_from_mm(vma); > > > > delete_vma(mm, vma); > > > > cond_resched(); > > > > -- > > > > 2.39.2.722.g9855ee24e9-goog > > > > > > > > > > > > > > > -- > > To unsubscribe from this group and stop receiving emails from it, send an > > email to kernel-team+unsubscr...@android.com. > >
Re: [PATCH v4 18/33] mm: write-lock VMAs before removing them from VMA tree
On Tue, Feb 28, 2023 at 11:57 PM Hyeonggon Yoo <42.hye...@gmail.com> wrote: > > On Wed, Mar 01, 2023 at 07:43:33AM +, Hyeonggon Yoo wrote: > > On Mon, Feb 27, 2023 at 09:36:17AM -0800, Suren Baghdasaryan wrote: > > > Write-locking VMAs before isolating them ensures that page fault > > > handlers don't operate on isolated VMAs. > > > > > > Signed-off-by: Suren Baghdasaryan > > > --- > > > mm/mmap.c | 1 + > > > mm/nommu.c | 5 + > > > 2 files changed, 6 insertions(+) > > > > > > diff --git a/mm/mmap.c b/mm/mmap.c > > > index 1f42b9a52b9b..f7ed357056c4 100644 > > > --- a/mm/mmap.c > > > +++ b/mm/mmap.c > > > @@ -2255,6 +2255,7 @@ int split_vma(struct vma_iterator *vmi, struct > > > vm_area_struct *vma, > > > static inline int munmap_sidetree(struct vm_area_struct *vma, > > >struct ma_state *mas_detach) > > > { > > > + vma_start_write(vma); > > > mas_set_range(mas_detach, vma->vm_start, vma->vm_end - 1); > > > > I may be missing something, but have few questions: > > > > 1) Why does a writer need to both write-lock a VMA and mark the VMA > > detached > > when unmapping it, isn't it enough to just only write-lock a VMA? We need to mark the VMA detached to avoid handling page fault in a detached VMA. The possible scenario is: lock_vma_under_rcu vma = mas_walk(&mas) munmap_sidetree vma_start_write(vma) mas_store_gfp() // remove VMA from the tree vma_end_write_all() vma_start_read(vma) // we locked the VMA but it is not part of the tree anymore. So, marking the VMA locked before vma_end_write_all() and checking vma->detached after vma_start_read() helps us avoid handling faults in the detached VMA. > > > > 2) as VMAs that are going to be removed are already locked in > > vma_prepare(), > > so I think this hunk could be dropped? > > After sending this just realized that I did not consider simple munmap case :) > But I still think 1) and 3) are valid question. > > > > > > if (mas_store_gfp(mas_detach, vma, GFP_KERNEL)) > > > return -ENOMEM; > > > diff --git a/mm/nommu.c b/mm/nommu.c > > > index 57ba243c6a37..2ab162d773e2 100644 > > > --- a/mm/nommu.c > > > +++ b/mm/nommu.c > > > @@ -588,6 +588,7 @@ static int delete_vma_from_mm(struct vm_area_struct > > > *vma) > > >current->pid); > > > return -ENOMEM; > > > } > > > + vma_start_write(vma); > > > cleanup_vma_from_mm(vma); > > > > 3) I think this hunk could be dropped as Per-VMA lock depends on MMU > > anyway. Ah, yes, you are right. We can safely remove the changes in nommu.c Andrew, should I post a fixup or you can make the removal directly in mm-unstable? Thanks, Suren. > > > > Thanks, > > Hyeonggon > > > > > > > > /* remove from the MM's tree and list */ > > > @@ -1519,6 +1520,10 @@ void exit_mmap(struct mm_struct *mm) > > > */ > > > mmap_write_lock(mm); > > > for_each_vma(vmi, vma) { > > > + /* > > > +* No need to lock VMA because this is the only mm user and no > > > +* page fault handled can race with it. > > > +*/ > > > cleanup_vma_from_mm(vma); > > > delete_vma(mm, vma); > > > cond_resched(); > > > -- > > > 2.39.2.722.g9855ee24e9-goog > > > > > > > > > > -- > To unsubscribe from this group and stop receiving emails from it, send an > email to kernel-team+unsubscr...@android.com. >
Re: [PATCH v2 00/33] Per-VMA locks
On Tue, Feb 28, 2023 at 4:06 AM Punit Agrawal wrote: > > Punit Agrawal writes: > > > Suren Baghdasaryan writes: > > > >> Previous version: > >> v1: https://lore.kernel.org/all/20230109205336.3665937-1-sur...@google.com/ > >> RFC: https://lore.kernel.org/all/20220901173516.702122-1-sur...@google.com/ > >> > >> LWN article describing the feature: > >> https://lwn.net/Articles/906852/ > >> > >> Per-vma locks idea that was discussed during SPF [1] discussion at LSF/MM > >> last year [2], which concluded with suggestion that “a reader/writer > >> semaphore could be put into the VMA itself; that would have the effect of > >> using the VMA as a sort of range lock. There would still be contention at > >> the VMA level, but it would be an improvement.” This patchset implements > >> this suggested approach. > > > > I took the patches for a spin on a 2-socket 32 core (64 threads) system > > with Intel 8336C (Ice Lake) and 512GB of RAM. > > > > For the initial testing, "pft-threads" from the mm-tests suite[0] was > > used. The test mmaps a memory region (~100GB on the test system) and > > triggers access by a number of threads executing in parallel. For each > > degree of parallelism, the test is repeated 10 times to get a better > > feel for the behaviour. Below is an excerpt of the harmonic mean > > reported by 'compare_kernel' script[1] included with mm-tests. > > > > The first column is results for mm-unstable as of 2023-02-10, the second > > column is the patches posted here while the third column includes > > optimizations to reclaim some of the observed regression. > > > > From the results, there is a drop in page fault/second for low number of > > CPUs but good improvement with higher CPUs. > > > > 6.2.0-rc46.2.0-rc4 > > 6.2.0-rc4 > > mm-unstable-20230210 pvl-v2 > > pvl-v2+opt > > > > Hmean faults/cpu-1 898792.9338 ( 0.00%) 894597.0474 * -0.47%* > > 895933.2782 * -0.32%* > > Hmean faults/cpu-4 751903.9803 ( 0.00%) 677764.2975 * -9.86%* > > 688643.8163 * -8.41%* > > Hmean faults/cpu-7 612275.5663 ( 0.00%) 565363.4137 * -7.66%* > > 597538.9396 * -2.41%* > > Hmean faults/cpu-12434460.9074 ( 0.00%) 410974.2708 * -5.41%* > > 452501.4290 * 4.15%* > > Hmean faults/cpu-21291475.5165 ( 0.00%) 293936.8460 ( 0.84%) > > 308712.2434 * 5.91%* > > Hmean faults/cpu-30218021.3980 ( 0.00%) 228265.0559 * 4.70%* > > 241897.5225 * 10.95%* > > Hmean faults/cpu-48141798.5030 ( 0.00%) 162322.5972 * 14.47%* > > 166081.9459 * 17.13%* > > Hmean faults/cpu-79 90060.9577 ( 0.00%) 107028.7779 * 18.84%* > > 109810.4488 * 21.93%* > > Hmean faults/cpu-11064729.3561 ( 0.00%)80597.7246 * 24.51%* > > 83134.0679 * 28.43%* > > Hmean faults/cpu-12855740.1334 ( 0.00%)68395.4426 * 22.70%* > > 69248.2836 * 24.23%* > > > > Hmean faults/sec-1 898781.7694 ( 0.00%) 894247.3174 * -0.50%* > > 894440.3118 * -0.48%* > > Hmean faults/sec-42965588.9697 ( 0.00%) 2683651.5664 * -9.51%* > > 2726450.9710 * -8.06%* > > Hmean faults/sec-74144512.3996 ( 0.00%) 3891644.2128 * -6.10%* > > 4099918.8601 ( -1.08%) > > Hmean faults/sec-12 4969513.6934 ( 0.00%) 4829731.4355 * -2.81%* > > 5264682.7371 * 5.94%* > > Hmean faults/sec-21 5814379.4789 ( 0.00%) 5941405.3116 * 2.18%* > > 6263716.3903 * 7.73%* > > Hmean faults/sec-30 6153685.3709 ( 0.00%) 6489311.6634 * 5.45%* > > 6910843.5858 * 12.30%* > > Hmean faults/sec-48 6197953.1327 ( 0.00%) 7216320.7727 * 16.43%* > > 7412782.2927 * 19.60%* > > Hmean faults/sec-79 6167135.3738 ( 0.00%) 7425927.1022 * 20.41%* > > 7637042.2198 * 23.83%* > > Hmean faults/sec-110 6264768.2247 ( 0.00%) 7813329.3863 * 24.72%* > > 7984344.4005 * 27.45%* > > Hmean faults/sec-128 6460727.8216 ( 0.00%) 7875664.8999 * 21.90%* > > 8049910.3601 * 24.60%* > > > The above workload represent the worst case with regards to per-VMA > locks as it creates a single large VMA. As a follow-up, I modified > pft[2] to create a VMA per thread to understand the behaviour in > scenarios where per-VMA locks should show the most benefit. > >
[PATCH v4 33/33] mm: separate vma->lock from vm_area_struct
vma->lock being part of the vm_area_struct causes performance regression during page faults because during contention its count and owner fields are constantly updated and having other parts of vm_area_struct used during page fault handling next to them causes constant cache line bouncing. Fix that by moving the lock outside of the vm_area_struct. All attempts to keep vma->lock inside vm_area_struct in a separate cache line still produce performance regression especially on NUMA machines. Smallest regression was achieved when lock is placed in the fourth cache line but that bloats vm_area_struct to 256 bytes. Considering performance and memory impact, separate lock looks like the best option. It increases memory footprint of each VMA but that can be optimized later if the new size causes issues. Note that after this change vma_init() does not allocate or initialize vma->lock anymore. A number of drivers allocate a pseudo VMA on the stack but they never use the VMA's lock, therefore it does not need to be allocated. The future drivers which might need the VMA lock should use vm_area_alloc()/vm_area_free() to allocate the VMA. Signed-off-by: Suren Baghdasaryan --- include/linux/mm.h | 23 ++--- include/linux/mm_types.h | 6 +++- kernel/fork.c| 73 3 files changed, 74 insertions(+), 28 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 5e142bfe7a58..3d4bb18dfcb7 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -627,12 +627,6 @@ struct vm_operations_struct { }; #ifdef CONFIG_PER_VMA_LOCK -static inline void vma_init_lock(struct vm_area_struct *vma) -{ - init_rwsem(&vma->lock); - vma->vm_lock_seq = -1; -} - /* * Try to read-lock a vma. The function is allowed to occasionally yield false * locked result to avoid performance overhead, in which case we fall back to @@ -644,17 +638,17 @@ static inline bool vma_start_read(struct vm_area_struct *vma) if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq)) return false; - if (unlikely(down_read_trylock(&vma->lock) == 0)) + if (unlikely(down_read_trylock(&vma->vm_lock->lock) == 0)) return false; /* * Overflow might produce false locked result. * False unlocked result is impossible because we modify and check -* vma->vm_lock_seq under vma->lock protection and mm->mm_lock_seq +* vma->vm_lock_seq under vma->vm_lock protection and mm->mm_lock_seq * modification invalidates all existing locks. */ if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) { - up_read(&vma->lock); + up_read(&vma->vm_lock->lock); return false; } return true; @@ -663,7 +657,7 @@ static inline bool vma_start_read(struct vm_area_struct *vma) static inline void vma_end_read(struct vm_area_struct *vma) { rcu_read_lock(); /* keeps vma alive till the end of up_read */ - up_read(&vma->lock); + up_read(&vma->vm_lock->lock); rcu_read_unlock(); } @@ -681,9 +675,9 @@ static inline void vma_start_write(struct vm_area_struct *vma) if (vma->vm_lock_seq == mm_lock_seq) return; - down_write(&vma->lock); + down_write(&vma->vm_lock->lock); vma->vm_lock_seq = mm_lock_seq; - up_write(&vma->lock); + up_write(&vma->vm_lock->lock); } static inline void vma_assert_write_locked(struct vm_area_struct *vma) @@ -720,6 +714,10 @@ static inline void vma_mark_detached(struct vm_area_struct *vma, #endif /* CONFIG_PER_VMA_LOCK */ +/* + * WARNING: vma_init does not initialize vma->vm_lock. + * Use vm_area_alloc()/vm_area_free() if vma needs locking. + */ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm) { static const struct vm_operations_struct dummy_vm_ops = {}; @@ -729,7 +727,6 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm) vma->vm_ops = &dummy_vm_ops; INIT_LIST_HEAD(&vma->anon_vma_chain); vma_mark_detached(vma, false); - vma_init_lock(vma); } /* Use when VMA is not part of the VMA tree and needs no locking */ diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 6768533a6b7c..89bbf7d8a312 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -471,6 +471,10 @@ struct anon_vma_name { char name[]; }; +struct vma_lock { + struct rw_semaphore lock; +}; + /* * This struct describes a virtual memory area. There is one of these * per VM-area/task. A VM area is any part of the process virtual memory @@ -510,7 +514,7 @@ struct vm_area_struct { #ifdef CONFIG_PER_VMA_LOCK i
[PATCH v4 32/33] mm/mmap: free vm_area_struct without call_rcu in exit_mmap
call_rcu() can take a long time when callback offloading is enabled. Its use in the vm_area_free can cause regressions in the exit path when multiple VMAs are being freed. Because exit_mmap() is called only after the last mm user drops its refcount, the page fault handlers can't be racing with it. Any other possible user like oom-reaper or process_mrelease are already synchronized using mmap_lock. Therefore exit_mmap() can free VMAs directly, without the use of call_rcu(). Expose __vm_area_free() and use it from exit_mmap() to avoid possible call_rcu() floods and performance regressions caused by it. Signed-off-by: Suren Baghdasaryan --- include/linux/mm.h | 2 ++ kernel/fork.c | 2 +- mm/mmap.c | 11 +++ 3 files changed, 10 insertions(+), 5 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index d07ac92f..5e142bfe7a58 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -256,6 +256,8 @@ void setup_initial_init_mm(void *start_code, void *end_code, struct vm_area_struct *vm_area_alloc(struct mm_struct *); struct vm_area_struct *vm_area_dup(struct vm_area_struct *); void vm_area_free(struct vm_area_struct *); +/* Use only if VMA has no other users */ +void __vm_area_free(struct vm_area_struct *vma); #ifndef CONFIG_MMU extern struct rb_root nommu_region_tree; diff --git a/kernel/fork.c b/kernel/fork.c index bdb55f25895d..ad37f1d0c5ab 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -480,7 +480,7 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig) return new; } -static void __vm_area_free(struct vm_area_struct *vma) +void __vm_area_free(struct vm_area_struct *vma) { free_anon_vma_name(vma); kmem_cache_free(vm_area_cachep, vma); diff --git a/mm/mmap.c b/mm/mmap.c index df13c33498db..0cd3714c2182 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -133,7 +133,7 @@ void unlink_file_vma(struct vm_area_struct *vma) /* * Close a vm structure and free it. */ -static void remove_vma(struct vm_area_struct *vma) +static void remove_vma(struct vm_area_struct *vma, bool unreachable) { might_sleep(); if (vma->vm_ops && vma->vm_ops->close) @@ -141,7 +141,10 @@ static void remove_vma(struct vm_area_struct *vma) if (vma->vm_file) fput(vma->vm_file); mpol_put(vma_policy(vma)); - vm_area_free(vma); + if (unreachable) + __vm_area_free(vma); + else + vm_area_free(vma); } static inline struct vm_area_struct *vma_prev_limit(struct vma_iterator *vmi, @@ -2130,7 +2133,7 @@ static inline void remove_mt(struct mm_struct *mm, struct ma_state *mas) if (vma->vm_flags & VM_ACCOUNT) nr_accounted += nrpages; vm_stat_account(mm, vma->vm_flags, -nrpages); - remove_vma(vma); + remove_vma(vma, false); } vm_unacct_memory(nr_accounted); validate_mm(mm); @@ -3070,7 +3073,7 @@ void exit_mmap(struct mm_struct *mm) do { if (vma->vm_flags & VM_ACCOUNT) nr_accounted += vma_pages(vma); - remove_vma(vma); + remove_vma(vma, true); count++; cond_resched(); } while ((vma = mas_find(&mas, ULONG_MAX)) != NULL); -- 2.39.2.722.g9855ee24e9-goog
[PATCH v4 31/33] powerc/mm: try VMA lock-based page fault handling first
From: Laurent Dufour Attempt VMA lock-based page fault handling first, and fall back to the existing mmap_lock-based handling if that fails. Copied from "x86/mm: try VMA lock-based page fault handling first" Signed-off-by: Laurent Dufour Signed-off-by: Suren Baghdasaryan --- arch/powerpc/mm/fault.c| 41 ++ arch/powerpc/platforms/powernv/Kconfig | 1 + arch/powerpc/platforms/pseries/Kconfig | 1 + 3 files changed, 43 insertions(+) diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c index 2bef19cc1b98..c7ae86b04b8a 100644 --- a/arch/powerpc/mm/fault.c +++ b/arch/powerpc/mm/fault.c @@ -469,6 +469,44 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address, if (is_exec) flags |= FAULT_FLAG_INSTRUCTION; +#ifdef CONFIG_PER_VMA_LOCK + if (!(flags & FAULT_FLAG_USER)) + goto lock_mmap; + + vma = lock_vma_under_rcu(mm, address); + if (!vma) + goto lock_mmap; + + if (unlikely(access_pkey_error(is_write, is_exec, + (error_code & DSISR_KEYFAULT), vma))) { + int rc = bad_access_pkey(regs, address, vma); + + vma_end_read(vma); + return rc; + } + + if (unlikely(access_error(is_write, is_exec, vma))) { + int rc = bad_access(regs, address); + + vma_end_read(vma); + return rc; + } + + fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs); + vma_end_read(vma); + + if (!(fault & VM_FAULT_RETRY)) { + count_vm_vma_lock_event(VMA_LOCK_SUCCESS); + goto done; + } + count_vm_vma_lock_event(VMA_LOCK_RETRY); + + if (fault_signal_pending(fault, regs)) + return user_mode(regs) ? 0 : SIGBUS; + +lock_mmap: +#endif /* CONFIG_PER_VMA_LOCK */ + /* When running in the kernel we expect faults to occur only to * addresses in user space. All other faults represent errors in the * kernel and should generate an OOPS. Unfortunately, in the case of an @@ -545,6 +583,9 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned long address, mmap_read_unlock(current->mm); +#ifdef CONFIG_PER_VMA_LOCK +done: +#endif if (unlikely(fault & VM_FAULT_ERROR)) return mm_fault_error(regs, address, fault); diff --git a/arch/powerpc/platforms/powernv/Kconfig b/arch/powerpc/platforms/powernv/Kconfig index ae248a161b43..70a46acc70d6 100644 --- a/arch/powerpc/platforms/powernv/Kconfig +++ b/arch/powerpc/platforms/powernv/Kconfig @@ -16,6 +16,7 @@ config PPC_POWERNV select PPC_DOORBELL select MMU_NOTIFIER select FORCE_SMP + select ARCH_SUPPORTS_PER_VMA_LOCK default y config OPAL_PRD diff --git a/arch/powerpc/platforms/pseries/Kconfig b/arch/powerpc/platforms/pseries/Kconfig index b481c5c8bae1..9c205fe0e619 100644 --- a/arch/powerpc/platforms/pseries/Kconfig +++ b/arch/powerpc/platforms/pseries/Kconfig @@ -21,6 +21,7 @@ config PPC_PSERIES select HOTPLUG_CPU select FORCE_SMP select SWIOTLB + select ARCH_SUPPORTS_PER_VMA_LOCK default y config PARAVIRT -- 2.39.2.722.g9855ee24e9-goog
[PATCH v4 30/33] arm64/mm: try VMA lock-based page fault handling first
Attempt VMA lock-based page fault handling first, and fall back to the existing mmap_lock-based handling if that fails. Signed-off-by: Suren Baghdasaryan --- arch/arm64/Kconfig| 1 + arch/arm64/mm/fault.c | 36 2 files changed, 37 insertions(+) diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig index 27b2592698b0..412207d789b1 100644 --- a/arch/arm64/Kconfig +++ b/arch/arm64/Kconfig @@ -95,6 +95,7 @@ config ARM64 select ARCH_SUPPORTS_INT128 if CC_HAS_INT128 select ARCH_SUPPORTS_NUMA_BALANCING select ARCH_SUPPORTS_PAGE_TABLE_CHECK + select ARCH_SUPPORTS_PER_VMA_LOCK select ARCH_WANT_COMPAT_IPC_PARSE_VERSION if COMPAT select ARCH_WANT_DEFAULT_BPF_JIT select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT diff --git a/arch/arm64/mm/fault.c b/arch/arm64/mm/fault.c index f4cb0f85ccf4..9e0db5c387e3 100644 --- a/arch/arm64/mm/fault.c +++ b/arch/arm64/mm/fault.c @@ -535,6 +535,9 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr, unsigned long vm_flags; unsigned int mm_flags = FAULT_FLAG_DEFAULT; unsigned long addr = untagged_addr(far); +#ifdef CONFIG_PER_VMA_LOCK + struct vm_area_struct *vma; +#endif if (kprobe_page_fault(regs, esr)) return 0; @@ -585,6 +588,36 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr, perf_sw_event(PERF_COUNT_SW_PAGE_FAULTS, 1, regs, addr); +#ifdef CONFIG_PER_VMA_LOCK + if (!(mm_flags & FAULT_FLAG_USER)) + goto lock_mmap; + + vma = lock_vma_under_rcu(mm, addr); + if (!vma) + goto lock_mmap; + + if (!(vma->vm_flags & vm_flags)) { + vma_end_read(vma); + goto lock_mmap; + } + fault = handle_mm_fault(vma, addr & PAGE_MASK, + mm_flags | FAULT_FLAG_VMA_LOCK, regs); + vma_end_read(vma); + + if (!(fault & VM_FAULT_RETRY)) { + count_vm_vma_lock_event(VMA_LOCK_SUCCESS); + goto done; + } + count_vm_vma_lock_event(VMA_LOCK_RETRY); + + /* Quick path to respond to signals */ + if (fault_signal_pending(fault, regs)) { + if (!user_mode(regs)) + goto no_context; + return 0; + } +lock_mmap: +#endif /* CONFIG_PER_VMA_LOCK */ /* * As per x86, we may deadlock here. However, since the kernel only * validly references user space from well defined areas of the code, @@ -628,6 +661,9 @@ static int __kprobes do_page_fault(unsigned long far, unsigned long esr, } mmap_read_unlock(mm); +#ifdef CONFIG_PER_VMA_LOCK +done: +#endif /* * Handle the "normal" (no error) case first. */ -- 2.39.2.722.g9855ee24e9-goog
[PATCH v4 29/33] x86/mm: try VMA lock-based page fault handling first
Attempt VMA lock-based page fault handling first, and fall back to the existing mmap_lock-based handling if that fails. Signed-off-by: Suren Baghdasaryan --- arch/x86/Kconfig| 1 + arch/x86/mm/fault.c | 36 2 files changed, 37 insertions(+) diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig index a825bf031f49..df21fba77db1 100644 --- a/arch/x86/Kconfig +++ b/arch/x86/Kconfig @@ -27,6 +27,7 @@ config X86_64 # Options that are inherently 64-bit kernel only: select ARCH_HAS_GIGANTIC_PAGE select ARCH_SUPPORTS_INT128 if CC_HAS_INT128 + select ARCH_SUPPORTS_PER_VMA_LOCK select ARCH_USE_CMPXCHG_LOCKREF select HAVE_ARCH_SOFT_DIRTY select MODULES_USE_ELF_RELA diff --git a/arch/x86/mm/fault.c b/arch/x86/mm/fault.c index a498ae1fbe66..e4399983c50c 100644 --- a/arch/x86/mm/fault.c +++ b/arch/x86/mm/fault.c @@ -19,6 +19,7 @@ #include /* faulthandler_disabled() */ #include /* efi_crash_gracefully_on_page_fault()*/ #include +#include /* find_and_lock_vma() */ #include /* boot_cpu_has, ...*/ #include /* dotraplinkage, ... */ @@ -1333,6 +1334,38 @@ void do_user_addr_fault(struct pt_regs *regs, } #endif +#ifdef CONFIG_PER_VMA_LOCK + if (!(flags & FAULT_FLAG_USER)) + goto lock_mmap; + + vma = lock_vma_under_rcu(mm, address); + if (!vma) + goto lock_mmap; + + if (unlikely(access_error(error_code, vma))) { + vma_end_read(vma); + goto lock_mmap; + } + fault = handle_mm_fault(vma, address, flags | FAULT_FLAG_VMA_LOCK, regs); + vma_end_read(vma); + + if (!(fault & VM_FAULT_RETRY)) { + count_vm_vma_lock_event(VMA_LOCK_SUCCESS); + goto done; + } + count_vm_vma_lock_event(VMA_LOCK_RETRY); + + /* Quick path to respond to signals */ + if (fault_signal_pending(fault, regs)) { + if (!user_mode(regs)) + kernelmode_fixup_or_oops(regs, error_code, address, +SIGBUS, BUS_ADRERR, +ARCH_DEFAULT_PKEY); + return; + } +lock_mmap: +#endif /* CONFIG_PER_VMA_LOCK */ + /* * Kernel-mode access to the user address space should only occur * on well-defined single instructions listed in the exception @@ -1433,6 +1466,9 @@ void do_user_addr_fault(struct pt_regs *regs, } mmap_read_unlock(mm); +#ifdef CONFIG_PER_VMA_LOCK +done: +#endif if (likely(!(fault & VM_FAULT_ERROR))) return; -- 2.39.2.722.g9855ee24e9-goog
[PATCH v4 28/33] mm: introduce per-VMA lock statistics
Add a new CONFIG_PER_VMA_LOCK_STATS config option to dump extra statistics about handling page fault under VMA lock. Signed-off-by: Suren Baghdasaryan --- include/linux/vm_event_item.h | 6 ++ include/linux/vmstat.h| 6 ++ mm/Kconfig.debug | 6 ++ mm/memory.c | 2 ++ mm/vmstat.c | 6 ++ 5 files changed, 26 insertions(+) diff --git a/include/linux/vm_event_item.h b/include/linux/vm_event_item.h index 7f5d1caf5890..8abfa1240040 100644 --- a/include/linux/vm_event_item.h +++ b/include/linux/vm_event_item.h @@ -149,6 +149,12 @@ enum vm_event_item { PGPGIN, PGPGOUT, PSWPIN, PSWPOUT, #ifdef CONFIG_X86 DIRECT_MAP_LEVEL2_SPLIT, DIRECT_MAP_LEVEL3_SPLIT, +#endif +#ifdef CONFIG_PER_VMA_LOCK_STATS + VMA_LOCK_SUCCESS, + VMA_LOCK_ABORT, + VMA_LOCK_RETRY, + VMA_LOCK_MISS, #endif NR_VM_EVENT_ITEMS }; diff --git a/include/linux/vmstat.h b/include/linux/vmstat.h index 19cf5b6892ce..fed855bae6d8 100644 --- a/include/linux/vmstat.h +++ b/include/linux/vmstat.h @@ -125,6 +125,12 @@ static inline void vm_events_fold_cpu(int cpu) #define count_vm_tlb_events(x, y) do { (void)(y); } while (0) #endif +#ifdef CONFIG_PER_VMA_LOCK_STATS +#define count_vm_vma_lock_event(x) count_vm_event(x) +#else +#define count_vm_vma_lock_event(x) do {} while (0) +#endif + #define __count_zid_vm_events(item, zid, delta) \ __count_vm_events(item##_NORMAL - ZONE_NORMAL + zid, delta) diff --git a/mm/Kconfig.debug b/mm/Kconfig.debug index c3547a373c9c..4965a7333a3f 100644 --- a/mm/Kconfig.debug +++ b/mm/Kconfig.debug @@ -279,3 +279,9 @@ config DEBUG_KMEMLEAK_AUTO_SCAN If unsure, say Y. +config PER_VMA_LOCK_STATS + bool "Statistics for per-vma locks" + depends on PER_VMA_LOCK + default y + help + Statistics for per-vma locks. diff --git a/mm/memory.c b/mm/memory.c index f734f80d28ca..255b2f4fdd4a 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -5273,6 +5273,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm, /* Check if the VMA got isolated after we found it */ if (vma->detached) { vma_end_read(vma); + count_vm_vma_lock_event(VMA_LOCK_MISS); /* The area was replaced with another one */ goto retry; } @@ -5281,6 +5282,7 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm, return vma; inval: rcu_read_unlock(); + count_vm_vma_lock_event(VMA_LOCK_ABORT); return NULL; } #endif /* CONFIG_PER_VMA_LOCK */ diff --git a/mm/vmstat.c b/mm/vmstat.c index 1ea6a5ce1c41..4f1089a1860e 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1399,6 +1399,12 @@ const char * const vmstat_text[] = { "direct_map_level2_splits", "direct_map_level3_splits", #endif +#ifdef CONFIG_PER_VMA_LOCK_STATS + "vma_lock_success", + "vma_lock_abort", + "vma_lock_retry", + "vma_lock_miss", +#endif #endif /* CONFIG_VM_EVENT_COUNTERS || CONFIG_MEMCG */ }; #endif /* CONFIG_PROC_FS || CONFIG_SYSFS || CONFIG_NUMA || CONFIG_MEMCG */ -- 2.39.2.722.g9855ee24e9-goog
[PATCH v4 27/33] mm: prevent userfaults to be handled under per-vma lock
Due to the possibility of handle_userfault dropping mmap_lock, avoid fault handling under VMA lock and retry holding mmap_lock. This can be handled more gracefully in the future. Signed-off-by: Suren Baghdasaryan Suggested-by: Peter Xu --- mm/memory.c | 9 + 1 file changed, 9 insertions(+) diff --git a/mm/memory.c b/mm/memory.c index af3c2c59cd11..f734f80d28ca 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -5255,6 +5255,15 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm, if (!vma_start_read(vma)) goto inval; + /* +* Due to the possibility of userfault handler dropping mmap_lock, avoid +* it for now and fall back to page fault handling under mmap_lock. +*/ + if (userfaultfd_armed(vma)) { + vma_end_read(vma); + goto inval; + } + /* Check since vm_start/vm_end might change before we lock the VMA */ if (unlikely(address < vma->vm_start || address >= vma->vm_end)) { vma_end_read(vma); -- 2.39.2.722.g9855ee24e9-goog
[PATCH v4 26/33] mm: prevent do_swap_page from handling page faults under VMA lock
Due to the possibility of do_swap_page dropping mmap_lock, abort fault handling under VMA lock and retry holding mmap_lock. This can be handled more gracefully in the future. Signed-off-by: Suren Baghdasaryan Reviewed-by: Laurent Dufour --- mm/memory.c | 5 + 1 file changed, 5 insertions(+) diff --git a/mm/memory.c b/mm/memory.c index 8855846a361b..af3c2c59cd11 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -3689,6 +3689,11 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) if (!pte_unmap_same(vmf)) goto out; + if (vmf->flags & FAULT_FLAG_VMA_LOCK) { + ret = VM_FAULT_RETRY; + goto out; + } + entry = pte_to_swp_entry(vmf->orig_pte); if (unlikely(non_swap_entry(entry))) { if (is_migration_entry(entry)) { -- 2.39.2.722.g9855ee24e9-goog
[PATCH v4 24/33] mm: fall back to mmap_lock if vma->anon_vma is not yet set
When vma->anon_vma is not set, page fault handler will set it by either reusing anon_vma of an adjacent VMA if VMAs are compatible or by allocating a new one. find_mergeable_anon_vma() walks VMA tree to find a compatible adjacent VMA and that requires not only the faulting VMA to be stable but also the tree structure and other VMAs inside that tree. Therefore locking just the faulting VMA is not enough for this search. Fall back to taking mmap_lock when vma->anon_vma is not set. This situation happens only on the first page fault and should not affect overall performance. Signed-off-by: Suren Baghdasaryan --- mm/memory.c | 4 1 file changed, 4 insertions(+) diff --git a/mm/memory.c b/mm/memory.c index bda4c1a991f0..8855846a361b 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -5243,6 +5243,10 @@ struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm, if (!vma_is_anonymous(vma)) goto inval; + /* find_mergeable_anon_vma uses adjacent vmas which are not locked */ + if (!vma->anon_vma) + goto inval; + if (!vma_start_read(vma)) goto inval; -- 2.39.2.722.g9855ee24e9-goog
[PATCH v4 25/33] mm: add FAULT_FLAG_VMA_LOCK flag
Add a new flag to distinguish page faults handled under protection of per-vma lock. Signed-off-by: Suren Baghdasaryan Reviewed-by: Laurent Dufour --- include/linux/mm.h | 3 ++- include/linux/mm_types.h | 1 + 2 files changed, 3 insertions(+), 1 deletion(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index 46d2db743b1a..d07ac92f 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -478,7 +478,8 @@ static inline bool fault_flag_allow_retry_first(enum fault_flag flags) { FAULT_FLAG_USER, "USER" }, \ { FAULT_FLAG_REMOTE,"REMOTE" }, \ { FAULT_FLAG_INSTRUCTION, "INSTRUCTION" }, \ - { FAULT_FLAG_INTERRUPTIBLE, "INTERRUPTIBLE" } + { FAULT_FLAG_INTERRUPTIBLE, "INTERRUPTIBLE" }, \ + { FAULT_FLAG_VMA_LOCK, "VMA_LOCK" } /* * vm_fault is filled by the pagefault handler and passed to the vma's diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 45a219d33c6b..6768533a6b7c 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -1097,6 +1097,7 @@ enum fault_flag { FAULT_FLAG_INTERRUPTIBLE = 1 << 9, FAULT_FLAG_UNSHARE =1 << 10, FAULT_FLAG_ORIG_PTE_VALID = 1 << 11, + FAULT_FLAG_VMA_LOCK = 1 << 12, }; typedef unsigned int __bitwise zap_flags_t; -- 2.39.2.722.g9855ee24e9-goog
[PATCH v4 23/33] mm: introduce lock_vma_under_rcu to be used from arch-specific code
Introduce lock_vma_under_rcu function to lookup and lock a VMA during page fault handling. When VMA is not found, can't be locked or changes after being locked, the function returns NULL. The lookup is performed under RCU protection to prevent the found VMA from being destroyed before the VMA lock is acquired. VMA lock statistics are updated according to the results. For now only anonymous VMAs can be searched this way. In other cases the function returns NULL. Signed-off-by: Suren Baghdasaryan --- include/linux/mm.h | 3 +++ mm/memory.c| 46 ++ 2 files changed, 49 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index 895bb3950e8a..46d2db743b1a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -701,6 +701,9 @@ static inline void vma_mark_detached(struct vm_area_struct *vma, bool detached) vma->detached = detached; } +struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm, + unsigned long address); + #else /* CONFIG_PER_VMA_LOCK */ static inline void vma_init_lock(struct vm_area_struct *vma) {} diff --git a/mm/memory.c b/mm/memory.c index f7f412833e42..bda4c1a991f0 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -5221,6 +5221,52 @@ vm_fault_t handle_mm_fault(struct vm_area_struct *vma, unsigned long address, } EXPORT_SYMBOL_GPL(handle_mm_fault); +#ifdef CONFIG_PER_VMA_LOCK +/* + * Lookup and lock a VMA under RCU protection. Returned VMA is guaranteed to be + * stable and not isolated. If the VMA is not found or is being modified the + * function returns NULL. + */ +struct vm_area_struct *lock_vma_under_rcu(struct mm_struct *mm, + unsigned long address) +{ + MA_STATE(mas, &mm->mm_mt, address, address); + struct vm_area_struct *vma; + + rcu_read_lock(); +retry: + vma = mas_walk(&mas); + if (!vma) + goto inval; + + /* Only anonymous vmas are supported for now */ + if (!vma_is_anonymous(vma)) + goto inval; + + if (!vma_start_read(vma)) + goto inval; + + /* Check since vm_start/vm_end might change before we lock the VMA */ + if (unlikely(address < vma->vm_start || address >= vma->vm_end)) { + vma_end_read(vma); + goto inval; + } + + /* Check if the VMA got isolated after we found it */ + if (vma->detached) { + vma_end_read(vma); + /* The area was replaced with another one */ + goto retry; + } + + rcu_read_unlock(); + return vma; +inval: + rcu_read_unlock(); + return NULL; +} +#endif /* CONFIG_PER_VMA_LOCK */ + #ifndef __PAGETABLE_P4D_FOLDED /* * Allocate p4d page table. -- 2.39.2.722.g9855ee24e9-goog
[PATCH v4 22/33] mm: introduce vma detached flag
Per-vma locking mechanism will search for VMA under RCU protection and then after locking it, has to ensure it was not removed from the VMA tree after we found it. To make this check efficient, introduce a vma->detached flag to mark VMAs which were removed from the VMA tree. Signed-off-by: Suren Baghdasaryan --- include/linux/mm.h | 11 +++ include/linux/mm_types.h | 3 +++ mm/mmap.c| 2 ++ 3 files changed, 16 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index 3d5e8666892d..895bb3950e8a 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -693,6 +693,14 @@ static inline void vma_assert_write_locked(struct vm_area_struct *vma) VM_BUG_ON_VMA(vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq), vma); } +static inline void vma_mark_detached(struct vm_area_struct *vma, bool detached) +{ + /* When detaching vma should be write-locked */ + if (detached) + vma_assert_write_locked(vma); + vma->detached = detached; +} + #else /* CONFIG_PER_VMA_LOCK */ static inline void vma_init_lock(struct vm_area_struct *vma) {} @@ -701,6 +709,8 @@ static inline bool vma_start_read(struct vm_area_struct *vma) static inline void vma_end_read(struct vm_area_struct *vma) {} static inline void vma_start_write(struct vm_area_struct *vma) {} static inline void vma_assert_write_locked(struct vm_area_struct *vma) {} +static inline void vma_mark_detached(struct vm_area_struct *vma, +bool detached) {} #endif /* CONFIG_PER_VMA_LOCK */ @@ -712,6 +722,7 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm) vma->vm_mm = mm; vma->vm_ops = &dummy_vm_ops; INIT_LIST_HEAD(&vma->anon_vma_chain); + vma_mark_detached(vma, false); vma_init_lock(vma); } diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index a4e7493bacd7..45a219d33c6b 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -511,6 +511,9 @@ struct vm_area_struct { #ifdef CONFIG_PER_VMA_LOCK int vm_lock_seq; struct rw_semaphore lock; + + /* Flag to indicate areas detached from the mm->mm_mt tree */ + bool detached; #endif /* diff --git a/mm/mmap.c b/mm/mmap.c index b947d82e8522..df13c33498db 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -600,6 +600,7 @@ static inline void vma_complete(struct vma_prepare *vp, if (vp->remove) { again: + vma_mark_detached(vp->remove, true); if (vp->file) { uprobe_munmap(vp->remove, vp->remove->vm_start, vp->remove->vm_end); @@ -2261,6 +2262,7 @@ static inline int munmap_sidetree(struct vm_area_struct *vma, if (mas_store_gfp(mas_detach, vma, GFP_KERNEL)) return -ENOMEM; + vma_mark_detached(vma, true); if (vma->vm_flags & VM_LOCKED) vma->vm_mm->locked_vm -= vma_pages(vma); -- 2.39.2.722.g9855ee24e9-goog
[PATCH v4 21/33] mm/mmap: prevent pagefault handler from racing with mmu_notifier registration
Page fault handlers might need to fire MMU notifications while a new notifier is being registered. Modify mm_take_all_locks to write-lock all VMAs and prevent this race with page fault handlers that would hold VMA locks. VMAs are locked before i_mmap_rwsem and anon_vma to keep the same locking order as in page fault handlers. Signed-off-by: Suren Baghdasaryan --- mm/mmap.c | 9 + 1 file changed, 9 insertions(+) diff --git a/mm/mmap.c b/mm/mmap.c index ec745586785c..b947d82e8522 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -3486,6 +3486,7 @@ static void vm_lock_mapping(struct mm_struct *mm, struct address_space *mapping) * of mm/rmap.c: * - all hugetlbfs_i_mmap_rwsem_key locks (aka mapping->i_mmap_rwsem for * hugetlb mapping); + * - all vmas marked locked * - all i_mmap_rwsem locks; * - all anon_vma->rwseml * @@ -3508,6 +3509,13 @@ int mm_take_all_locks(struct mm_struct *mm) mutex_lock(&mm_all_locks_mutex); + mas_for_each(&mas, vma, ULONG_MAX) { + if (signal_pending(current)) + goto out_unlock; + vma_start_write(vma); + } + + mas_set(&mas, 0); mas_for_each(&mas, vma, ULONG_MAX) { if (signal_pending(current)) goto out_unlock; @@ -3597,6 +3605,7 @@ void mm_drop_all_locks(struct mm_struct *mm) if (vma->vm_file && vma->vm_file->f_mapping) vm_unlock_mapping(vma->vm_file->f_mapping); } + vma_end_write_all(mm); mutex_unlock(&mm_all_locks_mutex); } -- 2.39.2.722.g9855ee24e9-goog
[PATCH v4 20/33] kernel/fork: assert no VMA readers during its destruction
Assert there are no holders of VMA lock for reading when it is about to be destroyed. Signed-off-by: Suren Baghdasaryan --- kernel/fork.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/kernel/fork.c b/kernel/fork.c index e1dd79c7738c..bdb55f25895d 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -491,6 +491,9 @@ static void vm_area_free_rcu_cb(struct rcu_head *head) { struct vm_area_struct *vma = container_of(head, struct vm_area_struct, vm_rcu); + + /* The vma should not be locked while being destroyed. */ + VM_BUG_ON_VMA(rwsem_is_locked(&vma->lock), vma); __vm_area_free(vma); } #endif -- 2.39.2.722.g9855ee24e9-goog
[PATCH v4 19/33] mm: conditionally write-lock VMA in free_pgtables
Normally free_pgtables needs to lock affected VMAs except for the case when VMAs were isolated under VMA write-lock. munmap() does just that, isolating while holding appropriate locks and then downgrading mmap_lock and dropping per-VMA locks before freeing page tables. Add a parameter to free_pgtables for such scenario. Signed-off-by: Suren Baghdasaryan --- mm/internal.h | 2 +- mm/memory.c | 6 +- mm/mmap.c | 5 +++-- 3 files changed, 9 insertions(+), 4 deletions(-) diff --git a/mm/internal.h b/mm/internal.h index 08ce56dbb1d9..fce94775819c 100644 --- a/mm/internal.h +++ b/mm/internal.h @@ -105,7 +105,7 @@ void folio_activate(struct folio *folio); void free_pgtables(struct mmu_gather *tlb, struct maple_tree *mt, struct vm_area_struct *start_vma, unsigned long floor, - unsigned long ceiling); + unsigned long ceiling, bool mm_wr_locked); void pmd_install(struct mm_struct *mm, pmd_t *pmd, pgtable_t *pte); struct zap_details; diff --git a/mm/memory.c b/mm/memory.c index bfa3100ec5a3..f7f412833e42 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -348,7 +348,7 @@ void free_pgd_range(struct mmu_gather *tlb, void free_pgtables(struct mmu_gather *tlb, struct maple_tree *mt, struct vm_area_struct *vma, unsigned long floor, - unsigned long ceiling) + unsigned long ceiling, bool mm_wr_locked) { MA_STATE(mas, mt, vma->vm_end, vma->vm_end); @@ -366,6 +366,8 @@ void free_pgtables(struct mmu_gather *tlb, struct maple_tree *mt, * Hide vma from rmap and truncate_pagecache before freeing * pgtables */ + if (mm_wr_locked) + vma_start_write(vma); unlink_anon_vmas(vma); unlink_file_vma(vma); @@ -380,6 +382,8 @@ void free_pgtables(struct mmu_gather *tlb, struct maple_tree *mt, && !is_vm_hugetlb_page(next)) { vma = next; next = mas_find(&mas, ceiling - 1); + if (mm_wr_locked) + vma_start_write(vma); unlink_anon_vmas(vma); unlink_file_vma(vma); } diff --git a/mm/mmap.c b/mm/mmap.c index f7ed357056c4..ec745586785c 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2152,7 +2152,8 @@ static void unmap_region(struct mm_struct *mm, struct maple_tree *mt, update_hiwater_rss(mm); unmap_vmas(&tlb, mt, vma, start, end, mm_wr_locked); free_pgtables(&tlb, mt, vma, prev ? prev->vm_end : FIRST_USER_ADDRESS, -next ? next->vm_start : USER_PGTABLES_CEILING); +next ? next->vm_start : USER_PGTABLES_CEILING, +mm_wr_locked); tlb_finish_mmu(&tlb); } @@ -3056,7 +3057,7 @@ void exit_mmap(struct mm_struct *mm) mmap_write_lock(mm); mt_clear_in_rcu(&mm->mm_mt); free_pgtables(&tlb, &mm->mm_mt, vma, FIRST_USER_ADDRESS, - USER_PGTABLES_CEILING); + USER_PGTABLES_CEILING, true); tlb_finish_mmu(&tlb); /* -- 2.39.2.722.g9855ee24e9-goog
[PATCH v4 18/33] mm: write-lock VMAs before removing them from VMA tree
Write-locking VMAs before isolating them ensures that page fault handlers don't operate on isolated VMAs. Signed-off-by: Suren Baghdasaryan --- mm/mmap.c | 1 + mm/nommu.c | 5 + 2 files changed, 6 insertions(+) diff --git a/mm/mmap.c b/mm/mmap.c index 1f42b9a52b9b..f7ed357056c4 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2255,6 +2255,7 @@ int split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma, static inline int munmap_sidetree(struct vm_area_struct *vma, struct ma_state *mas_detach) { + vma_start_write(vma); mas_set_range(mas_detach, vma->vm_start, vma->vm_end - 1); if (mas_store_gfp(mas_detach, vma, GFP_KERNEL)) return -ENOMEM; diff --git a/mm/nommu.c b/mm/nommu.c index 57ba243c6a37..2ab162d773e2 100644 --- a/mm/nommu.c +++ b/mm/nommu.c @@ -588,6 +588,7 @@ static int delete_vma_from_mm(struct vm_area_struct *vma) current->pid); return -ENOMEM; } + vma_start_write(vma); cleanup_vma_from_mm(vma); /* remove from the MM's tree and list */ @@ -1519,6 +1520,10 @@ void exit_mmap(struct mm_struct *mm) */ mmap_write_lock(mm); for_each_vma(vmi, vma) { + /* +* No need to lock VMA because this is the only mm user and no +* page fault handled can race with it. +*/ cleanup_vma_from_mm(vma); delete_vma(mm, vma); cond_resched(); -- 2.39.2.722.g9855ee24e9-goog
[PATCH v4 17/33] mm/mremap: write-lock VMA while remapping it to a new address range
Write-lock VMA as locked before copying it and when copy_vma produces a new VMA. Signed-off-by: Suren Baghdasaryan Reviewed-by: Laurent Dufour --- mm/mmap.c | 1 + mm/mremap.c | 1 + 2 files changed, 2 insertions(+) diff --git a/mm/mmap.c b/mm/mmap.c index e73fbb84ce12..1f42b9a52b9b 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -3189,6 +3189,7 @@ struct vm_area_struct *copy_vma(struct vm_area_struct **vmap, get_file(new_vma->vm_file); if (new_vma->vm_ops && new_vma->vm_ops->open) new_vma->vm_ops->open(new_vma); + vma_start_write(new_vma); if (vma_link(mm, new_vma)) goto out_vma_link; *need_rmap_locks = false; diff --git a/mm/mremap.c b/mm/mremap.c index 1ddf7beb62e9..327c38eb132e 100644 --- a/mm/mremap.c +++ b/mm/mremap.c @@ -623,6 +623,7 @@ static unsigned long move_vma(struct vm_area_struct *vma, return -ENOMEM; } + vma_start_write(vma); new_pgoff = vma->vm_pgoff + ((old_addr - vma->vm_start) >> PAGE_SHIFT); new_vma = copy_vma(&vma, new_addr, new_len, new_pgoff, &need_rmap_locks); -- 2.39.2.722.g9855ee24e9-goog
[PATCH v4 16/33] mm/mmap: write-lock VMAs in vma_prepare before modifying them
Write-lock all VMAs which might be affected by a merge, split, expand or shrink operations. All these operations use vma_prepare() before making the modifications, therefore it provides a centralized place to perform VMA locking. Signed-off-by: Suren Baghdasaryan --- mm/mmap.c | 10 ++ 1 file changed, 10 insertions(+) diff --git a/mm/mmap.c b/mm/mmap.c index 92893d86c0af..e73fbb84ce12 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -502,6 +502,16 @@ static inline void init_vma_prep(struct vma_prepare *vp, */ static inline void vma_prepare(struct vma_prepare *vp) { + if (vp->vma) + vma_start_write(vp->vma); + if (vp->adj_next) + vma_start_write(vp->adj_next); + /* vp->insert is always a newly created VMA, no need for locking */ + if (vp->remove) + vma_start_write(vp->remove); + if (vp->remove2) + vma_start_write(vp->remove2); + if (vp->file) { uprobe_munmap(vp->vma, vp->vma->vm_start, vp->vma->vm_end); -- 2.39.2.722.g9855ee24e9-goog
[PATCH v4 15/33] mm/khugepaged: write-lock VMA while collapsing a huge page
Protect VMA from concurrent page fault handler while collapsing a huge page. Page fault handler needs a stable PMD to use PTL and relies on per-VMA lock to prevent concurrent PMD changes. pmdp_collapse_flush(), set_huge_pmd() and collapse_and_free_pmd() can modify a PMD, which will not be detected by a page fault handler without proper locking. Before this patch, page tables can be walked under any one of the mmap_lock, the mapping lock, and the anon_vma lock; so when khugepaged unlinks and frees page tables, it must ensure that all of those either are locked or don't exist. This patch adds a fourth lock under which page tables can be traversed, and so khugepaged must also lock out that one. Signed-off-by: Suren Baghdasaryan --- mm/khugepaged.c | 5 + mm/rmap.c | 31 --- 2 files changed, 21 insertions(+), 15 deletions(-) diff --git a/mm/khugepaged.c b/mm/khugepaged.c index 941d1c7ea910..c64e01f03f27 100644 --- a/mm/khugepaged.c +++ b/mm/khugepaged.c @@ -1147,6 +1147,7 @@ static int collapse_huge_page(struct mm_struct *mm, unsigned long address, if (result != SCAN_SUCCEED) goto out_up_write; + vma_start_write(vma); anon_vma_lock_write(vma->anon_vma); mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, mm, address, @@ -1614,6 +1615,9 @@ int collapse_pte_mapped_thp(struct mm_struct *mm, unsigned long addr, goto drop_hpage; } + /* Lock the vma before taking i_mmap and page table locks */ + vma_start_write(vma); + /* * We need to lock the mapping so that from here on, only GUP-fast and * hardware page walks can access the parts of the page tables that @@ -1819,6 +1823,7 @@ static int retract_page_tables(struct address_space *mapping, pgoff_t pgoff, result = SCAN_PTE_UFFD_WP; goto unlock_next; } + vma_start_write(vma); collapse_and_free_pmd(mm, vma, addr, pmd); if (!cc->is_khugepaged && is_target) result = set_huge_pmd(vma, addr, pmd, hpage); diff --git a/mm/rmap.c b/mm/rmap.c index 8632e02661ac..cfdaa56cad3e 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -25,21 +25,22 @@ * mapping->invalidate_lock (in filemap_fault) * page->flags PG_locked (lock_page) * hugetlbfs_i_mmap_rwsem_key (in huge_pmd_share, see hugetlbfs below) - * mapping->i_mmap_rwsem - * anon_vma->rwsem - * mm->page_table_lock or pte_lock - * swap_lock (in swap_duplicate, swap_info_get) - * mmlist_lock (in mmput, drain_mmlist and others) - * mapping->private_lock (in block_dirty_folio) - * folio_lock_memcg move_lock (in block_dirty_folio) - * i_pages lock (widely used) - * lruvec->lru_lock (in folio_lruvec_lock_irq) - * inode->i_lock (in set_page_dirty's __mark_inode_dirty) - * bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty) - * sb_lock (within inode_lock in fs/fs-writeback.c) - * i_pages lock (widely used, in set_page_dirty, - * in arch-dependent flush_dcache_mmap_lock, - * within bdi.wb->list_lock in __sync_single_inode) + * vma_start_write + * mapping->i_mmap_rwsem + * anon_vma->rwsem + * mm->page_table_lock or pte_lock + * swap_lock (in swap_duplicate, swap_info_get) + * mmlist_lock (in mmput, drain_mmlist and others) + * mapping->private_lock (in block_dirty_folio) + * folio_lock_memcg move_lock (in block_dirty_folio) + * i_pages lock (widely used) + * lruvec->lru_lock (in folio_lruvec_lock_irq) + * inode->i_lock (in set_page_dirty's __mark_inode_dirty) + * bdi.wb->list_lock (in set_page_dirty's __mark_inode_dirty) + * sb_lock (within inode_lock in fs/fs-writeback.c) + * i_pages lock (widely used, in set_page_dirty, + * in arch-dependent flush_dcache_mmap_lock, + * within bdi.wb->list_lock in __sync_single_inode) * * anon_vma->rwsem,mapping->i_mmap_rwsem (memory_failure, collect_procs_anon) * ->tasklist_lock -- 2.39.2.722.g9855ee24e9-goog
[PATCH v4 14/33] mm/mmap: move vma_prepare before vma_adjust_trans_huge
vma_prepare() acquires all locks required before VMA modifications. Move vma_prepare() before vma_adjust_trans_huge() so that VMA is locked before any modification. Signed-off-by: Suren Baghdasaryan --- mm/mmap.c | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/mm/mmap.c b/mm/mmap.c index c234443ee24c..92893d86c0af 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -683,12 +683,12 @@ int vma_expand(struct vma_iterator *vmi, struct vm_area_struct *vma, if (vma_iter_prealloc(vmi)) goto nomem; + vma_prepare(&vp); vma_adjust_trans_huge(vma, start, end, 0); /* VMA iterator points to previous, so set to start if necessary */ if (vma_iter_addr(vmi) != start) vma_iter_set(vmi, start); - vma_prepare(&vp); vma->vm_start = start; vma->vm_end = end; vma->vm_pgoff = pgoff; @@ -723,8 +723,8 @@ int vma_shrink(struct vma_iterator *vmi, struct vm_area_struct *vma, return -ENOMEM; init_vma_prep(&vp, vma); - vma_adjust_trans_huge(vma, start, end, 0); vma_prepare(&vp); + vma_adjust_trans_huge(vma, start, end, 0); if (vma->vm_start < start) vma_iter_clear(vmi, vma->vm_start, start); @@ -994,12 +994,12 @@ struct vm_area_struct *vma_merge(struct vma_iterator *vmi, struct mm_struct *mm, if (vma_iter_prealloc(vmi)) return NULL; - vma_adjust_trans_huge(vma, vma_start, vma_end, adj_next); init_multi_vma_prep(&vp, vma, adjust, remove, remove2); VM_WARN_ON(vp.anon_vma && adjust && adjust->anon_vma && vp.anon_vma != adjust->anon_vma); vma_prepare(&vp); + vma_adjust_trans_huge(vma, vma_start, vma_end, adj_next); if (vma_start < vma->vm_start || vma_end > vma->vm_end) vma_expanded = true; @@ -2198,10 +2198,10 @@ int __split_vma(struct vma_iterator *vmi, struct vm_area_struct *vma, if (new->vm_ops && new->vm_ops->open) new->vm_ops->open(new); - vma_adjust_trans_huge(vma, vma->vm_start, addr, 0); init_vma_prep(&vp, vma); vp.insert = new; vma_prepare(&vp); + vma_adjust_trans_huge(vma, vma->vm_start, addr, 0); if (new_below) { vma->vm_start = addr; @@ -2910,9 +2910,9 @@ static int do_brk_flags(struct vma_iterator *vmi, struct vm_area_struct *vma, if (vma_iter_prealloc(vmi)) goto unacct_fail; - vma_adjust_trans_huge(vma, vma->vm_start, addr + len, 0); init_vma_prep(&vp, vma); vma_prepare(&vp); + vma_adjust_trans_huge(vma, vma->vm_start, addr + len, 0); vma->vm_end = addr + len; vm_flags_set(vma, VM_SOFTDIRTY); vma_iter_store(vmi, vma); -- 2.39.2.722.g9855ee24e9-goog
[PATCH v4 13/33] mm: mark VMA as being written when changing vm_flags
Updates to vm_flags have to be done with VMA marked as being written for preventing concurrent page faults or other modifications. Signed-off-by: Suren Baghdasaryan --- include/linux/mm.h | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index bbad5d4fa81b..3d5e8666892d 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -726,28 +726,28 @@ static inline void vm_flags_init(struct vm_area_struct *vma, static inline void vm_flags_reset(struct vm_area_struct *vma, vm_flags_t flags) { - mmap_assert_write_locked(vma->vm_mm); + vma_start_write(vma); vm_flags_init(vma, flags); } static inline void vm_flags_reset_once(struct vm_area_struct *vma, vm_flags_t flags) { - mmap_assert_write_locked(vma->vm_mm); + vma_start_write(vma); WRITE_ONCE(ACCESS_PRIVATE(vma, __vm_flags), flags); } static inline void vm_flags_set(struct vm_area_struct *vma, vm_flags_t flags) { - mmap_assert_write_locked(vma->vm_mm); + vma_start_write(vma); ACCESS_PRIVATE(vma, __vm_flags) |= flags; } static inline void vm_flags_clear(struct vm_area_struct *vma, vm_flags_t flags) { - mmap_assert_write_locked(vma->vm_mm); + vma_start_write(vma); ACCESS_PRIVATE(vma, __vm_flags) &= ~flags; } @@ -768,7 +768,7 @@ static inline void __vm_flags_mod(struct vm_area_struct *vma, static inline void vm_flags_mod(struct vm_area_struct *vma, vm_flags_t set, vm_flags_t clear) { - mmap_assert_write_locked(vma->vm_mm); + vma_start_write(vma); __vm_flags_mod(vma, set, clear); } -- 2.39.2.722.g9855ee24e9-goog
[PATCH v4 12/33] mm: add per-VMA lock and helper functions to control it
Introduce per-VMA locking. The lock implementation relies on a per-vma and per-mm sequence counters to note exclusive locking: - read lock - (implemented by vma_start_read) requires the vma (vm_lock_seq) and mm (mm_lock_seq) sequence counters to differ. If they match then there must be a vma exclusive lock held somewhere. - read unlock - (implemented by vma_end_read) is a trivial vma->lock unlock. - write lock - (vma_start_write) requires the mmap_lock to be held exclusively and the current mm counter is assigned to the vma counter. This will allow multiple vmas to be locked under a single mmap_lock write lock (e.g. during vma merging). The vma counter is modified under exclusive vma lock. - write unlock - (vma_end_write_all) is a batch release of all vma locks held. It doesn't pair with a specific vma_start_write! It is done before exclusive mmap_lock is released by incrementing mm sequence counter (mm_lock_seq). - write downgrade - if the mmap_lock is downgraded to the read lock, all vma write locks are released as well (effectivelly same as write unlock). Signed-off-by: Suren Baghdasaryan --- include/linux/mm.h| 82 +++ include/linux/mm_types.h | 8 include/linux/mmap_lock.h | 13 +++ kernel/fork.c | 4 ++ mm/init-mm.c | 3 ++ 5 files changed, 110 insertions(+) diff --git a/include/linux/mm.h b/include/linux/mm.h index 1f79667824eb..bbad5d4fa81b 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -623,6 +623,87 @@ struct vm_operations_struct { unsigned long addr); }; +#ifdef CONFIG_PER_VMA_LOCK +static inline void vma_init_lock(struct vm_area_struct *vma) +{ + init_rwsem(&vma->lock); + vma->vm_lock_seq = -1; +} + +/* + * Try to read-lock a vma. The function is allowed to occasionally yield false + * locked result to avoid performance overhead, in which case we fall back to + * using mmap_lock. The function should never yield false unlocked result. + */ +static inline bool vma_start_read(struct vm_area_struct *vma) +{ + /* Check before locking. A race might cause false locked result. */ + if (vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq)) + return false; + + if (unlikely(down_read_trylock(&vma->lock) == 0)) + return false; + + /* +* Overflow might produce false locked result. +* False unlocked result is impossible because we modify and check +* vma->vm_lock_seq under vma->lock protection and mm->mm_lock_seq +* modification invalidates all existing locks. +*/ + if (unlikely(vma->vm_lock_seq == READ_ONCE(vma->vm_mm->mm_lock_seq))) { + up_read(&vma->lock); + return false; + } + return true; +} + +static inline void vma_end_read(struct vm_area_struct *vma) +{ + rcu_read_lock(); /* keeps vma alive till the end of up_read */ + up_read(&vma->lock); + rcu_read_unlock(); +} + +static inline void vma_start_write(struct vm_area_struct *vma) +{ + int mm_lock_seq; + + mmap_assert_write_locked(vma->vm_mm); + + /* +* current task is holding mmap_write_lock, both vma->vm_lock_seq and +* mm->mm_lock_seq can't be concurrently modified. +*/ + mm_lock_seq = READ_ONCE(vma->vm_mm->mm_lock_seq); + if (vma->vm_lock_seq == mm_lock_seq) + return; + + down_write(&vma->lock); + vma->vm_lock_seq = mm_lock_seq; + up_write(&vma->lock); +} + +static inline void vma_assert_write_locked(struct vm_area_struct *vma) +{ + mmap_assert_write_locked(vma->vm_mm); + /* +* current task is holding mmap_write_lock, both vma->vm_lock_seq and +* mm->mm_lock_seq can't be concurrently modified. +*/ + VM_BUG_ON_VMA(vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq), vma); +} + +#else /* CONFIG_PER_VMA_LOCK */ + +static inline void vma_init_lock(struct vm_area_struct *vma) {} +static inline bool vma_start_read(struct vm_area_struct *vma) + { return false; } +static inline void vma_end_read(struct vm_area_struct *vma) {} +static inline void vma_start_write(struct vm_area_struct *vma) {} +static inline void vma_assert_write_locked(struct vm_area_struct *vma) {} + +#endif /* CONFIG_PER_VMA_LOCK */ + static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm) { static const struct vm_operations_struct dummy_vm_ops = {}; @@ -631,6 +712,7 @@ static inline void vma_init(struct vm_area_struct *vma, struct mm_struct *mm) vma->vm_mm = mm; vma->vm_ops = &dummy_vm_ops; INIT_LIST_HEAD(&vma->anon_vma_chain); + vma_init_lock(vma); } /* Use when VMA is
[PATCH v4 11/33] mm: move mmap_lock assert function definitions
Move mmap_lock assert function definitions up so that they can be used by other mmap_lock routines. Signed-off-by: Suren Baghdasaryan --- include/linux/mmap_lock.h | 24 1 file changed, 12 insertions(+), 12 deletions(-) diff --git a/include/linux/mmap_lock.h b/include/linux/mmap_lock.h index 96e113e23d04..e49ba91bb1f0 100644 --- a/include/linux/mmap_lock.h +++ b/include/linux/mmap_lock.h @@ -60,6 +60,18 @@ static inline void __mmap_lock_trace_released(struct mm_struct *mm, bool write) #endif /* CONFIG_TRACING */ +static inline void mmap_assert_locked(struct mm_struct *mm) +{ + lockdep_assert_held(&mm->mmap_lock); + VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm); +} + +static inline void mmap_assert_write_locked(struct mm_struct *mm) +{ + lockdep_assert_held_write(&mm->mmap_lock); + VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm); +} + static inline void mmap_init_lock(struct mm_struct *mm) { init_rwsem(&mm->mmap_lock); @@ -150,18 +162,6 @@ static inline void mmap_read_unlock_non_owner(struct mm_struct *mm) up_read_non_owner(&mm->mmap_lock); } -static inline void mmap_assert_locked(struct mm_struct *mm) -{ - lockdep_assert_held(&mm->mmap_lock); - VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm); -} - -static inline void mmap_assert_write_locked(struct mm_struct *mm) -{ - lockdep_assert_held_write(&mm->mmap_lock); - VM_BUG_ON_MM(!rwsem_is_locked(&mm->mmap_lock), mm); -} - static inline int mmap_lock_is_contended(struct mm_struct *mm) { return rwsem_is_contended(&mm->mmap_lock); -- 2.39.2.722.g9855ee24e9-goog
[PATCH v4 10/33] mm: rcu safe VMA freeing
From: Michel Lespinasse This prepares for page faults handling under VMA lock, looking up VMAs under protection of an rcu read lock, instead of the usual mmap read lock. Signed-off-by: Michel Lespinasse Signed-off-by: Suren Baghdasaryan --- include/linux/mm_types.h | 13 ++--- kernel/fork.c| 20 +++- 2 files changed, 29 insertions(+), 4 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 22b2ac82bffd..64a6b3f6b74f 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -480,9 +480,16 @@ struct anon_vma_name { struct vm_area_struct { /* The first cache line has the info for VMA tree walking. */ - unsigned long vm_start; /* Our start address within vm_mm. */ - unsigned long vm_end; /* The first byte after our end address - within vm_mm. */ + union { + struct { + /* VMA covers [vm_start; vm_end) addresses within mm */ + unsigned long vm_start; + unsigned long vm_end; + }; +#ifdef CONFIG_PER_VMA_LOCK + struct rcu_head vm_rcu; /* Used for deferred freeing. */ +#endif + }; struct mm_struct *vm_mm;/* The address space we belong to. */ diff --git a/kernel/fork.c b/kernel/fork.c index abfcf95734c7..a63b739aeca9 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -479,12 +479,30 @@ struct vm_area_struct *vm_area_dup(struct vm_area_struct *orig) return new; } -void vm_area_free(struct vm_area_struct *vma) +static void __vm_area_free(struct vm_area_struct *vma) { free_anon_vma_name(vma); kmem_cache_free(vm_area_cachep, vma); } +#ifdef CONFIG_PER_VMA_LOCK +static void vm_area_free_rcu_cb(struct rcu_head *head) +{ + struct vm_area_struct *vma = container_of(head, struct vm_area_struct, + vm_rcu); + __vm_area_free(vma); +} +#endif + +void vm_area_free(struct vm_area_struct *vma) +{ +#ifdef CONFIG_PER_VMA_LOCK + call_rcu(&vma->vm_rcu, vm_area_free_rcu_cb); +#else + __vm_area_free(vma); +#endif +} + static void account_kernel_stack(struct task_struct *tsk, int account) { if (IS_ENABLED(CONFIG_VMAP_STACK)) { -- 2.39.2.722.g9855ee24e9-goog
[PATCH v4 09/33] mm: introduce CONFIG_PER_VMA_LOCK
This configuration variable will be used to build the support for VMA locking during page fault handling. This is enabled on supported architectures with SMP and MMU set. The architecture support is needed since the page fault handler is called from the architecture's page faulting code which needs modifications to handle faults under VMA lock. Signed-off-by: Suren Baghdasaryan --- mm/Kconfig | 12 1 file changed, 12 insertions(+) diff --git a/mm/Kconfig b/mm/Kconfig index ca98b2072df5..2e4a7e61768a 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -1211,6 +1211,18 @@ config LRU_GEN_STATS This option has a per-memcg and per-node memory overhead. # } +config ARCH_SUPPORTS_PER_VMA_LOCK + def_bool n + +config PER_VMA_LOCK + def_bool y + depends on ARCH_SUPPORTS_PER_VMA_LOCK && MMU && SMP + help + Allow per-vma locking during page fault handling. + + This feature allows locking each virtual memory area separately when + handling page faults instead of taking mmap_lock. + source "mm/damon/Kconfig" endmenu -- 2.39.2.722.g9855ee24e9-goog
[PATCH v4 08/33] mm: Enable maple tree RCU mode by default.
From: "Liam R. Howlett" Use the maple tree in RCU mode for VMA tracking. This is necessary for the use of per-VMA locking. RCU mode is enabled by default but disabled when exiting an mm and for the new tree during a fork. Also enable RCU for the tree used in munmap operations to ensure the nodes remain valid for readers. Signed-off-by: Liam R. Howlett Signed-off-by: Suren Baghdasaryan --- include/linux/mm_types.h | 3 ++- kernel/fork.c| 3 +++ mm/mmap.c| 4 +++- 3 files changed, 8 insertions(+), 2 deletions(-) diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index 417d25c6a262..22b2ac82bffd 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -779,7 +779,8 @@ struct mm_struct { unsigned long cpu_bitmap[]; }; -#define MM_MT_FLAGS(MT_FLAGS_ALLOC_RANGE | MT_FLAGS_LOCK_EXTERN) +#define MM_MT_FLAGS(MT_FLAGS_ALLOC_RANGE | MT_FLAGS_LOCK_EXTERN | \ +MT_FLAGS_USE_RCU) extern struct mm_struct init_mm; /* Pointer magic because the dynamic array size confuses some compilers. */ diff --git a/kernel/fork.c b/kernel/fork.c index 0cbfdc4b509e..abfcf95734c7 100644 --- a/kernel/fork.c +++ b/kernel/fork.c @@ -617,6 +617,7 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, if (retval) goto out; + mt_clear_in_rcu(vmi.mas.tree); for_each_vma(old_vmi, mpnt) { struct file *file; @@ -700,6 +701,8 @@ static __latent_entropy int dup_mmap(struct mm_struct *mm, retval = arch_dup_mmap(oldmm, mm); loop_out: vma_iter_free(&vmi); + if (!retval) + mt_set_in_rcu(vmi.mas.tree); out: mmap_write_unlock(mm); flush_tlb_mm(oldmm); diff --git a/mm/mmap.c b/mm/mmap.c index 740b54be3ed4..c234443ee24c 100644 --- a/mm/mmap.c +++ b/mm/mmap.c @@ -2277,7 +2277,8 @@ do_vmi_align_munmap(struct vma_iterator *vmi, struct vm_area_struct *vma, int count = 0; int error = -ENOMEM; MA_STATE(mas_detach, &mt_detach, 0, 0); - mt_init_flags(&mt_detach, MT_FLAGS_LOCK_EXTERN); + mt_init_flags(&mt_detach, vmi->mas.tree->ma_flags & + (MT_FLAGS_LOCK_MASK | MT_FLAGS_USE_RCU)); mt_set_external_lock(&mt_detach, &mm->mmap_lock); /* @@ -3042,6 +3043,7 @@ void exit_mmap(struct mm_struct *mm) */ set_bit(MMF_OOM_SKIP, &mm->flags); mmap_write_lock(mm); + mt_clear_in_rcu(&mm->mm_mt); free_pgtables(&tlb, &mm->mm_mt, vma, FIRST_USER_ADDRESS, USER_PGTABLES_CEILING); tlb_finish_mmu(&tlb); -- 2.39.2.722.g9855ee24e9-goog
[PATCH v4 07/33] maple_tree: Add RCU lock checking to rcu callback functions
From: "Liam R. Howlett" Dereferencing RCU objects within the RCU callback without the RCU check has caused lockdep to complain. Fix the RCU dereferencing by using the RCU callback lock to ensure the operation is safe. Also stop creating a new lock to use for dereferencing during destruction of the tree or subtree. Instead, pass through a pointer to the tree that has the lock that is held for RCU dereferencing checking. It also does not make sense to use the maple state in the freeing scenario as the tree walk is a special case where the tree no longer has the normal encodings and parent pointers. Fixes: 54a611b60590 ("Maple Tree: add new data structure") Reported-by: Suren Baghdasaryan Signed-off-by: Liam R. Howlett --- lib/maple_tree.c | 188 --- 1 file changed, 96 insertions(+), 92 deletions(-) diff --git a/lib/maple_tree.c b/lib/maple_tree.c index 8ad2d1669fad..2be86368237d 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -824,6 +824,11 @@ static inline void *mt_slot(const struct maple_tree *mt, return rcu_dereference_check(slots[offset], mt_locked(mt)); } +static inline void *mt_slot_locked(struct maple_tree *mt, void __rcu **slots, + unsigned char offset) +{ + return rcu_dereference_protected(slots[offset], mt_locked(mt)); +} /* * mas_slot_locked() - Get the slot value when holding the maple tree lock. * @mas: The maple state @@ -835,7 +840,7 @@ static inline void *mt_slot(const struct maple_tree *mt, static inline void *mas_slot_locked(struct ma_state *mas, void __rcu **slots, unsigned char offset) { - return rcu_dereference_protected(slots[offset], mt_locked(mas->tree)); + return mt_slot_locked(mas->tree, slots, offset); } /* @@ -907,34 +912,35 @@ static inline void ma_set_meta(struct maple_node *mn, enum maple_type mt, } /* - * mas_clear_meta() - clear the metadata information of a node, if it exists - * @mas: The maple state + * mt_clear_meta() - clear the metadata information of a node, if it exists + * @mt: The maple tree * @mn: The maple node - * @mt: The maple node type + * @type: The maple node type * @offset: The offset of the highest sub-gap in this node. * @end: The end of the data in this node. */ -static inline void mas_clear_meta(struct ma_state *mas, struct maple_node *mn, - enum maple_type mt) +static inline void mt_clear_meta(struct maple_tree *mt, struct maple_node *mn, + enum maple_type type) { struct maple_metadata *meta; unsigned long *pivots; void __rcu **slots; void *next; - switch (mt) { + switch (type) { case maple_range_64: pivots = mn->mr64.pivot; if (unlikely(pivots[MAPLE_RANGE64_SLOTS - 2])) { slots = mn->mr64.slot; - next = mas_slot_locked(mas, slots, - MAPLE_RANGE64_SLOTS - 1); - if (unlikely((mte_to_node(next) && mte_node_type(next - return; /* The last slot is a node, no metadata */ + next = mt_slot_locked(mt, slots, + MAPLE_RANGE64_SLOTS - 1); + if (unlikely((mte_to_node(next) && + mte_node_type(next + return; /* no metadata, could be node */ } fallthrough; case maple_arange_64: - meta = ma_meta(mn, mt); + meta = ma_meta(mn, type); break; default: return; @@ -5497,7 +5503,7 @@ static inline int mas_rev_alloc(struct ma_state *mas, unsigned long min, } /* - * mas_dead_leaves() - Mark all leaves of a node as dead. + * mte_dead_leaves() - Mark all leaves of a node as dead. * @mas: The maple state * @slots: Pointer to the slot array * @type: The maple node type @@ -5507,16 +5513,16 @@ static inline int mas_rev_alloc(struct ma_state *mas, unsigned long min, * Return: The number of leaves marked as dead. */ static inline -unsigned char mas_dead_leaves(struct ma_state *mas, void __rcu **slots, - enum maple_type mt) +unsigned char mte_dead_leaves(struct maple_enode *enode, struct maple_tree *mt, + void __rcu **slots) { struct maple_node *node; enum maple_type type; void *entry; int offset; - for (offset = 0; offset < mt_slots[mt]; offset++) { - entry = mas_slot_locked(mas, slots, offset); + for (offset = 0; offset < mt_slot_count(enode); offset++) { + entry = mt_slot(mt, slots, offset); type = mte_node
[PATCH v4 06/33] maple_tree: Add smp_rmb() to dead node detection
From: "Liam R. Howlett" Add an smp_rmb() before reading the parent pointer to ensure that anything read from the node prior to the parent pointer hasn't been reordered ahead of this check. The is necessary for RCU mode. Fixes: 54a611b60590 ("Maple Tree: add new data structure") Signed-off-by: Liam R. Howlett Signed-off-by: Suren Baghdasaryan --- lib/maple_tree.c | 8 ++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/lib/maple_tree.c b/lib/maple_tree.c index 6b6eddadd9d2..8ad2d1669fad 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -539,9 +539,11 @@ static inline struct maple_node *mte_parent(const struct maple_enode *enode) */ static inline bool ma_dead_node(const struct maple_node *node) { - struct maple_node *parent = (void *)((unsigned long) -node->parent & ~MAPLE_NODE_MASK); + struct maple_node *parent; + /* Do not reorder reads from the node prior to the parent check */ + smp_rmb(); + parent = (void *)((unsigned long) node->parent & ~MAPLE_NODE_MASK); return (parent == node); } @@ -556,6 +558,8 @@ static inline bool mte_dead_node(const struct maple_enode *enode) struct maple_node *parent, *node; node = mte_to_node(enode); + /* Do not reorder reads from the node prior to the parent check */ + smp_rmb(); parent = mte_parent(enode); return (parent == node); } -- 2.39.2.722.g9855ee24e9-goog
[PATCH v4 05/33] maple_tree: Fix write memory barrier of nodes once dead for RCU mode
From: "Liam R. Howlett" During the development of the maple tree, the strategy of freeing multiple nodes changed and, in the process, the pivots were reused to store pointers to dead nodes. To ensure the readers see accurate pivots, the writers need to mark the nodes as dead and call smp_wmb() to ensure any readers can identify the node as dead before using the pivot values. There were two places where the old method of marking the node as dead without smp_wmb() were being used, which resulted in RCU readers seeing the wrong pivot value before seeing the node was dead. Fix this race condition by using mte_set_node_dead() which has the smp_wmb() call to ensure the race is closed. Add a WARN_ON() to the ma_free_rcu() call to ensure all nodes being freed are marked as dead to ensure there are no other call paths besides the two updated paths. This is necessary for the RCU mode of the maple tree. Fixes: 54a611b60590 ("Maple Tree: add new data structure") Signed-off-by: Liam R. Howlett Signed-off-by: Suren Baghdasaryan --- lib/maple_tree.c | 7 +-- tools/testing/radix-tree/maple.c | 16 2 files changed, 21 insertions(+), 2 deletions(-) diff --git a/lib/maple_tree.c b/lib/maple_tree.c index 3d5ab02f981a..6b6eddadd9d2 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -185,7 +185,7 @@ static void mt_free_rcu(struct rcu_head *head) */ static void ma_free_rcu(struct maple_node *node) { - node->parent = ma_parent_ptr(node); + WARN_ON(node->parent != ma_parent_ptr(node)); call_rcu(&node->rcu, mt_free_rcu); } @@ -1778,8 +1778,10 @@ static inline void mas_replace(struct ma_state *mas, bool advanced) rcu_assign_pointer(slots[offset], mas->node); } - if (!advanced) + if (!advanced) { + mte_set_node_dead(old_enode); mas_free(mas, old_enode); + } } /* @@ -4218,6 +4220,7 @@ static inline bool mas_wr_node_store(struct ma_wr_state *wr_mas) done: mas_leaf_set_meta(mas, newnode, dst_pivots, maple_leaf_64, new_end); if (in_rcu) { + mte_set_node_dead(mas->node); mas->node = mt_mk_node(newnode, wr_mas->type); mas_replace(mas, false); } else { diff --git a/tools/testing/radix-tree/maple.c b/tools/testing/radix-tree/maple.c index 958ee9bdb316..4c89ff333f6f 100644 --- a/tools/testing/radix-tree/maple.c +++ b/tools/testing/radix-tree/maple.c @@ -108,6 +108,7 @@ static noinline void check_new_node(struct maple_tree *mt) MT_BUG_ON(mt, mn->slot[1] != NULL); MT_BUG_ON(mt, mas_allocated(&mas) != 0); + mn->parent = ma_parent_ptr(mn); ma_free_rcu(mn); mas.node = MAS_START; mas_nomem(&mas, GFP_KERNEL); @@ -160,6 +161,7 @@ static noinline void check_new_node(struct maple_tree *mt) MT_BUG_ON(mt, mas_allocated(&mas) != i); MT_BUG_ON(mt, !mn); MT_BUG_ON(mt, not_empty(mn)); + mn->parent = ma_parent_ptr(mn); ma_free_rcu(mn); } @@ -192,6 +194,7 @@ static noinline void check_new_node(struct maple_tree *mt) MT_BUG_ON(mt, not_empty(mn)); MT_BUG_ON(mt, mas_allocated(&mas) != i - 1); MT_BUG_ON(mt, !mn); + mn->parent = ma_parent_ptr(mn); ma_free_rcu(mn); } @@ -210,6 +213,7 @@ static noinline void check_new_node(struct maple_tree *mt) mn = mas_pop_node(&mas); MT_BUG_ON(mt, not_empty(mn)); MT_BUG_ON(mt, mas_allocated(&mas) != j - 1); + mn->parent = ma_parent_ptr(mn); ma_free_rcu(mn); } MT_BUG_ON(mt, mas_allocated(&mas) != 0); @@ -233,6 +237,7 @@ static noinline void check_new_node(struct maple_tree *mt) MT_BUG_ON(mt, mas_allocated(&mas) != i - j); mn = mas_pop_node(&mas); MT_BUG_ON(mt, not_empty(mn)); + mn->parent = ma_parent_ptr(mn); ma_free_rcu(mn); MT_BUG_ON(mt, mas_allocated(&mas) != i - j - 1); } @@ -269,6 +274,7 @@ static noinline void check_new_node(struct maple_tree *mt) mn = mas_pop_node(&mas); /* get the next node. */ MT_BUG_ON(mt, mn == NULL); MT_BUG_ON(mt, not_empty(mn)); + mn->parent = ma_parent_ptr(mn); ma_free_rcu(mn); } MT_BUG_ON(mt, mas_allocated(&mas) != 0); @@ -294,6 +300,7 @@ static noinline void check_new_node(struct maple_tree *mt) mn = mas_pop_node(&mas2); /* get the next node. */
[PATCH v4 04/33] maple_tree: remove extra smp_wmb() from mas_dead_leaves()
From: Liam Howlett The call to mte_set_dead_node() before the smp_wmb() already calls smp_wmb() so this is not needed. This is an optimization for the RCU mode of the maple tree. Fixes: 54a611b60590 ("Maple Tree: add new data structure") Signed-off-by: Liam Howlett Signed-off-by: Suren Baghdasaryan --- lib/maple_tree.c | 1 - 1 file changed, 1 deletion(-) diff --git a/lib/maple_tree.c b/lib/maple_tree.c index 44d6ce30b28e..3d5ab02f981a 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -5517,7 +5517,6 @@ unsigned char mas_dead_leaves(struct ma_state *mas, void __rcu **slots, break; mte_set_node_dead(entry); - smp_wmb(); /* Needed for RCU */ node->type = type; rcu_assign_pointer(slots[offset], node); } -- 2.39.2.722.g9855ee24e9-goog
[PATCH v4 03/33] maple_tree: Fix freeing of nodes in rcu mode
From: Liam Howlett The walk to destroy the nodes was not always setting the node type and would result in a destroy method potentially using the values as nodes. Avoid this by setting the correct node types. This is necessary for the RCU mode of the maple tree. Fixes: 54a611b60590 ("Maple Tree: add new data structure") Signed-off-by: Liam Howlett Signed-off-by: Suren Baghdasaryan --- lib/maple_tree.c | 73 1 file changed, 62 insertions(+), 11 deletions(-) diff --git a/lib/maple_tree.c b/lib/maple_tree.c index 089cd76ec379..44d6ce30b28e 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -902,6 +902,44 @@ static inline void ma_set_meta(struct maple_node *mn, enum maple_type mt, meta->end = end; } +/* + * mas_clear_meta() - clear the metadata information of a node, if it exists + * @mas: The maple state + * @mn: The maple node + * @mt: The maple node type + * @offset: The offset of the highest sub-gap in this node. + * @end: The end of the data in this node. + */ +static inline void mas_clear_meta(struct ma_state *mas, struct maple_node *mn, + enum maple_type mt) +{ + struct maple_metadata *meta; + unsigned long *pivots; + void __rcu **slots; + void *next; + + switch (mt) { + case maple_range_64: + pivots = mn->mr64.pivot; + if (unlikely(pivots[MAPLE_RANGE64_SLOTS - 2])) { + slots = mn->mr64.slot; + next = mas_slot_locked(mas, slots, + MAPLE_RANGE64_SLOTS - 1); + if (unlikely((mte_to_node(next) && mte_node_type(next + return; /* The last slot is a node, no metadata */ + } + fallthrough; + case maple_arange_64: + meta = ma_meta(mn, mt); + break; + default: + return; + } + + meta->gap = 0; + meta->end = 0; +} + /* * ma_meta_end() - Get the data end of a node from the metadata * @mn: The maple node @@ -5455,20 +5493,22 @@ static inline int mas_rev_alloc(struct ma_state *mas, unsigned long min, * mas_dead_leaves() - Mark all leaves of a node as dead. * @mas: The maple state * @slots: Pointer to the slot array + * @type: The maple node type * * Must hold the write lock. * * Return: The number of leaves marked as dead. */ static inline -unsigned char mas_dead_leaves(struct ma_state *mas, void __rcu **slots) +unsigned char mas_dead_leaves(struct ma_state *mas, void __rcu **slots, + enum maple_type mt) { struct maple_node *node; enum maple_type type; void *entry; int offset; - for (offset = 0; offset < mt_slot_count(mas->node); offset++) { + for (offset = 0; offset < mt_slots[mt]; offset++) { entry = mas_slot_locked(mas, slots, offset); type = mte_node_type(entry); node = mte_to_node(entry); @@ -5487,14 +5527,13 @@ unsigned char mas_dead_leaves(struct ma_state *mas, void __rcu **slots) static void __rcu **mas_dead_walk(struct ma_state *mas, unsigned char offset) { - struct maple_node *node, *next; + struct maple_node *next; void __rcu **slots = NULL; next = mas_mn(mas); do { - mas->node = ma_enode_ptr(next); - node = mas_mn(mas); - slots = ma_slots(node, node->type); + mas->node = mt_mk_node(next, next->type); + slots = ma_slots(next, next->type); next = mas_slot_locked(mas, slots, offset); offset = 0; } while (!ma_is_leaf(next->type)); @@ -5558,11 +5597,14 @@ static inline void __rcu **mas_destroy_descend(struct ma_state *mas, node = mas_mn(mas); slots = ma_slots(node, mte_node_type(mas->node)); next = mas_slot_locked(mas, slots, 0); - if ((mte_dead_node(next))) + if ((mte_dead_node(next))) { + mte_to_node(next)->type = mte_node_type(next); next = mas_slot_locked(mas, slots, 1); + } mte_set_node_dead(mas->node); node->type = mte_node_type(mas->node); + mas_clear_meta(mas, node, node->type); node->piv_parent = prev; node->parent_slot = offset; offset = 0; @@ -5582,13 +5624,18 @@ static void mt_destroy_walk(struct maple_enode *enode, unsigned char ma_flags, MA_STATE(mas, &mt, 0, 0); - if (mte_is_leaf(enode)) + mas.node = enode; + if (mte_is_leaf(enode)) { + node->type = mte_node_type(enode); goto free_leaf; + } + m
[PATCH v4 02/33] maple_tree: Detect dead nodes in mas_start()
From: Liam Howlett When initially starting a search, the root node may already be in the process of being replaced in RCU mode. Detect and restart the walk if this is the case. This is necessary for RCU mode of the maple tree. Fixes: 54a611b60590 ("Maple Tree: add new data structure") Signed-off-by: Liam Howlett Signed-off-by: Suren Baghdasaryan --- lib/maple_tree.c | 4 1 file changed, 4 insertions(+) diff --git a/lib/maple_tree.c b/lib/maple_tree.c index cc356b8369ad..089cd76ec379 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -1360,12 +1360,16 @@ static inline struct maple_enode *mas_start(struct ma_state *mas) mas->max = ULONG_MAX; mas->depth = 0; +retry: root = mas_root(mas); /* Tree with nodes */ if (likely(xa_is_node(root))) { mas->depth = 1; mas->node = mte_safe_root(root); mas->offset = 0; + if (mte_dead_node(mas->node)) + goto retry; + return NULL; } -- 2.39.2.722.g9855ee24e9-goog
[PATCH v4 01/33] maple_tree: Be more cautious about dead nodes
From: Liam Howlett ma_pivots() and ma_data_end() may be called with a dead node. Ensure to that the node isn't dead before using the returned values. This is necessary for RCU mode of the maple tree. Fixes: 54a611b60590 ("Maple Tree: add new data structure") Signed-off-by: Liam Howlett Signed-off-by: Suren Baghdasaryan --- lib/maple_tree.c | 52 +++- 1 file changed, 43 insertions(+), 9 deletions(-) diff --git a/lib/maple_tree.c b/lib/maple_tree.c index 646297cae5d1..cc356b8369ad 100644 --- a/lib/maple_tree.c +++ b/lib/maple_tree.c @@ -544,6 +544,7 @@ static inline bool ma_dead_node(const struct maple_node *node) return (parent == node); } + /* * mte_dead_node() - check if the @enode is dead. * @enode: The encoded maple node @@ -625,6 +626,8 @@ static inline unsigned int mas_alloc_req(const struct ma_state *mas) * @node - the maple node * @type - the node type * + * In the event of a dead node, this array may be %NULL + * * Return: A pointer to the maple node pivots */ static inline unsigned long *ma_pivots(struct maple_node *node, @@ -1096,8 +1099,11 @@ static int mas_ascend(struct ma_state *mas) a_type = mas_parent_enum(mas, p_enode); a_node = mte_parent(p_enode); a_slot = mte_parent_slot(p_enode); - pivots = ma_pivots(a_node, a_type); a_enode = mt_mk_node(a_node, a_type); + pivots = ma_pivots(a_node, a_type); + + if (unlikely(ma_dead_node(a_node))) + return 1; if (!set_min && a_slot) { set_min = true; @@ -1401,6 +1407,9 @@ static inline unsigned char ma_data_end(struct maple_node *node, { unsigned char offset; + if (!pivots) + return 0; + if (type == maple_arange_64) return ma_meta_end(node, type); @@ -1436,6 +1445,9 @@ static inline unsigned char mas_data_end(struct ma_state *mas) return ma_meta_end(node, type); pivots = ma_pivots(node, type); + if (unlikely(ma_dead_node(node))) + return 0; + offset = mt_pivots[type] - 1; if (likely(!pivots[offset])) return ma_meta_end(node, type); @@ -4505,6 +4517,9 @@ static inline int mas_prev_node(struct ma_state *mas, unsigned long min) node = mas_mn(mas); slots = ma_slots(node, mt); pivots = ma_pivots(node, mt); + if (unlikely(ma_dead_node(node))) + return 1; + mas->max = pivots[offset]; if (offset) mas->min = pivots[offset - 1] + 1; @@ -4526,6 +4541,9 @@ static inline int mas_prev_node(struct ma_state *mas, unsigned long min) slots = ma_slots(node, mt); pivots = ma_pivots(node, mt); offset = ma_data_end(node, mt, pivots, mas->max); + if (unlikely(ma_dead_node(node))) + return 1; + if (offset) mas->min = pivots[offset - 1] + 1; @@ -4574,6 +4592,7 @@ static inline int mas_next_node(struct ma_state *mas, struct maple_node *node, struct maple_enode *enode; int level = 0; unsigned char offset; + unsigned char node_end; enum maple_type mt; void __rcu **slots; @@ -4597,7 +4616,11 @@ static inline int mas_next_node(struct ma_state *mas, struct maple_node *node, node = mas_mn(mas); mt = mte_node_type(mas->node); pivots = ma_pivots(node, mt); - } while (unlikely(offset == ma_data_end(node, mt, pivots, mas->max))); + node_end = ma_data_end(node, mt, pivots, mas->max); + if (unlikely(ma_dead_node(node))) + return 1; + + } while (unlikely(offset == node_end)); slots = ma_slots(node, mt); pivot = mas_safe_pivot(mas, pivots, ++offset, mt); @@ -4613,6 +4636,9 @@ static inline int mas_next_node(struct ma_state *mas, struct maple_node *node, mt = mte_node_type(mas->node); slots = ma_slots(node, mt); pivots = ma_pivots(node, mt); + if (unlikely(ma_dead_node(node))) + return 1; + offset = 0; pivot = pivots[0]; } @@ -4659,11 +4685,14 @@ static inline void *mas_next_nentry(struct ma_state *mas, return NULL; } - pivots = ma_pivots(node, type); slots = ma_slots(node, type); - mas->index = mas_safe_min(mas, pivots, mas->offset); + pivots = ma_pivots(node, type); count = ma_data_end(node, type, pivots, mas->max); - if (ma_dead_node(node)) + if (unlikely(ma_dead_node(node))) + return NULL; + + mas->index = mas_safe_min(mas, pivots, mas->offset); +
[PATCH v4 00/33] Per-VMA locks
that the peak thread creation rate was high enough that thread creation is no longer a bottleneck. TCP zerocopy receive: >From the point of view of TCP zerocopy receive, the per-vma lock patch is massively beneficial. In today's implementation, a process with N threads where N - 1 are performing zerocopy receive and 1 thread is performing madvise() with the write lock taken (e.g. needs to change vm_flags) will result in all N -1 receive threads blocking until the madvise is done. Conversely, on a busy process receiving a lot of data, an madvise operation that does need to take the mmap lock in write mode will need to wait for all of the receives to be done - a lose:lose proposition. Per-VMA locking _removes_ by definition this source of contention entirely. There are other benefits for receive as well, chiefly a reduction in cacheline bouncing across receiving threads for locking/unlocking the single mmap lock. On an RPC style synthetic workload with 4KB RPCs: 1a) The find+lock+unlock VMA path in the base case, without the per-vma lock patchset, is about 0.7% of cycles as measured by perf. 1b) mmap_read_lock + mmap_read_unlock in the base case is about 0.5% cycles overall - most of this is within the TCP read hotpath (a small fraction is 'other' usage in the system). 2a) The find+lock+unlock VMA path, with the per-vma patchset and a trivial patch written to take advantage of it in TCP, is about 0.4% of cycles (down from 0.7% above) 2b) mmap_read_lock + mmap_read_unlock in the per-vma patchset is < 0.1% cycles and is out of the TCP read hotpath entirely (down from 0.5% before, the remaining usage is the 'other' usage in the system). So, in addition to entirely removing an onerous source of contention, it also reduces the CPU cycles of TCP receive zerocopy by about 0.5%+ (compared to overall cycles in perf) for the 'small' RPC scenario. The patchset structure is: 0001-0008: Enable maple-tree RCU mode 0009-0031: Main per-vma locks patchset 0032-0033: Performance optimizations Changes since v3: - Changed patch [3] to move vma_prepare before vma_adjust_trans_huge - Dropped patch [4] from the set as unnecessary, per Hyeonggon Yoo - Changed patch [5] to do VMA locking inside vma_prepare, per Liam Howlett - Dropped patch [6] from the set as unnecessary, per Liam Howlett [1] https://lore.kernel.org/all/20220128131006.67712-1-mic...@lespinasse.org/ [2] https://lwn.net/Articles/893906/ [3] https://lore.kernel.org/all/20230216051750.3125598-15-sur...@google.com/ [4] https://lore.kernel.org/all/20230216051750.3125598-17-sur...@google.com/ [5] https://lore.kernel.org/all/20230216051750.3125598-18-sur...@google.com/ [6] https://lore.kernel.org/all/20230216051750.3125598-22-sur...@google.com/ The patchset applies cleanly over mm-unstable branch. Laurent Dufour (1): powerc/mm: try VMA lock-based page fault handling first Liam Howlett (4): maple_tree: Be more cautious about dead nodes maple_tree: Detect dead nodes in mas_start() maple_tree: Fix freeing of nodes in rcu mode maple_tree: remove extra smp_wmb() from mas_dead_leaves() Liam R. Howlett (4): maple_tree: Fix write memory barrier of nodes once dead for RCU mode maple_tree: Add smp_rmb() to dead node detection maple_tree: Add RCU lock checking to rcu callback functions mm: Enable maple tree RCU mode by default. Michel Lespinasse (1): mm: rcu safe VMA freeing Suren Baghdasaryan (23): mm: introduce CONFIG_PER_VMA_LOCK mm: move mmap_lock assert function definitions mm: add per-VMA lock and helper functions to control it mm: mark VMA as being written when changing vm_flags mm/mmap: move vma_prepare before vma_adjust_trans_huge mm/khugepaged: write-lock VMA while collapsing a huge page mm/mmap: write-lock VMAs in vma_prepare before modifying them mm/mremap: write-lock VMA while remapping it to a new address range mm: write-lock VMAs before removing them from VMA tree mm: conditionally write-lock VMA in free_pgtables kernel/fork: assert no VMA readers during its destruction mm/mmap: prevent pagefault handler from racing with mmu_notifier registration mm: introduce vma detached flag mm: introduce lock_vma_under_rcu to be used from arch-specific code mm: fall back to mmap_lock if vma->anon_vma is not yet set mm: add FAULT_FLAG_VMA_LOCK flag mm: prevent do_swap_page from handling page faults under VMA lock mm: prevent userfaults to be handled under per-vma lock mm: introduce per-VMA lock statistics x86/mm: try VMA lock-based page fault handling first arm64/mm: try VMA lock-based page fault handling first mm/mmap: free vm_area_struct without call_rcu in exit_mmap mm: separate vma->lock from vm_area_struct arch/arm64/Kconfig | 1 + arch/arm64/mm/fault.c | 36 arch/powerpc/mm/fault.c| 41 arch/powerpc/platforms/powernv/Kconfig | 1 + arch/powerpc/platforms/pseries/Kconfig
Re: [PATCH v3 17/35] mm/mmap: write-lock VMA before shrinking or expanding it
On Fri, Feb 24, 2023 at 8:19 AM Suren Baghdasaryan wrote: > > On Fri, Feb 24, 2023 at 8:14 AM Liam R. Howlett > wrote: > > > > * Suren Baghdasaryan [230223 21:06]: > > > On Thu, Feb 23, 2023 at 5:46 PM Liam R. Howlett > > > wrote: > > > > > > > > * Suren Baghdasaryan [230223 16:16]: > > > > > On Thu, Feb 23, 2023 at 12:28 PM Liam R. Howlett > > > > > wrote: > > > > > > > > > > > > > > > > > > Wait, I figured a better place to do this. > > > > > > > > > > > > init_multi_vma_prep() should vma_start_write() on any VMA that is > > > > > > passed > > > > > > in.. that we we catch any modifications here & in vma_merge(), > > > > > > which I > > > > > > think is missed in this patch set? > > > > > > > > > > Hmm. That looks like a good idea but in that case, why not do the > > > > > locking inside vma_prepare() itself? From the description of that > > > > > function it sounds like it was designed to acquire locks before VMA > > > > > modifications, so would be the ideal location for doing that. WDYT? > > > > > > > > That might be even better. I think it will result in even less code. > > > > > > Yes. > > > > > > > > > > > There is also a vma_complete() which might work to call > > > > vma_end_write_all() as well? > > > > > > If there are other VMAs already locked before vma_prepare() then we > > > would unlock them too. Safer to just let mmap_unlock do > > > vma_end_write_all(). > > > > > > > > > > > > The only concern is vma_adjust_trans_huge() being called before > > > > > vma_prepare() but I *think* that's safe because > > > > > vma_adjust_trans_huge() does its modifications after acquiring PTL > > > > > lock, which page fault handlers also have to take. Does that sound > > > > > right? > > > > > > > > I am not sure. We are certainly safe the way it is, and the PTL has to > > > > be safe for concurrent faults.. but this could alter the walk to a page > > > > table while that walk is occurring and I don't think that happens today. > > > > > > > > It might be best to leave the locking order the way you have it, unless > > > > someone can tell us it's safe? > > > > > > Yes, I have the same feelings about changing this. > > > > > > > > > > > We could pass through the three extra variables that are needed to move > > > > the vma_adjust_trans_huge() call within that function as well? This > > > > would have the added benefit of having all locking grouped in the one > > > > location, but the argument list would be getting long, however we could > > > > use the struct. > > > > > > Any issues if I change the order to have vma_prepare() called always > > > before vma_adjust_trans_huge()? That way the VMA will always be locked > > > before vma_adjust_trans_huge() executes and we don't need any > > > additional arguments. > > > > I preserved the locking order from __vma_adjust() to ensure there was no > > issues. > > > > I am not sure but, looking through the page table information [1], it > > seems that vma_adjust_trans_huge() uses the pmd lock, which is part of > > the split page table lock. According to the comment in rmap, it should > > be fine to reverse the ordering here. > > > > Instead of: > > > > mmap_lock() > > vma_adjust_trans_huge() > > pte_lock > > pte_unlock > > > > vma_prepare() > > mapping->i_mmap_rwsem lock > > anon_vma->rwsem lock > > > > > > > > vma_complete() > > anon_vma->rwsem unlock > > mapping->i_mmap_rwsem unlock > > > > mmap_unlock() > > > > - > > > > We would have: > > > > mmap_lock() > > vma_prepare() > > mapping->i_mmap_rwsem lock > > anon_vma->rwsem lock > > > > vma_adjust_trans_huge() > > pte_lock > > pte_unlock > > > > > > > > vma_complete() > > anon_vma->rwsem unlock > > mapping->i_mmap_rwsem unlock > > > > mmap_unlock() > > > > > > Essentially, increasing the nesting of the pte lock, but not violating > > the ordering. > > > > 1. https://docs.kernel.org/mm/split_page_table_lock.html > > Thanks for the confirmation, Liam. I'll make the changes and test over > the weekend. If everything's still fine, I will post the next version > with these and other requested changes on Monday. Weekend testing results look good with all the changes. I'll post v4 shortly. Thanks, Suren. > > > > > > > > > > > > > > remove & remove2 should be be detached in vma_prepare() or > > > > vma_complete() as well? > > > > > > They are marked detached in vma_complete() (see > > > https://lore.kernel.org/all/20230216051750.3125598-25-sur...@google.com/) > > > and that should be enough. We should be safe as long as we mark them > > > detached before unlocking the VMA. > > > > > > > Right, Thanks. > > > > ... > > > > -- > > To unsubscribe from this group and stop receiving emails from it, send an > > email to kernel-team+unsubscr...@android.com. > >
Re: [PATCH v3 00/35] Per-VMA locks
On Mon, Feb 27, 2023 at 9:19 AM Davidlohr Bueso wrote: > > On Fri, 24 Feb 2023, freak07 wrote: > > >Here are some measurements from a Pixel 7 Pro that´s running a kernel either > >with the Per-VMA locks patchset or without. > >If there´s interest I can provide results of other specific apps as well. > > > >Results are from consecutive cold app launches issued with "am start" > >command spawning the main activity of Slack Android app. > > > >https://docs.google.com/spreadsheets/d/1ktujfcyDmIJoQMWsoizGIE-0A_jMS9VMw_seehUY9s0/edit?usp=sharing > > > >There´s quite a noticeable improvement, as can be seen in the graph. The > >results were reproducible. > > Thanks for sharing. I am not surprised - after all, per-vma locks narrow some > of the performance gaps > between vanilla and speculative pfs, with less complexity (albeit this is now > a 35 patch series :). Yes, depending on the specific app we would expect some launch time improvement (in this case average improvement is 7%). Thanks for sharing the numbers! I'll be posting v4 today, which is a 33 patch series now, so a bit better :) Thanks, Suren. > > Thanks, > Davidlohr >
Re: [PATCH v3 17/35] mm/mmap: write-lock VMA before shrinking or expanding it
On Fri, Feb 24, 2023 at 8:14 AM Liam R. Howlett wrote: > > * Suren Baghdasaryan [230223 21:06]: > > On Thu, Feb 23, 2023 at 5:46 PM Liam R. Howlett > > wrote: > > > > > > * Suren Baghdasaryan [230223 16:16]: > > > > On Thu, Feb 23, 2023 at 12:28 PM Liam R. Howlett > > > > wrote: > > > > > > > > > > > > > > > Wait, I figured a better place to do this. > > > > > > > > > > init_multi_vma_prep() should vma_start_write() on any VMA that is > > > > > passed > > > > > in.. that we we catch any modifications here & in vma_merge(), which I > > > > > think is missed in this patch set? > > > > > > > > Hmm. That looks like a good idea but in that case, why not do the > > > > locking inside vma_prepare() itself? From the description of that > > > > function it sounds like it was designed to acquire locks before VMA > > > > modifications, so would be the ideal location for doing that. WDYT? > > > > > > That might be even better. I think it will result in even less code. > > > > Yes. > > > > > > > > There is also a vma_complete() which might work to call > > > vma_end_write_all() as well? > > > > If there are other VMAs already locked before vma_prepare() then we > > would unlock them too. Safer to just let mmap_unlock do > > vma_end_write_all(). > > > > > > > > > The only concern is vma_adjust_trans_huge() being called before > > > > vma_prepare() but I *think* that's safe because > > > > vma_adjust_trans_huge() does its modifications after acquiring PTL > > > > lock, which page fault handlers also have to take. Does that sound > > > > right? > > > > > > I am not sure. We are certainly safe the way it is, and the PTL has to > > > be safe for concurrent faults.. but this could alter the walk to a page > > > table while that walk is occurring and I don't think that happens today. > > > > > > It might be best to leave the locking order the way you have it, unless > > > someone can tell us it's safe? > > > > Yes, I have the same feelings about changing this. > > > > > > > > We could pass through the three extra variables that are needed to move > > > the vma_adjust_trans_huge() call within that function as well? This > > > would have the added benefit of having all locking grouped in the one > > > location, but the argument list would be getting long, however we could > > > use the struct. > > > > Any issues if I change the order to have vma_prepare() called always > > before vma_adjust_trans_huge()? That way the VMA will always be locked > > before vma_adjust_trans_huge() executes and we don't need any > > additional arguments. > > I preserved the locking order from __vma_adjust() to ensure there was no > issues. > > I am not sure but, looking through the page table information [1], it > seems that vma_adjust_trans_huge() uses the pmd lock, which is part of > the split page table lock. According to the comment in rmap, it should > be fine to reverse the ordering here. > > Instead of: > > mmap_lock() > vma_adjust_trans_huge() > pte_lock > pte_unlock > > vma_prepare() > mapping->i_mmap_rwsem lock > anon_vma->rwsem lock > > > > vma_complete() > anon_vma->rwsem unlock > mapping->i_mmap_rwsem unlock > > mmap_unlock() > > - > > We would have: > > mmap_lock() > vma_prepare() > mapping->i_mmap_rwsem lock > anon_vma->rwsem lock > > vma_adjust_trans_huge() > pte_lock > pte_unlock > > > > vma_complete() > anon_vma->rwsem unlock > mapping->i_mmap_rwsem unlock > > mmap_unlock() > > > Essentially, increasing the nesting of the pte lock, but not violating > the ordering. > > 1. https://docs.kernel.org/mm/split_page_table_lock.html Thanks for the confirmation, Liam. I'll make the changes and test over the weekend. If everything's still fine, I will post the next version with these and other requested changes on Monday. > > > > > > > > > remove & remove2 should be be detached in vma_prepare() or > > > vma_complete() as well? > > > > They are marked detached in vma_complete() (see > > https://lore.kernel.org/all/20230216051750.3125598-25-sur...@google.com/) > > and that should be enough. We should be safe as long as we mark them > > detached before unlocking the VMA. > > > > Right, Thanks. > > ... > > -- > To unsubscribe from this group and stop receiving emails from it, send an > email to kernel-team+unsubscr...@android.com. >
Re: [PATCH v3 17/35] mm/mmap: write-lock VMA before shrinking or expanding it
On Thu, Feb 23, 2023 at 5:46 PM Liam R. Howlett wrote: > > * Suren Baghdasaryan [230223 16:16]: > > On Thu, Feb 23, 2023 at 12:28 PM Liam R. Howlett > > wrote: > > > > > > > > > Wait, I figured a better place to do this. > > > > > > init_multi_vma_prep() should vma_start_write() on any VMA that is passed > > > in.. that we we catch any modifications here & in vma_merge(), which I > > > think is missed in this patch set? > > > > Hmm. That looks like a good idea but in that case, why not do the > > locking inside vma_prepare() itself? From the description of that > > function it sounds like it was designed to acquire locks before VMA > > modifications, so would be the ideal location for doing that. WDYT? > > That might be even better. I think it will result in even less code. Yes. > > There is also a vma_complete() which might work to call > vma_end_write_all() as well? If there are other VMAs already locked before vma_prepare() then we would unlock them too. Safer to just let mmap_unlock do vma_end_write_all(). > > > The only concern is vma_adjust_trans_huge() being called before > > vma_prepare() but I *think* that's safe because > > vma_adjust_trans_huge() does its modifications after acquiring PTL > > lock, which page fault handlers also have to take. Does that sound > > right? > > I am not sure. We are certainly safe the way it is, and the PTL has to > be safe for concurrent faults.. but this could alter the walk to a page > table while that walk is occurring and I don't think that happens today. > > It might be best to leave the locking order the way you have it, unless > someone can tell us it's safe? Yes, I have the same feelings about changing this. > > We could pass through the three extra variables that are needed to move > the vma_adjust_trans_huge() call within that function as well? This > would have the added benefit of having all locking grouped in the one > location, but the argument list would be getting long, however we could > use the struct. Any issues if I change the order to have vma_prepare() called always before vma_adjust_trans_huge()? That way the VMA will always be locked before vma_adjust_trans_huge() executes and we don't need any additional arguments. > > remove & remove2 should be be detached in vma_prepare() or > vma_complete() as well? They are marked detached in vma_complete() (see https://lore.kernel.org/all/20230216051750.3125598-25-sur...@google.com/) and that should be enough. We should be safe as long as we mark them detached before unlocking the VMA. > > > > > > > > > > > > * Liam R. Howlett [230223 15:20]: > > > > Reviewed-by: Liam R. Howlett > > > > > > > > * Suren Baghdasaryan [230216 00:18]: > > > > > vma_expand and vma_shrink change VMA boundaries. Expansion might also > > > > > result in freeing of an adjacent VMA. Write-lock affected VMAs to > > > > > prevent > > > > > concurrent page faults. > > > > > > > > > > Signed-off-by: Suren Baghdasaryan > > > > > --- > > > > > mm/mmap.c | 5 + > > > > > 1 file changed, 5 insertions(+) > > > > > > > > > > diff --git a/mm/mmap.c b/mm/mmap.c > > > > > index ec2f8d0af280..f079e5bbcd57 100644 > > > > > --- a/mm/mmap.c > > > > > +++ b/mm/mmap.c > > > > > @@ -674,6 +674,9 @@ int vma_expand(struct vma_iterator *vmi, struct > > > > > vm_area_struct *vma, > > > > > ret = dup_anon_vma(vma, next); > > > > > if (ret) > > > > > return ret; > > > > > + > > > > > + /* Lock the VMA before removing it */ > > > > > + vma_start_write(next); > > > > > } > > > > > > > > > > init_multi_vma_prep(&vp, vma, NULL, remove_next ? next : NULL, > > > > > NULL); > > > > > @@ -686,6 +689,7 @@ int vma_expand(struct vma_iterator *vmi, struct > > > > > vm_area_struct *vma, > > > > > if (vma_iter_prealloc(vmi)) > > > > > goto nomem; > > > > > > > > > > + vma_start_write(vma); > > > > > vma_adjust_trans_huge(vma, start, end, 0); > > > > > /* VMA iterator points to previous, so set to start if necessary > > > > > */ > > > > > if (vma_iter_addr(vmi) != start) > > > > > @@ -725,6 +729,7 @@ int vma_shrink(struct vma_iterator *vmi, struct > > > > > vm_area_struct *vma, > > > > > if (vma_iter_prealloc(vmi)) > > > > > return -ENOMEM; > > > > > > > > > > + vma_start_write(vma); > > > > > init_vma_prep(&vp, vma); > > > > > vma_adjust_trans_huge(vma, start, end, 0); > > > > > vma_prepare(&vp); > > > > > -- > > > > > 2.39.1 > > > > > > > > > > > -- > > > To unsubscribe from this group and stop receiving emails from it, send an > > > email to kernel-team+unsubscr...@android.com. > > > > > -- > To unsubscribe from this group and stop receiving emails from it, send an > email to kernel-team+unsubscr...@android.com. >
Re: [PATCH v3 17/35] mm/mmap: write-lock VMA before shrinking or expanding it
On Thu, Feb 23, 2023 at 12:28 PM Liam R. Howlett wrote: > > > Wait, I figured a better place to do this. > > init_multi_vma_prep() should vma_start_write() on any VMA that is passed > in.. that we we catch any modifications here & in vma_merge(), which I > think is missed in this patch set? Hmm. That looks like a good idea but in that case, why not do the locking inside vma_prepare() itself? From the description of that function it sounds like it was designed to acquire locks before VMA modifications, so would be the ideal location for doing that. WDYT? The only concern is vma_adjust_trans_huge() being called before vma_prepare() but I *think* that's safe because vma_adjust_trans_huge() does its modifications after acquiring PTL lock, which page fault handlers also have to take. Does that sound right? > > > * Liam R. Howlett [230223 15:20]: > > Reviewed-by: Liam R. Howlett > > > > * Suren Baghdasaryan [230216 00:18]: > > > vma_expand and vma_shrink change VMA boundaries. Expansion might also > > > result in freeing of an adjacent VMA. Write-lock affected VMAs to prevent > > > concurrent page faults. > > > > > > Signed-off-by: Suren Baghdasaryan > > > --- > > > mm/mmap.c | 5 + > > > 1 file changed, 5 insertions(+) > > > > > > diff --git a/mm/mmap.c b/mm/mmap.c > > > index ec2f8d0af280..f079e5bbcd57 100644 > > > --- a/mm/mmap.c > > > +++ b/mm/mmap.c > > > @@ -674,6 +674,9 @@ int vma_expand(struct vma_iterator *vmi, struct > > > vm_area_struct *vma, > > > ret = dup_anon_vma(vma, next); > > > if (ret) > > > return ret; > > > + > > > + /* Lock the VMA before removing it */ > > > + vma_start_write(next); > > > } > > > > > > init_multi_vma_prep(&vp, vma, NULL, remove_next ? next : NULL, NULL); > > > @@ -686,6 +689,7 @@ int vma_expand(struct vma_iterator *vmi, struct > > > vm_area_struct *vma, > > > if (vma_iter_prealloc(vmi)) > > > goto nomem; > > > > > > + vma_start_write(vma); > > > vma_adjust_trans_huge(vma, start, end, 0); > > > /* VMA iterator points to previous, so set to start if necessary */ > > > if (vma_iter_addr(vmi) != start) > > > @@ -725,6 +729,7 @@ int vma_shrink(struct vma_iterator *vmi, struct > > > vm_area_struct *vma, > > > if (vma_iter_prealloc(vmi)) > > > return -ENOMEM; > > > > > > + vma_start_write(vma); > > > init_vma_prep(&vp, vma); > > > vma_adjust_trans_huge(vma, start, end, 0); > > > vma_prepare(&vp); > > > -- > > > 2.39.1 > > > > > -- > To unsubscribe from this group and stop receiving emails from it, send an > email to kernel-team+unsubscr...@android.com. >
Re: [PATCH v3 24/35] mm: introduce vma detached flag
On Thu, Feb 23, 2023 at 12:08 PM Liam R. Howlett wrote: > > > Can we change this to active/inactive? I think there is potential for > reusing this functionality to even larger degrees and that name would > fit better and would still make sense in this context. > > ie: vma_mark_active() and vma_mark_inactive() ? Those names sound too generic (not obvious what active/inactive means), while detached/isolated I think is more clear and specific. Does not really make a huge difference to me but maybe you can come up with better naming that addresses my concern and meets your usecase? > > * Suren Baghdasaryan [230216 00:18]: > > Per-vma locking mechanism will search for VMA under RCU protection and > > then after locking it, has to ensure it was not removed from the VMA > > tree after we found it. To make this check efficient, introduce a > > vma->detached flag to mark VMAs which were removed from the VMA tree. > > > > Signed-off-by: Suren Baghdasaryan > > --- > > include/linux/mm.h | 11 +++ > > include/linux/mm_types.h | 3 +++ > > mm/mmap.c| 2 ++ > > 3 files changed, 16 insertions(+) > > > > diff --git a/include/linux/mm.h b/include/linux/mm.h > > index f4f702224ec5..3f98344f829c 100644 > > --- a/include/linux/mm.h > > +++ b/include/linux/mm.h > > @@ -693,6 +693,14 @@ static inline void vma_assert_write_locked(struct > > vm_area_struct *vma) > > VM_BUG_ON_VMA(vma->vm_lock_seq != READ_ONCE(vma->vm_mm->mm_lock_seq), > > vma); > > } > > > > +static inline void vma_mark_detached(struct vm_area_struct *vma, bool > > detached) > > +{ > > + /* When detaching vma should be write-locked */ > > + if (detached) > > + vma_assert_write_locked(vma); > > + vma->detached = detached; > > +} > > + > > #else /* CONFIG_PER_VMA_LOCK */ > > > > static inline void vma_init_lock(struct vm_area_struct *vma) {} > > @@ -701,6 +709,8 @@ static inline bool vma_start_read(struct vm_area_struct > > *vma) > > static inline void vma_end_read(struct vm_area_struct *vma) {} > > static inline void vma_start_write(struct vm_area_struct *vma) {} > > static inline void vma_assert_write_locked(struct vm_area_struct *vma) {} > > +static inline void vma_mark_detached(struct vm_area_struct *vma, > > + bool detached) {} > > > > #endif /* CONFIG_PER_VMA_LOCK */ > > > > @@ -712,6 +722,7 @@ static inline void vma_init(struct vm_area_struct *vma, > > struct mm_struct *mm) > > vma->vm_mm = mm; > > vma->vm_ops = &dummy_vm_ops; > > INIT_LIST_HEAD(&vma->anon_vma_chain); > > + vma_mark_detached(vma, false); > > vma_init_lock(vma); > > } > > > > diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h > > index e268723eaf44..939f4f5a1115 100644 > > --- a/include/linux/mm_types.h > > +++ b/include/linux/mm_types.h > > @@ -511,6 +511,9 @@ struct vm_area_struct { > > #ifdef CONFIG_PER_VMA_LOCK > > int vm_lock_seq; > > struct rw_semaphore lock; > > + > > + /* Flag to indicate areas detached from the mm->mm_mt tree */ > > + bool detached; > > #endif > > > > /* > > diff --git a/mm/mmap.c b/mm/mmap.c > > index 801608726be8..adf40177e68f 100644 > > --- a/mm/mmap.c > > +++ b/mm/mmap.c > > @@ -593,6 +593,7 @@ static inline void vma_complete(struct vma_prepare *vp, > > > > if (vp->remove) { > > again: > > + vma_mark_detached(vp->remove, true); > > if (vp->file) { > > uprobe_munmap(vp->remove, vp->remove->vm_start, > > vp->remove->vm_end); > > @@ -2267,6 +2268,7 @@ static inline int munmap_sidetree(struct > > vm_area_struct *vma, > > if (mas_store_gfp(mas_detach, vma, GFP_KERNEL)) > > return -ENOMEM; > > > > + vma_mark_detached(vma, true); > > if (vma->vm_flags & VM_LOCKED) > > vma->vm_mm->locked_vm -= vma_pages(vma); > > > > -- > > 2.39.1 > > > > -- > To unsubscribe from this group and stop receiving emails from it, send an > email to kernel-team+unsubscr...@android.com. >
Re: [PATCH v3 23/35] mm/mmap: prevent pagefault handler from racing with mmu_notifier registration
On Thu, Feb 23, 2023 at 12:06 PM Liam R. Howlett wrote: > > * Suren Baghdasaryan [230216 00:18]: > > Page fault handlers might need to fire MMU notifications while a new > > notifier is being registered. Modify mm_take_all_locks to write-lock all > > VMAs and prevent this race with page fault handlers that would hold VMA > > locks. VMAs are locked before i_mmap_rwsem and anon_vma to keep the same > > locking order as in page fault handlers. > > > > Signed-off-by: Suren Baghdasaryan > > --- > > mm/mmap.c | 9 + > > 1 file changed, 9 insertions(+) > > > > diff --git a/mm/mmap.c b/mm/mmap.c > > index 00f8c5798936..801608726be8 100644 > > --- a/mm/mmap.c > > +++ b/mm/mmap.c > > @@ -3501,6 +3501,7 @@ static void vm_lock_mapping(struct mm_struct *mm, > > struct address_space *mapping) > > * of mm/rmap.c: > > * - all hugetlbfs_i_mmap_rwsem_key locks (aka mapping->i_mmap_rwsem for > > * hugetlb mapping); > > + * - all vmas marked locked > > * - all i_mmap_rwsem locks; > > * - all anon_vma->rwseml > > * > > @@ -3523,6 +3524,13 @@ int mm_take_all_locks(struct mm_struct *mm) > > > > mutex_lock(&mm_all_locks_mutex); > > > > + mas_for_each(&mas, vma, ULONG_MAX) { > > + if (signal_pending(current)) > > + goto out_unlock; > > + vma_start_write(vma); > > + } > > + > > + mas_set(&mas, 0); > > mas_for_each(&mas, vma, ULONG_MAX) { > > if (signal_pending(current)) > > goto out_unlock; > > Do we need a vma_end_write_all(mm) in the out_unlock unrolling? We can't really do that because some VMAs might have been locked before mm_take_all_locks() got called. So, we will have to wait until mmap write lock is dropped and vma_end_write_all() is called from there. Getting a signal while executing mm_take_all_locks() is probably not too common and won't pose a performance issue. > > Also, does this need to honour the strict locking order that we have to > add an entire new loop? This function is...suboptimal today, but if we > could get away with not looping through every VMA for a 4th time, that > would be nice. That's what I used to do until Jann pointed out the locking order requirement to avoid deadlocks in here: https://lore.kernel.org/all/CAG48ez3EAai=1ghnCMF6xcgUebQRm-u2xhwcpYsfP9=r=ov...@mail.gmail.com/. > > > @@ -3612,6 +3620,7 @@ void mm_drop_all_locks(struct mm_struct *mm) > > if (vma->vm_file && vma->vm_file->f_mapping) > > vm_unlock_mapping(vma->vm_file->f_mapping); > > } > > + vma_end_write_all(mm); > > > > mutex_unlock(&mm_all_locks_mutex); > > } > > -- > > 2.39.1 > > > > -- > To unsubscribe from this group and stop receiving emails from it, send an > email to kernel-team+unsubscr...@android.com. >
Re: [PATCH v3 26/35] mm: fall back to mmap_lock if vma->anon_vma is not yet set
On Fri, Feb 17, 2023 at 2:21 AM Hyeonggon Yoo <42.hye...@gmail.com> wrote: > > On Fri, Feb 17, 2023 at 11:15 AM Suren Baghdasaryan wrote: > > > > On Thu, Feb 16, 2023 at 11:43 AM Suren Baghdasaryan > > wrote: > > > > > > On Thu, Feb 16, 2023 at 7:44 AM Matthew Wilcox > > > wrote: > > > > > > > > On Wed, Feb 15, 2023 at 09:17:41PM -0800, Suren Baghdasaryan wrote: > > > > > When vma->anon_vma is not set, page fault handler will set it by > > > > > either > > > > > reusing anon_vma of an adjacent VMA if VMAs are compatible or by > > > > > allocating a new one. find_mergeable_anon_vma() walks VMA tree to find > > > > > a compatible adjacent VMA and that requires not only the faulting VMA > > > > > to be stable but also the tree structure and other VMAs inside that > > > > > tree. > > > > > Therefore locking just the faulting VMA is not enough for this search. > > > > > Fall back to taking mmap_lock when vma->anon_vma is not set. This > > > > > situation happens only on the first page fault and should not affect > > > > > overall performance. > > > > > > > > I think I asked this before, but don't remember getting an aswer. > > > > Why do we defer setting anon_vma to the first fault? Why don't we > > > > set it up at mmap time? > > > > > > Yeah, I remember that conversation Matthew and I could not find the > > > definitive answer at the time. I'll look into that again or maybe > > > someone can answer it here. > > > > After looking into it again I'm still under the impression that > > vma->anon_vma is populated lazily (during the first page fault rather > > than at mmap time) to avoid doing extra work for areas which are never > > faulted. Though I might be missing some important detail here. > > I think this is because the kernel cannot merge VMAs that have > different anon_vmas? > > Enabling lazy population of anon_vma could potentially increase the > chances of merging VMAs. Hmm. Do you have a clear explanation why merging chances increase this way? A couple of possibilities I can think of would be: 1. If after mmap'ing a VMA and before faulting the first page into it we often change something that affects anon_vma_compatible() decision, like vm_policy; 2. When mmap'ing VMAs we do not map them consecutively but the final arrangement is actually contiguous. Don't think either of those cases would be very representative of a usual case but maybe I'm wrong or there is another reason? > > > > In the end rather than changing that logic I decided to skip > > > vma->anon_vma==NULL cases because I measured them being less than > > > 0.01% of all page faults, so ROI from changing that would be quite > > > low. But I agree that the logic is weird and maybe we can improve > > > that. I will have to review that again when I'm working on eliminating > > > all these special cases we skip, like swap/userfaults/etc. > > -- > To unsubscribe from this group and stop receiving emails from it, send an > email to kernel-team+unsubscr...@android.com. >
Re: [PATCH v3 26/35] mm: fall back to mmap_lock if vma->anon_vma is not yet set
On Fri, Feb 17, 2023 at 8:05 AM Matthew Wilcox wrote: > > On Thu, Feb 16, 2023 at 06:14:59PM -0800, Suren Baghdasaryan wrote: > > On Thu, Feb 16, 2023 at 11:43 AM Suren Baghdasaryan > > wrote: > > > > > > On Thu, Feb 16, 2023 at 7:44 AM Matthew Wilcox > > > wrote: > > > > > > > > On Wed, Feb 15, 2023 at 09:17:41PM -0800, Suren Baghdasaryan wrote: > > > > > When vma->anon_vma is not set, page fault handler will set it by > > > > > either > > > > > reusing anon_vma of an adjacent VMA if VMAs are compatible or by > > > > > allocating a new one. find_mergeable_anon_vma() walks VMA tree to find > > > > > a compatible adjacent VMA and that requires not only the faulting VMA > > > > > to be stable but also the tree structure and other VMAs inside that > > > > > tree. > > > > > Therefore locking just the faulting VMA is not enough for this search. > > > > > Fall back to taking mmap_lock when vma->anon_vma is not set. This > > > > > situation happens only on the first page fault and should not affect > > > > > overall performance. > > > > > > > > I think I asked this before, but don't remember getting an aswer. > > > > Why do we defer setting anon_vma to the first fault? Why don't we > > > > set it up at mmap time? > > > > > > Yeah, I remember that conversation Matthew and I could not find the > > > definitive answer at the time. I'll look into that again or maybe > > > someone can answer it here. > > > > After looking into it again I'm still under the impression that > > vma->anon_vma is populated lazily (during the first page fault rather > > than at mmap time) to avoid doing extra work for areas which are never > > faulted. Though I might be missing some important detail here. > > How often does userspace call mmap() and then _never_ fault on it? > I appreciate that userspace might mmap() gigabytes of address space and > then only end up using a small amount of it, so populating it lazily > makes sense. But creating a region and never faulting on it? The only > use-case I can think of is loading shared libraries: > > openat(AT_FDCWD, "/lib/x86_64-linux-gnu/libc.so.6", O_RDONLY|O_CLOEXEC) = 3 > (...) > mmap(NULL, 197, PROT_READ, MAP_PRIVATE|MAP_DENYWRITE, 3, 0) = > 0x7f0ce612e000 > mmap(0x7f0ce6154000, 1396736, PROT_READ|PROT_EXEC, > MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x26000) = 0x7f0ce6154000 > mmap(0x7f0ce62a9000, 339968, PROT_READ, MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, > 3, 0x17b000) = 0x7f0ce62a9000 > mmap(0x7f0ce62fc000, 24576, PROT_READ|PROT_WRITE, > MAP_PRIVATE|MAP_FIXED|MAP_DENYWRITE, 3, 0x1ce000) = 0x7f0ce62fc000 > mmap(0x7f0ce6302000, 53072, PROT_READ|PROT_WRITE, > MAP_PRIVATE|MAP_FIXED|MAP_ANONYMOUS, -1, 0) = 0x7f0ce6302000 > > but that's a file-backed VMA, not an anon VMA. Might the case of dup_mmap() while forking be the reason why a VMA in the child process might be never used while parent uses it (or visa versa)? Again, I'm not sure this is the reason but I can find no other good explanation.
Re: [PATCH v3 21/35] mm/mmap: write-lock adjacent VMAs if they can grow into unmapped area
On Fri, Feb 17, 2023 at 6:51 AM Liam R. Howlett wrote: > > * Suren Baghdasaryan [230216 14:36]: > > On Thu, Feb 16, 2023 at 7:34 AM Liam R. Howlett > > wrote: > > > > > > > > > First, sorry I didn't see this before v3.. > > > > Feedback at any time is highly appreciated! > > > > > > > > * Suren Baghdasaryan [230216 00:18]: > > > > While unmapping VMAs, adjacent VMAs might be able to grow into the area > > > > being unmapped. In such cases write-lock adjacent VMAs to prevent this > > > > growth. > > > > > > > > Signed-off-by: Suren Baghdasaryan > > > > --- > > > > mm/mmap.c | 8 +--- > > > > 1 file changed, 5 insertions(+), 3 deletions(-) > > > > > > > > diff --git a/mm/mmap.c b/mm/mmap.c > > > > index 118b2246bba9..00f8c5798936 100644 > > > > --- a/mm/mmap.c > > > > +++ b/mm/mmap.c > > > > @@ -2399,11 +2399,13 @@ do_vmi_align_munmap(struct vma_iterator *vmi, > > > > struct vm_area_struct *vma, > > > >* down_read(mmap_lock) and collide with the VMA we are about to > > > > unmap. > > > >*/ > > > > if (downgrade) { > > > > - if (next && (next->vm_flags & VM_GROWSDOWN)) > > > > + if (next && (next->vm_flags & VM_GROWSDOWN)) { > > > > + vma_start_write(next); > > > > downgrade = false; > > > > > > If the mmap write lock is insufficient to protect us from next/prev > > > modifications then we need to move *most* of this block above the maple > > > tree write operation, otherwise we have a race here. When I say most, I > > > mean everything besides the call to mmap_write_downgrade() needs to be > > > moved. > > > > Which prior maple tree write operation are you referring to? I see > > __split_vma() and munmap_sidetree() which both already lock the VMAs > > they operate on, so page faults can't happen in those VMAs. > > The write that removes the VMAs from the maple tree a few lines above.. > /* Point of no return */ > > If the mmap lock is not sufficient, then we need to move the > vma_start_write() of prev/next to above the call to > vma_iter_clear_gfp() in do_vmi_align_munmap(). > > But I still think it IS enough. > > > > > > > > > If the mmap write lock is sufficient to protect us from next/prev > > > modifications then we don't need to write lock the vmas themselves. > > > > mmap write lock is not sufficient because with per-VMA locks we do not > > take mmap lock at all. > > Understood, but it also does not expand VMAs. > > > > > > > > > I believe this is for expand_stack() protection, so I believe it's okay > > > to not vma write lock these vmas.. I don't think there are other areas > > > where we can modify the vmas without holding the mmap lock, but others > > > on the CC list please chime in if I've forgotten something. > > > > > > So, if I am correct, then you shouldn't lock next/prev and allow the > > > vma locking fault method on these vmas. This will work because > > > lock_vma_under_rcu() uses mas_walk() on the faulting address. That is, > > > your lock_vma_under_rcu() will fail to find anything that needs to be > > > grown and go back to mmap lock protection. As it is written today, the > > > vma locking fault handler will fail and we will wait for the mmap lock > > > to be released even when the vma isn't going to expand. > > > > So, let's consider a case when the next VMA is not being removed (so > > it was neither removed nor locked by munmap_sidetree()) and it is > > found by lock_vma_under_rcu() in the page fault handling path. > > By this point next VMA is either NULL or outside the munmap area, so > what you said here is always true. > > >Page > > fault handler can now expand it and push into the area we are > > unmapping in unmap_region(). That is the race I'm trying to prevent > > here by locking the next/prev VMAs which can be expanded before > > unmap_region() unmaps them. Am I missing something? > > Yes, I think the part you are missing (or I am missing..) is that > expand_stack() will never be called without the mmap lock. We don't use > the vma locking to expand the stack. Ah, yes, you are absolutely right. I missed that when the VMA explands as a result of a page fault, lock_vma_under_rcu() can't find the faulting VMA (the fault is outside of the area and hence the need to expand) and will fall back to mmap read locking. Since do_vmi_align_munmap() holds the mmap write lock and does not downgrade it, the race will be avoided and expansion will wait until we drop the mmap write lock. Good catch Liam! We can drop this patch completely from the series. Thanks, Suren. > > ... > > -- > To unsubscribe from this group and stop receiving emails from it, send an > email to kernel-team+unsubscr...@android.com. >
Re: [PATCH v3 26/35] mm: fall back to mmap_lock if vma->anon_vma is not yet set
On Thu, Feb 16, 2023 at 11:43 AM Suren Baghdasaryan wrote: > > On Thu, Feb 16, 2023 at 7:44 AM Matthew Wilcox wrote: > > > > On Wed, Feb 15, 2023 at 09:17:41PM -0800, Suren Baghdasaryan wrote: > > > When vma->anon_vma is not set, page fault handler will set it by either > > > reusing anon_vma of an adjacent VMA if VMAs are compatible or by > > > allocating a new one. find_mergeable_anon_vma() walks VMA tree to find > > > a compatible adjacent VMA and that requires not only the faulting VMA > > > to be stable but also the tree structure and other VMAs inside that tree. > > > Therefore locking just the faulting VMA is not enough for this search. > > > Fall back to taking mmap_lock when vma->anon_vma is not set. This > > > situation happens only on the first page fault and should not affect > > > overall performance. > > > > I think I asked this before, but don't remember getting an aswer. > > Why do we defer setting anon_vma to the first fault? Why don't we > > set it up at mmap time? > > Yeah, I remember that conversation Matthew and I could not find the > definitive answer at the time. I'll look into that again or maybe > someone can answer it here. After looking into it again I'm still under the impression that vma->anon_vma is populated lazily (during the first page fault rather than at mmap time) to avoid doing extra work for areas which are never faulted. Though I might be missing some important detail here. > > In the end rather than changing that logic I decided to skip > vma->anon_vma==NULL cases because I measured them being less than > 0.01% of all page faults, so ROI from changing that would be quite > low. But I agree that the logic is weird and maybe we can improve > that. I will have to review that again when I'm working on eliminating > all these special cases we skip, like swap/userfaults/etc.