Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code

2024-04-02 Thread Ryan Roberts
On 02/04/2024 17:20, Peter Xu wrote: > On Tue, Apr 02, 2024 at 05:26:28PM +0200, David Hildenbrand wrote: >> On 02.04.24 16:48, Ryan Roberts wrote: >>> Hi Peter, > > Hey, Ryan, > > Thanks for the report! > >>> >>> On 27/03/2024 15:23, pet...@r

Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code

2024-04-02 Thread Ryan Roberts
On 02/04/2024 17:00, Matthew Wilcox wrote: > On Tue, Apr 02, 2024 at 05:26:28PM +0200, David Hildenbrand wrote: >>> The oops trigger is at mm/gup.c:778: >>> VM_BUG_ON_PAGE(!PageHead(page) && !is_zone_device_page(page), page); >>> >>> So 2M passed ok, and its failing for 32M, which is cont-pmd. I'm

Re: [PATCH v4 13/13] mm/gup: Handle hugetlb in the generic follow_page_mask code

2024-04-02 Thread Ryan Roberts
Hi Peter, On 27/03/2024 15:23, pet...@redhat.com wrote: > From: Peter Xu > > Now follow_page() is ready to handle hugetlb pages in whatever form, and > over all architectures. Switch to the generic code path. > > Time to retire hugetlb_follow_page_mask(), following the previous > retirement

Re: [PATCH RFC 0/3] mm/gup: consistently call it GUP-fast

2024-03-27 Thread Ryan Roberts
> > Some of them look like mm-unstable issue, For example, arm64 fails with > >   CC  arch/arm64/mm/extable.o > In file included from ./include/linux/hugetlb.h:828, > from security/commoncap.c:19: > ./arch/arm64/include/asm/hugetlb.h:25:34: error: redefinition of >

Re: [PATCH v6 12/18] arm64/mm: Wire up PTE_CONT for user mappings

2024-02-20 Thread Ryan Roberts
On 19/02/2024 15:18, Catalin Marinas wrote: > On Fri, Feb 16, 2024 at 12:53:43PM +0000, Ryan Roberts wrote: >> On 16/02/2024 12:25, Catalin Marinas wrote: >>> On Thu, Feb 15, 2024 at 10:31:59AM +, Ryan Roberts wrote: >>>> +pte_t contpte_ptep

Re: [PATCH v6 12/18] arm64/mm: Wire up PTE_CONT for user mappings

2024-02-20 Thread Ryan Roberts
On 16/02/2024 19:54, John Hubbard wrote: > On 2/16/24 08:56, Catalin Marinas wrote: > ... >>> The problem is that the contpte_* symbols are called from the ptep_* inline >>> functions. So where those inlines are called from modules, we need to make >>> sure >>> the contpte_* symbols are

Re: [PATCH v6 12/18] arm64/mm: Wire up PTE_CONT for user mappings

2024-02-16 Thread Ryan Roberts
Hi Catalin, Thanks for the review! Comments below... On 16/02/2024 12:25, Catalin Marinas wrote: > On Thu, Feb 15, 2024 at 10:31:59AM +0000, Ryan Roberts wrote: >> arch/arm64/mm/contpte.c | 285 +++ > > Nitpick: I think most symbols i

[PATCH v6 14/18] arm64/mm: Implement new [get_and_]clear_full_ptes() batch APIs

2024-02-15 Thread Ryan Roberts
. Tested-by: John Hubbard Signed-off-by: Ryan Roberts --- arch/arm64/include/asm/pgtable.h | 67 arch/arm64/mm/contpte.c | 17 2 files changed, 84 insertions(+) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h

[PATCH v6 13/18] arm64/mm: Implement new wrprotect_ptes() batch API

2024-02-15 Thread Ryan Roberts
viour when 'Misprogramming the Contiguous bit'. See section D21194 at https://developer.arm.com/documentation/102105/ja-07/ Tested-by: John Hubbard Signed-off-by: Ryan Roberts --- arch/arm64/include/asm/pgtable.h | 61 ++-- arch/arm64/mm/contpte.c | 38 +

[PATCH v6 12/18] arm64/mm: Wire up PTE_CONT for user mappings

2024-02-15 Thread Ryan Roberts
sheuvel Tested-by: John Hubbard Signed-off-by: Ryan Roberts --- arch/arm64/Kconfig | 9 + arch/arm64/include/asm/pgtable.h | 167 ++ arch/arm64/mm/Makefile | 1 + arch/arm64/mm/contpte.c | 285 +++ include/linux/efi

[PATCH v6 09/18] arm64/mm: Convert ptep_clear() to ptep_get_and_clear()

2024-02-15 Thread Ryan Roberts
convert it to directly call ptep_get_and_clear(). Tested-by: John Hubbard Signed-off-by: Ryan Roberts --- arch/arm64/mm/hugetlbpage.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/arm64/mm/hugetlbpage.c b/arch/arm64/mm/hugetlbpage.c index 27f6160890d1..48e8b429879d 100644

[PATCH v6 11/18] arm64/mm: Split __flush_tlb_range() to elide trailing DSB

2024-02-15 Thread Ryan Roberts
if needed. Reviewed-by: David Hildenbrand Tested-by: John Hubbard Signed-off-by: Ryan Roberts --- arch/arm64/include/asm/tlbflush.h | 13 +++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflush.h index 1d

[PATCH v6 10/18] arm64/mm: New ptep layer to manage contig bit

2024-02-15 Thread Ryan Roberts
. The following APIs are treated this way: - ptep_get - set_pte - set_ptes - pte_clear - ptep_get_and_clear - ptep_test_and_clear_young - ptep_clear_flush_young - ptep_set_wrprotect - ptep_set_access_flags Tested-by: John Hubbard Signed-off-by: Ryan Roberts --- arch/arm64/include/asm

[PATCH v6 15/18] mm: Add pte_batch_hint() to reduce scanning in folio_pte_batch()

2024-02-15 Thread Ryan Roberts
of contptes. Acked-by: David Hildenbrand Tested-by: John Hubbard Signed-off-by: Ryan Roberts --- include/linux/pgtable.h | 21 + mm/memory.c | 19 --- 2 files changed, 33 insertions(+), 7 deletions(-) diff --git a/include/linux/pgtable.h b/include

[PATCH v6 17/18] arm64/mm: __always_inline to improve fork() perf

2024-02-15 Thread Ryan Roberts
with order-0 folios (the common case). Acked-by: Mark Rutland Signed-off-by: Ryan Roberts --- arch/arm64/include/asm/pgtable.h | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index d759a20d2929

[PATCH v6 18/18] arm64/mm: Automatically fold contpte mappings

2024-02-15 Thread Ryan Roberts
_at() -> set_ptes(nr=1)) and only when we are setting the final PTE in a contpte-aligned block. Signed-off-by: Ryan Roberts --- arch/arm64/include/asm/pgtable.h | 26 + arch/arm64/mm/contpte.c | 64 2 files changed, 90 insertions(+) diff --git

[PATCH v6 16/18] arm64/mm: Implement pte_batch_hint()

2024-02-15 Thread Ryan Roberts
their iterators to skip getting the contpte tail ptes when gathering the batch of ptes to operate on. This results in the number of PTE reads returning to 1 per pte. Acked-by: Mark Rutland Reviewed-by: David Hildenbrand Tested-by: John Hubbard Signed-off-by: Ryan Roberts --- arch/arm64/include

[PATCH v6 08/18] arm64/mm: Convert set_pte_at() to set_ptes(..., 1)

2024-02-15 Thread Ryan Roberts
() rather than the arch-private __set_ptes(). Tested-by: John Hubbard Signed-off-by: Ryan Roberts --- arch/arm64/include/asm/pgtable.h | 2 +- arch/arm64/kernel/mte.c | 2 +- arch/arm64/kvm/guest.c | 2 +- arch/arm64/mm/fault.c| 2 +- arch/arm64/mm/hugetlbpage.c

[PATCH v6 07/18] arm64/mm: Convert READ_ONCE(*ptep) to ptep_get(ptep)

2024-02-15 Thread Ryan Roberts
support. In this case, ptep_get() will become more complex so we now have all the code abstracted through it. Tested-by: John Hubbard Signed-off-by: Ryan Roberts --- arch/arm64/include/asm/pgtable.h | 12 +--- arch/arm64/kernel/efi.c | 2 +- arch/arm64/mm/fault.c| 4

[PATCH v6 05/18] x86/mm: Convert pte_next_pfn() to pte_advance_pfn()

2024-02-15 Thread Ryan Roberts
Core-mm needs to be able to advance the pfn by an arbitrary amount, so override the new pte_advance_pfn() API to do so. Signed-off-by: Ryan Roberts --- arch/x86/include/asm/pgtable.h | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/x86/include/asm/pgtable.h b

[PATCH v6 06/18] mm: Tidy up pte_next_pfn() definition

2024-02-15 Thread Ryan Roberts
Now that the all architecture overrides of pte_next_pfn() have been replaced with pte_advance_pfn(), we can simplify the definition of the generic pte_next_pfn() macro so that it is unconditionally defined. Signed-off-by: Ryan Roberts --- include/linux/pgtable.h | 2 -- 1 file changed, 2

[PATCH v6 04/18] arm64/mm: Convert pte_next_pfn() to pte_advance_pfn()

2024-02-15 Thread Ryan Roberts
Core-mm needs to be able to advance the pfn by an arbitrary amount, so override the new pte_advance_pfn() API to do so. Signed-off-by: Ryan Roberts --- arch/arm64/include/asm/pgtable.h | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/arm64/include/asm/pgtable.h b

[PATCH v6 03/18] mm: Introduce pte_advance_pfn() and use for pte_next_pfn()

2024-02-15 Thread Ryan Roberts
overriding architecture's pte_next_pfn() to pte_advance_pfn(). Signed-off-by: Ryan Roberts --- include/linux/pgtable.h | 9 ++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index 231370e1b80f..b7ac8358f2aa 100644 --- a/include

[PATCH v6 02/18] mm: thp: Batch-collapse PMD with set_ptes()

2024-02-15 Thread Ryan Roberts
are set as a batch, the contpte blocks can be initially set up pre-folded (once the arm64 contpte support is added in the next few patches). This leads to noticeable performance improvement during split. Acked-by: David Hildenbrand Signed-off-by: Ryan Roberts --- mm/huge_memory.c | 58

[PATCH v6 00/18] Transparent Contiguous PTEs for User Mappings

2024-02-15 Thread Ryan Roberts
.com/ [6] https://lore.kernel.org/lkml/08c16f7d-f3b3-4f22-9acc-da943f647...@arm.com/ [7] https://lore.kernel.org/linux-mm/20240214204435.167852-1-da...@redhat.com/ [8] https://lore.kernel.org/linux-mm/c507308d-bdd4-5f9e-d4ff-e96e4520b...@nvidia.com/ [9] https://gitlab.arm.com/linux-arm/linux-rr/-/t

[PATCH v6 01/18] mm: Clarify the spec for set_ptes()

2024-02-15 Thread Ryan Roberts
s must initially be not-present. All set_ptes() callsites already conform to this requirement. Stating it explicitly is useful because it allows for a simplification to the upcoming arm64 contpte implementation. Acked-by: David Hildenbrand Signed-off-by: Ryan Roberts --- include/linux/pgtable

Re: [PATCH v5 25/25] arm64/mm: Automatically fold contpte mappings

2024-02-13 Thread Ryan Roberts
On 13/02/2024 17:44, Mark Rutland wrote: > On Fri, Feb 02, 2024 at 08:07:56AM +0000, Ryan Roberts wrote: >> There are situations where a change to a single PTE could cause the >> contpte block in which it resides to become foldable (i.e. could be >> repainted wi

Re: [PATCH v5 21/25] arm64/mm: Implement new [get_and_]clear_full_ptes() batch APIs

2024-02-13 Thread Ryan Roberts
On 13/02/2024 16:43, Mark Rutland wrote: > On Fri, Feb 02, 2024 at 08:07:52AM +0000, Ryan Roberts wrote: >> Optimize the contpte implementation to fix some of the >> exit/munmap/dontneed performance regression introduced by the initial >> contpte commit. Subsequent patches w

Re: [PATCH v5 20/25] arm64/mm: Implement new wrprotect_ptes() batch API

2024-02-13 Thread Ryan Roberts
On 13/02/2024 16:31, Mark Rutland wrote: > On Fri, Feb 02, 2024 at 08:07:51AM +0000, Ryan Roberts wrote: >> Optimize the contpte implementation to fix some of the fork performance >> regression introduced by the initial contpte commit. Subsequent patches >> will solve it e

Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings

2024-02-13 Thread Ryan Roberts
On 12/02/2024 16:24, David Hildenbrand wrote: > On 12.02.24 16:34, Ryan Roberts wrote: >> On 12/02/2024 15:26, David Hildenbrand wrote: >>> On 12.02.24 15:45, Ryan Roberts wrote: >>>> On 12/02/2024 13:54, David Hildenbrand wrote: >>>>>>> If

Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings

2024-02-13 Thread Ryan Roberts
On 13/02/2024 14:08, Ard Biesheuvel wrote: > On Tue, 13 Feb 2024 at 15:05, David Hildenbrand wrote: >> >> On 13.02.24 15:02, Ryan Roberts wrote: >>> On 13/02/2024 13:45, David Hildenbrand wrote: >>>> On 13.02.24 14:33, Ard Biesheuvel wrote: >>>>>

Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings

2024-02-13 Thread Ryan Roberts
On 13/02/2024 13:45, David Hildenbrand wrote: > On 13.02.24 14:33, Ard Biesheuvel wrote: >> On Tue, 13 Feb 2024 at 14:21, Ryan Roberts wrote: >>> >>> On 13/02/2024 13:13, David Hildenbrand wrote: >>>> On 13.02.24 14:06, Ryan Roberts wrote: >>>

Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings

2024-02-13 Thread Ryan Roberts
On 13/02/2024 13:22, David Hildenbrand wrote: > On 13.02.24 14:20, Ryan Roberts wrote: >> On 13/02/2024 13:13, David Hildenbrand wrote: >>> On 13.02.24 14:06, Ryan Roberts wrote: >>>> On 13/02/2024 12:19, David Hildenbrand wrote: >>>>> On 13.02.24 13:06

Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings

2024-02-13 Thread Ryan Roberts
On 13/02/2024 13:13, David Hildenbrand wrote: > On 13.02.24 14:06, Ryan Roberts wrote: >> On 13/02/2024 12:19, David Hildenbrand wrote: >>> On 13.02.24 13:06, Ryan Roberts wrote: >>>> On 12/02/2024 20:38, Ryan Roberts wrote: >>>>> [...] >>

Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings

2024-02-13 Thread Ryan Roberts
On 13/02/2024 12:19, David Hildenbrand wrote: > On 13.02.24 13:06, Ryan Roberts wrote: >> On 12/02/2024 20:38, Ryan Roberts wrote: >>> [...] >>> >>>>>>> +static inline bool mm_is_user(struct mm_struct *mm) >>>>>>> +{ >>&g

Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings

2024-02-13 Thread Ryan Roberts
On 13/02/2024 12:02, Mark Rutland wrote: > On Mon, Feb 12, 2024 at 12:59:57PM +0000, Ryan Roberts wrote: >> On 12/02/2024 12:00, Mark Rutland wrote: >>> Hi Ryan, > > [...] > >>>> +static inline void set_pte(pte_t *ptep, pte_t pte) >>>> +{ >

Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings

2024-02-13 Thread Ryan Roberts
On 12/02/2024 20:38, Ryan Roberts wrote: > [...] > >>>>> +static inline bool mm_is_user(struct mm_struct *mm) >>>>> +{ >>>>> + /* >>>>> + * Don't attempt to apply the contig bit to kernel mappings, because >>&

Re: [PATCH v5 03/25] mm: Make pte_next_pfn() a wrapper around pte_advance_pfn()

2024-02-12 Thread Ryan Roberts
On 12/02/2024 14:29, David Hildenbrand wrote: > On 12.02.24 15:10, Ryan Roberts wrote: >> On 12/02/2024 12:14, David Hildenbrand wrote: >>> On 02.02.24 09:07, Ryan Roberts wrote: >>>> The goal is to be able to advance a PTE by an arbitrary number of PFNs. >>

Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings

2024-02-12 Thread Ryan Roberts
[...] +static inline bool mm_is_user(struct mm_struct *mm) +{ + /* + * Don't attempt to apply the contig bit to kernel mappings, because + * dynamically adding/removing the contig bit can cause page faults. + * These racing faults are ok for user space, since

Re: [PATCH v5 22/25] mm: Add pte_batch_hint() to reduce scanning in folio_pte_batch()

2024-02-12 Thread Ryan Roberts
On 12/02/2024 13:43, David Hildenbrand wrote: > On 02.02.24 09:07, Ryan Roberts wrote: >> Some architectures (e.g. arm64) can tell from looking at a pte, if some >> follow-on ptes also map contiguous physical memory with the same pgprot. >> (for arm64, these are contpte

Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings

2024-02-12 Thread Ryan Roberts
On 12/02/2024 15:26, David Hildenbrand wrote: > On 12.02.24 15:45, Ryan Roberts wrote: >> On 12/02/2024 13:54, David Hildenbrand wrote: >>>>> If so, I wonder if we could instead do that comparison modulo the >>>>> access/dirty >>>>> bits, >

Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings

2024-02-12 Thread Ryan Roberts
On 12/02/2024 12:59, Ryan Roberts wrote: > On 12/02/2024 12:00, Mark Rutland wrote: >> Hi Ryan, >> >> Overall this looks pretty good; I have a bunch of minor comments below, and a >> bigger question on the way ptep_get_lockless() works. > > OK great - thanks f

Re: [PATCH v5 22/25] mm: Add pte_batch_hint() to reduce scanning in folio_pte_batch()

2024-02-12 Thread Ryan Roberts
On 12/02/2024 13:43, David Hildenbrand wrote: > On 02.02.24 09:07, Ryan Roberts wrote: >> Some architectures (e.g. arm64) can tell from looking at a pte, if some >> follow-on ptes also map contiguous physical memory with the same pgprot. >> (for arm64, these are contpte

Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings

2024-02-12 Thread Ryan Roberts
On 12/02/2024 13:54, David Hildenbrand wrote: >>> If so, I wonder if we could instead do that comparison modulo the >>> access/dirty >>> bits, >> >> I think that would work - but will need to think a bit more on it. >> >>> and leave ptep_get_lockless() only reading a single entry? >> >> I think

Re: [PATCH v5 03/25] mm: Make pte_next_pfn() a wrapper around pte_advance_pfn()

2024-02-12 Thread Ryan Roberts
On 12/02/2024 12:14, David Hildenbrand wrote: > On 02.02.24 09:07, Ryan Roberts wrote: >> The goal is to be able to advance a PTE by an arbitrary number of PFNs. >> So introduce a new API that takes a nr param. >> >> We are going to remove pte_next_pfn() and replace

Re: [PATCH v5 18/25] arm64/mm: Split __flush_tlb_range() to elide trailing DSB

2024-02-12 Thread Ryan Roberts
On 12/02/2024 13:15, David Hildenbrand wrote: > On 12.02.24 14:05, Ryan Roberts wrote: >> On 12/02/2024 12:44, David Hildenbrand wrote: >>> On 02.02.24 09:07, Ryan Roberts wrote: >>>> Split __flush_tlb_range() into __flush_tlb_range_nosync() + >>>

Re: [PATCH v5 18/25] arm64/mm: Split __flush_tlb_range() to elide trailing DSB

2024-02-12 Thread Ryan Roberts
On 12/02/2024 12:44, David Hildenbrand wrote: > On 02.02.24 09:07, Ryan Roberts wrote: >> Split __flush_tlb_range() into __flush_tlb_range_nosync() + >> __flush_tlb_range(), in the same way as the existing flush_tlb_page() >> arrangement. This allows calling __flush_tlb_ra

Re: [PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings

2024-02-12 Thread Ryan Roberts
Feb 02, 2024 at 08:07:50AM +, Ryan Roberts wrote: >> With the ptep API sufficiently refactored, we can now introduce a new >> "contpte" API layer, which transparently manages the PTE_CONT bit for >> user mappings. >> >> In this initial implementation, only s

Re: [PATCH v2 09/10] mm/mmu_gather: improve cond_resched() handling with large folios and expensive page freeing

2024-02-12 Thread Ryan Roberts
On 12/02/2024 11:05, David Hildenbrand wrote: > On 12.02.24 11:56, David Hildenbrand wrote: >> On 12.02.24 11:32, Ryan Roberts wrote: >>> On 12/02/2024 10:11, David Hildenbrand wrote: >>>> Hi Ryan, >>>> >>>>>> -static void tlb_b

Re: [PATCH v2 09/10] mm/mmu_gather: improve cond_resched() handling with large folios and expensive page freeing

2024-02-12 Thread Ryan Roberts
On 12/02/2024 10:11, David Hildenbrand wrote: > Hi Ryan, > >>> -static void tlb_batch_pages_flush(struct mmu_gather *tlb) >>> +static void __tlb_batch_free_encoded_pages(struct mmu_gather_batch *batch) >>>   { >>> -    struct mmu_gather_batch *batch; >>> - >>> -    for (batch = >local; batch &&

Re: [PATCH v2 10/10] mm/memory: optimize unmap/zap with PTE-mapped THP

2024-02-12 Thread Ryan Roberts
ave a cheap > folio_mapcount(), we might just want to check for underflows there. > > To keep small folios as fast as possible force inlining of a specialized > variant using __always_inline with nr=1. > > Signed-off-by: David Hildenbrand Reviewed-by: Ryan Roberts > --- >

Re: [PATCH v2 09/10] mm/mmu_gather: improve cond_resched() handling with large folios and expensive page freeing

2024-02-12 Thread Ryan Roberts
On 09/02/2024 22:15, David Hildenbrand wrote: > It's a pain that we have to handle cond_resched() in > tlb_batch_pages_flush() manually and cannot simply handle it in > release_pages() -- release_pages() can be called from atomic context. > Well, in a perfect world we wouldn't have to make our

Re: [PATCH v2 08/10] mm/mmu_gather: add __tlb_remove_folio_pages()

2024-02-12 Thread Ryan Roberts
> - bool delay_rmap, int page_size) > +static bool __tlb_remove_folio_pages_size(struct mmu_gather *tlb, > + struct page *page, unsigned int nr_pages, bool delay_rmap, > + int page_size) > { > int flags = delay_rmap ? ENCODED_PAGE_B

Re: [PATCH v2 01/10] mm/memory: factor out zapping of present pte into zap_present_pte()

2024-02-12 Thread Ryan Roberts
On 09/02/2024 22:15, David Hildenbrand wrote: > Let's prepare for further changes by factoring out processing of present > PTEs. > > Signed-off-by: David Hildenbrand Reviewed-by: Ryan Roberts > --- > mm/memory.c | 94 ++--- >

Re: [PATCH v5 00/25] Transparent Contiguous PTEs for User Mappings

2024-02-09 Thread Ryan Roberts
On 09/02/2024 22:16, David Hildenbrand wrote: >>> 1) Convert READ_ONCE() -> ptep_get() >>> 2) Convert set_pte_at() -> set_ptes() >>> 3) All the "New layer" renames and addition of the trivial wrappers >> >> Yep that makes sense. I'll start prepping that today. I'll hold off reposting >> until I

Re: [PATCH v5 00/25] Transparent Contiguous PTEs for User Mappings

2024-02-09 Thread Ryan Roberts
On 08/02/2024 17:34, Mark Rutland wrote: > On Fri, Feb 02, 2024 at 08:07:31AM +0000, Ryan Roberts wrote: >> Hi All, > > Hi Ryan, > > I assume this is the same as your 'features/granule_perf/contpte-lkml_v' > branch > on https://gitlab.arm.com/linux-arm/linux-rr/

[PATCH v5 25/25] arm64/mm: Automatically fold contpte mappings

2024-02-02 Thread Ryan Roberts
_at() -> set_ptes(nr=1)) and only when we are setting the final PTE in a contpte-aligned block. Signed-off-by: Ryan Roberts --- arch/arm64/include/asm/pgtable.h | 26 + arch/arm64/mm/contpte.c | 64 2 files changed, 90 insertions(+) diff --git

[PATCH v5 24/25] arm64/mm: __always_inline to improve fork() perf

2024-02-02 Thread Ryan Roberts
with order-0 folios (the common case). Signed-off-by: Ryan Roberts --- arch/arm64/include/asm/pgtable.h | 10 +- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 353ea67b5d75..cdc310880a3b 100644

[PATCH v5 23/25] arm64/mm: Implement pte_batch_hint()

2024-02-02 Thread Ryan Roberts
their iterators to skip getting the contpte tail ptes when gathering the batch of ptes to operate on. This results in the number of PTE reads returning to 1 per pte. Tested-by: John Hubbard Signed-off-by: Ryan Roberts --- arch/arm64/include/asm/pgtable.h | 9 + 1 file changed, 9 insertions

[PATCH v5 22/25] mm: Add pte_batch_hint() to reduce scanning in folio_pte_batch()

2024-02-02 Thread Ryan Roberts
of contptes. Tested-by: John Hubbard Signed-off-by: Ryan Roberts --- include/linux/pgtable.h | 18 ++ mm/memory.c | 20 +--- 2 files changed, 31 insertions(+), 7 deletions(-) diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h index

[PATCH v5 21/25] arm64/mm: Implement new [get_and_]clear_full_ptes() batch APIs

2024-02-02 Thread Ryan Roberts
. Tested-by: John Hubbard Signed-off-by: Ryan Roberts --- arch/arm64/include/asm/pgtable.h | 67 arch/arm64/mm/contpte.c | 17 2 files changed, 84 insertions(+) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h

[PATCH v5 20/25] arm64/mm: Implement new wrprotect_ptes() batch API

2024-02-02 Thread Ryan Roberts
viour when 'Misprogramming the Contiguous bit'. See section D21194 at https://developer.arm.com/documentation/102105/latest/ Tested-by: John Hubbard Signed-off-by: Ryan Roberts --- arch/arm64/include/asm/pgtable.h | 61 ++-- arch/arm64/mm/contpte.c | 35 +++

[PATCH v5 18/25] arm64/mm: Split __flush_tlb_range() to elide trailing DSB

2024-02-02 Thread Ryan Roberts
the young bit from a contiguous range of ptes. Tested-by: John Hubbard Signed-off-by: Ryan Roberts --- arch/arm64/include/asm/tlbflush.h | 13 +++-- 1 file changed, 11 insertions(+), 2 deletions(-) diff --git a/arch/arm64/include/asm/tlbflush.h b/arch/arm64/include/asm/tlbflu

[PATCH v5 19/25] arm64/mm: Wire up PTE_CONT for user mappings

2024-02-02 Thread Ryan Roberts
lts to enabled as long as its dependency, TRANSPARENT_HUGEPAGE is also enabled. The core-mm depends upon TRANSPARENT_HUGEPAGE to be able to allocate large folios, so if its not enabled, then there is no chance of meeting the physical contiguity requirement for contpte mappings. Tested-by: J

[PATCH v5 14/25] arm64/mm: ptep_clear_flush_young(): New layer to manage contig bit

2024-02-02 Thread Ryan Roberts
. Tested-by: John Hubbard Signed-off-by: Ryan Roberts --- arch/arm64/include/asm/pgtable.h | 7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 77a8b100e1cd..2870bc12f288 100644 --- a/arch/arm64/include

[PATCH v5 13/25] arm64/mm: ptep_test_and_clear_young(): New layer to manage contig bit

2024-02-02 Thread Ryan Roberts
. Tested-by: John Hubbard Signed-off-by: Ryan Roberts --- arch/arm64/include/asm/pgtable.h | 18 +++--- 1 file changed, 7 insertions(+), 11 deletions(-) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 5f560326116e..77a8b100e1cd 100644 --- a/arch

[PATCH v5 12/25] arm64/mm: ptep_get_and_clear(): New layer to manage contig bit

2024-02-02 Thread Ryan Roberts
. Tested-by: John Hubbard Signed-off-by: Ryan Roberts --- arch/arm64/include/asm/pgtable.h | 5 +++-- arch/arm64/mm/hugetlbpage.c | 6 +++--- 2 files changed, 6 insertions(+), 5 deletions(-) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 3b0ff58109c5

[PATCH v5 11/25] arm64/mm: pte_clear(): New layer to manage contig bit

2024-02-02 Thread Ryan Roberts
. Tested-by: John Hubbard Signed-off-by: Ryan Roberts --- arch/arm64/include/asm/pgtable.h | 3 ++- arch/arm64/mm/fixmap.c | 2 +- arch/arm64/mm/hugetlbpage.c | 2 +- arch/arm64/mm/mmu.c | 2 +- 4 files changed, 5 insertions(+), 4 deletions(-) diff --git a/arch/arm64

[PATCH v5 10/25] arm64/mm: set_ptes()/set_pte_at(): New layer to manage contig bit

2024-02-02 Thread Ryan Roberts
managing their own loop. This is left for future improvement. Tested-by: John Hubbard Signed-off-by: Ryan Roberts --- arch/arm64/include/asm/pgtable.h | 10 +- arch/arm64/kernel/mte.c | 2 +- arch/arm64/kvm/guest.c | 2 +- arch/arm64/mm/fault.c| 2 +- arch

[PATCH v5 09/25] arm64/mm: set_pte(): New layer to manage contig bit

2024-02-02 Thread Ryan Roberts
. Tested-by: John Hubbard Signed-off-by: Ryan Roberts --- arch/arm64/include/asm/pgtable.h | 11 +++ arch/arm64/kernel/efi.c | 2 +- arch/arm64/mm/fixmap.c | 2 +- arch/arm64/mm/kasan_init.c | 4 ++-- arch/arm64/mm/mmu.c | 2 +- arch/arm64/mm

[PATCH v5 16/25] arm64/mm: ptep_set_access_flags(): New layer to manage contig bit

2024-02-02 Thread Ryan Roberts
. Tested-by: John Hubbard Signed-off-by: Ryan Roberts --- arch/arm64/include/asm/pgtable.h | 10 ++ arch/arm64/mm/fault.c| 6 +++--- arch/arm64/mm/hugetlbpage.c | 2 +- 3 files changed, 10 insertions(+), 8 deletions(-) diff --git a/arch/arm64/include/asm/pgtable.h b/arch

[PATCH v5 15/25] arm64/mm: ptep_set_wrprotect(): New layer to manage contig bit

2024-02-02 Thread Ryan Roberts
. Tested-by: John Hubbard Signed-off-by: Ryan Roberts --- arch/arm64/include/asm/pgtable.h | 10 ++ arch/arm64/mm/hugetlbpage.c | 2 +- 2 files changed, 7 insertions(+), 5 deletions(-) diff --git a/arch/arm64/include/asm/pgtable.h b/arch/arm64/include/asm/pgtable.h index 2870bc12f288

[PATCH v5 17/25] arm64/mm: ptep_get(): New layer to manage contig bit

2024-02-02 Thread Ryan Roberts
() so convert those to the private API. While other callsites were doing direct READ_ONCE(), so convert those to use the appropriate (public/private) API too. Tested-by: John Hubbard Signed-off-by: Ryan Roberts --- arch/arm64/include/asm/pgtable.h | 12 +--- arch/arm64/kernel/efi.c

[PATCH v5 07/25] x86/mm: Convert pte_next_pfn() to pte_advance_pfn()

2024-02-02 Thread Ryan Roberts
Core-mm needs to be able to advance the pfn by an arbitrary amount, so improve the API to do so and change the name. Signed-off-by: Ryan Roberts --- arch/x86/include/asm/pgtable.h | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/x86/include/asm/pgtable.h b/arch

[PATCH v5 05/25] arm64/mm: Convert pte_next_pfn() to pte_advance_pfn()

2024-02-02 Thread Ryan Roberts
Core-mm needs to be able to advance the pfn by an arbitrary amount, so improve the API to do so and change the name. Signed-off-by: Ryan Roberts --- arch/arm64/include/asm/pgtable.h | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/arch/arm64/include/asm/pgtable.h b

[PATCH v5 06/25] powerpc/mm: Convert pte_next_pfn() to pte_advance_pfn()

2024-02-02 Thread Ryan Roberts
Core-mm needs to be able to advance the pfn by an arbitrary amount, so improve the API to do so and change the name. Signed-off-by: Ryan Roberts --- arch/powerpc/mm/pgtable.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/powerpc/mm/pgtable.c b/arch/powerpc/mm

[PATCH v5 08/25] mm: Remove pte_next_pfn() and replace with pte_advance_pfn()

2024-02-02 Thread Ryan Roberts
Now that the architectures are converted over to pte_advance_pfn(), we can remove the pte_next_pfn() wrapper and convert the callers to call pte_advance_pfn(). Signed-off-by: Ryan Roberts --- include/linux/pgtable.h | 9 + mm/memory.c | 4 ++-- 2 files changed, 3 insertions

[PATCH v5 04/25] arm/mm: Convert pte_next_pfn() to pte_advance_pfn()

2024-02-02 Thread Ryan Roberts
Core-mm needs to be able to advance the pfn by an arbitrary amount, so improve the API to do so and change the name. Signed-off-by: Ryan Roberts --- arch/arm/mm/mmu.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/arch/arm/mm/mmu.c b/arch/arm/mm/mmu.c index c24e29c0b9a4

[PATCH v5 03/25] mm: Make pte_next_pfn() a wrapper around pte_advance_pfn()

2024-02-02 Thread Ryan Roberts
incrementally switch the architectures over. Once all arches are moved over, we will change all the core-mm callers to call pte_advance_pfn() directly and remove the wrapper. Signed-off-by: Ryan Roberts --- include/linux/pgtable.h | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git

[PATCH v5 02/25] mm: thp: Batch-collapse PMD with set_ptes()

2024-02-02 Thread Ryan Roberts
are set as a batch, the contpte blocks can be initially set up pre-folded (once the arm64 contpte support is added in the next few patches). This leads to noticeable performance improvement during split. Acked-by: David Hildenbrand Signed-off-by: Ryan Roberts --- mm/huge_memory.c | 58

[PATCH v5 01/25] mm: Clarify the spec for set_ptes()

2024-02-02 Thread Ryan Roberts
s must initially be not-present. All set_ptes() callsites already conform to this requirement. Stating it explicitly is useful because it allows for a simplification to the upcoming arm64 contpte implementation. Signed-off-by: Ryan Roberts --- include/linux/pgtable.h | 4 1 file chan

[PATCH v5 00/25] Transparent Contiguous PTEs for User Mappings

2024-02-02 Thread Ryan Roberts
943f647...@arm.com/ [6] https://lore.kernel.org/lkml/20240129124649.189745-1-da...@redhat.com/ [7] https://lore.kernel.org/lkml/20240129143221.263763-1-da...@redhat.com/ [8] https://lore.kernel.org/linux-mm/c507308d-bdd4-5f9e-d4ff-e96e4520b...@nvidia.com/ Thanks, Ryan Ryan Roberts (25): mm: Clarify t

Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP

2024-01-31 Thread Ryan Roberts
On 31/01/2024 15:05, David Hildenbrand wrote: > On 31.01.24 16:02, Ryan Roberts wrote: >> On 31/01/2024 14:29, David Hildenbrand wrote: >>>>> Note that regarding NUMA effects, I mean when some memory access within >>>>> the >>>>> same >

Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP

2024-01-31 Thread Ryan Roberts
On 31/01/2024 14:29, David Hildenbrand wrote: >>> Note that regarding NUMA effects, I mean when some memory access within the >>> same >>> socket is faster/slower even with only a single node. On AMD EPYC that's >>> possible, depending on which core you are running and on which memory >>>

Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP

2024-01-31 Thread Ryan Roberts
On 31/01/2024 13:38, David Hildenbrand wrote: Nope: looks the same. I've taken my test harness out of the picture and done everything manually from the ground up, with the old tests and the new. Headline is that I see similar numbers from both. >>> >>> I took me a while

Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP

2024-01-31 Thread Ryan Roberts
On 31/01/2024 12:56, David Hildenbrand wrote: > On 31.01.24 13:37, Ryan Roberts wrote: >> On 31/01/2024 11:49, Ryan Roberts wrote: >>> On 31/01/2024 11:28, David Hildenbrand wrote: >>>> On 31.01.24 12:16, Ryan Roberts wrote: >>>>> On 31/01/2024 11:06, D

Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP

2024-01-31 Thread Ryan Roberts
On 31/01/2024 11:49, Ryan Roberts wrote: > On 31/01/2024 11:28, David Hildenbrand wrote: >> On 31.01.24 12:16, Ryan Roberts wrote: >>> On 31/01/2024 11:06, David Hildenbrand wrote: >>>> On 31.01.24 11:43, Ryan Roberts wrote: >>>>> On 29/01/20

Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP

2024-01-31 Thread Ryan Roberts
On 31/01/2024 11:28, David Hildenbrand wrote: > On 31.01.24 12:16, Ryan Roberts wrote: >> On 31/01/2024 11:06, David Hildenbrand wrote: >>> On 31.01.24 11:43, Ryan Roberts wrote: >>>> On 29/01/2024 12:46, David Hildenbrand wrote: >>>>> Now that t

Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP

2024-01-31 Thread Ryan Roberts
On 31/01/2024 11:06, David Hildenbrand wrote: > On 31.01.24 11:43, Ryan Roberts wrote: >> On 29/01/2024 12:46, David Hildenbrand wrote: >>> Now that the rmap overhaul[1] is upstream that provides a clean interface >>> for rmap batching, let's implement PTE batching

Re: [PATCH v3 00/15] mm/memory: optimize fork() with PTE-mapped THP

2024-01-31 Thread Ryan Roberts
https://lkml.kernel.org/r/20231220224504.646757-1-da...@redhat.com > [2] https://lkml.kernel.org/r/20231218105100.172635-1-ryan.robe...@arm.com > [3] https://lkml.kernel.org/r/20230809083256.699513-1-da...@redhat.com > [4] https://lkml.kernel.org/r/20231124132626.235350-1-da...@redha

Re: [PATCH v1 9/9] mm/memory: optimize unmap/zap with PTE-mapped THP

2024-01-31 Thread Ryan Roberts
On 31/01/2024 10:21, David Hildenbrand wrote: > >>> + >>> +#ifndef clear_full_ptes >>> +/** >>> + * clear_full_ptes - Clear PTEs that map consecutive pages of the same >>> folio. >> >> I know its implied from "pages of the same folio" (and even more so for the >> above variant due to mention of

Re: [PATCH v1 0/9] mm/memory: optimize unmap/zap with PTE-mapped THP

2024-01-31 Thread Ryan Roberts
On 31/01/2024 10:16, David Hildenbrand wrote: > On 31.01.24 03:20, Yin Fengwei wrote: >> On 1/29/24 22:32, David Hildenbrand wrote: >>> This series is based on [1] and must be applied on top of it. >>> Similar to what we did with fork(), let's implement PTE batching >>> during unmap/zap when

Re: [PATCH v1 9/9] mm/memory: optimize unmap/zap with PTE-mapped THP

2024-01-30 Thread Ryan Roberts
On 29/01/2024 14:32, David Hildenbrand wrote: > Similar to how we optimized fork(), let's implement PTE batching when > consecutive (present) PTEs map consecutive pages of the same large > folio. > > Most infrastructure we need for batching (mmu gather, rmap) is already > there. We only have to

Re: [PATCH v1 8/9] mm/mmu_gather: add tlb_remove_tlb_entries()

2024-01-30 Thread Ryan Roberts
ke the compiler happy (and avoid making tlb_remove_tlb_entries() a > macro). > > Signed-off-by: David Hildenbrand Reviewed-by: Ryan Roberts > --- > arch/powerpc/include/asm/tlb.h | 2 ++ > include/asm-generic/tlb.h | 20 > 2 files changed, 22

Re: [PATCH v1 7/9] mm/mmu_gather: add __tlb_remove_folio_pages()

2024-01-30 Thread Ryan Roberts
On 29/01/2024 14:32, David Hildenbrand wrote: > Add __tlb_remove_folio_pages(), which will remove multiple consecutive > pages that belong to the same large folio, instead of only a single > page. We'll be using this function when optimizing unmapping/zapping of > large folios that are mapped by

Re: [PATCH v1 6/9] mm/mmu_gather: define ENCODED_PAGE_FLAG_DELAY_RMAP

2024-01-30 Thread Ryan Roberts
m in an array of encoded pages is a "nr_pages" argument and > not an encoded page. > > Signed-off-by: David Hildenbrand Reviewed-by: Ryan Roberts > --- > include/linux/mm_types.h | 17 +++-- > mm/mmu_gather.c | 5 +++-- > 2 files changed, 14

Re: [PATCH v1 4/9] mm/memory: factor out zapping folio pte into zap_present_folio_pte()

2024-01-30 Thread Ryan Roberts
On 29/01/2024 14:32, David Hildenbrand wrote: > Let's prepare for further changes by factoring it out into a separate > function. > > Signed-off-by: David Hildenbrand Reviewed-by: Ryan Roberts > --- > mm/memory.c | 53 -

Re: [PATCH v1 1/9] mm/memory: factor out zapping of present pte into zap_present_pte()

2024-01-30 Thread Ryan Roberts
On 30/01/2024 08:41, David Hildenbrand wrote: > On 30.01.24 09:13, Ryan Roberts wrote: >> On 29/01/2024 14:32, David Hildenbrand wrote: >>> Let's prepare for further changes by factoring out processing of present >>> PTEs. >>> >>> Signed-off-by: Davi

Re: [PATCH v1 3/9] mm/memory: further separate anon and pagecache folio handling in zap_present_pte()

2024-01-30 Thread Ryan Roberts
On 30/01/2024 08:37, David Hildenbrand wrote: > On 30.01.24 09:31, Ryan Roberts wrote: >> On 29/01/2024 14:32, David Hildenbrand wrote: >>> We don't need up-to-date accessed-dirty information for anon folios and can >>> simply work with the ptent we already have. Al

Re: [PATCH v1 5/9] mm/mmu_gather: pass "delay_rmap" instead of encoded page to __tlb_remove_page_size()

2024-01-30 Thread Ryan Roberts
ly for batching of > multiple pages of the same folio, specifying that the next encoded page > pointer in an array is actually "nr_pages". So pass page + delay_rmap flag > instead of an encoded page, to handle the encoding internally. > > Signed-off-by: David Hildenbrand R

  1   2   >