Re: [PATCH v5 26/33] nios2: Convert __pte_free_tlb() to use ptdescs

2023-06-26 Thread Guenter Roeck
On Thu, Jun 22, 2023 at 01:57:38PM -0700, Vishal Moola (Oracle) wrote:
> Part of the conversions to replace pgtable constructor/destructors with
> ptdesc equivalents.
> 
> Signed-off-by: Vishal Moola (Oracle) 
> Acked-by: Mike Rapoport (IBM) 

This patch causes all nios2 builds to fail.

Building nios2:allnoconfig ... failed
--
Error log:
:1519:2: warning: #warning syscall clone3 not implemented [-Wcpp]
In file included from mm/memory.c:85:
mm/memory.c: In function 'free_pte_range':
arch/nios2/include/asm/pgalloc.h:33:17: error: implicit declaration of function 
'pagetable_pte_dtor'; did you mean 'pgtable_pte_page_dtor'? 
[-Werror=implicit-function-declaration]
   33 | pagetable_pte_dtor(page_ptdesc(pte));   
\
  | ^~
include/asm-generic/tlb.h:666:17: note: in expansion of macro '__pte_free_tlb'
  666 | __pte_free_tlb(tlb, ptep, address); \
  | ^~
mm/memory.c:193:9: note: in expansion of macro 'pte_free_tlb'
  193 | pte_free_tlb(tlb, token, addr);
  | ^~~~
arch/nios2/include/asm/pgalloc.h:33:36: error: implicit declaration of function 
'page_ptdesc' [-Werror=implicit-function-declaration]
   33 | pagetable_pte_dtor(page_ptdesc(pte));   
\
  |^~~
include/asm-generic/tlb.h:666:17: note: in expansion of macro '__pte_free_tlb'
  666 | __pte_free_tlb(tlb, ptep, address); \
  | ^~
mm/memory.c:193:9: note: in expansion of macro 'pte_free_tlb'
  193 | pte_free_tlb(tlb, token, addr);
  | ^~~~
arch/nios2/include/asm/pgalloc.h:34:17: error: implicit declaration of function 
'tlb_remove_page_ptdesc'; did you mean 'tlb_remove_page_size'? 
[-Werror=implicit-function-declaration]
   34 | tlb_remove_page_ptdesc((tlb), (page_ptdesc(pte)));  
\
  | ^~
include/asm-generic/tlb.h:666:17: note: in expansion of macro '__pte_free_tlb'
  666 | __pte_free_tlb(tlb, ptep, address); \
  | ^~
mm/memory.c:193:9: note: in expansion of macro 'pte_free_tlb'
  193 | pte_free_tlb(tlb, token, addr);

> ---
>  arch/nios2/include/asm/pgalloc.h | 8 
>  1 file changed, 4 insertions(+), 4 deletions(-)
> 
> diff --git a/arch/nios2/include/asm/pgalloc.h 
> b/arch/nios2/include/asm/pgalloc.h
> index ecd1657bb2ce..ce6bb8e74271 100644
> --- a/arch/nios2/include/asm/pgalloc.h
> +++ b/arch/nios2/include/asm/pgalloc.h
> @@ -28,10 +28,10 @@ static inline void pmd_populate(struct mm_struct *mm, 
> pmd_t *pmd,
>  
>  extern pgd_t *pgd_alloc(struct mm_struct *mm);
>  
> -#define __pte_free_tlb(tlb, pte, addr)   \
> - do {\
> - pgtable_pte_page_dtor(pte); \
> - tlb_remove_page((tlb), (pte));  \
> +#define __pte_free_tlb(tlb, pte, addr)   
> \
> + do {\
> + pagetable_pte_dtor(page_ptdesc(pte));   \
> + tlb_remove_page_ptdesc((tlb), (page_ptdesc(pte)));  \
>   } while (0)
>  
>  #endif /* _ASM_NIOS2_PGALLOC_H */
> -- 
> 2.40.1
> 
> 


Re: [PATCH v6 00/33] Split ptdesc from struct page

2023-06-26 Thread Hugh Dickins
On Mon, 26 Jun 2023, Vishal Moola (Oracle) wrote:

> The MM subsystem is trying to shrink struct page. This patchset
> introduces a memory descriptor for page table tracking - struct ptdesc.
...
>  39 files changed, 686 insertions(+), 455 deletions(-)

I don't see the point of this patchset: to me it is just obfuscation of
the present-day tight relationship between page table and struct page.

Matthew already explained:

> The intent is to get ptdescs to be dynamically allocated at some point
> in the ~2-3 years out future when we have finished the folio project ...

So in a kindly mood, I'd say that this patchset is ahead of its time.
But I can certainly adapt to it, if everyone else sees some point to it.

Hugh


[PATCH v6 33/33] mm: Remove pgtable_{pmd, pte}_page_{ctor, dtor}() wrappers

2023-06-26 Thread Vishal Moola (Oracle)
These functions are no longer necessary. Remove them and cleanup
Documentation referencing them.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Mike Rapoport (IBM) 
---
 Documentation/mm/split_page_table_lock.rst| 12 +--
 .../zh_CN/mm/split_page_table_lock.rst| 14 ++---
 include/linux/mm.h| 20 ---
 3 files changed, 13 insertions(+), 33 deletions(-)

diff --git a/Documentation/mm/split_page_table_lock.rst 
b/Documentation/mm/split_page_table_lock.rst
index a834fad9de12..e4f6972eb6c0 100644
--- a/Documentation/mm/split_page_table_lock.rst
+++ b/Documentation/mm/split_page_table_lock.rst
@@ -58,7 +58,7 @@ Support of split page table lock by an architecture
 ===
 
 There's no need in special enabling of PTE split page table lock: everything
-required is done by pgtable_pte_page_ctor() and pgtable_pte_page_dtor(), which
+required is done by pagetable_pte_ctor() and pagetable_pte_dtor(), which
 must be called on PTE table allocation / freeing.
 
 Make sure the architecture doesn't use slab allocator for page table
@@ -68,8 +68,8 @@ This field shares storage with page->ptl.
 PMD split lock only makes sense if you have more than two page table
 levels.
 
-PMD split lock enabling requires pgtable_pmd_page_ctor() call on PMD table
-allocation and pgtable_pmd_page_dtor() on freeing.
+PMD split lock enabling requires pagetable_pmd_ctor() call on PMD table
+allocation and pagetable_pmd_dtor() on freeing.
 
 Allocation usually happens in pmd_alloc_one(), freeing in pmd_free() and
 pmd_free_tlb(), but make sure you cover all PMD table allocation / freeing
@@ -77,7 +77,7 @@ paths: i.e X86_PAE preallocate few PMDs on pgd_alloc().
 
 With everything in place you can set CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK.
 
-NOTE: pgtable_pte_page_ctor() and pgtable_pmd_page_ctor() can fail -- it must
+NOTE: pagetable_pte_ctor() and pagetable_pmd_ctor() can fail -- it must
 be handled properly.
 
 page->ptl
@@ -97,7 +97,7 @@ trick:
split lock with enabled DEBUG_SPINLOCK or DEBUG_LOCK_ALLOC, but costs
one more cache line for indirect access;
 
-The spinlock_t allocated in pgtable_pte_page_ctor() for PTE table and in
-pgtable_pmd_page_ctor() for PMD table.
+The spinlock_t allocated in pagetable_pte_ctor() for PTE table and in
+pagetable_pmd_ctor() for PMD table.
 
 Please, never access page->ptl directly -- use appropriate helper.
diff --git a/Documentation/translations/zh_CN/mm/split_page_table_lock.rst 
b/Documentation/translations/zh_CN/mm/split_page_table_lock.rst
index 4fb7aa666037..a2c288670a24 100644
--- a/Documentation/translations/zh_CN/mm/split_page_table_lock.rst
+++ b/Documentation/translations/zh_CN/mm/split_page_table_lock.rst
@@ -56,16 +56,16 @@ Hugetlb特定的辅助函数:
 架构对分页表锁的支持
 
 
-没有必要特别启用PTE分页表锁:所有需要的东西都由pgtable_pte_page_ctor()
-和pgtable_pte_page_dtor()完成,它们必须在PTE表分配/释放时被调用。
+没有必要特别启用PTE分页表锁:所有需要的东西都由pagetable_pte_ctor()
+和pagetable_pte_dtor()完成,它们必须在PTE表分配/释放时被调用。
 
 确保架构不使用slab分配器来分配页表:slab使用page->slab_cache来分配其页
 面。这个区域与page->ptl共享存储。
 
 PMD分页锁只有在你有两个以上的页表级别时才有意义。
 
-启用PMD分页锁需要在PMD表分配时调用pgtable_pmd_page_ctor(),在释放时调
-用pgtable_pmd_page_dtor()。
+启用PMD分页锁需要在PMD表分配时调用pagetable_pmd_ctor(),在释放时调
+用pagetable_pmd_dtor()。
 
 分配通常发生在pmd_alloc_one()中,释放发生在pmd_free()和pmd_free_tlb()
 中,但要确保覆盖所有的PMD表分配/释放路径:即X86_PAE在pgd_alloc()中预先
@@ -73,7 +73,7 @@ PMD分页锁只有在你有两个以上的页表级别时才有意义。
 
 一切就绪后,你可以设置CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK。
 
-注意:pgtable_pte_page_ctor()和pgtable_pmd_page_ctor()可能失败--必
+注意:pagetable_pte_ctor()和pagetable_pmd_ctor()可能失败--必
 须正确处理。
 
 page->ptl
@@ -90,7 +90,7 @@ page->ptl用于访问分割页表锁,其中'page'是包含该表的页面struc
的指针并动态分配它。这允许在启用DEBUG_SPINLOCK或DEBUG_LOCK_ALLOC的
情况下使用分页锁,但由于间接访问而多花了一个缓存行。
 
-PTE表的spinlock_t分配在pgtable_pte_page_ctor()中,PMD表的spinlock_t
-分配在pgtable_pmd_page_ctor()中。
+PTE表的spinlock_t分配在pagetable_pte_ctor()中,PMD表的spinlock_t
+分配在pagetable_pmd_ctor()中。
 
 请不要直接访问page->ptl - -使用适当的辅助函数。
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0e4d5f6d10e5..dc0f19f35424 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2873,11 +2873,6 @@ static inline bool pagetable_pte_ctor(struct ptdesc 
*ptdesc)
return true;
 }
 
-static inline bool pgtable_pte_page_ctor(struct page *page)
-{
-   return pagetable_pte_ctor(page_ptdesc(page));
-}
-
 static inline void pagetable_pte_dtor(struct ptdesc *ptdesc)
 {
struct folio *folio = ptdesc_folio(ptdesc);
@@ -2887,11 +2882,6 @@ static inline void pagetable_pte_dtor(struct ptdesc 
*ptdesc)
lruvec_stat_sub_folio(folio, NR_PAGETABLE);
 }
 
-static inline void pgtable_pte_page_dtor(struct page *page)
-{
-   pagetable_pte_dtor(page_ptdesc(page));
-}
-
 pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp);
 static inline pte_t *pte_offset_map(pmd_t *pmd, unsigned long addr)
 {
@@ -2993,11 +2983,6 @@ static inline bool pagetable_pmd_ctor(struct ptdesc 
*ptdesc)
return true;
 }
 

[PATCH v6 32/33] um: Convert {pmd, pte}_free_tlb() to use ptdescs

2023-06-26 Thread Vishal Moola (Oracle)
Part of the conversions to replace pgtable constructor/destructors with
ptdesc equivalents. Also cleans up some spacing issues.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Mike Rapoport (IBM) 
---
 arch/um/include/asm/pgalloc.h | 18 +-
 1 file changed, 9 insertions(+), 9 deletions(-)

diff --git a/arch/um/include/asm/pgalloc.h b/arch/um/include/asm/pgalloc.h
index 8ec7cd46dd96..de5e31c64793 100644
--- a/arch/um/include/asm/pgalloc.h
+++ b/arch/um/include/asm/pgalloc.h
@@ -25,19 +25,19 @@
  */
 extern pgd_t *pgd_alloc(struct mm_struct *);
 
-#define __pte_free_tlb(tlb,pte, address)   \
-do {   \
-   pgtable_pte_page_dtor(pte); \
-   tlb_remove_page((tlb),(pte));   \
+#define __pte_free_tlb(tlb, pte, address)  \
+do {   \
+   pagetable_pte_dtor(page_ptdesc(pte));   \
+   tlb_remove_page_ptdesc((tlb), (page_ptdesc(pte)));  \
 } while (0)
 
 #ifdef CONFIG_3_LEVEL_PGTABLES
 
-#define __pmd_free_tlb(tlb, pmd, address)  \
-do {   \
-   pgtable_pmd_page_dtor(virt_to_page(pmd));   \
-   tlb_remove_page((tlb),virt_to_page(pmd));   \
-} while (0)\
+#define __pmd_free_tlb(tlb, pmd, address)  \
+do {   \
+   pagetable_pmd_dtor(virt_to_ptdesc(pmd));\
+   tlb_remove_page_ptdesc((tlb), virt_to_ptdesc(pmd)); \
+} while (0)
 
 #endif
 
-- 
2.40.1



[PATCH v6 31/33] sparc: Convert pgtable_pte_page_{ctor, dtor}() to ptdesc equivalents

2023-06-26 Thread Vishal Moola (Oracle)
Part of the conversions to replace pgtable pte constructor/destructors with
ptdesc equivalents.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Mike Rapoport (IBM) 
---
 arch/sparc/mm/srmmu.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/arch/sparc/mm/srmmu.c b/arch/sparc/mm/srmmu.c
index 13f027afc875..8393faa3e596 100644
--- a/arch/sparc/mm/srmmu.c
+++ b/arch/sparc/mm/srmmu.c
@@ -355,7 +355,8 @@ pgtable_t pte_alloc_one(struct mm_struct *mm)
return NULL;
page = pfn_to_page(__nocache_pa((unsigned long)ptep) >> PAGE_SHIFT);
spin_lock(>page_table_lock);
-   if (page_ref_inc_return(page) == 2 && !pgtable_pte_page_ctor(page)) {
+   if (page_ref_inc_return(page) == 2 &&
+   !pagetable_pte_ctor(page_ptdesc(page))) {
page_ref_dec(page);
ptep = NULL;
}
@@ -371,7 +372,7 @@ void pte_free(struct mm_struct *mm, pgtable_t ptep)
page = pfn_to_page(__nocache_pa((unsigned long)ptep) >> PAGE_SHIFT);
spin_lock(>page_table_lock);
if (page_ref_dec_return(page) == 1)
-   pgtable_pte_page_dtor(page);
+   pagetable_pte_dtor(page_ptdesc(page));
spin_unlock(>page_table_lock);
 
srmmu_free_nocache(ptep, SRMMU_PTE_TABLE_SIZE);
-- 
2.40.1



[PATCH v6 30/33] sparc64: Convert various functions to use ptdescs

2023-06-26 Thread Vishal Moola (Oracle)
As part of the conversions to replace pgtable constructor/destructors with
ptdesc equivalents, convert various page table functions to use ptdescs.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Mike Rapoport (IBM) 
---
 arch/sparc/mm/init_64.c | 17 +
 1 file changed, 9 insertions(+), 8 deletions(-)

diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
index 04f9db0c3111..105915cd2eee 100644
--- a/arch/sparc/mm/init_64.c
+++ b/arch/sparc/mm/init_64.c
@@ -2893,14 +2893,15 @@ pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
 
 pgtable_t pte_alloc_one(struct mm_struct *mm)
 {
-   struct page *page = alloc_page(GFP_KERNEL | __GFP_ZERO);
-   if (!page)
+   struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL | __GFP_ZERO, 0);
+
+   if (!ptdesc)
return NULL;
-   if (!pgtable_pte_page_ctor(page)) {
-   __free_page(page);
+   if (!pagetable_pte_ctor(ptdesc)) {
+   pagetable_free(ptdesc);
return NULL;
}
-   return (pte_t *) page_address(page);
+   return ptdesc_address(ptdesc);
 }
 
 void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
@@ -2910,10 +2911,10 @@ void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
 
 static void __pte_free(pgtable_t pte)
 {
-   struct page *page = virt_to_page(pte);
+   struct ptdesc *ptdesc = virt_to_ptdesc(pte);
 
-   pgtable_pte_page_dtor(page);
-   __free_page(page);
+   pagetable_pte_dtor(ptdesc);
+   pagetable_free(ptdesc);
 }
 
 void pte_free(struct mm_struct *mm, pgtable_t pte)
-- 
2.40.1



[PATCH v6 29/33] sh: Convert pte_free_tlb() to use ptdescs

2023-06-26 Thread Vishal Moola (Oracle)
Part of the conversions to replace pgtable constructor/destructors with
ptdesc equivalents. Also cleans up some spacing issues.

Signed-off-by: Vishal Moola (Oracle) 
Reviewed-by: Geert Uytterhoeven 
Acked-by: John Paul Adrian Glaubitz 
Acked-by: Mike Rapoport (IBM) 
---
 arch/sh/include/asm/pgalloc.h | 9 +
 1 file changed, 5 insertions(+), 4 deletions(-)

diff --git a/arch/sh/include/asm/pgalloc.h b/arch/sh/include/asm/pgalloc.h
index a9e98233c4d4..5d8577ab1591 100644
--- a/arch/sh/include/asm/pgalloc.h
+++ b/arch/sh/include/asm/pgalloc.h
@@ -2,6 +2,7 @@
 #ifndef __ASM_SH_PGALLOC_H
 #define __ASM_SH_PGALLOC_H
 
+#include 
 #include 
 
 #define __HAVE_ARCH_PMD_ALLOC_ONE
@@ -31,10 +32,10 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t 
*pmd,
set_pmd(pmd, __pmd((unsigned long)page_address(pte)));
 }
 
-#define __pte_free_tlb(tlb,pte,addr)   \
-do {   \
-   pgtable_pte_page_dtor(pte); \
-   tlb_remove_page((tlb), (pte));  \
+#define __pte_free_tlb(tlb, pte, addr) \
+do {   \
+   pagetable_pte_dtor(page_ptdesc(pte));   \
+   tlb_remove_page_ptdesc((tlb), (page_ptdesc(pte)));  \
 } while (0)
 
 #endif /* __ASM_SH_PGALLOC_H */
-- 
2.40.1



[PATCH v6 28/33] riscv: Convert alloc_{pmd, pte}_late() to use ptdescs

2023-06-26 Thread Vishal Moola (Oracle)
As part of the conversions to replace pgtable constructor/destructors with
ptdesc equivalents, convert various page table functions to use ptdescs.

Some of the functions use the *get*page*() helper functions. Convert
these to use pagetable_alloc() and ptdesc_address() instead to help
standardize page tables further.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Palmer Dabbelt 
Acked-by: Mike Rapoport (IBM) 
---
 arch/riscv/include/asm/pgalloc.h |  8 
 arch/riscv/mm/init.c | 16 ++--
 2 files changed, 10 insertions(+), 14 deletions(-)

diff --git a/arch/riscv/include/asm/pgalloc.h b/arch/riscv/include/asm/pgalloc.h
index 59dc12b5b7e8..d169a4f41a2e 100644
--- a/arch/riscv/include/asm/pgalloc.h
+++ b/arch/riscv/include/asm/pgalloc.h
@@ -153,10 +153,10 @@ static inline pgd_t *pgd_alloc(struct mm_struct *mm)
 
 #endif /* __PAGETABLE_PMD_FOLDED */
 
-#define __pte_free_tlb(tlb, pte, buf)   \
-do {\
-   pgtable_pte_page_dtor(pte); \
-   tlb_remove_page((tlb), pte);\
+#define __pte_free_tlb(tlb, pte, buf)  \
+do {   \
+   pagetable_pte_dtor(page_ptdesc(pte));   \
+   tlb_remove_page_ptdesc((tlb), page_ptdesc(pte));\
 } while (0)
 #endif /* CONFIG_MMU */
 
diff --git a/arch/riscv/mm/init.c b/arch/riscv/mm/init.c
index 4b95d8999120..efff9c752fcf 100644
--- a/arch/riscv/mm/init.c
+++ b/arch/riscv/mm/init.c
@@ -354,12 +354,10 @@ static inline phys_addr_t __init 
alloc_pte_fixmap(uintptr_t va)
 
 static phys_addr_t __init alloc_pte_late(uintptr_t va)
 {
-   unsigned long vaddr;
-
-   vaddr = __get_free_page(GFP_KERNEL);
-   BUG_ON(!vaddr || !pgtable_pte_page_ctor(virt_to_page((void *)vaddr)));
+   struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL & ~__GFP_HIGHMEM, 0);
 
-   return __pa(vaddr);
+   BUG_ON(!ptdesc || !pagetable_pte_ctor(ptdesc));
+   return __pa((pte_t *)ptdesc_address(ptdesc));
 }
 
 static void __init create_pte_mapping(pte_t *ptep,
@@ -437,12 +435,10 @@ static phys_addr_t __init alloc_pmd_fixmap(uintptr_t va)
 
 static phys_addr_t __init alloc_pmd_late(uintptr_t va)
 {
-   unsigned long vaddr;
-
-   vaddr = __get_free_page(GFP_KERNEL);
-   BUG_ON(!vaddr || !pgtable_pmd_page_ctor(virt_to_page((void *)vaddr)));
+   struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL & ~__GFP_HIGHMEM, 0);
 
-   return __pa(vaddr);
+   BUG_ON(!ptdesc || !pagetable_pmd_ctor(ptdesc));
+   return __pa((pmd_t *)ptdesc_address(ptdesc));
 }
 
 static void __init create_pmd_mapping(pmd_t *pmdp,
-- 
2.40.1



[PATCH v6 27/33] openrisc: Convert __pte_free_tlb() to use ptdescs

2023-06-26 Thread Vishal Moola (Oracle)
Part of the conversions to replace pgtable constructor/destructors with
ptdesc equivalents.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Mike Rapoport (IBM) 
---
 arch/openrisc/include/asm/pgalloc.h | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/openrisc/include/asm/pgalloc.h 
b/arch/openrisc/include/asm/pgalloc.h
index b7b2b8d16fad..c6a73772a546 100644
--- a/arch/openrisc/include/asm/pgalloc.h
+++ b/arch/openrisc/include/asm/pgalloc.h
@@ -66,10 +66,10 @@ extern inline pgd_t *pgd_alloc(struct mm_struct *mm)
 
 extern pte_t *pte_alloc_one_kernel(struct mm_struct *mm);
 
-#define __pte_free_tlb(tlb, pte, addr) \
-do {   \
-   pgtable_pte_page_dtor(pte); \
-   tlb_remove_page((tlb), (pte));  \
+#define __pte_free_tlb(tlb, pte, addr) \
+do {   \
+   pagetable_pte_dtor(page_ptdesc(pte));   \
+   tlb_remove_page_ptdesc((tlb), (page_ptdesc(pte)));  \
 } while (0)
 
 #endif
-- 
2.40.1



[PATCH v6 26/33] nios2: Convert __pte_free_tlb() to use ptdescs

2023-06-26 Thread Vishal Moola (Oracle)
Part of the conversions to replace pgtable constructor/destructors with
ptdesc equivalents.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Mike Rapoport (IBM) 
Acked-by: Dinh Nguyen 
---
 arch/nios2/include/asm/pgalloc.h | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/nios2/include/asm/pgalloc.h b/arch/nios2/include/asm/pgalloc.h
index ecd1657bb2ce..ce6bb8e74271 100644
--- a/arch/nios2/include/asm/pgalloc.h
+++ b/arch/nios2/include/asm/pgalloc.h
@@ -28,10 +28,10 @@ static inline void pmd_populate(struct mm_struct *mm, pmd_t 
*pmd,
 
 extern pgd_t *pgd_alloc(struct mm_struct *mm);
 
-#define __pte_free_tlb(tlb, pte, addr) \
-   do {\
-   pgtable_pte_page_dtor(pte); \
-   tlb_remove_page((tlb), (pte));  \
+#define __pte_free_tlb(tlb, pte, addr) \
+   do {\
+   pagetable_pte_dtor(page_ptdesc(pte));   \
+   tlb_remove_page_ptdesc((tlb), (page_ptdesc(pte)));  \
} while (0)
 
 #endif /* _ASM_NIOS2_PGALLOC_H */
-- 
2.40.1



[PATCH v6 25/33] mips: Convert various functions to use ptdescs

2023-06-26 Thread Vishal Moola (Oracle)
As part of the conversions to replace pgtable constructor/destructors with
ptdesc equivalents, convert various page table functions to use ptdescs.

Some of the functions use the *get*page*() helper functions. Convert
these to use pagetable_alloc() and ptdesc_address() instead to help
standardize page tables further.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Mike Rapoport (IBM) 
---
 arch/mips/include/asm/pgalloc.h | 32 ++--
 arch/mips/mm/pgtable.c  |  8 +---
 2 files changed, 23 insertions(+), 17 deletions(-)

diff --git a/arch/mips/include/asm/pgalloc.h b/arch/mips/include/asm/pgalloc.h
index f72e737dda21..40e40a7eb94a 100644
--- a/arch/mips/include/asm/pgalloc.h
+++ b/arch/mips/include/asm/pgalloc.h
@@ -51,13 +51,13 @@ extern pgd_t *pgd_alloc(struct mm_struct *mm);
 
 static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 {
-   free_pages((unsigned long)pgd, PGD_TABLE_ORDER);
+   pagetable_free(virt_to_ptdesc(pgd));
 }
 
-#define __pte_free_tlb(tlb,pte,address)\
-do {   \
-   pgtable_pte_page_dtor(pte); \
-   tlb_remove_page((tlb), pte);\
+#define __pte_free_tlb(tlb, pte, address)  \
+do {   \
+   pagetable_pte_dtor(page_ptdesc(pte));   \
+   tlb_remove_page_ptdesc((tlb), page_ptdesc(pte));\
 } while (0)
 
 #ifndef __PAGETABLE_PMD_FOLDED
@@ -65,18 +65,18 @@ do {
\
 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address)
 {
pmd_t *pmd;
-   struct page *pg;
+   struct ptdesc *ptdesc;
 
-   pg = alloc_pages(GFP_KERNEL_ACCOUNT, PMD_TABLE_ORDER);
-   if (!pg)
+   ptdesc = pagetable_alloc(GFP_KERNEL_ACCOUNT, PMD_TABLE_ORDER);
+   if (!ptdesc)
return NULL;
 
-   if (!pgtable_pmd_page_ctor(pg)) {
-   __free_pages(pg, PMD_TABLE_ORDER);
+   if (!pagetable_pmd_ctor(ptdesc)) {
+   pagetable_free(ptdesc);
return NULL;
}
 
-   pmd = (pmd_t *)page_address(pg);
+   pmd = ptdesc_address(ptdesc);
pmd_init(pmd);
return pmd;
 }
@@ -90,10 +90,14 @@ static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, 
unsigned long address)
 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long address)
 {
pud_t *pud;
+   struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL & ~__GFP_HIGHMEM,
+   PUD_TABLE_ORDER);
 
-   pud = (pud_t *) __get_free_pages(GFP_KERNEL, PUD_TABLE_ORDER);
-   if (pud)
-   pud_init(pud);
+   if (!ptdesc)
+   return NULL;
+   pud = ptdesc_address(ptdesc);
+
+   pud_init(pud);
return pud;
 }
 
diff --git a/arch/mips/mm/pgtable.c b/arch/mips/mm/pgtable.c
index b13314be5d0e..1506e458040d 100644
--- a/arch/mips/mm/pgtable.c
+++ b/arch/mips/mm/pgtable.c
@@ -10,10 +10,12 @@
 
 pgd_t *pgd_alloc(struct mm_struct *mm)
 {
-   pgd_t *ret, *init;
+   pgd_t *init, *ret = NULL;
+   struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL & ~__GFP_HIGHMEM,
+   PGD_TABLE_ORDER);
 
-   ret = (pgd_t *) __get_free_pages(GFP_KERNEL, PGD_TABLE_ORDER);
-   if (ret) {
+   if (ptdesc) {
+   ret = ptdesc_address(ptdesc);
init = pgd_offset(_mm, 0UL);
pgd_init(ret);
memcpy(ret + USER_PTRS_PER_PGD, init + USER_PTRS_PER_PGD,
-- 
2.40.1



[PATCH v6 24/33] m68k: Convert various functions to use ptdescs

2023-06-26 Thread Vishal Moola (Oracle)
As part of the conversions to replace pgtable constructor/destructors with
ptdesc equivalents, convert various page table functions to use ptdescs.

Some of the functions use the *get*page*() helper functions. Convert
these to use pagetable_alloc() and ptdesc_address() instead to help
standardize page tables further.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Mike Rapoport (IBM) 
Acked-by: Geert Uytterhoeven 
---
 arch/m68k/include/asm/mcf_pgalloc.h  | 47 ++--
 arch/m68k/include/asm/sun3_pgalloc.h |  8 ++---
 arch/m68k/mm/motorola.c  |  4 +--
 3 files changed, 30 insertions(+), 29 deletions(-)

diff --git a/arch/m68k/include/asm/mcf_pgalloc.h 
b/arch/m68k/include/asm/mcf_pgalloc.h
index 5c2c0a864524..302c5bf67179 100644
--- a/arch/m68k/include/asm/mcf_pgalloc.h
+++ b/arch/m68k/include/asm/mcf_pgalloc.h
@@ -5,22 +5,22 @@
 #include 
 #include 
 
-extern inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
+static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
 {
-   free_page((unsigned long) pte);
+   pagetable_free(virt_to_ptdesc(pte));
 }
 
 extern const char bad_pmd_string[];
 
-extern inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
+static inline pte_t *pte_alloc_one_kernel(struct mm_struct *mm)
 {
-   unsigned long page = __get_free_page(GFP_DMA);
+   struct ptdesc *ptdesc = pagetable_alloc((GFP_DMA | __GFP_ZERO) &
+   ~__GFP_HIGHMEM, 0);
 
-   if (!page)
+   if (!ptdesc)
return NULL;
 
-   memset((void *)page, 0, PAGE_SIZE);
-   return (pte_t *) (page);
+   return ptdesc_address(ptdesc);
 }
 
 extern inline pmd_t *pmd_alloc_kernel(pgd_t *pgd, unsigned long address)
@@ -35,36 +35,34 @@ extern inline pmd_t *pmd_alloc_kernel(pgd_t *pgd, unsigned 
long address)
 static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pgtable,
  unsigned long address)
 {
-   struct page *page = virt_to_page(pgtable);
+   struct ptdesc *ptdesc = virt_to_ptdesc(pgtable);
 
-   pgtable_pte_page_dtor(page);
-   __free_page(page);
+   pagetable_pte_dtor(ptdesc);
+   pagetable_free(ptdesc);
 }
 
 static inline pgtable_t pte_alloc_one(struct mm_struct *mm)
 {
-   struct page *page = alloc_pages(GFP_DMA, 0);
+   struct ptdesc *ptdesc = pagetable_alloc(GFP_DMA | __GFP_ZERO, 0);
pte_t *pte;
 
-   if (!page)
+   if (!ptdesc)
return NULL;
-   if (!pgtable_pte_page_ctor(page)) {
-   __free_page(page);
+   if (!pagetable_pte_ctor(ptdesc)) {
+   pagetable_free(ptdesc);
return NULL;
}
 
-   pte = page_address(page);
-   clear_page(pte);
-
+   pte = ptdesc_address(ptdesc);
return pte;
 }
 
 static inline void pte_free(struct mm_struct *mm, pgtable_t pgtable)
 {
-   struct page *page = virt_to_page(pgtable);
+   struct ptdesc *ptdesc = virt_to_ptdesc(pgtable);
 
-   pgtable_pte_page_dtor(page);
-   __free_page(page);
+   pagetable_pte_dtor(ptdesc);
+   pagetable_free(ptdesc);
 }
 
 /*
@@ -75,16 +73,19 @@ static inline void pte_free(struct mm_struct *mm, pgtable_t 
pgtable)
 
 static inline void pgd_free(struct mm_struct *mm, pgd_t *pgd)
 {
-   free_page((unsigned long) pgd);
+   pagetable_free(virt_to_ptdesc(pgd));
 }
 
 static inline pgd_t *pgd_alloc(struct mm_struct *mm)
 {
pgd_t *new_pgd;
+   struct ptdesc *ptdesc = pagetable_alloc((GFP_DMA | __GFP_NOWARN) &
+   ~__GFP_HIGHMEM, 0);
 
-   new_pgd = (pgd_t *)__get_free_page(GFP_DMA | __GFP_NOWARN);
-   if (!new_pgd)
+   if (!ptdesc)
return NULL;
+   new_pgd = ptdesc_address(ptdesc);
+
memcpy(new_pgd, swapper_pg_dir, PTRS_PER_PGD * sizeof(pgd_t));
memset(new_pgd, 0, PAGE_OFFSET >> PGDIR_SHIFT);
return new_pgd;
diff --git a/arch/m68k/include/asm/sun3_pgalloc.h 
b/arch/m68k/include/asm/sun3_pgalloc.h
index 198036aff519..ff48573db2c0 100644
--- a/arch/m68k/include/asm/sun3_pgalloc.h
+++ b/arch/m68k/include/asm/sun3_pgalloc.h
@@ -17,10 +17,10 @@
 
 extern const char bad_pmd_string[];
 
-#define __pte_free_tlb(tlb,pte,addr)   \
-do {   \
-   pgtable_pte_page_dtor(pte); \
-   tlb_remove_page((tlb), pte);\
+#define __pte_free_tlb(tlb, pte, addr) \
+do {   \
+   pagetable_pte_dtor(page_ptdesc(pte));   \
+   tlb_remove_page_ptdesc((tlb), page_ptdesc(pte));\
 } while (0)
 
 static inline void pmd_populate_kernel(struct mm_struct *mm, pmd_t *pmd, pte_t 
*pte)
diff --git a/arch/m68k/mm/motorola.c b/arch/m68k/mm/motorola.c
index c75984e2d86b..594575a0780c 100644
--- a/arch/m68k/mm/motorola.c
+++ b/arch/m68k/mm/motorola.c
@@ -161,7 +161,7 @@ 

[PATCH v6 23/33] loongarch: Convert various functions to use ptdescs

2023-06-26 Thread Vishal Moola (Oracle)
As part of the conversions to replace pgtable constructor/destructors with
ptdesc equivalents, convert various page table functions to use ptdescs.

Some of the functions use the *get*page*() helper functions. Convert
these to use pagetable_alloc() and ptdesc_address() instead to help
standardize page tables further.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Mike Rapoport (IBM) 
---
 arch/loongarch/include/asm/pgalloc.h | 27 +++
 arch/loongarch/mm/pgtable.c  |  7 ---
 2 files changed, 19 insertions(+), 15 deletions(-)

diff --git a/arch/loongarch/include/asm/pgalloc.h 
b/arch/loongarch/include/asm/pgalloc.h
index af1d1e4a6965..23f5b1107246 100644
--- a/arch/loongarch/include/asm/pgalloc.h
+++ b/arch/loongarch/include/asm/pgalloc.h
@@ -45,9 +45,9 @@ extern void pagetable_init(void);
 extern pgd_t *pgd_alloc(struct mm_struct *mm);
 
 #define __pte_free_tlb(tlb, pte, address)  \
-do {   \
-   pgtable_pte_page_dtor(pte); \
-   tlb_remove_page((tlb), pte);\
+do {   \
+   pagetable_pte_dtor(page_ptdesc(pte));   \
+   tlb_remove_page_ptdesc((tlb), page_ptdesc(pte));\
 } while (0)
 
 #ifndef __PAGETABLE_PMD_FOLDED
@@ -55,18 +55,18 @@ do {
\
 static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, unsigned long address)
 {
pmd_t *pmd;
-   struct page *pg;
+   struct ptdesc *ptdesc;
 
-   pg = alloc_page(GFP_KERNEL_ACCOUNT);
-   if (!pg)
+   ptdesc = pagetable_alloc(GFP_KERNEL_ACCOUNT, 0);
+   if (!ptdesc)
return NULL;
 
-   if (!pgtable_pmd_page_ctor(pg)) {
-   __free_page(pg);
+   if (!pagetable_pmd_ctor(ptdesc)) {
+   pagetable_free(ptdesc);
return NULL;
}
 
-   pmd = (pmd_t *)page_address(pg);
+   pmd = ptdesc_address(ptdesc);
pmd_init(pmd);
return pmd;
 }
@@ -80,10 +80,13 @@ static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, 
unsigned long address)
 static inline pud_t *pud_alloc_one(struct mm_struct *mm, unsigned long address)
 {
pud_t *pud;
+   struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL & ~__GFP_HIGHMEM, 0);
 
-   pud = (pud_t *) __get_free_page(GFP_KERNEL);
-   if (pud)
-   pud_init(pud);
+   if (!ptdesc)
+   return NULL;
+   pud = ptdesc_address(ptdesc);
+
+   pud_init(pud);
return pud;
 }
 
diff --git a/arch/loongarch/mm/pgtable.c b/arch/loongarch/mm/pgtable.c
index 36a6dc0148ae..5bd102b51f7c 100644
--- a/arch/loongarch/mm/pgtable.c
+++ b/arch/loongarch/mm/pgtable.c
@@ -11,10 +11,11 @@
 
 pgd_t *pgd_alloc(struct mm_struct *mm)
 {
-   pgd_t *ret, *init;
+   pgd_t *init, *ret = NULL;
+   struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL & ~__GFP_HIGHMEM, 0);
 
-   ret = (pgd_t *) __get_free_page(GFP_KERNEL);
-   if (ret) {
+   if (ptdesc) {
+   ret = (pgd_t *)ptdesc_address(ptdesc);
init = pgd_offset(_mm, 0UL);
pgd_init(ret);
memcpy(ret + USER_PTRS_PER_PGD, init + USER_PTRS_PER_PGD,
-- 
2.40.1



[PATCH v6 22/33] hexagon: Convert __pte_free_tlb() to use ptdescs

2023-06-26 Thread Vishal Moola (Oracle)
Part of the conversions to replace pgtable constructor/destructors with
ptdesc equivalents.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Mike Rapoport (IBM) 
---
 arch/hexagon/include/asm/pgalloc.h | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/arch/hexagon/include/asm/pgalloc.h 
b/arch/hexagon/include/asm/pgalloc.h
index f0c47e6a7427..55988625e6fb 100644
--- a/arch/hexagon/include/asm/pgalloc.h
+++ b/arch/hexagon/include/asm/pgalloc.h
@@ -87,10 +87,10 @@ static inline void pmd_populate_kernel(struct mm_struct 
*mm, pmd_t *pmd,
max_kernel_seg = pmdindex;
 }
 
-#define __pte_free_tlb(tlb, pte, addr) \
-do {   \
-   pgtable_pte_page_dtor((pte));   \
-   tlb_remove_page((tlb), (pte));  \
+#define __pte_free_tlb(tlb, pte, addr) \
+do {   \
+   pagetable_pte_dtor((page_ptdesc(pte))); \
+   tlb_remove_page_ptdesc((tlb), (page_ptdesc(pte)));  \
 } while (0)
 
 #endif
-- 
2.40.1



[PATCH v6 21/33] csky: Convert __pte_free_tlb() to use ptdescs

2023-06-26 Thread Vishal Moola (Oracle)
Part of the conversions to replace pgtable constructor/destructors with
ptdesc equivalents.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Guo Ren 
Acked-by: Mike Rapoport (IBM) 
---
 arch/csky/include/asm/pgalloc.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/csky/include/asm/pgalloc.h b/arch/csky/include/asm/pgalloc.h
index 7d57e5da0914..9c84c9012e53 100644
--- a/arch/csky/include/asm/pgalloc.h
+++ b/arch/csky/include/asm/pgalloc.h
@@ -63,8 +63,8 @@ static inline pgd_t *pgd_alloc(struct mm_struct *mm)
 
 #define __pte_free_tlb(tlb, pte, address)  \
 do {   \
-   pgtable_pte_page_dtor(pte); \
-   tlb_remove_page(tlb, pte);  \
+   pagetable_pte_dtor(page_ptdesc(pte));   \
+   tlb_remove_page_ptdesc(tlb, page_ptdesc(pte));  \
 } while (0)
 
 extern void pagetable_init(void);
-- 
2.40.1



[PATCH v6 20/33] arm64: Convert various functions to use ptdescs

2023-06-26 Thread Vishal Moola (Oracle)
As part of the conversions to replace pgtable constructor/destructors with
ptdesc equivalents, convert various page table functions to use ptdescs.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Mike Rapoport (IBM) 
Acked-by: Catalin Marinas 
---
 arch/arm64/include/asm/tlb.h | 14 --
 arch/arm64/mm/mmu.c  |  7 ---
 2 files changed, 12 insertions(+), 9 deletions(-)

diff --git a/arch/arm64/include/asm/tlb.h b/arch/arm64/include/asm/tlb.h
index c995d1f4594f..2c29239d05c3 100644
--- a/arch/arm64/include/asm/tlb.h
+++ b/arch/arm64/include/asm/tlb.h
@@ -75,18 +75,20 @@ static inline void tlb_flush(struct mmu_gather *tlb)
 static inline void __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte,
  unsigned long addr)
 {
-   pgtable_pte_page_dtor(pte);
-   tlb_remove_table(tlb, pte);
+   struct ptdesc *ptdesc = page_ptdesc(pte);
+
+   pagetable_pte_dtor(ptdesc);
+   tlb_remove_ptdesc(tlb, ptdesc);
 }
 
 #if CONFIG_PGTABLE_LEVELS > 2
 static inline void __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmdp,
  unsigned long addr)
 {
-   struct page *page = virt_to_page(pmdp);
+   struct ptdesc *ptdesc = virt_to_ptdesc(pmdp);
 
-   pgtable_pmd_page_dtor(page);
-   tlb_remove_table(tlb, page);
+   pagetable_pmd_dtor(ptdesc);
+   tlb_remove_ptdesc(tlb, ptdesc);
 }
 #endif
 
@@ -94,7 +96,7 @@ static inline void __pmd_free_tlb(struct mmu_gather *tlb, 
pmd_t *pmdp,
 static inline void __pud_free_tlb(struct mmu_gather *tlb, pud_t *pudp,
  unsigned long addr)
 {
-   tlb_remove_table(tlb, virt_to_page(pudp));
+   tlb_remove_ptdesc(tlb, virt_to_ptdesc(pudp));
 }
 #endif
 
diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index 95d360805f8a..47781bec6171 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -426,6 +426,7 @@ static phys_addr_t __pgd_pgtable_alloc(int shift)
 static phys_addr_t pgd_pgtable_alloc(int shift)
 {
phys_addr_t pa = __pgd_pgtable_alloc(shift);
+   struct ptdesc *ptdesc = page_ptdesc(phys_to_page(pa));
 
/*
 * Call proper page table ctor in case later we need to
@@ -433,12 +434,12 @@ static phys_addr_t pgd_pgtable_alloc(int shift)
 * this pre-allocated page table.
 *
 * We don't select ARCH_ENABLE_SPLIT_PMD_PTLOCK if pmd is
-* folded, and if so pgtable_pmd_page_ctor() becomes nop.
+* folded, and if so pagetable_pte_ctor() becomes nop.
 */
if (shift == PAGE_SHIFT)
-   BUG_ON(!pgtable_pte_page_ctor(phys_to_page(pa)));
+   BUG_ON(!pagetable_pte_ctor(ptdesc));
else if (shift == PMD_SHIFT)
-   BUG_ON(!pgtable_pmd_page_ctor(phys_to_page(pa)));
+   BUG_ON(!pagetable_pmd_ctor(ptdesc));
 
return pa;
 }
-- 
2.40.1



[PATCH v6 19/33] arm: Convert various functions to use ptdescs

2023-06-26 Thread Vishal Moola (Oracle)
As part of the conversions to replace pgtable constructor/destructors with
ptdesc equivalents, convert various page table functions to use ptdescs.

late_alloc() also uses the __get_free_pages() helper function. Convert
this to use pagetable_alloc() and ptdesc_address() instead to help
standardize page tables further.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Mike Rapoport (IBM) 
---
 arch/arm/include/asm/tlb.h | 12 +++-
 arch/arm/mm/mmu.c  |  7 ---
 2 files changed, 11 insertions(+), 8 deletions(-)

diff --git a/arch/arm/include/asm/tlb.h b/arch/arm/include/asm/tlb.h
index b8cbe03ad260..f40d06ad5d2a 100644
--- a/arch/arm/include/asm/tlb.h
+++ b/arch/arm/include/asm/tlb.h
@@ -39,7 +39,9 @@ static inline void __tlb_remove_table(void *_table)
 static inline void
 __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte, unsigned long addr)
 {
-   pgtable_pte_page_dtor(pte);
+   struct ptdesc *ptdesc = page_ptdesc(pte);
+
+   pagetable_pte_dtor(ptdesc);
 
 #ifndef CONFIG_ARM_LPAE
/*
@@ -50,17 +52,17 @@ __pte_free_tlb(struct mmu_gather *tlb, pgtable_t pte, 
unsigned long addr)
__tlb_adjust_range(tlb, addr - PAGE_SIZE, 2 * PAGE_SIZE);
 #endif
 
-   tlb_remove_table(tlb, pte);
+   tlb_remove_ptdesc(tlb, ptdesc);
 }
 
 static inline void
 __pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmdp, unsigned long addr)
 {
 #ifdef CONFIG_ARM_LPAE
-   struct page *page = virt_to_page(pmdp);
+   struct ptdesc *ptdesc = virt_to_ptdesc(pmdp);
 
-   pgtable_pmd_page_dtor(page);
-   tlb_remove_table(tlb, page);
+   pagetable_pmd_dtor(ptdesc);
+   tlb_remove_ptdesc(tlb, ptdesc);
 #endif
 }
 
diff --git a/arch/arm/mm/mmu.c b/arch/arm/mm/mmu.c
index 13fc4bb5f792..fdeaee30d167 100644
--- a/arch/arm/mm/mmu.c
+++ b/arch/arm/mm/mmu.c
@@ -737,11 +737,12 @@ static void __init *early_alloc(unsigned long sz)
 
 static void *__init late_alloc(unsigned long sz)
 {
-   void *ptr = (void *)__get_free_pages(GFP_PGTABLE_KERNEL, get_order(sz));
+   void *ptdesc = pagetable_alloc(GFP_PGTABLE_KERNEL & ~__GFP_HIGHMEM,
+   get_order(sz));
 
-   if (!ptr || !pgtable_pte_page_ctor(virt_to_page(ptr)))
+   if (!ptdesc || !pagetable_pte_ctor(ptdesc))
BUG();
-   return ptr;
+   return ptdesc_to_virt(ptdesc);
 }
 
 static pte_t * __init arm_pte_alloc(pmd_t *pmd, unsigned long addr,
-- 
2.40.1



[PATCH v6 18/33] pgalloc: Convert various functions to use ptdescs

2023-06-26 Thread Vishal Moola (Oracle)
As part of the conversions to replace pgtable constructor/destructors with
ptdesc equivalents, convert various page table functions to use ptdescs.

Some of the functions use the *get*page*() helper functions. Convert
these to use pagetable_alloc() and ptdesc_address() instead to help
standardize page tables further.

Signed-off-by: Vishal Moola (Oracle) 
---
 include/asm-generic/pgalloc.h | 88 +--
 1 file changed, 52 insertions(+), 36 deletions(-)

diff --git a/include/asm-generic/pgalloc.h b/include/asm-generic/pgalloc.h
index a7cf825befae..c75d4a753849 100644
--- a/include/asm-generic/pgalloc.h
+++ b/include/asm-generic/pgalloc.h
@@ -8,7 +8,7 @@
 #define GFP_PGTABLE_USER   (GFP_PGTABLE_KERNEL | __GFP_ACCOUNT)
 
 /**
- * __pte_alloc_one_kernel - allocate a page for PTE-level kernel page table
+ * __pte_alloc_one_kernel - allocate memory for a PTE-level kernel page table
  * @mm: the mm_struct of the current context
  *
  * This function is intended for architectures that need
@@ -18,12 +18,17 @@
  */
 static inline pte_t *__pte_alloc_one_kernel(struct mm_struct *mm)
 {
-   return (pte_t *)__get_free_page(GFP_PGTABLE_KERNEL);
+   struct ptdesc *ptdesc = pagetable_alloc(GFP_PGTABLE_KERNEL &
+   ~__GFP_HIGHMEM, 0);
+
+   if (!ptdesc)
+   return NULL;
+   return ptdesc_address(ptdesc);
 }
 
 #ifndef __HAVE_ARCH_PTE_ALLOC_ONE_KERNEL
 /**
- * pte_alloc_one_kernel - allocate a page for PTE-level kernel page table
+ * pte_alloc_one_kernel - allocate memory for a PTE-level kernel page table
  * @mm: the mm_struct of the current context
  *
  * Return: pointer to the allocated memory or %NULL on error
@@ -35,40 +40,40 @@ static inline pte_t *pte_alloc_one_kernel(struct mm_struct 
*mm)
 #endif
 
 /**
- * pte_free_kernel - free PTE-level kernel page table page
+ * pte_free_kernel - free PTE-level kernel page table memory
  * @mm: the mm_struct of the current context
  * @pte: pointer to the memory containing the page table
  */
 static inline void pte_free_kernel(struct mm_struct *mm, pte_t *pte)
 {
-   free_page((unsigned long)pte);
+   pagetable_free(virt_to_ptdesc(pte));
 }
 
 /**
- * __pte_alloc_one - allocate a page for PTE-level user page table
+ * __pte_alloc_one - allocate memory for a PTE-level user page table
  * @mm: the mm_struct of the current context
  * @gfp: GFP flags to use for the allocation
  *
- * Allocates a page and runs the pgtable_pte_page_ctor().
+ * Allocate memory for a page table and ptdesc and runs pagetable_pte_ctor().
  *
  * This function is intended for architectures that need
  * anything beyond simple page allocation or must have custom GFP flags.
  *
- * Return: `struct page` initialized as page table or %NULL on error
+ * Return: `struct page` referencing the ptdesc or %NULL on error
  */
 static inline pgtable_t __pte_alloc_one(struct mm_struct *mm, gfp_t gfp)
 {
-   struct page *pte;
+   struct ptdesc *ptdesc;
 
-   pte = alloc_page(gfp);
-   if (!pte)
+   ptdesc = pagetable_alloc(gfp, 0);
+   if (!ptdesc)
return NULL;
-   if (!pgtable_pte_page_ctor(pte)) {
-   __free_page(pte);
+   if (!pagetable_pte_ctor(ptdesc)) {
+   pagetable_free(ptdesc);
return NULL;
}
 
-   return pte;
+   return ptdesc_page(ptdesc);
 }
 
 #ifndef __HAVE_ARCH_PTE_ALLOC_ONE
@@ -76,9 +81,9 @@ static inline pgtable_t __pte_alloc_one(struct mm_struct *mm, 
gfp_t gfp)
  * pte_alloc_one - allocate a page for PTE-level user page table
  * @mm: the mm_struct of the current context
  *
- * Allocates a page and runs the pgtable_pte_page_ctor().
+ * Allocate memory for a page table and ptdesc and runs pagetable_pte_ctor().
  *
- * Return: `struct page` initialized as page table or %NULL on error
+ * Return: `struct page` referencing the ptdesc or %NULL on error
  */
 static inline pgtable_t pte_alloc_one(struct mm_struct *mm)
 {
@@ -92,14 +97,16 @@ static inline pgtable_t pte_alloc_one(struct mm_struct *mm)
  */
 
 /**
- * pte_free - free PTE-level user page table page
+ * pte_free - free PTE-level user page table memory
  * @mm: the mm_struct of the current context
- * @pte_page: the `struct page` representing the page table
+ * @pte_page: the `struct page` referencing the ptdesc
  */
 static inline void pte_free(struct mm_struct *mm, struct page *pte_page)
 {
-   pgtable_pte_page_dtor(pte_page);
-   __free_page(pte_page);
+   struct ptdesc *ptdesc = page_ptdesc(pte_page);
+
+   pagetable_pte_dtor(ptdesc);
+   pagetable_free(ptdesc);
 }
 
 
@@ -107,10 +114,11 @@ static inline void pte_free(struct mm_struct *mm, struct 
page *pte_page)
 
 #ifndef __HAVE_ARCH_PMD_ALLOC_ONE
 /**
- * pmd_alloc_one - allocate a page for PMD-level page table
+ * pmd_alloc_one - allocate memory for a PMD-level page table
  * @mm: the mm_struct of the current context
  *
- * Allocates a page and runs the 

[PATCH v6 17/33] mm: Remove page table members from struct page

2023-06-26 Thread Vishal Moola (Oracle)
The page table members are now split out into their own ptdesc struct.
Remove them from struct page.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Mike Rapoport (IBM) 
---
 include/linux/mm_types.h | 14 --
 include/linux/pgtable.h  |  3 ---
 2 files changed, 17 deletions(-)

diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h
index fbbe4e93a9ba..434e54440686 100644
--- a/include/linux/mm_types.h
+++ b/include/linux/mm_types.h
@@ -141,20 +141,6 @@ struct page {
struct {/* Tail pages of compound page */
unsigned long compound_head;/* Bit zero is set */
};
-   struct {/* Page table pages */
-   unsigned long _pt_pad_1;/* compound_head */
-   pgtable_t pmd_huge_pte; /* protected by page->ptl */
-   unsigned long _pt_s390_gaddr;   /* mapping */
-   union {
-   struct mm_struct *pt_mm; /* x86 pgds only */
-   atomic_t pt_frag_refcount; /* powerpc */
-   };
-#if ALLOC_SPLIT_PTLOCKS
-   spinlock_t *ptl;
-#else
-   spinlock_t ptl;
-#endif
-   };
struct {/* ZONE_DEVICE pages */
/** @pgmap: Points to the hosting device page map. */
struct dev_pagemap *pgmap;
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index e9bb5f18cade..daeacfe3930d 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1044,10 +1044,7 @@ struct ptdesc {
 TABLE_MATCH(flags, __page_flags);
 TABLE_MATCH(compound_head, pt_list);
 TABLE_MATCH(compound_head, _pt_pad_1);
-TABLE_MATCH(pmd_huge_pte, pmd_huge_pte);
 TABLE_MATCH(mapping, _pt_s390_gaddr);
-TABLE_MATCH(pt_mm, pt_mm);
-TABLE_MATCH(ptl, ptl);
 TABLE_MATCH(rcu_head, pt_rcu_head);
 #ifdef CONFIG_MEMCG
 TABLE_MATCH(memcg_data, pt_memcg_data);
-- 
2.40.1



[PATCH v6 16/33] s390: Convert various pgalloc functions to use ptdescs

2023-06-26 Thread Vishal Moola (Oracle)
As part of the conversions to replace pgtable constructor/destructors with
ptdesc equivalents, convert various page table functions to use ptdescs.

Some of the functions use the *get*page*() helper functions. Convert
these to use pagetable_alloc() and ptdesc_address() instead to help
standardize page tables further.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Mike Rapoport (IBM) 
---
 arch/s390/include/asm/pgalloc.h |   4 +-
 arch/s390/include/asm/tlb.h |   4 +-
 arch/s390/mm/pgalloc.c  | 108 
 3 files changed, 59 insertions(+), 57 deletions(-)

diff --git a/arch/s390/include/asm/pgalloc.h b/arch/s390/include/asm/pgalloc.h
index 17eb618f1348..00ad9b88fda9 100644
--- a/arch/s390/include/asm/pgalloc.h
+++ b/arch/s390/include/asm/pgalloc.h
@@ -86,7 +86,7 @@ static inline pmd_t *pmd_alloc_one(struct mm_struct *mm, 
unsigned long vmaddr)
if (!table)
return NULL;
crst_table_init(table, _SEGMENT_ENTRY_EMPTY);
-   if (!pgtable_pmd_page_ctor(virt_to_page(table))) {
+   if (!pagetable_pmd_ctor(virt_to_ptdesc(table))) {
crst_table_free(mm, table);
return NULL;
}
@@ -97,7 +97,7 @@ static inline void pmd_free(struct mm_struct *mm, pmd_t *pmd)
 {
if (mm_pmd_folded(mm))
return;
-   pgtable_pmd_page_dtor(virt_to_page(pmd));
+   pagetable_pmd_dtor(virt_to_ptdesc(pmd));
crst_table_free(mm, (unsigned long *) pmd);
 }
 
diff --git a/arch/s390/include/asm/tlb.h b/arch/s390/include/asm/tlb.h
index b91f4a9b044c..383b1f91442c 100644
--- a/arch/s390/include/asm/tlb.h
+++ b/arch/s390/include/asm/tlb.h
@@ -89,12 +89,12 @@ static inline void pmd_free_tlb(struct mmu_gather *tlb, 
pmd_t *pmd,
 {
if (mm_pmd_folded(tlb->mm))
return;
-   pgtable_pmd_page_dtor(virt_to_page(pmd));
+   pagetable_pmd_dtor(virt_to_ptdesc(pmd));
__tlb_adjust_range(tlb, address, PAGE_SIZE);
tlb->mm->context.flush_mm = 1;
tlb->freed_tables = 1;
tlb->cleared_puds = 1;
-   tlb_remove_table(tlb, pmd);
+   tlb_remove_ptdesc(tlb, pmd);
 }
 
 /*
diff --git a/arch/s390/mm/pgalloc.c b/arch/s390/mm/pgalloc.c
index 66ab68db9842..79b1c2458d85 100644
--- a/arch/s390/mm/pgalloc.c
+++ b/arch/s390/mm/pgalloc.c
@@ -43,17 +43,17 @@ __initcall(page_table_register_sysctl);
 
 unsigned long *crst_table_alloc(struct mm_struct *mm)
 {
-   struct page *page = alloc_pages(GFP_KERNEL, CRST_ALLOC_ORDER);
+   struct ptdesc *ptdesc = pagetable_alloc(GFP_KERNEL, CRST_ALLOC_ORDER);
 
-   if (!page)
+   if (!ptdesc)
return NULL;
-   arch_set_page_dat(page, CRST_ALLOC_ORDER);
-   return (unsigned long *) page_to_virt(page);
+   arch_set_page_dat(ptdesc_page(ptdesc), CRST_ALLOC_ORDER);
+   return (unsigned long *) ptdesc_to_virt(ptdesc);
 }
 
 void crst_table_free(struct mm_struct *mm, unsigned long *table)
 {
-   free_pages((unsigned long)table, CRST_ALLOC_ORDER);
+   pagetable_free(virt_to_ptdesc(table));
 }
 
 static void __crst_table_upgrade(void *arg)
@@ -140,21 +140,21 @@ static inline unsigned int atomic_xor_bits(atomic_t *v, 
unsigned int bits)
 
 struct page *page_table_alloc_pgste(struct mm_struct *mm)
 {
-   struct page *page;
+   struct ptdesc *ptdesc;
u64 *table;
 
-   page = alloc_page(GFP_KERNEL);
-   if (page) {
-   table = (u64 *)page_to_virt(page);
+   ptdesc = pagetable_alloc(GFP_KERNEL, 0);
+   if (ptdesc) {
+   table = (u64 *)ptdesc_to_virt(ptdesc);
memset64(table, _PAGE_INVALID, PTRS_PER_PTE);
memset64(table + PTRS_PER_PTE, 0, PTRS_PER_PTE);
}
-   return page;
+   return ptdesc_page(ptdesc);
 }
 
 void page_table_free_pgste(struct page *page)
 {
-   __free_page(page);
+   pagetable_free(page_ptdesc(page));
 }
 
 #endif /* CONFIG_PGSTE */
@@ -233,7 +233,7 @@ void page_table_free_pgste(struct page *page)
 unsigned long *page_table_alloc(struct mm_struct *mm)
 {
unsigned long *table;
-   struct page *page;
+   struct ptdesc *ptdesc;
unsigned int mask, bit;
 
/* Try to get a fragment of a 4K page as a 2K page table */
@@ -241,9 +241,9 @@ unsigned long *page_table_alloc(struct mm_struct *mm)
table = NULL;
spin_lock_bh(>context.lock);
if (!list_empty(>context.pgtable_list)) {
-   page = list_first_entry(>context.pgtable_list,
-   struct page, lru);
-   mask = atomic_read(>_refcount) >> 24;
+   ptdesc = list_first_entry(>context.pgtable_list,
+   struct ptdesc, pt_list);
+   mask = atomic_read(>_refcount) >> 24;
/*
 * The pending removal bits must also be checked.

[PATCH v6 15/33] s390: Convert various gmap functions to use ptdescs

2023-06-26 Thread Vishal Moola (Oracle)
In order to split struct ptdesc from struct page, convert various
functions to use ptdescs.

Some of the functions use the *get*page*() helper functions. Convert
these to use pagetable_alloc() and ptdesc_address() instead to help
standardize page tables further.

Since we're now using pagetable_free(), set _pt_s390_gaddr (which
aliases with page->mapping) to NULL in that function instead.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Mike Rapoport (IBM) 
---
 arch/s390/mm/gmap.c | 217 +++-
 include/linux/mm.h  |   3 +
 2 files changed, 117 insertions(+), 103 deletions(-)

diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index beb4804d9ca8..8dbe0fdc0e44 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -34,7 +34,7 @@
 static struct gmap *gmap_alloc(unsigned long limit)
 {
struct gmap *gmap;
-   struct page *page;
+   struct ptdesc *ptdesc;
unsigned long *table;
unsigned long etype, atype;
 
@@ -67,12 +67,12 @@ static struct gmap *gmap_alloc(unsigned long limit)
spin_lock_init(>guest_table_lock);
spin_lock_init(>shadow_lock);
refcount_set(>ref_count, 1);
-   page = alloc_pages(GFP_KERNEL_ACCOUNT, CRST_ALLOC_ORDER);
-   if (!page)
+   ptdesc = pagetable_alloc(GFP_KERNEL_ACCOUNT, CRST_ALLOC_ORDER);
+   if (!ptdesc)
goto out_free;
-   page->_pt_s390_gaddr = 0;
-   list_add(>lru, >crst_list);
-   table = page_to_virt(page);
+   ptdesc->_pt_s390_gaddr = 0;
+   list_add(>pt_list, >crst_list);
+   table = ptdesc_to_virt(ptdesc);
crst_table_init(table, etype);
gmap->table = table;
gmap->asce = atype | _ASCE_TABLE_LENGTH |
@@ -181,25 +181,23 @@ static void gmap_rmap_radix_tree_free(struct 
radix_tree_root *root)
  */
 static void gmap_free(struct gmap *gmap)
 {
-   struct page *page, *next;
+   struct ptdesc *ptdesc, *next;
 
/* Flush tlb of all gmaps (if not already done for shadows) */
if (!(gmap_is_shadow(gmap) && gmap->removed))
gmap_flush_tlb(gmap);
/* Free all segment & region tables. */
-   list_for_each_entry_safe(page, next, >crst_list, lru) {
-   page->_pt_s390_gaddr = 0;
-   __free_pages(page, CRST_ALLOC_ORDER);
+   list_for_each_entry_safe(ptdesc, next, >crst_list, pt_list) {
+   pagetable_free(ptdesc);
}
gmap_radix_tree_free(>guest_to_host);
gmap_radix_tree_free(>host_to_guest);
 
/* Free additional data for a shadow gmap */
if (gmap_is_shadow(gmap)) {
-   /* Free all page tables. */
-   list_for_each_entry_safe(page, next, >pt_list, lru) {
-   page->_pt_s390_gaddr = 0;
-   page_table_free_pgste(page);
+   /* Free all ptdesc tables. */
+   list_for_each_entry_safe(ptdesc, next, >pt_list, pt_list) 
{
+   page_table_free_pgste(ptdesc_page(ptdesc));
}
gmap_rmap_radix_tree_free(>host_to_rmap);
/* Release reference to the parent */
@@ -308,28 +306,27 @@ EXPORT_SYMBOL_GPL(gmap_get_enabled);
 static int gmap_alloc_table(struct gmap *gmap, unsigned long *table,
unsigned long init, unsigned long gaddr)
 {
-   struct page *page;
+   struct ptdesc *ptdesc;
unsigned long *new;
 
/* since we dont free the gmap table until gmap_free we can unlock */
-   page = alloc_pages(GFP_KERNEL_ACCOUNT, CRST_ALLOC_ORDER);
-   if (!page)
+   ptdesc = pagetable_alloc(GFP_KERNEL_ACCOUNT, CRST_ALLOC_ORDER);
+   if (!ptdesc)
return -ENOMEM;
-   new = page_to_virt(page);
+   new = ptdesc_to_virt(ptdesc);
crst_table_init(new, init);
spin_lock(>guest_table_lock);
if (*table & _REGION_ENTRY_INVALID) {
-   list_add(>lru, >crst_list);
+   list_add(>pt_list, >crst_list);
*table = __pa(new) | _REGION_ENTRY_LENGTH |
(*table & _REGION_ENTRY_TYPE_MASK);
-   page->_pt_s390_gaddr = gaddr;
-   page = NULL;
+   ptdesc->_pt_s390_gaddr = gaddr;
+   ptdesc = NULL;
}
spin_unlock(>guest_table_lock);
-   if (page) {
-   page->_pt_s390_gaddr = 0;
-   __free_pages(page, CRST_ALLOC_ORDER);
-   }
+   if (ptdesc)
+   pagetable_free(ptdesc);
+
return 0;
 }
 
@@ -341,15 +338,15 @@ static int gmap_alloc_table(struct gmap *gmap, unsigned 
long *table,
  */
 static unsigned long __gmap_segment_gaddr(unsigned long *entry)
 {
-   struct page *page;
+   struct ptdesc *ptdesc;
unsigned long offset, mask;
 
offset = (unsigned long) entry / sizeof(unsigned long);
offset = (offset & (PTRS_PER_PMD - 1)) * PMD_SIZE;
mask = ~(PTRS_PER_PMD * sizeof(pmd_t) - 1);
-

[PATCH v6 14/33] x86: Convert various functions to use ptdescs

2023-06-26 Thread Vishal Moola (Oracle)
In order to split struct ptdesc from struct page, convert various
functions to use ptdescs.

Some of the functions use the *get*page*() helper functions. Convert
these to use pagetable_alloc() and ptdesc_address() instead to help
standardize page tables further.

Signed-off-by: Vishal Moola (Oracle) 
---
 arch/x86/mm/pgtable.c | 47 ++-
 1 file changed, 28 insertions(+), 19 deletions(-)

diff --git a/arch/x86/mm/pgtable.c b/arch/x86/mm/pgtable.c
index 15a8009a4480..d3a93e8766ee 100644
--- a/arch/x86/mm/pgtable.c
+++ b/arch/x86/mm/pgtable.c
@@ -52,7 +52,7 @@ early_param("userpte", setup_userpte);
 
 void ___pte_free_tlb(struct mmu_gather *tlb, struct page *pte)
 {
-   pgtable_pte_page_dtor(pte);
+   pagetable_pte_dtor(page_ptdesc(pte));
paravirt_release_pte(page_to_pfn(pte));
paravirt_tlb_remove_table(tlb, pte);
 }
@@ -60,7 +60,7 @@ void ___pte_free_tlb(struct mmu_gather *tlb, struct page *pte)
 #if CONFIG_PGTABLE_LEVELS > 2
 void ___pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd)
 {
-   struct page *page = virt_to_page(pmd);
+   struct ptdesc *ptdesc = virt_to_ptdesc(pmd);
paravirt_release_pmd(__pa(pmd) >> PAGE_SHIFT);
/*
 * NOTE! For PAE, any changes to the top page-directory-pointer-table
@@ -69,8 +69,8 @@ void ___pmd_free_tlb(struct mmu_gather *tlb, pmd_t *pmd)
 #ifdef CONFIG_X86_PAE
tlb->need_flush_all = 1;
 #endif
-   pgtable_pmd_page_dtor(page);
-   paravirt_tlb_remove_table(tlb, page);
+   pagetable_pmd_dtor(ptdesc);
+   paravirt_tlb_remove_table(tlb, ptdesc_page(ptdesc));
 }
 
 #if CONFIG_PGTABLE_LEVELS > 3
@@ -92,16 +92,16 @@ void ___p4d_free_tlb(struct mmu_gather *tlb, p4d_t *p4d)
 
 static inline void pgd_list_add(pgd_t *pgd)
 {
-   struct page *page = virt_to_page(pgd);
+   struct ptdesc *ptdesc = virt_to_ptdesc(pgd);
 
-   list_add(>lru, _list);
+   list_add(>pt_list, _list);
 }
 
 static inline void pgd_list_del(pgd_t *pgd)
 {
-   struct page *page = virt_to_page(pgd);
+   struct ptdesc *ptdesc = virt_to_ptdesc(pgd);
 
-   list_del(>lru);
+   list_del(>pt_list);
 }
 
 #define UNSHARED_PTRS_PER_PGD  \
@@ -112,12 +112,12 @@ static inline void pgd_list_del(pgd_t *pgd)
 
 static void pgd_set_mm(pgd_t *pgd, struct mm_struct *mm)
 {
-   virt_to_page(pgd)->pt_mm = mm;
+   virt_to_ptdesc(pgd)->pt_mm = mm;
 }
 
 struct mm_struct *pgd_page_get_mm(struct page *page)
 {
-   return page->pt_mm;
+   return page_ptdesc(page)->pt_mm;
 }
 
 static void pgd_ctor(struct mm_struct *mm, pgd_t *pgd)
@@ -213,11 +213,14 @@ void pud_populate(struct mm_struct *mm, pud_t *pudp, 
pmd_t *pmd)
 static void free_pmds(struct mm_struct *mm, pmd_t *pmds[], int count)
 {
int i;
+   struct ptdesc *ptdesc;
 
for (i = 0; i < count; i++)
if (pmds[i]) {
-   pgtable_pmd_page_dtor(virt_to_page(pmds[i]));
-   free_page((unsigned long)pmds[i]);
+   ptdesc = virt_to_ptdesc(pmds[i]);
+
+   pagetable_pmd_dtor(ptdesc);
+   pagetable_free(ptdesc);
mm_dec_nr_pmds(mm);
}
 }
@@ -230,18 +233,24 @@ static int preallocate_pmds(struct mm_struct *mm, pmd_t 
*pmds[], int count)
 
if (mm == _mm)
gfp &= ~__GFP_ACCOUNT;
+   gfp &= ~__GFP_HIGHMEM;
 
for (i = 0; i < count; i++) {
-   pmd_t *pmd = (pmd_t *)__get_free_page(gfp);
-   if (!pmd)
+   pmd_t *pmd = NULL;
+   struct ptdesc *ptdesc = pagetable_alloc(gfp, 0);
+
+   if (!ptdesc)
failed = true;
-   if (pmd && !pgtable_pmd_page_ctor(virt_to_page(pmd))) {
-   free_page((unsigned long)pmd);
-   pmd = NULL;
+   if (ptdesc && !pagetable_pmd_ctor(ptdesc)) {
+   pagetable_free(ptdesc);
+   ptdesc = NULL;
failed = true;
}
-   if (pmd)
+   if (ptdesc) {
mm_inc_nr_pmds(mm);
+   pmd = ptdesc_address(ptdesc);
+   }
+
pmds[i] = pmd;
}
 
@@ -830,7 +839,7 @@ int pud_free_pmd_page(pud_t *pud, unsigned long addr)
 
free_page((unsigned long)pmd_sv);
 
-   pgtable_pmd_page_dtor(virt_to_page(pmd));
+   pagetable_pmd_dtor(virt_to_ptdesc(pmd));
free_page((unsigned long)pmd);
 
return 1;
-- 
2.40.1



[PATCH v6 13/33] powerpc: Convert various functions to use ptdescs

2023-06-26 Thread Vishal Moola (Oracle)
In order to split struct ptdesc from struct page, convert various
functions to use ptdescs.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Mike Rapoport (IBM) 
---
 arch/powerpc/mm/book3s64/mmu_context.c | 10 +++---
 arch/powerpc/mm/book3s64/pgtable.c | 32 +-
 arch/powerpc/mm/pgtable-frag.c | 46 +-
 3 files changed, 44 insertions(+), 44 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/mmu_context.c 
b/arch/powerpc/mm/book3s64/mmu_context.c
index c766e4c26e42..1715b07c630c 100644
--- a/arch/powerpc/mm/book3s64/mmu_context.c
+++ b/arch/powerpc/mm/book3s64/mmu_context.c
@@ -246,15 +246,15 @@ static void destroy_contexts(mm_context_t *ctx)
 static void pmd_frag_destroy(void *pmd_frag)
 {
int count;
-   struct page *page;
+   struct ptdesc *ptdesc;
 
-   page = virt_to_page(pmd_frag);
+   ptdesc = virt_to_ptdesc(pmd_frag);
/* drop all the pending references */
count = ((unsigned long)pmd_frag & ~PAGE_MASK) >> PMD_FRAG_SIZE_SHIFT;
/* We allow PTE_FRAG_NR fragments from a PTE page */
-   if (atomic_sub_and_test(PMD_FRAG_NR - count, >pt_frag_refcount)) {
-   pgtable_pmd_page_dtor(page);
-   __free_page(page);
+   if (atomic_sub_and_test(PMD_FRAG_NR - count, 
>pt_frag_refcount)) {
+   pagetable_pmd_dtor(ptdesc);
+   pagetable_free(ptdesc);
}
 }
 
diff --git a/arch/powerpc/mm/book3s64/pgtable.c 
b/arch/powerpc/mm/book3s64/pgtable.c
index 85c84e89e3ea..1212deeabe15 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -306,22 +306,22 @@ static pmd_t *get_pmd_from_cache(struct mm_struct *mm)
 static pmd_t *__alloc_for_pmdcache(struct mm_struct *mm)
 {
void *ret = NULL;
-   struct page *page;
+   struct ptdesc *ptdesc;
gfp_t gfp = GFP_KERNEL_ACCOUNT | __GFP_ZERO;
 
if (mm == _mm)
gfp &= ~__GFP_ACCOUNT;
-   page = alloc_page(gfp);
-   if (!page)
+   ptdesc = pagetable_alloc(gfp, 0);
+   if (!ptdesc)
return NULL;
-   if (!pgtable_pmd_page_ctor(page)) {
-   __free_pages(page, 0);
+   if (!pagetable_pmd_ctor(ptdesc)) {
+   pagetable_free(ptdesc);
return NULL;
}
 
-   atomic_set(>pt_frag_refcount, 1);
+   atomic_set(>pt_frag_refcount, 1);
 
-   ret = page_address(page);
+   ret = ptdesc_address(ptdesc);
/*
 * if we support only one fragment just return the
 * allocated page.
@@ -331,12 +331,12 @@ static pmd_t *__alloc_for_pmdcache(struct mm_struct *mm)
 
spin_lock(>page_table_lock);
/*
-* If we find pgtable_page set, we return
+* If we find ptdesc_page set, we return
 * the allocated page with single fragment
 * count.
 */
if (likely(!mm->context.pmd_frag)) {
-   atomic_set(>pt_frag_refcount, PMD_FRAG_NR);
+   atomic_set(>pt_frag_refcount, PMD_FRAG_NR);
mm->context.pmd_frag = ret + PMD_FRAG_SIZE;
}
spin_unlock(>page_table_lock);
@@ -357,15 +357,15 @@ pmd_t *pmd_fragment_alloc(struct mm_struct *mm, unsigned 
long vmaddr)
 
 void pmd_fragment_free(unsigned long *pmd)
 {
-   struct page *page = virt_to_page(pmd);
+   struct ptdesc *ptdesc = virt_to_ptdesc(pmd);
 
-   if (PageReserved(page))
-   return free_reserved_page(page);
+   if (pagetable_is_reserved(ptdesc))
+   return free_reserved_ptdesc(ptdesc);
 
-   BUG_ON(atomic_read(>pt_frag_refcount) <= 0);
-   if (atomic_dec_and_test(>pt_frag_refcount)) {
-   pgtable_pmd_page_dtor(page);
-   __free_page(page);
+   BUG_ON(atomic_read(>pt_frag_refcount) <= 0);
+   if (atomic_dec_and_test(>pt_frag_refcount)) {
+   pagetable_pmd_dtor(ptdesc);
+   pagetable_free(ptdesc);
}
 }
 
diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c
index 20652daa1d7e..8961f1540209 100644
--- a/arch/powerpc/mm/pgtable-frag.c
+++ b/arch/powerpc/mm/pgtable-frag.c
@@ -18,15 +18,15 @@
 void pte_frag_destroy(void *pte_frag)
 {
int count;
-   struct page *page;
+   struct ptdesc *ptdesc;
 
-   page = virt_to_page(pte_frag);
+   ptdesc = virt_to_ptdesc(pte_frag);
/* drop all the pending references */
count = ((unsigned long)pte_frag & ~PAGE_MASK) >> PTE_FRAG_SIZE_SHIFT;
/* We allow PTE_FRAG_NR fragments from a PTE page */
-   if (atomic_sub_and_test(PTE_FRAG_NR - count, >pt_frag_refcount)) {
-   pgtable_pte_page_dtor(page);
-   __free_page(page);
+   if (atomic_sub_and_test(PTE_FRAG_NR - count, 
>pt_frag_refcount)) {
+   pagetable_pte_dtor(ptdesc);
+   pagetable_free(ptdesc);
}
 }
 
@@ -55,25 +55,25 @@ static pte_t *get_pte_from_cache(struct mm_struct *mm)
 static pte_t 

[PATCH v6 12/33] mm: Create ptdesc equivalents for pgtable_{pte,pmd}_page_{ctor,dtor}

2023-06-26 Thread Vishal Moola (Oracle)
Create pagetable_pte_ctor(), pagetable_pmd_ctor(), pagetable_pte_dtor(),
and pagetable_pmd_dtor() and make the original pgtable
constructor/destructors wrappers.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Mike Rapoport (IBM) 
---
 include/linux/mm.h | 56 ++
 1 file changed, 42 insertions(+), 14 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 69e6d6696c44..356e79984cf9 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2859,20 +2859,34 @@ static inline bool ptlock_init(struct ptdesc *ptdesc) { 
return true; }
 static inline void ptlock_free(struct ptdesc *ptdesc) {}
 #endif /* USE_SPLIT_PTE_PTLOCKS */
 
-static inline bool pgtable_pte_page_ctor(struct page *page)
+static inline bool pagetable_pte_ctor(struct ptdesc *ptdesc)
 {
-   if (!ptlock_init(page_ptdesc(page)))
+   struct folio *folio = ptdesc_folio(ptdesc);
+
+   if (!ptlock_init(ptdesc))
return false;
-   __SetPageTable(page);
-   inc_lruvec_page_state(page, NR_PAGETABLE);
+   __folio_set_pgtable(folio);
+   lruvec_stat_add_folio(folio, NR_PAGETABLE);
return true;
 }
 
+static inline bool pgtable_pte_page_ctor(struct page *page)
+{
+   return pagetable_pte_ctor(page_ptdesc(page));
+}
+
+static inline void pagetable_pte_dtor(struct ptdesc *ptdesc)
+{
+   struct folio *folio = ptdesc_folio(ptdesc);
+
+   ptlock_free(ptdesc);
+   __folio_clear_pgtable(folio);
+   lruvec_stat_sub_folio(folio, NR_PAGETABLE);
+}
+
 static inline void pgtable_pte_page_dtor(struct page *page)
 {
-   ptlock_free(page_ptdesc(page));
-   __ClearPageTable(page);
-   dec_lruvec_page_state(page, NR_PAGETABLE);
+   pagetable_pte_dtor(page_ptdesc(page));
 }
 
 pte_t *__pte_offset_map(pmd_t *pmd, unsigned long addr, pmd_t *pmdvalp);
@@ -2965,20 +2979,34 @@ static inline spinlock_t *pmd_lock(struct mm_struct 
*mm, pmd_t *pmd)
return ptl;
 }
 
-static inline bool pgtable_pmd_page_ctor(struct page *page)
+static inline bool pagetable_pmd_ctor(struct ptdesc *ptdesc)
 {
-   if (!pmd_ptlock_init(page_ptdesc(page)))
+   struct folio *folio = ptdesc_folio(ptdesc);
+
+   if (!pmd_ptlock_init(ptdesc))
return false;
-   __SetPageTable(page);
-   inc_lruvec_page_state(page, NR_PAGETABLE);
+   __folio_set_pgtable(folio);
+   lruvec_stat_add_folio(folio, NR_PAGETABLE);
return true;
 }
 
+static inline bool pgtable_pmd_page_ctor(struct page *page)
+{
+   return pagetable_pmd_ctor(page_ptdesc(page));
+}
+
+static inline void pagetable_pmd_dtor(struct ptdesc *ptdesc)
+{
+   struct folio *folio = ptdesc_folio(ptdesc);
+
+   pmd_ptlock_free(ptdesc);
+   __folio_clear_pgtable(folio);
+   lruvec_stat_sub_folio(folio, NR_PAGETABLE);
+}
+
 static inline void pgtable_pmd_page_dtor(struct page *page)
 {
-   pmd_ptlock_free(page_ptdesc(page));
-   __ClearPageTable(page);
-   dec_lruvec_page_state(page, NR_PAGETABLE);
+   pagetable_pmd_dtor(page_ptdesc(page));
 }
 
 /*
-- 
2.40.1



[PATCH v6 11/33] mm: Convert ptlock_free() to use ptdescs

2023-06-26 Thread Vishal Moola (Oracle)
This removes some direct accesses to struct page, working towards
splitting out struct ptdesc from struct page.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Mike Rapoport (IBM) 
---
 include/linux/mm.h | 10 +-
 mm/memory.c|  4 ++--
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0221675e4dc5..69e6d6696c44 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2799,7 +2799,7 @@ static inline void pagetable_free(struct ptdesc *pt)
 #if ALLOC_SPLIT_PTLOCKS
 void __init ptlock_cache_init(void);
 bool ptlock_alloc(struct ptdesc *ptdesc);
-extern void ptlock_free(struct page *page);
+void ptlock_free(struct ptdesc *ptdesc);
 
 static inline spinlock_t *ptlock_ptr(struct ptdesc *ptdesc)
 {
@@ -2815,7 +2815,7 @@ static inline bool ptlock_alloc(struct ptdesc *ptdesc)
return true;
 }
 
-static inline void ptlock_free(struct page *page)
+static inline void ptlock_free(struct ptdesc *ptdesc)
 {
 }
 
@@ -2856,7 +2856,7 @@ static inline spinlock_t *pte_lockptr(struct mm_struct 
*mm, pmd_t *pmd)
 }
 static inline void ptlock_cache_init(void) {}
 static inline bool ptlock_init(struct ptdesc *ptdesc) { return true; }
-static inline void ptlock_free(struct page *page) {}
+static inline void ptlock_free(struct ptdesc *ptdesc) {}
 #endif /* USE_SPLIT_PTE_PTLOCKS */
 
 static inline bool pgtable_pte_page_ctor(struct page *page)
@@ -2870,7 +2870,7 @@ static inline bool pgtable_pte_page_ctor(struct page 
*page)
 
 static inline void pgtable_pte_page_dtor(struct page *page)
 {
-   ptlock_free(page);
+   ptlock_free(page_ptdesc(page));
__ClearPageTable(page);
dec_lruvec_page_state(page, NR_PAGETABLE);
 }
@@ -2939,7 +2939,7 @@ static inline void pmd_ptlock_free(struct ptdesc *ptdesc)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
VM_BUG_ON_PAGE(ptdesc->pmd_huge_pte, ptdesc_page(ptdesc));
 #endif
-   ptlock_free(ptdesc_page(ptdesc));
+   ptlock_free(ptdesc);
 }
 
 #define pmd_huge_pte(mm, pmd) (pmd_ptdesc(pmd)->pmd_huge_pte)
diff --git a/mm/memory.c b/mm/memory.c
index 2ff14f50c7b3..8743aef6095b 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5931,8 +5931,8 @@ bool ptlock_alloc(struct ptdesc *ptdesc)
return true;
 }
 
-void ptlock_free(struct page *page)
+void ptlock_free(struct ptdesc *ptdesc)
 {
-   kmem_cache_free(page_ptl_cachep, page->ptl);
+   kmem_cache_free(page_ptl_cachep, ptdesc->ptl);
 }
 #endif
-- 
2.40.1



[PATCH v6 10/33] mm: Convert pmd_ptlock_free() to use ptdescs

2023-06-26 Thread Vishal Moola (Oracle)
This removes some direct accesses to struct page, working towards
splitting out struct ptdesc from struct page.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Mike Rapoport (IBM) 
---
 include/linux/mm.h | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4af424e4015a..0221675e4dc5 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2934,12 +2934,12 @@ static inline bool pmd_ptlock_init(struct ptdesc 
*ptdesc)
return ptlock_init(ptdesc);
 }
 
-static inline void pmd_ptlock_free(struct page *page)
+static inline void pmd_ptlock_free(struct ptdesc *ptdesc)
 {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-   VM_BUG_ON_PAGE(page->pmd_huge_pte, page);
+   VM_BUG_ON_PAGE(ptdesc->pmd_huge_pte, ptdesc_page(ptdesc));
 #endif
-   ptlock_free(page);
+   ptlock_free(ptdesc_page(ptdesc));
 }
 
 #define pmd_huge_pte(mm, pmd) (pmd_ptdesc(pmd)->pmd_huge_pte)
@@ -2952,7 +2952,7 @@ static inline spinlock_t *pmd_lockptr(struct mm_struct 
*mm, pmd_t *pmd)
 }
 
 static inline bool pmd_ptlock_init(struct ptdesc *ptdesc) { return true; }
-static inline void pmd_ptlock_free(struct page *page) {}
+static inline void pmd_ptlock_free(struct ptdesc *ptdesc) {}
 
 #define pmd_huge_pte(mm, pmd) ((mm)->pmd_huge_pte)
 
@@ -2976,7 +2976,7 @@ static inline bool pgtable_pmd_page_ctor(struct page 
*page)
 
 static inline void pgtable_pmd_page_dtor(struct page *page)
 {
-   pmd_ptlock_free(page);
+   pmd_ptlock_free(page_ptdesc(page));
__ClearPageTable(page);
dec_lruvec_page_state(page, NR_PAGETABLE);
 }
-- 
2.40.1



[PATCH v6 09/33] mm: Convert ptlock_init() to use ptdescs

2023-06-26 Thread Vishal Moola (Oracle)
This removes some direct accesses to struct page, working towards
splitting out struct ptdesc from struct page.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Mike Rapoport (IBM) 
---
 include/linux/mm.h | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1c4c6a7b69b3..4af424e4015a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2830,7 +2830,7 @@ static inline spinlock_t *pte_lockptr(struct mm_struct 
*mm, pmd_t *pmd)
return ptlock_ptr(page_ptdesc(pmd_page(*pmd)));
 }
 
-static inline bool ptlock_init(struct page *page)
+static inline bool ptlock_init(struct ptdesc *ptdesc)
 {
/*
 * prep_new_page() initialize page->private (and therefore page->ptl)
@@ -2839,10 +2839,10 @@ static inline bool ptlock_init(struct page *page)
 * It can happen if arch try to use slab for page table allocation:
 * slab code uses page->slab_cache, which share storage with page->ptl.
 */
-   VM_BUG_ON_PAGE(*(unsigned long *)>ptl, page);
-   if (!ptlock_alloc(page_ptdesc(page)))
+   VM_BUG_ON_PAGE(*(unsigned long *)>ptl, ptdesc_page(ptdesc));
+   if (!ptlock_alloc(ptdesc))
return false;
-   spin_lock_init(ptlock_ptr(page_ptdesc(page)));
+   spin_lock_init(ptlock_ptr(ptdesc));
return true;
 }
 
@@ -2855,13 +2855,13 @@ static inline spinlock_t *pte_lockptr(struct mm_struct 
*mm, pmd_t *pmd)
return >page_table_lock;
 }
 static inline void ptlock_cache_init(void) {}
-static inline bool ptlock_init(struct page *page) { return true; }
+static inline bool ptlock_init(struct ptdesc *ptdesc) { return true; }
 static inline void ptlock_free(struct page *page) {}
 #endif /* USE_SPLIT_PTE_PTLOCKS */
 
 static inline bool pgtable_pte_page_ctor(struct page *page)
 {
-   if (!ptlock_init(page))
+   if (!ptlock_init(page_ptdesc(page)))
return false;
__SetPageTable(page);
inc_lruvec_page_state(page, NR_PAGETABLE);
@@ -2931,7 +2931,7 @@ static inline bool pmd_ptlock_init(struct ptdesc *ptdesc)
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
ptdesc->pmd_huge_pte = NULL;
 #endif
-   return ptlock_init(ptdesc_page(ptdesc));
+   return ptlock_init(ptdesc);
 }
 
 static inline void pmd_ptlock_free(struct page *page)
-- 
2.40.1



[PATCH v6 08/33] mm: Convert pmd_ptlock_init() to use ptdescs

2023-06-26 Thread Vishal Moola (Oracle)
This removes some direct accesses to struct page, working towards
splitting out struct ptdesc from struct page.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Mike Rapoport (IBM) 
---
 include/linux/mm.h | 10 +-
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0b230d5d229a..1c4c6a7b69b3 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2926,12 +2926,12 @@ static inline spinlock_t *pmd_lockptr(struct mm_struct 
*mm, pmd_t *pmd)
return ptlock_ptr(pmd_ptdesc(pmd));
 }
 
-static inline bool pmd_ptlock_init(struct page *page)
+static inline bool pmd_ptlock_init(struct ptdesc *ptdesc)
 {
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
-   page->pmd_huge_pte = NULL;
+   ptdesc->pmd_huge_pte = NULL;
 #endif
-   return ptlock_init(page);
+   return ptlock_init(ptdesc_page(ptdesc));
 }
 
 static inline void pmd_ptlock_free(struct page *page)
@@ -2951,7 +2951,7 @@ static inline spinlock_t *pmd_lockptr(struct mm_struct 
*mm, pmd_t *pmd)
return >page_table_lock;
 }
 
-static inline bool pmd_ptlock_init(struct page *page) { return true; }
+static inline bool pmd_ptlock_init(struct ptdesc *ptdesc) { return true; }
 static inline void pmd_ptlock_free(struct page *page) {}
 
 #define pmd_huge_pte(mm, pmd) ((mm)->pmd_huge_pte)
@@ -2967,7 +2967,7 @@ static inline spinlock_t *pmd_lock(struct mm_struct *mm, 
pmd_t *pmd)
 
 static inline bool pgtable_pmd_page_ctor(struct page *page)
 {
-   if (!pmd_ptlock_init(page))
+   if (!pmd_ptlock_init(page_ptdesc(page)))
return false;
__SetPageTable(page);
inc_lruvec_page_state(page, NR_PAGETABLE);
-- 
2.40.1



[PATCH v6 07/33] mm: Convert ptlock_ptr() to use ptdescs

2023-06-26 Thread Vishal Moola (Oracle)
This removes some direct accesses to struct page, working towards
splitting out struct ptdesc from struct page.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Mike Rapoport (IBM) 
---
 arch/x86/xen/mmu_pv.c |  2 +-
 include/linux/mm.h| 14 +++---
 2 files changed, 8 insertions(+), 8 deletions(-)

diff --git a/arch/x86/xen/mmu_pv.c b/arch/x86/xen/mmu_pv.c
index e0a975165de7..8796ec310483 100644
--- a/arch/x86/xen/mmu_pv.c
+++ b/arch/x86/xen/mmu_pv.c
@@ -667,7 +667,7 @@ static spinlock_t *xen_pte_lock(struct page *page, struct 
mm_struct *mm)
spinlock_t *ptl = NULL;
 
 #if USE_SPLIT_PTE_PTLOCKS
-   ptl = ptlock_ptr(page);
+   ptl = ptlock_ptr(page_ptdesc(page));
spin_lock_nest_lock(ptl, >page_table_lock);
 #endif
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 39b0a4661e44..0b230d5d229a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2801,9 +2801,9 @@ void __init ptlock_cache_init(void);
 bool ptlock_alloc(struct ptdesc *ptdesc);
 extern void ptlock_free(struct page *page);
 
-static inline spinlock_t *ptlock_ptr(struct page *page)
+static inline spinlock_t *ptlock_ptr(struct ptdesc *ptdesc)
 {
-   return page->ptl;
+   return ptdesc->ptl;
 }
 #else /* ALLOC_SPLIT_PTLOCKS */
 static inline void ptlock_cache_init(void)
@@ -2819,15 +2819,15 @@ static inline void ptlock_free(struct page *page)
 {
 }
 
-static inline spinlock_t *ptlock_ptr(struct page *page)
+static inline spinlock_t *ptlock_ptr(struct ptdesc *ptdesc)
 {
-   return >ptl;
+   return >ptl;
 }
 #endif /* ALLOC_SPLIT_PTLOCKS */
 
 static inline spinlock_t *pte_lockptr(struct mm_struct *mm, pmd_t *pmd)
 {
-   return ptlock_ptr(pmd_page(*pmd));
+   return ptlock_ptr(page_ptdesc(pmd_page(*pmd)));
 }
 
 static inline bool ptlock_init(struct page *page)
@@ -2842,7 +2842,7 @@ static inline bool ptlock_init(struct page *page)
VM_BUG_ON_PAGE(*(unsigned long *)>ptl, page);
if (!ptlock_alloc(page_ptdesc(page)))
return false;
-   spin_lock_init(ptlock_ptr(page));
+   spin_lock_init(ptlock_ptr(page_ptdesc(page)));
return true;
 }
 
@@ -2923,7 +2923,7 @@ static inline struct ptdesc *pmd_ptdesc(pmd_t *pmd)
 
 static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd)
 {
-   return ptlock_ptr(ptdesc_page(pmd_ptdesc(pmd)));
+   return ptlock_ptr(pmd_ptdesc(pmd));
 }
 
 static inline bool pmd_ptlock_init(struct page *page)
-- 
2.40.1



[PATCH v6 06/33] mm: Convert ptlock_alloc() to use ptdescs

2023-06-26 Thread Vishal Moola (Oracle)
This removes some direct accesses to struct page, working towards
splitting out struct ptdesc from struct page.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Mike Rapoport (IBM) 
---
 include/linux/mm.h | 6 +++---
 mm/memory.c| 4 ++--
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 1511faf0263c..39b0a4661e44 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2798,7 +2798,7 @@ static inline void pagetable_free(struct ptdesc *pt)
 #if USE_SPLIT_PTE_PTLOCKS
 #if ALLOC_SPLIT_PTLOCKS
 void __init ptlock_cache_init(void);
-extern bool ptlock_alloc(struct page *page);
+bool ptlock_alloc(struct ptdesc *ptdesc);
 extern void ptlock_free(struct page *page);
 
 static inline spinlock_t *ptlock_ptr(struct page *page)
@@ -2810,7 +2810,7 @@ static inline void ptlock_cache_init(void)
 {
 }
 
-static inline bool ptlock_alloc(struct page *page)
+static inline bool ptlock_alloc(struct ptdesc *ptdesc)
 {
return true;
 }
@@ -2840,7 +2840,7 @@ static inline bool ptlock_init(struct page *page)
 * slab code uses page->slab_cache, which share storage with page->ptl.
 */
VM_BUG_ON_PAGE(*(unsigned long *)>ptl, page);
-   if (!ptlock_alloc(page))
+   if (!ptlock_alloc(page_ptdesc(page)))
return false;
spin_lock_init(ptlock_ptr(page));
return true;
diff --git a/mm/memory.c b/mm/memory.c
index 80faf3e76232..2ff14f50c7b3 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -5920,14 +5920,14 @@ void __init ptlock_cache_init(void)
SLAB_PANIC, NULL);
 }
 
-bool ptlock_alloc(struct page *page)
+bool ptlock_alloc(struct ptdesc *ptdesc)
 {
spinlock_t *ptl;
 
ptl = kmem_cache_alloc(page_ptl_cachep, GFP_KERNEL);
if (!ptl)
return false;
-   page->ptl = ptl;
+   ptdesc->ptl = ptl;
return true;
 }
 
-- 
2.40.1



[PATCH v6 05/33] mm: Convert pmd_pgtable_page() to pmd_ptdesc()

2023-06-26 Thread Vishal Moola (Oracle)
Converts pmd_pgtable_page() to pmd_ptdesc() and all its callers. This
removes some direct accesses to struct page, working towards splitting
out struct ptdesc from struct page.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Mike Rapoport (IBM) 
---
 include/linux/mm.h | 8 
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 14d95d494958..1511faf0263c 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2915,15 +2915,15 @@ pte_t *pte_offset_map_nolock(struct mm_struct *mm, 
pmd_t *pmd,
 
 #if USE_SPLIT_PMD_PTLOCKS
 
-static inline struct page *pmd_pgtable_page(pmd_t *pmd)
+static inline struct ptdesc *pmd_ptdesc(pmd_t *pmd)
 {
unsigned long mask = ~(PTRS_PER_PMD * sizeof(pmd_t) - 1);
-   return virt_to_page((void *)((unsigned long) pmd & mask));
+   return virt_to_ptdesc((void *)((unsigned long) pmd & mask));
 }
 
 static inline spinlock_t *pmd_lockptr(struct mm_struct *mm, pmd_t *pmd)
 {
-   return ptlock_ptr(pmd_pgtable_page(pmd));
+   return ptlock_ptr(ptdesc_page(pmd_ptdesc(pmd)));
 }
 
 static inline bool pmd_ptlock_init(struct page *page)
@@ -2942,7 +2942,7 @@ static inline void pmd_ptlock_free(struct page *page)
ptlock_free(page);
 }
 
-#define pmd_huge_pte(mm, pmd) (pmd_pgtable_page(pmd)->pmd_huge_pte)
+#define pmd_huge_pte(mm, pmd) (pmd_ptdesc(pmd)->pmd_huge_pte)
 
 #else
 
-- 
2.40.1



[PATCH v6 04/33] mm: add utility functions for ptdesc

2023-06-26 Thread Vishal Moola (Oracle)
Introduce utility functions setting the foundation for ptdescs. These
will also assist in the splitting out of ptdesc from struct page.

Functions that focus on the descriptor are prefixed with ptdesc_* while
functions that focus on the pagetable are prefixed with pagetable_*.

pagetable_alloc() is defined to allocate new ptdesc pages as compound
pages. This is to standardize ptdescs by allowing for one allocation
and one free function, in contrast to 2 allocation and 2 free functions.

Signed-off-by: Vishal Moola (Oracle) 
---
 include/asm-generic/tlb.h | 11 
 include/linux/mm.h| 56 +++
 include/linux/pgtable.h   | 12 +
 3 files changed, 79 insertions(+)

diff --git a/include/asm-generic/tlb.h b/include/asm-generic/tlb.h
index b46617207c93..6bade9e0e799 100644
--- a/include/asm-generic/tlb.h
+++ b/include/asm-generic/tlb.h
@@ -481,6 +481,17 @@ static inline void tlb_remove_page(struct mmu_gather *tlb, 
struct page *page)
return tlb_remove_page_size(tlb, page, PAGE_SIZE);
 }
 
+static inline void tlb_remove_ptdesc(struct mmu_gather *tlb, void *pt)
+{
+   tlb_remove_table(tlb, pt);
+}
+
+/* Like tlb_remove_ptdesc, but for page-like page directories. */
+static inline void tlb_remove_page_ptdesc(struct mmu_gather *tlb, struct 
ptdesc *pt)
+{
+   tlb_remove_page(tlb, ptdesc_page(pt));
+}
+
 static inline void tlb_change_page_size(struct mmu_gather *tlb,
 unsigned int page_size)
 {
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 0dad5f40ef96..14d95d494958 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2744,6 +2744,57 @@ static inline pmd_t *pmd_alloc(struct mm_struct *mm, 
pud_t *pud, unsigned long a
 }
 #endif /* CONFIG_MMU */
 
+static inline struct ptdesc *virt_to_ptdesc(const void *x)
+{
+   return page_ptdesc(virt_to_page(x));
+}
+
+static inline void *ptdesc_to_virt(const struct ptdesc *pt)
+{
+   return page_to_virt(ptdesc_page(pt));
+}
+
+static inline void *ptdesc_address(const struct ptdesc *pt)
+{
+   return folio_address(ptdesc_folio(pt));
+}
+
+static inline bool pagetable_is_reserved(struct ptdesc *pt)
+{
+   return folio_test_reserved(ptdesc_folio(pt));
+}
+
+/**
+ * pagetable_alloc - Allocate pagetables
+ * @gfp:GFP flags
+ * @order:  desired pagetable order
+ *
+ * pagetable_alloc allocates memory for page tables as well as a page table
+ * descriptor to describe that memory.
+ *
+ * Return: The ptdesc describing the allocated page tables.
+ */
+static inline struct ptdesc *pagetable_alloc(gfp_t gfp, unsigned int order)
+{
+   struct page *page = alloc_pages(gfp | __GFP_COMP, order);
+
+   return page_ptdesc(page);
+}
+
+/**
+ * pagetable_free - Free pagetables
+ * @pt:The page table descriptor
+ *
+ * pagetable_free frees the memory of all page tables described by a page
+ * table descriptor and the memory for the descriptor itself.
+ */
+static inline void pagetable_free(struct ptdesc *pt)
+{
+   struct page *page = ptdesc_page(pt);
+
+   __free_pages(page, compound_order(page));
+}
+
 #if USE_SPLIT_PTE_PTLOCKS
 #if ALLOC_SPLIT_PTLOCKS
 void __init ptlock_cache_init(void);
@@ -2981,6 +3032,11 @@ static inline void mark_page_reserved(struct page *page)
adjust_managed_page_count(page, -1);
 }
 
+static inline void free_reserved_ptdesc(struct ptdesc *pt)
+{
+   free_reserved_page(ptdesc_page(pt));
+}
+
 /*
  * Default method to free all the __init memory into the buddy system.
  * The freed pages will be poisoned with pattern "poison" if it's within
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index d46cb709ce08..e9bb5f18cade 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1055,6 +1055,18 @@ TABLE_MATCH(memcg_data, pt_memcg_data);
 #undef TABLE_MATCH
 static_assert(sizeof(struct ptdesc) <= sizeof(struct page));
 
+#define ptdesc_page(pt)(_Generic((pt), 
\
+   const struct ptdesc *:  (const struct page *)(pt),  \
+   struct ptdesc *:(struct page *)(pt)))
+
+#define ptdesc_folio(pt)   (_Generic((pt), \
+   const struct ptdesc *:  (const struct folio *)(pt), \
+   struct ptdesc *:(struct folio *)(pt)))
+
+#define page_ptdesc(p) (_Generic((p),  \
+   const struct page *:(const struct ptdesc *)(p), \
+   struct page *:  (struct ptdesc *)(p)))
+
 /*
  * No-op macros that just return the current protection value. Defined here
  * because these macros can be used even if CONFIG_MMU is not defined.
-- 
2.40.1



[PATCH v6 03/33] pgtable: Create struct ptdesc

2023-06-26 Thread Vishal Moola (Oracle)
Currently, page table information is stored within struct page. As part
of simplifying struct page, create struct ptdesc for page table
information.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Mike Rapoport (IBM) 
---
 include/linux/pgtable.h | 68 +
 1 file changed, 68 insertions(+)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 5063b482e34f..d46cb709ce08 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -987,6 +987,74 @@ static inline void ptep_modify_prot_commit(struct 
vm_area_struct *vma,
 #endif /* __HAVE_ARCH_PTEP_MODIFY_PROT_TRANSACTION */
 #endif /* CONFIG_MMU */
 
+
+/**
+ * struct ptdesc -Memory descriptor for page tables.
+ * @__page_flags: Same as page flags. Unused for page tables.
+ * @pt_rcu_head:  For freeing page table pages.
+ * @pt_list:  List of used page tables. Used for s390 and x86.
+ * @_pt_pad_1:Padding that aliases with page's compound head.
+ * @pmd_huge_pte: Protected by ptdesc->ptl, used for THPs.
+ * @_pt_s390_gaddr:   Aliases with page's mapping. Used for s390 gmap only.
+ * @pt_mm:Used for x86 pgds.
+ * @pt_frag_refcount: For fragmented page table tracking. Powerpc and s390 
only.
+ * @ptl:  Lock for the page table.
+ * @__page_type:  Same as page->page_type. Unused for page tables.
+ * @_refcount:Same as page refcount. Used for s390 page tables.
+ * @pt_memcg_data:Memcg data. Tracked for page tables here.
+ *
+ * This struct overlays struct page for now. Do not modify without a good
+ * understanding of the issues.
+ */
+struct ptdesc {
+   unsigned long __page_flags;
+
+   union {
+   struct rcu_head pt_rcu_head;
+   struct list_head pt_list;
+   struct {
+   unsigned long _pt_pad_1;
+   pgtable_t pmd_huge_pte;
+   };
+   };
+   unsigned long _pt_s390_gaddr;
+
+   union {
+   struct mm_struct *pt_mm;
+   atomic_t pt_frag_refcount;
+   };
+
+   union {
+   unsigned long _pt_pad_2;
+#if ALLOC_SPLIT_PTLOCKS
+   spinlock_t *ptl;
+#else
+   spinlock_t ptl;
+#endif
+   };
+   unsigned int __page_type;
+   atomic_t _refcount;
+#ifdef CONFIG_MEMCG
+   unsigned long pt_memcg_data;
+#endif
+};
+
+#define TABLE_MATCH(pg, pt)\
+   static_assert(offsetof(struct page, pg) == offsetof(struct ptdesc, pt))
+TABLE_MATCH(flags, __page_flags);
+TABLE_MATCH(compound_head, pt_list);
+TABLE_MATCH(compound_head, _pt_pad_1);
+TABLE_MATCH(pmd_huge_pte, pmd_huge_pte);
+TABLE_MATCH(mapping, _pt_s390_gaddr);
+TABLE_MATCH(pt_mm, pt_mm);
+TABLE_MATCH(ptl, ptl);
+TABLE_MATCH(rcu_head, pt_rcu_head);
+#ifdef CONFIG_MEMCG
+TABLE_MATCH(memcg_data, pt_memcg_data);
+#endif
+#undef TABLE_MATCH
+static_assert(sizeof(struct ptdesc) <= sizeof(struct page));
+
 /*
  * No-op macros that just return the current protection value. Defined here
  * because these macros can be used even if CONFIG_MMU is not defined.
-- 
2.40.1



[PATCH v6 01/33] mm: Add PAGE_TYPE_OP folio functions

2023-06-26 Thread Vishal Moola (Oracle)
No folio equivalents for page type operations have been defined, so
define them for later folio conversions.

Also changes the Page##uname macros to take in const struct page* since
we only read the memory here.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Mike Rapoport (IBM) 
---
 include/linux/page-flags.h | 30 +++---
 1 file changed, 23 insertions(+), 7 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 92a2063a0a23..9218028caf33 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -908,6 +908,8 @@ static inline bool is_page_hwpoison(struct page *page)
 
 #define PageType(page, flag)   \
((page->page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE)
+#define folio_test_type(folio, flag)   \
+   ((folio->page.page_type & (PAGE_TYPE_BASE | flag)) == PAGE_TYPE_BASE)
 
 static inline int page_type_has_type(unsigned int page_type)
 {
@@ -919,27 +921,41 @@ static inline int page_has_type(struct page *page)
return page_type_has_type(page->page_type);
 }
 
-#define PAGE_TYPE_OPS(uname, lname)\
-static __always_inline int Page##uname(struct page *page)  \
+#define PAGE_TYPE_OPS(uname, lname, fname) \
+static __always_inline int Page##uname(const struct page *page)
\
 {  \
return PageType(page, PG_##lname);  \
 }  \
+static __always_inline int folio_test_##fname(const struct folio *folio)\
+{  \
+   return folio_test_type(folio, PG_##lname);  \
+}  \
 static __always_inline void __SetPage##uname(struct page *page)
\
 {  \
VM_BUG_ON_PAGE(!PageType(page, 0), page);   \
page->page_type &= ~PG_##lname; \
 }  \
+static __always_inline void __folio_set_##fname(struct folio *folio)   \
+{  \
+   VM_BUG_ON_FOLIO(!folio_test_type(folio, 0), folio); \
+   folio->page.page_type &= ~PG_##lname;   \
+}  \
 static __always_inline void __ClearPage##uname(struct page *page)  \
 {  \
VM_BUG_ON_PAGE(!Page##uname(page), page);   \
page->page_type |= PG_##lname;  \
-}
+}  \
+static __always_inline void __folio_clear_##fname(struct folio *folio) \
+{  \
+   VM_BUG_ON_FOLIO(!folio_test_##fname(folio), folio); \
+   folio->page.page_type |= PG_##lname;\
+}  \
 
 /*
  * PageBuddy() indicates that the page is free and in the buddy system
  * (see mm/page_alloc.c).
  */
-PAGE_TYPE_OPS(Buddy, buddy)
+PAGE_TYPE_OPS(Buddy, buddy, buddy)
 
 /*
  * PageOffline() indicates that the page is logically offline although the
@@ -963,7 +979,7 @@ PAGE_TYPE_OPS(Buddy, buddy)
  * pages should check PageOffline() and synchronize with such drivers using
  * page_offline_freeze()/page_offline_thaw().
  */
-PAGE_TYPE_OPS(Offline, offline)
+PAGE_TYPE_OPS(Offline, offline, offline)
 
 extern void page_offline_freeze(void);
 extern void page_offline_thaw(void);
@@ -973,12 +989,12 @@ extern void page_offline_end(void);
 /*
  * Marks pages in use as page tables.
  */
-PAGE_TYPE_OPS(Table, table)
+PAGE_TYPE_OPS(Table, table, pgtable)
 
 /*
  * Marks guardpages used with debug_pagealloc.
  */
-PAGE_TYPE_OPS(Guard, guard)
+PAGE_TYPE_OPS(Guard, guard, guard)
 
 extern bool is_free_buddy_page(struct page *page);
 
-- 
2.40.1



[PATCH v6 02/33] s390: Use _pt_s390_gaddr for gmap address tracking

2023-06-26 Thread Vishal Moola (Oracle)
s390 uses page->index to keep track of page tables for the guest address
space. In an attempt to consolidate the usage of page fields in s390,
replace _pt_pad_2 with _pt_s390_gaddr to replace page->index in gmap.

Since page->_pt_s390_gaddr aliases with mapping, ensure its set to NULL
before freeing the pages as well.

This also reverts commit 7e25de77bc5ea ("s390/mm: use pmd_pgtable_page()
helper in __gmap_segment_gaddr()") which had s390 use
pmd_pgtable_page() to get a gmap page table, as pmd_pgtable_page()
should be used for more generic process page tables.

Signed-off-by: Vishal Moola (Oracle) 
Acked-by: Mike Rapoport (IBM) 
---
 arch/s390/mm/gmap.c  | 56 +++-
 include/linux/mm_types.h |  2 +-
 2 files changed, 39 insertions(+), 19 deletions(-)

diff --git a/arch/s390/mm/gmap.c b/arch/s390/mm/gmap.c
index f4b6fc746fce..beb4804d9ca8 100644
--- a/arch/s390/mm/gmap.c
+++ b/arch/s390/mm/gmap.c
@@ -70,7 +70,7 @@ static struct gmap *gmap_alloc(unsigned long limit)
page = alloc_pages(GFP_KERNEL_ACCOUNT, CRST_ALLOC_ORDER);
if (!page)
goto out_free;
-   page->index = 0;
+   page->_pt_s390_gaddr = 0;
list_add(>lru, >crst_list);
table = page_to_virt(page);
crst_table_init(table, etype);
@@ -187,16 +187,20 @@ static void gmap_free(struct gmap *gmap)
if (!(gmap_is_shadow(gmap) && gmap->removed))
gmap_flush_tlb(gmap);
/* Free all segment & region tables. */
-   list_for_each_entry_safe(page, next, >crst_list, lru)
+   list_for_each_entry_safe(page, next, >crst_list, lru) {
+   page->_pt_s390_gaddr = 0;
__free_pages(page, CRST_ALLOC_ORDER);
+   }
gmap_radix_tree_free(>guest_to_host);
gmap_radix_tree_free(>host_to_guest);
 
/* Free additional data for a shadow gmap */
if (gmap_is_shadow(gmap)) {
/* Free all page tables. */
-   list_for_each_entry_safe(page, next, >pt_list, lru)
+   list_for_each_entry_safe(page, next, >pt_list, lru) {
+   page->_pt_s390_gaddr = 0;
page_table_free_pgste(page);
+   }
gmap_rmap_radix_tree_free(>host_to_rmap);
/* Release reference to the parent */
gmap_put(gmap->parent);
@@ -318,12 +322,14 @@ static int gmap_alloc_table(struct gmap *gmap, unsigned 
long *table,
list_add(>lru, >crst_list);
*table = __pa(new) | _REGION_ENTRY_LENGTH |
(*table & _REGION_ENTRY_TYPE_MASK);
-   page->index = gaddr;
+   page->_pt_s390_gaddr = gaddr;
page = NULL;
}
spin_unlock(>guest_table_lock);
-   if (page)
+   if (page) {
+   page->_pt_s390_gaddr = 0;
__free_pages(page, CRST_ALLOC_ORDER);
+   }
return 0;
 }
 
@@ -336,12 +342,14 @@ static int gmap_alloc_table(struct gmap *gmap, unsigned 
long *table,
 static unsigned long __gmap_segment_gaddr(unsigned long *entry)
 {
struct page *page;
-   unsigned long offset;
+   unsigned long offset, mask;
 
offset = (unsigned long) entry / sizeof(unsigned long);
offset = (offset & (PTRS_PER_PMD - 1)) * PMD_SIZE;
-   page = pmd_pgtable_page((pmd_t *) entry);
-   return page->index + offset;
+   mask = ~(PTRS_PER_PMD * sizeof(pmd_t) - 1);
+   page = virt_to_page((void *)((unsigned long) entry & mask));
+
+   return page->_pt_s390_gaddr + offset;
 }
 
 /**
@@ -1351,6 +1359,7 @@ static void gmap_unshadow_pgt(struct gmap *sg, unsigned 
long raddr)
/* Free page table */
page = phys_to_page(pgt);
list_del(>lru);
+   page->_pt_s390_gaddr = 0;
page_table_free_pgste(page);
 }
 
@@ -1379,6 +1388,7 @@ static void __gmap_unshadow_sgt(struct gmap *sg, unsigned 
long raddr,
/* Free page table */
page = phys_to_page(pgt);
list_del(>lru);
+   page->_pt_s390_gaddr = 0;
page_table_free_pgste(page);
}
 }
@@ -1409,6 +1419,7 @@ static void gmap_unshadow_sgt(struct gmap *sg, unsigned 
long raddr)
/* Free segment table */
page = phys_to_page(sgt);
list_del(>lru);
+   page->_pt_s390_gaddr = 0;
__free_pages(page, CRST_ALLOC_ORDER);
 }
 
@@ -1437,6 +1448,7 @@ static void __gmap_unshadow_r3t(struct gmap *sg, unsigned 
long raddr,
/* Free segment table */
page = phys_to_page(sgt);
list_del(>lru);
+   page->_pt_s390_gaddr = 0;
__free_pages(page, CRST_ALLOC_ORDER);
}
 }
@@ -1467,6 +1479,7 @@ static void gmap_unshadow_r3t(struct gmap *sg, unsigned 
long raddr)
/* Free region 3 table */
page = phys_to_page(r3t);
list_del(>lru);
+   page->_pt_s390_gaddr = 0;
__free_pages(page, 

[PATCH v6 00/33] Split ptdesc from struct page

2023-06-26 Thread Vishal Moola (Oracle)
The MM subsystem is trying to shrink struct page. This patchset
introduces a memory descriptor for page table tracking - struct ptdesc.

This patchset introduces ptdesc, splits ptdesc from struct page, and
converts many callers of page table constructor/destructors to use ptdescs.

Ptdesc is a foundation to further standardize page tables, and eventually
allow for dynamic allocation of page tables independent of struct page.
However, the use of pages for page table tracking is quite deeply
ingrained and varied across archictectures, so there is still a lot of
work to be done before that can happen.

This is rebased on next-20230626.

There is a minor conflict with patch 24 and the mm-unstable tree in
arch/m68k/mm/motorola.c - The end result of applying the patch should
be the same.

v6:
  Fix compiler warnings/errors

v5:
  More Acked-bys :)
  Cleanup some documentation wording and formatting
  Add pt_rcu_head to ptdesc
  Add memcg to ptdesc (and align it with struct page)
  Ensure all get_free_page() callers prohibit HIGHMEM for 32 bit support.
  Renamed folio_{set, clear}_table() to folio_{set, clear}_pgtable()
  Removed pagetable_clear() as it is not necessary right now
  Dropped s390 _refcount to _pt_frag_refcount conversion

Vishal Moola (Oracle) (33):
  mm: Add PAGE_TYPE_OP folio functions
  s390: Use _pt_s390_gaddr for gmap address tracking
  pgtable: Create struct ptdesc
  mm: add utility functions for ptdesc
  mm: Convert pmd_pgtable_page() to pmd_ptdesc()
  mm: Convert ptlock_alloc() to use ptdescs
  mm: Convert ptlock_ptr() to use ptdescs
  mm: Convert pmd_ptlock_init() to use ptdescs
  mm: Convert ptlock_init() to use ptdescs
  mm: Convert pmd_ptlock_free() to use ptdescs
  mm: Convert ptlock_free() to use ptdescs
  mm: Create ptdesc equivalents for pgtable_{pte,pmd}_page_{ctor,dtor}
  powerpc: Convert various functions to use ptdescs
  x86: Convert various functions to use ptdescs
  s390: Convert various gmap functions to use ptdescs
  s390: Convert various pgalloc functions to use ptdescs
  mm: Remove page table members from struct page
  pgalloc: Convert various functions to use ptdescs
  arm: Convert various functions to use ptdescs
  arm64: Convert various functions to use ptdescs
  csky: Convert __pte_free_tlb() to use ptdescs
  hexagon: Convert __pte_free_tlb() to use ptdescs
  loongarch: Convert various functions to use ptdescs
  m68k: Convert various functions to use ptdescs
  mips: Convert various functions to use ptdescs
  nios2: Convert __pte_free_tlb() to use ptdescs
  openrisc: Convert __pte_free_tlb() to use ptdescs
  riscv: Convert alloc_{pmd, pte}_late() to use ptdescs
  sh: Convert pte_free_tlb() to use ptdescs
  sparc64: Convert various functions to use ptdescs
  sparc: Convert pgtable_pte_page_{ctor, dtor}() to ptdesc equivalents
  um: Convert {pmd, pte}_free_tlb() to use ptdescs
  mm: Remove pgtable_{pmd, pte}_page_{ctor, dtor}() wrappers

 Documentation/mm/split_page_table_lock.rst|  12 +-
 .../zh_CN/mm/split_page_table_lock.rst|  14 +-
 arch/arm/include/asm/tlb.h|  12 +-
 arch/arm/mm/mmu.c |   7 +-
 arch/arm64/include/asm/tlb.h  |  14 +-
 arch/arm64/mm/mmu.c   |   7 +-
 arch/csky/include/asm/pgalloc.h   |   4 +-
 arch/hexagon/include/asm/pgalloc.h|   8 +-
 arch/loongarch/include/asm/pgalloc.h  |  27 ++-
 arch/loongarch/mm/pgtable.c   |   7 +-
 arch/m68k/include/asm/mcf_pgalloc.h   |  47 ++--
 arch/m68k/include/asm/sun3_pgalloc.h  |   8 +-
 arch/m68k/mm/motorola.c   |   4 +-
 arch/mips/include/asm/pgalloc.h   |  32 +--
 arch/mips/mm/pgtable.c|   8 +-
 arch/nios2/include/asm/pgalloc.h  |   8 +-
 arch/openrisc/include/asm/pgalloc.h   |   8 +-
 arch/powerpc/mm/book3s64/mmu_context.c|  10 +-
 arch/powerpc/mm/book3s64/pgtable.c|  32 +--
 arch/powerpc/mm/pgtable-frag.c|  46 ++--
 arch/riscv/include/asm/pgalloc.h  |   8 +-
 arch/riscv/mm/init.c  |  16 +-
 arch/s390/include/asm/pgalloc.h   |   4 +-
 arch/s390/include/asm/tlb.h   |   4 +-
 arch/s390/mm/gmap.c   | 207 ++
 arch/s390/mm/pgalloc.c| 108 -
 arch/sh/include/asm/pgalloc.h |   9 +-
 arch/sparc/mm/init_64.c   |  17 +-
 arch/sparc/mm/srmmu.c |   5 +-
 arch/um/include/asm/pgalloc.h |  18 +-
 arch/x86/mm/pgtable.c |  47 ++--
 arch/x86/xen/mmu_pv.c |   2 +-
 include/asm-generic/pgalloc.h |  88 +---
 include/asm-generic/tlb.h |  11 +
 include/linux/mm.h| 153 +
 include/linux/mm_types.h  |  14 --
 include

[powerpc:next] BUILD SUCCESS bfd8d989210cb6bb1c8e87b7c525831dceb91418

2023-06-26 Thread kernel test robot
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
next
branch HEAD: bfd8d989210cb6bb1c8e87b7c525831dceb91418  powerpc/iommu: Only 
build sPAPR access functions on pSeries

elapsed time: 7862m

configs tested: 425
configs skipped: 24

The following configs have been built successfully.
More configs may be tested in the coming days.

tested configs:
alphaalldefconfig   gcc  
alphaallyesconfig   gcc  
alpha   defconfig   gcc  
alpharandconfig-r011-20230622   gcc  
alpharandconfig-r013-20230626   gcc  
alpharandconfig-r014-20230621   gcc  
alpharandconfig-r022-20230622   gcc  
alpharandconfig-r026-20230622   gcc  
alpharandconfig-r031-20230626   gcc  
alpharandconfig-r035-20230621   gcc  
arc  allyesconfig   gcc  
arc  axs101_defconfig   gcc  
arc defconfig   gcc  
arc haps_hs_smp_defconfig   gcc  
arcnsimosci_defconfig   gcc  
arc nsimosci_hs_defconfig   gcc  
arc  randconfig-r003-20230621   gcc  
arc  randconfig-r004-20230622   gcc  
arc  randconfig-r021-20230621   gcc  
arc  randconfig-r024-20230621   gcc  
arc  randconfig-r024-20230622   gcc  
arc  randconfig-r043-20230621   gcc  
arc  randconfig-r043-20230622   gcc  
arc  randconfig-r043-20230626   gcc  
arc   tb10x_defconfig   gcc  
arm  alldefconfig   clang
arm  allmodconfig   gcc  
arm  allyesconfig   gcc  
arm   aspeed_g5_defconfig   gcc  
arm at91_dt_defconfig   gcc  
arm bcm2835_defconfig   clang
armclps711x_defconfig   gcc  
arm  collie_defconfig   clang
arm davinci_all_defconfig   clang
arm defconfig   gcc  
arm  ep93xx_defconfig   clang
arm  gemini_defconfig   gcc  
arm   imx_v4_v5_defconfig   clang
arm  integrator_defconfig   gcc  
arm  ixp4xx_defconfig   clang
armkeystone_defconfig   gcc  
armmmp2_defconfig   clang
armmulti_v7_defconfig   gcc  
armmvebu_v7_defconfig   gcc  
armneponset_defconfig   clang
arm   netwinder_defconfig   clang
arm   omap2plus_defconfig   gcc  
arm  pxa3xx_defconfig   gcc  
arm  pxa910_defconfig   gcc  
arm pxa_defconfig   gcc  
arm  randconfig-r012-20230622   clang
arm  randconfig-r016-20230621   gcc  
arm  randconfig-r016-20230622   clang
arm  randconfig-r022-20230622   clang
arm  randconfig-r025-20230622   clang
arm  randconfig-r034-20230621   clang
arm  randconfig-r046-20230621   gcc  
arm  randconfig-r046-20230622   clang
arm s5pv210_defconfig   clang
armshmobile_defconfig   gcc  
arm   spear13xx_defconfig   clang
armspear3xx_defconfig   clang
armspear6xx_defconfig   gcc  
arm   spitz_defconfig   clang
arm   stm32_defconfig   gcc  
arm   tegra_defconfig   gcc  
arm64allyesconfig   gcc  
arm64   defconfig   gcc  
arm64randconfig-r004-20230621   gcc  
arm64randconfig-r011-20230626   gcc  
arm64randconfig-r013-20230621   clang
arm64randconfig-r013-20230622   gcc  
arm64randconfig-r021-20230622   gcc  
arm64randconfig-r023-20230622   gcc  
csky alldefconfig   gcc  
cskydefconfig   gcc  
csky randconfig-r001-20230621   gcc  
csky randconfig-r004-20230626   gcc  
csky randconfig-r011-20230621   gcc  
csky randconfig-r012-20230621   gcc  
csky randconfig-r013-20230626   gcc  
csky randconfig-r014-20230626   gcc  
csky randconfig-r015-20230622   gcc  
csky randconfig-r016-20230626   gcc  
csky randconfig-r025-20230622   gcc  
csky

Re: [PATCH v3 01/13] kexec: consolidate kexec and crash options into kernel/Kconfig.kexec

2023-06-26 Thread Eric DeVolder




On 6/26/23 11:19, Russell King (Oracle) wrote:

On Mon, Jun 26, 2023 at 12:13:20PM -0400, Eric DeVolder wrote:

+config KEXEC
+   bool "Enable kexec system call"
+   default ARCH_DEFAULT_KEXEC
+   depends on ARCH_SUPPORTS_KEXEC
+   select KEXEC_CORE
+   help
+ kexec is a system call that implements the ability to shutdown your
+ current kernel, and to start another kernel. It is like a reboot
+ but it is independent of the system firmware. And like a reboot
+ you can start any kernel with it, not just Linux.
+
+ The name comes from the similarity to the exec system call.
+
+ It is an ongoing process to be certain the hardware in a machine
+ is properly shutdown, so do not be surprised if this code does not
+ initially work for you. As of this writing the exact hardware
+ interface is strongly in flux, so no good recommendation can be
+ made.


Is this last paragraph still true? Is the hardware interface still
"strongly in flux" ?


Russell,
In short, I don't know. Specifically with respect to the verbage you point out, 
it was
present in most of the original Kconfig descriptions. Some archs are probably 
in better
shape than others, but overall I've always see people issue caution statements 
around
kexec/kdump.
$0.02
eric


Re: [PATCH v3 01/13] kexec: consolidate kexec and crash options into kernel/Kconfig.kexec

2023-06-26 Thread Russell King (Oracle)
On Mon, Jun 26, 2023 at 12:13:20PM -0400, Eric DeVolder wrote:
> +config KEXEC
> + bool "Enable kexec system call"
> + default ARCH_DEFAULT_KEXEC
> + depends on ARCH_SUPPORTS_KEXEC
> + select KEXEC_CORE
> + help
> +   kexec is a system call that implements the ability to shutdown your
> +   current kernel, and to start another kernel. It is like a reboot
> +   but it is independent of the system firmware. And like a reboot
> +   you can start any kernel with it, not just Linux.
> +
> +   The name comes from the similarity to the exec system call.
> +
> +   It is an ongoing process to be certain the hardware in a machine
> +   is properly shutdown, so do not be surprised if this code does not
> +   initially work for you. As of this writing the exact hardware
> +   interface is strongly in flux, so no good recommendation can be
> +   made.

Is this last paragraph still true? Is the hardware interface still
"strongly in flux" ?

-- 
RMK's Patch system: https://www.armlinux.org.uk/developer/patches/
FTTP is here! 80Mbps down 10Mbps up. Decent connectivity at last!


[PATCH v3 13/13] sh/kexec: refactor for kernel/Kconfig.kexec

2023-06-26 Thread Eric DeVolder
The kexec and crash kernel options are provided in the common
kernel/Kconfig.kexec. Utilize the common options and provide
the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the
equivalent set of KEXEC and CRASH options.

Signed-off-by: Eric DeVolder 
Acked-by: John Paul Adrian Glaubitz 
---
 arch/sh/Kconfig | 46 --
 1 file changed, 8 insertions(+), 38 deletions(-)

diff --git a/arch/sh/Kconfig b/arch/sh/Kconfig
index 9652d367fc37..d52e0beed7e9 100644
--- a/arch/sh/Kconfig
+++ b/arch/sh/Kconfig
@@ -546,44 +546,14 @@ menu "Kernel features"
 
 source "kernel/Kconfig.hz"
 
-config KEXEC
-   bool "kexec system call (EXPERIMENTAL)"
-   depends on MMU
-   select KEXEC_CORE
-   help
- kexec is a system call that implements the ability to shutdown your
- current kernel, and to start another kernel.  It is like a reboot
- but it is independent of the system firmware.  And like a reboot
- you can start any kernel with it, not just Linux.
-
- The name comes from the similarity to the exec system call.
-
- It is an ongoing process to be certain the hardware in a machine
- is properly shutdown, so do not be surprised if this code does not
- initially work for you.  As of this writing the exact hardware
- interface is strongly in flux, so no good recommendation can be
- made.
-
-config CRASH_DUMP
-   bool "kernel crash dumps (EXPERIMENTAL)"
-   depends on BROKEN_ON_SMP
-   help
- Generate crash dump after being started by kexec.
- This should be normally only set in special crash dump kernels
- which are loaded in the main kernel with kexec-tools into
- a specially reserved region and then later executed after
- a crash by kdump/kexec. The crash dump kernel must be compiled
- to a memory address not used by the main kernel using
- PHYSICAL_START.
-
- For more details see Documentation/admin-guide/kdump/kdump.rst
-
-config KEXEC_JUMP
-   bool "kexec jump (EXPERIMENTAL)"
-   depends on KEXEC && HIBERNATION
-   help
- Jump between original kernel and kexeced kernel and invoke
- code via KEXEC
+config ARCH_SUPPORTS_KEXEC
+   def_bool MMU
+
+config ARCH_SUPPORTS_CRASH_DUMP
+   def_bool BROKEN_ON_SMP
+
+config ARCH_SUPPORTS_KEXEC_JUMP
+   def_bool y
 
 config PHYSICAL_START
hex "Physical address where the kernel is loaded" if (EXPERT || 
CRASH_DUMP)
-- 
2.31.1



[PATCH v3 12/13] s390/kexec: refactor for kernel/Kconfig.kexec

2023-06-26 Thread Eric DeVolder
The kexec and crash kernel options are provided in the common
kernel/Kconfig.kexec. Utilize the common options and provide
the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the
equivalent set of KEXEC and CRASH options.

NOTE: The original Kconfig has a KEXEC_SIG which depends on
MODULE_SIG_FORMAT. However, attempts to keep the MODULE_SIG_FORMAT
dependency (using the strategy outlined in this series, and other
techniques) results in 'error: recursive dependency detected'
on CRYPTO.

Per Alexander Gordeev : "the MODULE_SIG_FORMAT
dependency was introduced with [git commit below] and in fact was not
necessary, since s390 did/does not use mod_check_sig() anyway.

 commit c8424e776b09 ("MODSIGN: Export module signature definitions")

MODULE_SIG_FORMAT is needed to select SYSTEM_DATA_VERIFICATION. But
SYSTEM_DATA_VERIFICATION is also selected by FS_VERITY*, so dropping
MODULE_SIG_FORMAT does not hurt."

Therefore, the solution is to drop the MODULE_SIG_FORMAT dependency
from KEXEC_SIG. Still results in equivalent .config files for s390.

Signed-off-by: Eric DeVolder 
---
 arch/s390/Kconfig | 65 ++-
 1 file changed, 19 insertions(+), 46 deletions(-)

diff --git a/arch/s390/Kconfig b/arch/s390/Kconfig
index 6dab9c1be508..58dc124433ca 100644
--- a/arch/s390/Kconfig
+++ b/arch/s390/Kconfig
@@ -243,6 +243,25 @@ config PGTABLE_LEVELS
 
 source "kernel/livepatch/Kconfig"
 
+config ARCH_DEFAULT_KEXEC
+   def_bool y
+
+config ARCH_SUPPORTS_KEXEC
+   def_bool y
+
+config ARCH_SUPPORTS_KEXEC_FILE
+   def_bool CRYPTO && CRYPTO_SHA256 && CRYPTO_SHA256_S390
+
+config ARCH_HAS_KEXEC_PURGATORY
+   def_bool KEXEC_FILE
+
+config ARCH_SUPPORTS_CRASH_DUMP
+   def_bool y
+   help
+ Refer to  for more details on 
this.
+ This option also enables s390 zfcpdump.
+ See also 
+
 menu "Processor type and features"
 
 config HAVE_MARCH_Z10_FEATURES
@@ -481,36 +500,6 @@ config SCHED_TOPOLOGY
 
 source "kernel/Kconfig.hz"
 
-config KEXEC
-   def_bool y
-   select KEXEC_CORE
-
-config KEXEC_FILE
-   bool "kexec file based system call"
-   select KEXEC_CORE
-   depends on CRYPTO
-   depends on CRYPTO_SHA256
-   depends on CRYPTO_SHA256_S390
-   help
- Enable the kexec file based system call. In contrast to the normal
- kexec system call this system call takes file descriptors for the
- kernel and initramfs as arguments.
-
-config ARCH_HAS_KEXEC_PURGATORY
-   def_bool y
-   depends on KEXEC_FILE
-
-config KEXEC_SIG
-   bool "Verify kernel signature during kexec_file_load() syscall"
-   depends on KEXEC_FILE && MODULE_SIG_FORMAT
-   help
- This option makes kernel signature verification mandatory for
- the kexec_file_load() syscall.
-
- In addition to that option, you need to enable signature
- verification for the corresponding kernel image type being
- loaded in order for this to work.
-
 config KERNEL_NOBP
def_bool n
prompt "Enable modified branch prediction for the kernel by default"
@@ -732,22 +721,6 @@ config VFIO_AP
 
 endmenu
 
-menu "Dump support"
-
-config CRASH_DUMP
-   bool "kernel crash dumps"
-   select KEXEC
-   help
- Generate crash dump after being started by kexec.
- Crash dump kernels are loaded in the main kernel with kexec-tools
- into a specially reserved region and then later executed after
- a crash by kdump/kexec.
- Refer to  for more details on 
this.
- This option also enables s390 zfcpdump.
- See also 
-
-endmenu
-
 config CCW
def_bool y
 
-- 
2.31.1



[PATCH v3 11/13] riscv/kexec: refactor for kernel/Kconfig.kexec

2023-06-26 Thread Eric DeVolder
The kexec and crash kernel options are provided in the common
kernel/Kconfig.kexec. Utilize the common options and provide
the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the
equivalent set of KEXEC and CRASH options.

Signed-off-by: Eric DeVolder 
---
 arch/riscv/Kconfig | 48 ++
 1 file changed, 14 insertions(+), 34 deletions(-)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 5966ad97c30c..c484abd9bbfd 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -585,48 +585,28 @@ config RISCV_BOOT_SPINWAIT
 
  If unsure what to do here, say N.
 
-config KEXEC
-   bool "Kexec system call"
-   depends on MMU
+config ARCH_SUPPORTS_KEXEC
+   def_bool MMU
+
+config ARCH_SELECTS_KEXEC
+   def_bool y
+   depends on KEXEC
select HOTPLUG_CPU if SMP
-   select KEXEC_CORE
-   help
- kexec is a system call that implements the ability to shutdown your
- current kernel, and to start another kernel. It is like a reboot
- but it is independent of the system firmware. And like a reboot
- you can start any kernel with it, not just Linux.
 
- The name comes from the similarity to the exec system call.
+config ARCH_SUPPORTS_KEXEC_FILE
+   def_bool 64BIT && MMU && CRYPTO=y && CRYPTO_SHA256=y
 
-config KEXEC_FILE
-   bool "kexec file based systmem call"
-   depends on 64BIT && MMU
-   select HAVE_IMA_KEXEC if IMA
-   select KEXEC_CORE
+config ARCH_SELECTS_KEXEC_FILE
+   def_bool y
+   depends on KEXEC_FILE
select KEXEC_ELF
-   help
- This is new version of kexec system call. This system call is
- file based and takes file descriptors as system call argument
- for kernel and initramfs as opposed to list of segments as
- accepted by previous system call.
-
- If you don't know what to do here, say Y.
+   select HAVE_IMA_KEXEC if IMA
 
 config ARCH_HAS_KEXEC_PURGATORY
def_bool KEXEC_FILE
-   depends on CRYPTO=y
-   depends on CRYPTO_SHA256=y
 
-config CRASH_DUMP
-   bool "Build kdump crash kernel"
-   help
- Generate crash dump after being started by kexec. This should
- be normally only set in special crash dump kernels which are
- loaded in the main kernel with kexec-tools into a specially
- reserved region and then later executed after a crash by
- kdump/kexec.
-
- For more details see Documentation/admin-guide/kdump/kdump.rst
+config ARCH_SUPPORTS_CRASH_DUMP
+   def_bool y
 
 config COMPAT
bool "Kernel support for 32-bit U-mode"
-- 
2.31.1



[PATCH v3 10/13] powerpc/kexec: refactor for kernel/Kconfig.kexec

2023-06-26 Thread Eric DeVolder
The kexec and crash kernel options are provided in the common
kernel/Kconfig.kexec. Utilize the common options and provide
the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the
equivalent set of KEXEC and CRASH options.

Signed-off-by: Eric DeVolder 
Reviewed-by: Sourabh Jain 
---
 arch/powerpc/Kconfig | 55 ++--
 1 file changed, 17 insertions(+), 38 deletions(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index bff5820b7cda..70edbda08ae3 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -588,41 +588,21 @@ config PPC64_SUPPORTS_MEMORY_FAILURE
default "y" if PPC_POWERNV
select ARCH_SUPPORTS_MEMORY_FAILURE
 
-config KEXEC
-   bool "kexec system call"
-   depends on PPC_BOOK3S || PPC_E500 || (44x && !SMP)
-   select KEXEC_CORE
-   help
- kexec is a system call that implements the ability to shutdown your
- current kernel, and to start another kernel.  It is like a reboot
- but it is independent of the system firmware.   And like a reboot
- you can start any kernel with it, not just Linux.
-
- The name comes from the similarity to the exec system call.
-
- It is an ongoing process to be certain the hardware in a machine
- is properly shutdown, so do not be surprised if this code does not
- initially work for you.  As of this writing the exact hardware
- interface is strongly in flux, so no good recommendation can be
- made.
-
-config KEXEC_FILE
-   bool "kexec file based system call"
-   select KEXEC_CORE
-   select HAVE_IMA_KEXEC if IMA
-   select KEXEC_ELF
-   depends on PPC64
-   depends on CRYPTO=y
-   depends on CRYPTO_SHA256=y
-   help
- This is a new version of the kexec system call. This call is
- file based and takes in file descriptors as system call arguments
- for kernel and initramfs as opposed to a list of segments as is the
- case for the older kexec call.
+config ARCH_SUPPORTS_KEXEC
+   def_bool PPC_BOOK3S || PPC_E500 || (44x && !SMP)
+
+config ARCH_SUPPORTS_KEXEC_FILE
+   def_bool PPC64 && CRYPTO=y && CRYPTO_SHA256=y
 
 config ARCH_HAS_KEXEC_PURGATORY
def_bool KEXEC_FILE
 
+config ARCH_SELECTS_KEXEC_FILE
+   def_bool y
+   depends on KEXEC_FILE
+   select KEXEC_ELF
+   select HAVE_IMA_KEXEC if IMA
+
 config PPC64_BIG_ENDIAN_ELF_ABI_V2
bool "Build big-endian kernel using ELF ABI V2 (EXPERIMENTAL)"
depends on PPC64 && CPU_BIG_ENDIAN
@@ -682,14 +662,13 @@ config RELOCATABLE_TEST
  loaded at, which tends to be non-zero and therefore test the
  relocation code.
 
-config CRASH_DUMP
-   bool "Build a dump capture kernel"
-   depends on PPC64 || PPC_BOOK3S_32 || PPC_85xx || (44x && !SMP)
+config ARCH_SUPPORTS_CRASH_DUMP
+   def_bool PPC64 || PPC_BOOK3S_32 || PPC_85xx || (44x && !SMP)
+
+config ARCH_SELECTS_CRASH_DUMP
+   def_bool y
+   depends on CRASH_DUMP
select RELOCATABLE if PPC64 || 44x || PPC_85xx
-   help
- Build a kernel suitable for use as a dump capture kernel.
- The same kernel binary can be used as production kernel and dump
- capture kernel.
 
 config FA_DUMP
bool "Firmware-assisted dump"
-- 
2.31.1



[PATCH v3 08/13] mips/kexec: refactor for kernel/Kconfig.kexec

2023-06-26 Thread Eric DeVolder
The kexec and crash kernel options are provided in the common
kernel/Kconfig.kexec. Utilize the common options and provide
the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the
equivalent set of KEXEC and CRASH options.

Signed-off-by: Eric DeVolder 
Acked-by: Thomas Bogendoerfer 
---
 arch/mips/Kconfig | 32 +---
 1 file changed, 5 insertions(+), 27 deletions(-)

diff --git a/arch/mips/Kconfig b/arch/mips/Kconfig
index 675a8660cb85..3d9960942cbd 100644
--- a/arch/mips/Kconfig
+++ b/arch/mips/Kconfig
@@ -2873,33 +2873,11 @@ config HZ
 config SCHED_HRTICK
def_bool HIGH_RES_TIMERS
 
-config KEXEC
-   bool "Kexec system call"
-   select KEXEC_CORE
-   help
- kexec is a system call that implements the ability to shutdown your
- current kernel, and to start another kernel.  It is like a reboot
- but it is independent of the system firmware.   And like a reboot
- you can start any kernel with it, not just Linux.
-
- The name comes from the similarity to the exec system call.
-
- It is an ongoing process to be certain the hardware in a machine
- is properly shutdown, so do not be surprised if this code does not
- initially work for you.  As of this writing the exact hardware
- interface is strongly in flux, so no good recommendation can be
- made.
-
-config CRASH_DUMP
-   bool "Kernel crash dumps"
-   help
- Generate crash dump after being started by kexec.
- This should be normally only set in special crash dump kernels
- which are loaded in the main kernel with kexec-tools into
- a specially reserved region and then later executed after
- a crash by kdump/kexec. The crash dump kernel must be compiled
- to a memory address not used by the main kernel or firmware using
- PHYSICAL_START.
+config ARCH_SUPPORTS_KEXEC
+   def_bool y
+
+config ARCH_SUPPORTS_CRASH_DUMP
+   def_bool y
 
 config PHYSICAL_START
hex "Physical address where the kernel is loaded"
-- 
2.31.1



[PATCH v3 09/13] parisc/kexec: refactor for kernel/Kconfig.kexec

2023-06-26 Thread Eric DeVolder
The kexec and crash kernel options are provided in the common
kernel/Kconfig.kexec. Utilize the common options and provide
the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the
equivalent set of KEXEC and CRASH options.

Signed-off-by: Eric DeVolder 
---
 arch/parisc/Kconfig | 34 +++---
 1 file changed, 11 insertions(+), 23 deletions(-)

diff --git a/arch/parisc/Kconfig b/arch/parisc/Kconfig
index 967bde65dd0e..8de24bc503aa 100644
--- a/arch/parisc/Kconfig
+++ b/arch/parisc/Kconfig
@@ -348,29 +348,17 @@ config NR_CPUS
default "4" if 64BIT
default "16"
 
-config KEXEC
-   bool "Kexec system call"
-   select KEXEC_CORE
-   help
- kexec is a system call that implements the ability to shutdown your
- current kernel, and to start another kernel.  It is like a reboot
- but it is independent of the system firmware.   And like a reboot
- you can start any kernel with it, not just Linux.
-
- It is an ongoing process to be certain the hardware in a machine
- shutdown, so do not be surprised if this code does not
- initially work for you.
-
-config KEXEC_FILE
-   bool "kexec file based system call"
-   select KEXEC_CORE
-   select KEXEC_ELF
-   help
- This enables the kexec_file_load() System call. This is
- file based and takes file descriptors as system call argument
- for kernel and initramfs as opposed to list of segments as
- accepted by previous system call.
-
 endmenu
 
+config ARCH_SUPPORTS_KEXEC
+   def_bool y
+
+config ARCH_SUPPORTS_KEXEC_FILE
+   def_bool y
+
+config ARCH_SELECTS_KEXEC_FILE
+   def_bool y
+   depends on KEXEC_FILE
+   select KEXEC_ELF
+
 source "drivers/parisc/Kconfig"
-- 
2.31.1



[PATCH v3 00/13] refactor Kconfig to consolidate KEXEC and CRASH options

2023-06-26 Thread Eric DeVolder
The Kconfig is refactored to consolidate KEXEC and CRASH options from
various arch//Kconfig files into new file kernel/Kconfig.kexec.

The Kconfig.kexec is now a submenu titled "Kexec and crash features"
located under "General Setup".

The following options are impacted:

 - KEXEC
 - KEXEC_FILE
 - KEXEC_SIG
 - KEXEC_SIG_FORCE
 - KEXEC_BZIMAGE_VERIFY_SIG
 - KEXEC_JUMP
 - CRASH_DUMP

Over time, these options have been copied between Kconfig files and
are very similar to one another, but with slight differences.

The following architectures are impacted by the refactor (because of
use of one or more KEXEC/CRASH options):

 - arm
 - arm64
 - ia64
 - loongarch
 - m68k
 - mips
 - parisc
 - powerpc
 - riscv
 - s390
 - sh
 - x86 

More information:

In the patch series "crash: Kernel handling of CPU and memory hot
un/plug"

 https://lore.kernel.org/lkml/20230503224145.7405-1-eric.devol...@oracle.com/

the new kernel feature introduces the config option CRASH_HOTPLUG.

In reviewing, Thomas Gleixner requested that the new config option
not be placed in x86 Kconfig. Rather the option needs a generic/common
home. To Thomas' point, the KEXEC and CRASH options have largely been
duplicated in the various arch//Kconfig files, with minor
differences. This kind of proliferation is to be avoid/stopped.

 https://lore.kernel.org/lkml/875y91yv63.ffs@tglx/

To that end, I have refactored the arch Kconfigs so as to consolidate
the various KEXEC and CRASH options. Generally speaking, this work has
the following themes:

- KEXEC and CRASH options are moved into new file kernel/Kconfig.kexec
  - These items from arch/Kconfig:
  CRASH_CORE KEXEC_CORE KEXEC_ELF HAVE_IMA_KEXEC
  - These items from arch/x86/Kconfig form the common options:
  KEXEC KEXEC_FILE KEXEC_SIG KEXEC_SIG_FORCE
  KEXEC_BZIMAGE_VERIFY_SIG KEXEC_JUMP CRASH_DUMP
  - The crash hotplug series appends CRASH_HOTPLUG to Kconfig.kexec
  NOTE: PHYSICAL_START could be argued to be included in this series.
- The Kconfig.kexec is now a submenu titled "Kexec and crash features"
- The Kconfig.kexec is now listed in "General Setup" submenu from
  init/Kconfig
- To control the main common options, new options ARCH_SUPPORTS_KEXEC,
  ARCH_SUPPORTS_KEXEC_FILE and ARCH_SUPPORTS_CRASH_DUMP are introduced.
  NOTE: The existing ARCH_HAS_KEXEC_PURGATORY remains unchanged.
- To account for the slight differences, new options ARCH_SELECTS_KEXEC,
  ARCH_SELECTS_KEXEC_FILE and ARCH_SELECTS_CRASH_DUMP are used to
  elicit the same side effects as the original arch//Kconfig
  files for KEXEC and CRASH options.

An example, 'make menuconfig' illustrating the submenu:

  > General setup > Kexec and crash features
  [*] Enable kexec system call
  [*] Enable kexec file based system call
  [*]   Verify kernel signature during kexec_file_load() syscall
  [ ] Require a valid signature in kexec_file_load() syscall
  [ ] Enable bzImage signature verification support
  [*] kexec jump
  [*] kernel crash dumps
  [*]   Update the crash elfcorehdr on system configuration changes

The three main options are KEXEC, KEXEC_FILE and CRASH_DUMP. In the
process of consolidating these options, I encountered slight differences
in the coding of these options in several of the architectures. As a
result, I settled on the following solution:

- Each of three main options has a 'depends on ARCH_SUPPORTS_'
  statement: ARCH_SUPPORTS_KEXEC, ARCH_SUPPORTS_KEXEC_FILE,
  ARCH_SUPPORTS_CRASH_DUMP.

  For example, the KEXEC_FILE option has a 'depends on
  ARCH_SUPPORTS_KEXEC_FILE' statement.

- The boolean ARCH_SUPPORTS_ in effect allows the arch to
  determine when the feature is allowed.  Archs which don't have the
  feature simply do not provide the corresponding ARCH_SUPPORTS_.
  For each arch, where there previously were KEXEC and/or CRASH
  options, these have been replaced with the corresponding boolean
  ARCH_SUPPORTS_, and an appropriate def_bool statement.

  For example, if the arch supports KEXEC_FILE, then the
  ARCH_SUPPORTS_KEXEC_FILE simply has a 'def_bool y'. This permits the
  KEXEC_FILE option to be available.

  If the arch has a 'depends on' statement in its original coding
  of the option, then that expression becomes part of the def_bool
  expression. For example, arm64 had:

  config KEXEC
depends on PM_SLEEP_SMP

  and in this solution, this converts to:

  config ARCH_SUPPORTS_KEXEC
def_bool PM_SLEEP_SMP


- In order to account for the differences in the config coding for
  the three common options, the ARCH_SELECTS_ is used.
  This options has a 'depends on ' statement to couple it
  to the main option, and from there can insert the differences
  from the common option and the arch original coding of that option.

  For example, a few archs enable CRYPTO and CRYTPO_SHA256 for
  KEXEC_FILE. These require a ARCH_SELECTS_KEXEC_FILE and
  'select CRYPTO' and 'select CRYPTO_SHA256' statements.

Illustrating the option relationships:

For KEXEC:
 ARCH_SUPPORTS_KEXEC <- KEXEC <- 

[PATCH v3 03/13] arm/kexec: refactor for kernel/Kconfig.kexec

2023-06-26 Thread Eric DeVolder
The kexec and crash kernel options are provided in the common
kernel/Kconfig.kexec. Utilize the common options and provide
the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the
equivalent set of KEXEC and CRASH options.

Signed-off-by: Eric DeVolder 
---
 arch/arm/Kconfig | 29 -
 1 file changed, 4 insertions(+), 25 deletions(-)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index 0fb4b218f665..6af0105407af 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -1639,20 +1639,8 @@ config XIP_DEFLATED_DATA
  copied, saving some precious ROM space. A possible drawback is a
  slightly longer boot delay.
 
-config KEXEC
-   bool "Kexec system call (EXPERIMENTAL)"
-   depends on (!SMP || PM_SLEEP_SMP)
-   depends on MMU
-   select KEXEC_CORE
-   help
- kexec is a system call that implements the ability to shutdown your
- current kernel, and to start another kernel.  It is like a reboot
- but it is independent of the system firmware.   And like a reboot
- you can start any kernel with it, not just Linux.
-
- It is an ongoing process to be certain the hardware in a machine
- is properly shutdown, so do not be surprised if this code does not
- initially work for you.
+config ARCH_SUPPORTS_KEXEC
+   def_bool (!SMP || PM_SLEEP_SMP) && MMU
 
 config ATAGS_PROC
bool "Export atags in procfs"
@@ -1662,17 +1650,8 @@ config ATAGS_PROC
  Should the atags used to boot the kernel be exported in an "atags"
  file in procfs. Useful with kexec.
 
-config CRASH_DUMP
-   bool "Build kdump crash kernel (EXPERIMENTAL)"
-   help
- Generate crash dump after being started by kexec. This should
- be normally only set in special crash dump kernels which are
- loaded in the main kernel with kexec-tools into a specially
- reserved region and then later executed after a crash by
- kdump/kexec. The crash dump kernel must be compiled to a
- memory address not used by the main kernel
-
- For more details see Documentation/admin-guide/kdump/kdump.rst
+config ARCH_SUPPORTS_CRASH_DUMP
+   def_bool y
 
 config AUTO_ZRELADDR
bool "Auto calculation of the decompressed kernel image address" if 
!ARCH_MULTIPLATFORM
-- 
2.31.1



[PATCH v3 04/13] ia64/kexec: refactor for kernel/Kconfig.kexec

2023-06-26 Thread Eric DeVolder
The kexec and crash kernel options are provided in the common
kernel/Kconfig.kexec. Utilize the common options and provide
the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the
equivalent set of KEXEC and CRASH options.

Signed-off-by: Eric DeVolder 
---
 arch/ia64/Kconfig | 28 +---
 1 file changed, 5 insertions(+), 23 deletions(-)

diff --git a/arch/ia64/Kconfig b/arch/ia64/Kconfig
index 21fa63ce5ffc..df54a038e6da 100644
--- a/arch/ia64/Kconfig
+++ b/arch/ia64/Kconfig
@@ -360,31 +360,13 @@ config IA64_HP_AML_NFW
  the "force" module parameter, e.g., with the "aml_nfw.force"
  kernel command line option.
 
-config KEXEC
-   bool "kexec system call"
-   depends on !SMP || HOTPLUG_CPU
-   select KEXEC_CORE
-   help
- kexec is a system call that implements the ability to shutdown your
- current kernel, and to start another kernel.  It is like a reboot
- but it is independent of the system firmware.   And like a reboot
- you can start any kernel with it, not just Linux.
-
- The name comes from the similarity to the exec system call.
-
- It is an ongoing process to be certain the hardware in a machine
- is properly shutdown, so do not be surprised if this code does not
- initially work for you.  As of this writing the exact hardware
- interface is strongly in flux, so no good recommendation can be
- made.
+endmenu
 
-config CRASH_DUMP
- bool "kernel crash dumps"
- depends on IA64_MCA_RECOVERY && (!SMP || HOTPLUG_CPU)
- help
-   Generate crash dump after being started by kexec.
+config ARCH_SUPPORTS_KEXEC
+   def_bool !SMP || HOTPLUG_CPU
 
-endmenu
+config ARCH_SUPPORTS_CRASH_DUMP
+   def_bool IA64_MCA_RECOVERY && (!SMP || HOTPLUG_CPU)
 
 menu "Power management and ACPI options"
 
-- 
2.31.1



[PATCH v3 05/13] arm64/kexec: refactor for kernel/Kconfig.kexec

2023-06-26 Thread Eric DeVolder
The kexec and crash kernel options are provided in the common
kernel/Kconfig.kexec. Utilize the common options and provide
the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the
equivalent set of KEXEC and CRASH options.

Signed-off-by: Eric DeVolder 
---
 arch/arm64/Kconfig | 62 +-
 1 file changed, 12 insertions(+), 50 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 343e1e1cae10..dfe47efa7cc1 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -1433,60 +1433,22 @@ config PARAVIRT_TIME_ACCOUNTING
 
  If in doubt, say N here.
 
-config KEXEC
-   depends on PM_SLEEP_SMP
-   select KEXEC_CORE
-   bool "kexec system call"
-   help
- kexec is a system call that implements the ability to shutdown your
- current kernel, and to start another kernel.  It is like a reboot
- but it is independent of the system firmware.   And like a reboot
- you can start any kernel with it, not just Linux.
-
-config KEXEC_FILE
-   bool "kexec file based system call"
-   select KEXEC_CORE
-   select HAVE_IMA_KEXEC if IMA
-   help
- This is new version of kexec system call. This system call is
- file based and takes file descriptors as system call argument
- for kernel and initramfs as opposed to list of segments as
- accepted by previous system call.
-
-config KEXEC_SIG
-   bool "Verify kernel signature during kexec_file_load() syscall"
-   depends on KEXEC_FILE
-   help
- Select this option to verify a signature with loaded kernel
- image. If configured, any attempt of loading a image without
- valid signature will fail.
-
- In addition to that option, you need to enable signature
- verification for the corresponding kernel image type being
- loaded in order for this to work.
+config ARCH_SUPPORTS_KEXEC
+   def_bool PM_SLEEP_SMP
 
-config KEXEC_IMAGE_VERIFY_SIG
-   bool "Enable Image signature verification support"
-   default y
-   depends on KEXEC_SIG
-   depends on EFI && SIGNED_PE_FILE_VERIFICATION
-   help
- Enable Image signature verification support.
+config ARCH_SUPPORTS_KEXEC_FILE
+   def_bool y
 
-comment "Support for PE file signature verification disabled"
-   depends on KEXEC_SIG
-   depends on !EFI || !SIGNED_PE_FILE_VERIFICATION
+config ARCH_SELECTS_KEXEC_FILE
+   def_bool y
+   depends on KEXEC_FILE
+   select HAVE_IMA_KEXEC if IMA
 
-config CRASH_DUMP
-   bool "Build kdump crash kernel"
-   help
- Generate crash dump after being started by kexec. This should
- be normally only set in special crash dump kernels which are
- loaded in the main kernel with kexec-tools into a specially
- reserved region and then later executed after a crash by
- kdump/kexec.
+config ARCH_DEFAULT_KEXEC_IMAGE_VERIFY_SIG
+   def_bool y
 
- For more details see Documentation/admin-guide/kdump/kdump.rst
+config ARCH_SUPPORTS_CRASH_DUMP
+   def_bool y
 
 config TRANS_TABLE
def_bool y
-- 
2.31.1



[PATCH v3 07/13] m68k/kexec: refactor for kernel/Kconfig.kexec

2023-06-26 Thread Eric DeVolder
The kexec and crash kernel options are provided in the common
kernel/Kconfig.kexec. Utilize the common options and provide
the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the
equivalent set of KEXEC and CRASH options.

Signed-off-by: Eric DeVolder 
Reviewed-by: Geert Uytterhoeven 
Acked-by: Geert Uytterhoeven 
---
 arch/m68k/Kconfig | 19 ++-
 1 file changed, 2 insertions(+), 17 deletions(-)

diff --git a/arch/m68k/Kconfig b/arch/m68k/Kconfig
index 40198a1ebe27..7b71916d1519 100644
--- a/arch/m68k/Kconfig
+++ b/arch/m68k/Kconfig
@@ -88,23 +88,8 @@ config MMU_SUN3
bool
depends on MMU && !MMU_MOTOROLA && !MMU_COLDFIRE
 
-config KEXEC
-   bool "kexec system call"
-   depends on M68KCLASSIC && MMU
-   select KEXEC_CORE
-   help
- kexec is a system call that implements the ability to shutdown your
- current kernel, and to start another kernel.  It is like a reboot
- but it is independent of the system firmware.   And like a reboot
- you can start any kernel with it, not just Linux.
-
- The name comes from the similarity to the exec system call.
-
- It is an ongoing process to be certain the hardware in a machine
- is properly shutdown, so do not be surprised if this code does not
- initially work for you.  As of this writing the exact hardware
- interface is strongly in flux, so no good recommendation can be
- made.
+config ARCH_SUPPORTS_KEXEC
+   def_bool M68KCLASSIC && MMU
 
 config BOOTINFO_PROC
bool "Export bootinfo in procfs"
-- 
2.31.1



[PATCH v3 02/13] x86/kexec: refactor for kernel/Kconfig.kexec

2023-06-26 Thread Eric DeVolder
The kexec and crash kernel options are provided in the common
kernel/Kconfig.kexec. Utilize the common options and provide
the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the
equivalent set of KEXEC and CRASH options.

Signed-off-by: Eric DeVolder 
---
 arch/x86/Kconfig | 89 +++-
 1 file changed, 13 insertions(+), 76 deletions(-)

diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 53bab123a8ee..1afc6ca2986b 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -2043,88 +2043,25 @@ config EFI_RUNTIME_MAP
 
 source "kernel/Kconfig.hz"
 
-config KEXEC
-   bool "kexec system call"
-   select KEXEC_CORE
-   help
- kexec is a system call that implements the ability to shutdown your
- current kernel, and to start another kernel.  It is like a reboot
- but it is independent of the system firmware.   And like a reboot
- you can start any kernel with it, not just Linux.
-
- The name comes from the similarity to the exec system call.
-
- It is an ongoing process to be certain the hardware in a machine
- is properly shutdown, so do not be surprised if this code does not
- initially work for you.  As of this writing the exact hardware
- interface is strongly in flux, so no good recommendation can be
- made.
-
-config KEXEC_FILE
-   bool "kexec file based system call"
-   select KEXEC_CORE
-   select HAVE_IMA_KEXEC if IMA
-   depends on X86_64
-   depends on CRYPTO=y
-   depends on CRYPTO_SHA256=y
-   help
- This is new version of kexec system call. This system call is
- file based and takes file descriptors as system call argument
- for kernel and initramfs as opposed to list of segments as
- accepted by previous system call.
+config ARCH_SUPPORTS_KEXEC
+   def_bool y
 
-config ARCH_HAS_KEXEC_PURGATORY
-   def_bool KEXEC_FILE
+config ARCH_SUPPORTS_KEXEC_FILE
+   def_bool X86_64 && CRYPTO && CRYPTO_SHA256
 
-config KEXEC_SIG
-   bool "Verify kernel signature during kexec_file_load() syscall"
+config ARCH_SELECTS_KEXEC_FILE
+   def_bool y
depends on KEXEC_FILE
-   help
-
- This option makes the kexec_file_load() syscall check for a valid
- signature of the kernel image.  The image can still be loaded without
- a valid signature unless you also enable KEXEC_SIG_FORCE, though if
- there's a signature that we can check, then it must be valid.
-
- In addition to this option, you need to enable signature
- verification for the corresponding kernel image type being
- loaded in order for this to work.
-
-config KEXEC_SIG_FORCE
-   bool "Require a valid signature in kexec_file_load() syscall"
-   depends on KEXEC_SIG
-   help
- This option makes kernel signature verification mandatory for
- the kexec_file_load() syscall.
+   select HAVE_IMA_KEXEC if IMA
 
-config KEXEC_BZIMAGE_VERIFY_SIG
-   bool "Enable bzImage signature verification support"
-   depends on KEXEC_SIG
-   depends on SIGNED_PE_FILE_VERIFICATION
-   select SYSTEM_TRUSTED_KEYRING
-   help
- Enable bzImage signature verification support.
+config ARCH_HAS_KEXEC_PURGATORY
+   def_bool KEXEC_FILE
 
-config CRASH_DUMP
-   bool "kernel crash dumps"
-   depends on X86_64 || (X86_32 && HIGHMEM)
-   help
- Generate crash dump after being started by kexec.
- This should be normally only set in special crash dump kernels
- which are loaded in the main kernel with kexec-tools into
- a specially reserved region and then later executed after
- a crash by kdump/kexec. The crash dump kernel must be compiled
- to a memory address not used by the main kernel or BIOS using
- PHYSICAL_START, or it must be built as a relocatable image
- (CONFIG_RELOCATABLE=y).
- For more details see Documentation/admin-guide/kdump/kdump.rst
+config ARCH_SUPPORTS_KEXEC_JUMP
+   def_bool y
 
-config KEXEC_JUMP
-   bool "kexec jump"
-   depends on KEXEC && HIBERNATION
-   help
- Jump between original kernel and kexeced kernel and invoke
- code in physical address mode via KEXEC
+config ARCH_SUPPORTS_CRASH_DUMP
+   def_bool X86_64 || (X86_32 && HIGHMEM)
 
 config PHYSICAL_START
hex "Physical address where the kernel is loaded" if (EXPERT || 
CRASH_DUMP)
-- 
2.31.1



[PATCH v3 06/13] loongarch/kexec: refactor for kernel/Kconfig.kexec

2023-06-26 Thread Eric DeVolder
The kexec and crash kernel options are provided in the common
kernel/Kconfig.kexec. Utilize the common options and provide
the ARCH_SUPPORTS_ and ARCH_SELECTS_ entries to recreate the
equivalent set of KEXEC and CRASH options.

Signed-off-by: Eric DeVolder 
---
 arch/loongarch/Kconfig | 26 +++---
 1 file changed, 7 insertions(+), 19 deletions(-)

diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig
index d38b066fc931..3542bf669c78 100644
--- a/arch/loongarch/Kconfig
+++ b/arch/loongarch/Kconfig
@@ -481,28 +481,16 @@ config ARCH_STRICT_ALIGN
  to run kernel only on systems with h/w unaligned access support in
  order to optimise for performance.
 
-config KEXEC
-   bool "Kexec system call"
-   select KEXEC_CORE
-   help
- kexec is a system call that implements the ability to shutdown your
- current kernel, and to start another kernel.  It is like a reboot
- but it is independent of the system firmware.   And like a reboot
- you can start any kernel with it, not just Linux.
+config ARCH_SUPPORTS_KEXEC
+   def_bool y
 
- The name comes from the similarity to the exec system call.
+config ARCH_SUPPORTS_CRASH_DUMP
+   def_bool y
 
-config CRASH_DUMP
-   bool "Build kdump crash kernel"
+config ARCH_SELECTS_CRASH_DUMP
+   def_bool y
+   depends on CRASH_DUMP
select RELOCATABLE
-   help
- Generate crash dump after being started by kexec. This should
- be normally only set in special crash dump kernels which are
- loaded in the main kernel with kexec-tools into a specially
- reserved region and then later executed after a crash by
- kdump/kexec.
-
- For more details see Documentation/admin-guide/kdump/kdump.rst
 
 config RELOCATABLE
bool "Relocatable kernel"
-- 
2.31.1



[PATCH v3 01/13] kexec: consolidate kexec and crash options into kernel/Kconfig.kexec

2023-06-26 Thread Eric DeVolder
The config options for kexec and crash features are consolidated
into new file kernel/Kconfig.kexec. Under the "General Setup" submenu
is a new submenu "Kexec and crash handling". All the kexec and
crash options that were once in the arch-dependent submenu "Processor
type and features" are now consolidated in the new submenu.

The following options are impacted:

 - KEXEC
 - KEXEC_FILE
 - KEXEC_SIG
 - KEXEC_SIG_FORCE
 - KEXEC_BZIMAGE_VERIFY_SIG
 - KEXEC_JUMP
 - CRASH_DUMP

The three main options are KEXEC, KEXEC_FILE and CRASH_DUMP.

Architectures specify support of certain KEXEC and CRASH features with
similarly named new ARCH_SUPPORTS_ config options.

Architectures can utilize the new ARCH_SELECTS_ config
options to specify additional components when  is enabled.

To summarize, the ARCH_SUPPORTS_ permits the  to be
enabled, and the ARCH_SELECTS_ handles side effects (ie.
select statements).

Signed-off-by: Eric DeVolder 
---
 arch/Kconfig |  13 -
 init/Kconfig |   2 +
 kernel/Kconfig.kexec | 110 +++
 3 files changed, 112 insertions(+), 13 deletions(-)
 create mode 100644 kernel/Kconfig.kexec

diff --git a/arch/Kconfig b/arch/Kconfig
index 205fd23e0cad..a37730679730 100644
--- a/arch/Kconfig
+++ b/arch/Kconfig
@@ -11,19 +11,6 @@ source "arch/$(SRCARCH)/Kconfig"
 
 menu "General architecture-dependent options"
 
-config CRASH_CORE
-   bool
-
-config KEXEC_CORE
-   select CRASH_CORE
-   bool
-
-config KEXEC_ELF
-   bool
-
-config HAVE_IMA_KEXEC
-   bool
-
 config ARCH_HAS_SUBPAGE_FAULTS
bool
help
diff --git a/init/Kconfig b/init/Kconfig
index 32c24950c4ce..4424447e23a5 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -1917,6 +1917,8 @@ config BINDGEN_VERSION_TEXT
 config TRACEPOINTS
bool
 
+source "kernel/Kconfig.kexec"
+
 endmenu# General setup
 
 source "arch/Kconfig"
diff --git a/kernel/Kconfig.kexec b/kernel/Kconfig.kexec
new file mode 100644
index ..5d576ddfd999
--- /dev/null
+++ b/kernel/Kconfig.kexec
@@ -0,0 +1,110 @@
+# SPDX-License-Identifier: GPL-2.0-only
+
+menu "Kexec and crash features"
+
+config CRASH_CORE
+   bool
+
+config KEXEC_CORE
+   select CRASH_CORE
+   bool
+
+config KEXEC_ELF
+   bool
+
+config HAVE_IMA_KEXEC
+   bool
+
+config KEXEC
+   bool "Enable kexec system call"
+   default ARCH_DEFAULT_KEXEC
+   depends on ARCH_SUPPORTS_KEXEC
+   select KEXEC_CORE
+   help
+ kexec is a system call that implements the ability to shutdown your
+ current kernel, and to start another kernel. It is like a reboot
+ but it is independent of the system firmware. And like a reboot
+ you can start any kernel with it, not just Linux.
+
+ The name comes from the similarity to the exec system call.
+
+ It is an ongoing process to be certain the hardware in a machine
+ is properly shutdown, so do not be surprised if this code does not
+ initially work for you. As of this writing the exact hardware
+ interface is strongly in flux, so no good recommendation can be
+ made.
+
+config KEXEC_FILE
+   bool "Enable kexec file based system call"
+   depends on ARCH_SUPPORTS_KEXEC_FILE
+   select KEXEC_CORE
+   help
+ This is new version of kexec system call. This system call is
+ file based and takes file descriptors as system call argument
+ for kernel and initramfs as opposed to list of segments as
+ accepted by kexec system call.
+
+config KEXEC_SIG
+   bool "Verify kernel signature during kexec_file_load() syscall"
+   depends on KEXEC_FILE
+   help
+ This option makes the kexec_file_load() syscall check for a valid
+ signature of the kernel image. The image can still be loaded without
+ a valid signature unless you also enable KEXEC_SIG_FORCE, though if
+ there's a signature that we can check, then it must be valid.
+
+ In addition to this option, you need to enable signature
+ verification for the corresponding kernel image type being
+ loaded in order for this to work.
+
+config KEXEC_SIG_FORCE
+   bool "Require a valid signature in kexec_file_load() syscall"
+   depends on KEXEC_SIG
+   help
+ This option makes kernel signature verification mandatory for
+ the kexec_file_load() syscall.
+
+config KEXEC_IMAGE_VERIFY_SIG
+   bool "Enable Image signature verification support"
+   default ARCH_DEFAULT_KEXEC_IMAGE_VERIFY_SIG
+   depends on KEXEC_SIG
+   depends on EFI && SIGNED_PE_FILE_VERIFICATION
+   help
+ Enable Image signature verification support.
+
+config KEXEC_BZIMAGE_VERIFY_SIG
+   bool "Enable bzImage signature verification support"
+   depends on KEXEC_SIG
+   depends on SIGNED_PE_FILE_VERIFICATION
+   select SYSTEM_TRUSTED_KEYRING
+   help
+ Enable 

Re: [kvm-unit-tests PATCH 2/2] Link with "-z noexecstack" to avoid warning from newer versions of ld

2023-06-26 Thread Sean Christopherson
On Fri, Jun 23, 2023, Thomas Huth wrote:
> On 23/06/2023 16.24, Sean Christopherson wrote:
> > On Fri, Jun 23, 2023, Thomas Huth wrote:
> > > Newer versions of ld (from binutils 2.40) complain on s390x and x86:
> > > 
> > >   ld: warning: s390x/cpu.o: missing .note.GNU-stack section implies
> > >executable stack
> > >   ld: NOTE: This behaviour is deprecated and will be removed in a
> > > future version of the linker
> > > 
> > > We can silence these warnings by using "-z noexecstack" for linking
> > > (which should not have any real influence on the kvm-unit-tests since
> > > the information from the ELF header is not used here anyway, so it's
> > > just cosmetics).
> > > 
> > > Signed-off-by: Thomas Huth 
> > > ---
> > >   Makefile | 2 +-
> > >   1 file changed, 1 insertion(+), 1 deletion(-)
> > > 
> > > diff --git a/Makefile b/Makefile
> > > index 0e5d85a1..20f7137c 100644
> > > --- a/Makefile
> > > +++ b/Makefile
> > > @@ -96,7 +96,7 @@ CFLAGS += -Woverride-init -Wmissing-prototypes 
> > > -Wstrict-prototypes
> > >   autodepend-flags = -MMD -MF $(dir $*).$(notdir $*).d
> > > -LDFLAGS += -nostdlib
> > > +LDFLAGS += -nostdlib -z noexecstack
> > 
> > Drat, the pull request[1] I sent to Paolo yesterday only fixes x86[2].
> 
> Oops, sorry, I did not notice that patch in my overcrowded mailboxes (or
> forgot about it during KVM forum...) :-/

Heh, you gave a Reviewed-by[*], so either its the latter, or you've got a clone
running around :-)

[*] https://lore.kernel.org/all/916aac4f-97b8-70c2-de39-87438eb4a...@redhat.com

> > Paolo, want me to redo the pull request to drop the x86-specific patch?
> 
> I can also respin my patch on top of your series later ... the problem
> currently also only seems to happen on x86 and s390x, on ppc64 and aarch64,
> the linker does not complain ... so maybe it's even better to do it
> per-architecture only anyway? Opinions?

I don't think it makes sense to do this per-arch, other architectures likely 
aren't
problematic purely because of linker specific behavior, e.g. see

https://patches.linaro.org/project/binutils/patch/1506025575-1559-1-git-send-email-jim.wil...@linaro.org


Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()

2023-06-26 Thread Song Liu
On Mon, Jun 26, 2023 at 5:31 AM Mark Rutland  wrote:
>
[...]
> >
> > So the idea was that jit_text_alloc() will have a cache of large pages
> > mapped ROX, will allocate memory from those caches and there will be
> > jit_update() that uses text poking for writing to that memory.
> >
> > Upon allocation of a large page to increase the cache, that large page will
> > be "invalidated" by filling it with breakpoint instructions (e.g int3 on
> > x86)
>
> Does that work on x86?
>
> That is in no way gauranteed for other architectures; on arm64 you need
> explicit cache maintenance (with I-cache maintenance at the VA to be executed
> from) followed by context-synchronization-events (e.g. via ISB instructions, 
> or
> IPIs).

I guess we need:
1) Invalidate unused part of the huge ROX pages;
2) Do not put two jit users (including module text, bpf, etc.) in the
same cache line;
3) Explicit cache maintenance;
4) context-synchronization-events.

Would these (or a subset of them) be sufficient to protect us from torn read?

Thanks,
Song


Re: [PATCH v2 2/2] mm/mmu_gather: send tlb_remove_table_smp_sync IPI only to MM CPUs

2023-06-26 Thread Dave Hansen
On 6/26/23 07:36, ypode...@redhat.com wrote:
> On Thu, 2023-06-22 at 06:37 -0700, Dave Hansen wrote:
>> On 6/22/23 06:14, ypode...@redhat.com wrote:
>>> I will send a new version with the local variable as you suggested
>>> soon.
>>> As for the config name, what about CONFIG_ARCH_HAS_MM_CPUMASK?
>>
>> The confusing part about that name is that mm_cpumask() and
>> mm->cpu_bitmap[] are defined unconditionally.  So, they're *around*
>> unconditionally but just aren't updated.
>>
> I think your right about the config name,
> How about the
> CONFIG_ARCH_USE_MM_CPUMASK?
> This has the right semantic as these archs use the cpumask field of the
> mm struct.

"USE" is still a command.  It should, at worst, be "USES".  But that's
still kinda generic.  How about:

CONFIG_ARCH_UPDATES_MM_CPUMASK

?

>> BTW, it would also be nice to have _some_ kind of data behind this
>> patch.
>>
>> Fewer IPIs are better I guess, but it would still be nice if you
>> could say:
>>
>>  Before this patch, /proc/interrupts showed 123 IPIs/hour for an
>>  isolated CPU.  After the approach here, it was 0.
>>
>> ... or something.
> 
> This is part of an ongoing effort to remove IPIs and this one was found
> via code inspection.

OK, so it should be something more like:

This was found via code inspection, but fixing it isn't very
important so we didn't bother to test it any more than just
making sure the thing still boots when it is applied.

Does that cover it?



Re: [PATCH v2 2/2] mm/mmu_gather: send tlb_remove_table_smp_sync IPI only to MM CPUs

2023-06-26 Thread ypodemsk
On Thu, 2023-06-22 at 06:37 -0700, Dave Hansen wrote:
> On 6/22/23 06:14, ypode...@redhat.com wrote:
> > I will send a new version with the local variable as you suggested
> > soon.
> > As for the config name, what about CONFIG_ARCH_HAS_MM_CPUMASK?
> 
> The confusing part about that name is that mm_cpumask() and
> mm->cpu_bitmap[] are defined unconditionally.  So, they're *around*
> unconditionally but just aren't updated.
> 
I think your right about the config name,
How about the
CONFIG_ARCH_USE_MM_CPUMASK?
This has the right semantic as these archs use the cpumask field of the
mm struct.

> BTW, it would also be nice to have _some_ kind of data behind this
> patch.
> 
> Fewer IPIs are better I guess, but it would still be nice if you
> could say:
> 
>   Before this patch, /proc/interrupts showed 123 IPIs/hour for an
>   isolated CPU.  After the approach here, it was 0.
> 
> ... or something.

This is part of an ongoing effort to remove IPIs and this one was found
via code inspection.




Re: [PATCH] powerpc/iommu: TCEs are incorrectly manipulated with DLPAR add/remove of memory

2023-06-26 Thread Gaurav Batra

Thanks a lot


On 6/25/23 11:54 PM, Michael Ellerman wrote:

Gaurav Batra  writes:

Hello Michael,

Did you get a chance to look into this patch? I don't mean to rush you.
Just wondering if there is anything I can do to help make the patch to
Upstream.

I skimmed it and decided it wasn't a critical bug fix, and hoped someone
else would review it - silly me :D

But the patch looks simple enough, and the explanation is very good so I
think it looks good to merge.

I'll apply it for v6.5.

cheers


On 6/13/23 12:17 PM, Gaurav Batra wrote:

Hello Michael,

I found this bug while going though the code. This bug is exposed when
DDW is smaller than the max memory of the LPAR. This will result in
creating DDW which will have Dynamically mapped TCEs (no direct mapping).

I would like to stress that this  bug is exposed only in Upstream
kernel. Current kernel level in Distros are not exposed to this since
they don't have the  concept of "dynamically mapped" DDW.

I didn't have access to any of the P10 boxes with large amount of
memory to  re-create the scenario. On P10 we have 2MB TCEs, which
results in DDW large enough to be able to cover  max memory I could
have for the LPAR. As a result,  IO Bus Addresses generated were
always within DDW limits and no H_PARAMETER was returned by HCALL.

So, I hacked the kernel to force the use of 64K TCEs. This resulted in
DDW smaller than max memory.

When I tried to DLPAR ADD memory, it failed with error code of -4
(H_PARAMETER) from HCALL (H_PUT_TCE/H_PUT_TCE_INDIRECT), when
iommu_mem_notifier() invoked tce_setrange_multi_pSeriesLP().

I didn't test the DLPAR REMOVE path, to verify if incorrect TCEs are
removed by tce_clearrange_multi_pSeriesLP(), since I would need to
hack kernel to force dynamically added TCEs to the high IO Bus
Addresses. But, the concept is  same.

Thanks,

Gaurav

On 6/13/23 12:16 PM, Gaurav Batra wrote:

When memory is dynamically added/removed, iommu_mem_notifier() is
invoked. This
routine traverses through all the DMA windows (DDW only, not default
windows)
to add/remove "direct" TCE mappings. The routines for this purpose are
tce_clearrange_multi_pSeriesLP() and tce_clearrange_multi_pSeriesLP().

Both these routines are designed for Direct mapped DMA windows only.

The issue is that there could be some DMA windows in the list which
are not
"direct" mapped. Calling these routines will either,

1) remove some dynamically mapped TCEs, Or
2) try to add TCEs which are out of bounds and HCALL returns H_PARAMETER

Here are the side affects when these routines are incorrectly invoked
for
"dynamically" mapped DMA windows.

tce_setrange_multi_pSeriesLP()

This adds direct mapped TCEs. Now, this could invoke HCALL to add
TCEs with
out-of-bound range. In this scenario, HCALL will return H_PARAMETER
and DLAR
ADD of memory will fail.

tce_clearrange_multi_pSeriesLP()

This will remove range of TCEs. The TCE range that is calculated,
depending on
the memory range being added, could infact be mapping some other memory
address (for dynamic DMA window scenario). This will wipe out those
TCEs.

The solution is for iommu_mem_notifier() to only invoke these
routines for
"direct" mapped DMA windows.

Signed-off-by: Gaurav Batra 
Reviewed-by: Brian King 
---
   arch/powerpc/platforms/pseries/iommu.c | 17 +
   1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/platforms/pseries/iommu.c
b/arch/powerpc/platforms/pseries/iommu.c
index 918f511837db..24dd61636400 100644
--- a/arch/powerpc/platforms/pseries/iommu.c
+++ b/arch/powerpc/platforms/pseries/iommu.c
@@ -363,6 +363,7 @@ struct dynamic_dma_window_prop {
   struct dma_win {
   struct device_node *device;
   const struct dynamic_dma_window_prop *prop;
+    bool    direct;
   struct list_head list;
   };

@@ -1409,6 +1410,8 @@ static bool enable_ddw(struct pci_dev *dev,
struct device_node *pdn)
   goto out_del_prop;

   if (direct_mapping) {
+    window->direct = true;
+
   /* DDW maps the whole partition, so enable direct DMA
mapping */
   ret = walk_system_ram_range(0, memblock_end_of_DRAM() >>
PAGE_SHIFT,
   win64->value,
tce_setrange_multi_pSeriesLP_walk);
@@ -1425,6 +1428,8 @@ static bool enable_ddw(struct pci_dev *dev,
struct device_node *pdn)
   int i;
   unsigned long start = 0, end = 0;

+    window->direct = false;
+
   for (i = 0; i < ARRAY_SIZE(pci->phb->mem_resources); i++) {
   const unsigned long mask = IORESOURCE_MEM_64 |
IORESOURCE_MEM;

@@ -1587,8 +1592,10 @@ static int iommu_mem_notifier(struct
notifier_block *nb, unsigned long action,
   case MEM_GOING_ONLINE:
   spin_lock(_win_list_lock);
   list_for_each_entry(window, _win_list, list) {
-    ret |= tce_setrange_multi_pSeriesLP(arg->start_pfn,
-    arg->nr_pages, window->prop);
+    if (window->direct) {
+    ret |= 

Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()

2023-06-26 Thread Mark Rutland
On Mon, Jun 19, 2023 at 10:09:02AM -0700, Andy Lutomirski wrote:
> On Sun, Jun 18, 2023, at 1:00 AM, Mike Rapoport wrote:
> > On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote:
> >> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
> >> > From: "Mike Rapoport (IBM)" 
> >> >
> >> > module_alloc() is used everywhere as a mean to allocate memory for code.
> >> >
> >> > Beside being semantically wrong, this unnecessarily ties all subsystems
> >> > that need to allocate code, such as ftrace, kprobes and BPF to modules
> >> > and puts the burden of code allocation to the modules code.
> >> >
> >> > Several architectures override module_alloc() because of various
> >> > constraints where the executable memory can be located and this causes
> >> > additional obstacles for improvements of code allocation.
> >> >
> >> > Start splitting code allocation from modules by introducing
> >> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() APIs.
> >> >
> >> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
> >> > module_alloc() and execmem_free() and jit_free() are replacements of
> >> > module_memfree() to allow updating all call sites to use the new APIs.
> >> >
> >> > The intention semantics for new allocation APIs:
> >> >
> >> > * execmem_text_alloc() should be used to allocate memory that must reside
> >> >   close to the kernel image, like loadable kernel modules and generated
> >> >   code that is restricted by relative addressing.
> >> >
> >> > * jit_text_alloc() should be used to allocate memory for generated code
> >> >   when there are no restrictions for the code placement. For
> >> >   architectures that require that any code is within certain distance
> >> >   from the kernel image, jit_text_alloc() will be essentially aliased to
> >> >   execmem_text_alloc().
> >> >
> >> 
> >> Is there anything in this series to help users do the appropriate
> >> synchronization when the actually populate the allocated memory with
> >> code?  See here, for example:
> >
> > This series only factors out the executable allocations from modules and
> > puts them in a central place.
> > Anything else would go on top after this lands.
> 
> Hmm.
> 
> On the one hand, there's nothing wrong with factoring out common code. On the
> other hand, this is probably the right time to at least start thinking about
> synchronization, at least to the extent that it might make us want to change
> this API.  (I'm not at all saying that this series should require changes --
> I'm just saying that this is a good time to think about how this should
> work.)
> 
> The current APIs, *and* the proposed jit_text_alloc() API, don't actually
> look like the one think in the Linux ecosystem that actually intelligently
> and efficiently maps new text into an address space: mmap().
> 
> On x86, you can mmap() an existing file full of executable code PROT_EXEC and
> jump to it with minimal synchronization (just the standard implicit ordering
> in the kernel that populates the pages before setting up the PTEs and
> whatever user synchronization is needed to avoid jumping into the mapping
> before mmap() finishes).  It works across CPUs, and the only possible way
> userspace can screw it up (for a read-only mapping of read-only text, anyway)
> is to jump to the mapping too early, in which case userspace gets a page
> fault.  Incoherence is impossible, and no one needs to "serialize" (in the
> SDM sense).
> 
> I think the same sequence (from userspace's perspective) works on other
> architectures, too, although I think more cache management is needed on the
> kernel's end.  As far as I know, no Linux SMP architecture needs an IPI to
> map executable text into usermode, but I could easily be wrong.  (IIRC RISC-V
> has very developer-unfriendly icache management, but I don't remember the
> details.)

That's my understanding too, with a couple of details:

1) After the copy we perform and complete all the data + instruction cache
   maintenance *before* marking the mapping as executable.

2) Even *after* the mapping is marked executable, a thread could take a
   spurious fault on an instruction fetch for the new instructions. One way to
   think about this is that the CPU attempted to speculate the instructions
   earlier, saw that the mapping was faulting, and placed a "generate a fault
   here" operation into its pipeline to generate that later.

   The CPU pipeline/OoO-engine/whatever is effectively a transient cache for
   operations in-flight which is only ever "invalidated" by a
   context-synchronization-event (akin to an x86 serializing effect).

   We're only guarnateed to have a new instruction fetch (from the I-cache into
   the CPU pipeline) after the next context synchronization event (akin to an 
x86
   serializing effect), and luckily out exception entry/exit is architecturally
   guarnateed to provide that (unless we explicitly opt out via a control bit).

I know we're a bit lax with that 

Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()

2023-06-26 Thread Mark Rutland
On Sun, Jun 25, 2023 at 07:14:17PM +0300, Mike Rapoport wrote:
> On Mon, Jun 19, 2023 at 10:09:02AM -0700, Andy Lutomirski wrote:
> > 
> > On Sun, Jun 18, 2023, at 1:00 AM, Mike Rapoport wrote:
> > > On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote:
> > >> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
> > >> > From: "Mike Rapoport (IBM)" 
> > >> >
> > >> > module_alloc() is used everywhere as a mean to allocate memory for 
> > >> > code.
> > >> >
> > >> > Beside being semantically wrong, this unnecessarily ties all subsystems
> > >> > that need to allocate code, such as ftrace, kprobes and BPF to modules
> > >> > and puts the burden of code allocation to the modules code.
> > >> >
> > >> > Several architectures override module_alloc() because of various
> > >> > constraints where the executable memory can be located and this causes
> > >> > additional obstacles for improvements of code allocation.
> > >> >
> > >> > Start splitting code allocation from modules by introducing
> > >> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), jit_free() 
> > >> > APIs.
> > >> >
> > >> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers for
> > >> > module_alloc() and execmem_free() and jit_free() are replacements of
> > >> > module_memfree() to allow updating all call sites to use the new APIs.
> > >> >
> > >> > The intention semantics for new allocation APIs:
> > >> >
> > >> > * execmem_text_alloc() should be used to allocate memory that must 
> > >> > reside
> > >> >   close to the kernel image, like loadable kernel modules and generated
> > >> >   code that is restricted by relative addressing.
> > >> >
> > >> > * jit_text_alloc() should be used to allocate memory for generated code
> > >> >   when there are no restrictions for the code placement. For
> > >> >   architectures that require that any code is within certain distance
> > >> >   from the kernel image, jit_text_alloc() will be essentially aliased 
> > >> > to
> > >> >   execmem_text_alloc().
> > >> >
> > >> 
> > >> Is there anything in this series to help users do the appropriate
> > >> synchronization when the actually populate the allocated memory with
> > >> code?  See here, for example:
> > >
> > > This series only factors out the executable allocations from modules and
> > > puts them in a central place.
> > > Anything else would go on top after this lands.
> > 
> > Hmm.
> > 
> > On the one hand, there's nothing wrong with factoring out common code. On
> > the other hand, this is probably the right time to at least start
> > thinking about synchronization, at least to the extent that it might make
> > us want to change this API.  (I'm not at all saying that this series
> > should require changes -- I'm just saying that this is a good time to
> > think about how this should work.)
> > 
> > The current APIs, *and* the proposed jit_text_alloc() API, don't actually
> > look like the one think in the Linux ecosystem that actually
> > intelligently and efficiently maps new text into an address space:
> > mmap().
> > 
> > On x86, you can mmap() an existing file full of executable code PROT_EXEC
> > and jump to it with minimal synchronization (just the standard implicit
> > ordering in the kernel that populates the pages before setting up the
> > PTEs and whatever user synchronization is needed to avoid jumping into
> > the mapping before mmap() finishes).  It works across CPUs, and the only
> > possible way userspace can screw it up (for a read-only mapping of
> > read-only text, anyway) is to jump to the mapping too early, in which
> > case userspace gets a page fault.  Incoherence is impossible, and no one
> > needs to "serialize" (in the SDM sense).
> > 
> > I think the same sequence (from userspace's perspective) works on other
> > architectures, too, although I think more cache management is needed on
> > the kernel's end.  As far as I know, no Linux SMP architecture needs an
> > IPI to map executable text into usermode, but I could easily be wrong.
> > (IIRC RISC-V has very developer-unfriendly icache management, but I don't
> > remember the details.)
> > 
> > Of course, using ptrace or any other FOLL_FORCE to modify text on x86 is
> > rather fraught, and I bet many things do it wrong when userspace is
> > multithreaded.  But not in production because it's mostly not used in
> > production.)
> > 
> > But jit_text_alloc() can't do this, because the order of operations
> > doesn't match.  With jit_text_alloc(), the executable mapping shows up
> > before the text is populated, so there is no atomic change from not-there
> > to populated-and-executable.  Which means that there is an opportunity
> > for CPUs, speculatively or otherwise, to start filling various caches
> > with intermediate states of the text, which means that various
> > architectures (even x86!) may need serialization.
> > 
> > For eBPF- and module- like use cases, where JITting/code gen is quite
> > coarse-grained, perhaps something vaguely 

Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()

2023-06-26 Thread Mark Rutland
On Mon, Jun 26, 2023 at 11:54:02AM +0200, Puranjay Mohan wrote:
> On Mon, Jun 26, 2023 at 8:13 AM Song Liu  wrote:
> >
> > On Sun, Jun 25, 2023 at 11:07 AM Kent Overstreet
> >  wrote:
> > >
> > > On Sun, Jun 25, 2023 at 08:42:57PM +0300, Mike Rapoport wrote:
> > > > On Sun, Jun 25, 2023 at 09:59:34AM -0700, Andy Lutomirski wrote:
> > > > >
> > > > >
> > > > > On Sun, Jun 25, 2023, at 9:14 AM, Mike Rapoport wrote:
> > > > > > On Mon, Jun 19, 2023 at 10:09:02AM -0700, Andy Lutomirski wrote:
> > > > > >>
> > > > > >> On Sun, Jun 18, 2023, at 1:00 AM, Mike Rapoport wrote:
> > > > > >> > On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote:
> > > > > >> >> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
> > > > > >> >> > From: "Mike Rapoport (IBM)" 
> > > > > >> >> >
> > > > > >> >> > module_alloc() is used everywhere as a mean to allocate 
> > > > > >> >> > memory for code.
> > > > > >> >> >
> > > > > >> >> > Beside being semantically wrong, this unnecessarily ties all 
> > > > > >> >> > subsystems
> > > > > >> >> > that need to allocate code, such as ftrace, kprobes and BPF 
> > > > > >> >> > to modules
> > > > > >> >> > and puts the burden of code allocation to the modules code.
> > > > > >> >> >
> > > > > >> >> > Several architectures override module_alloc() because of 
> > > > > >> >> > various
> > > > > >> >> > constraints where the executable memory can be located and 
> > > > > >> >> > this causes
> > > > > >> >> > additional obstacles for improvements of code allocation.
> > > > > >> >> >
> > > > > >> >> > Start splitting code allocation from modules by introducing
> > > > > >> >> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), 
> > > > > >> >> > jit_free() APIs.
> > > > > >> >> >
> > > > > >> >> > Initially, execmem_text_alloc() and jit_text_alloc() are 
> > > > > >> >> > wrappers for
> > > > > >> >> > module_alloc() and execmem_free() and jit_free() are 
> > > > > >> >> > replacements of
> > > > > >> >> > module_memfree() to allow updating all call sites to use the 
> > > > > >> >> > new APIs.
> > > > > >> >> >
> > > > > >> >> > The intention semantics for new allocation APIs:
> > > > > >> >> >
> > > > > >> >> > * execmem_text_alloc() should be used to allocate memory that 
> > > > > >> >> > must reside
> > > > > >> >> >   close to the kernel image, like loadable kernel modules and 
> > > > > >> >> > generated
> > > > > >> >> >   code that is restricted by relative addressing.
> > > > > >> >> >
> > > > > >> >> > * jit_text_alloc() should be used to allocate memory for 
> > > > > >> >> > generated code
> > > > > >> >> >   when there are no restrictions for the code placement. For
> > > > > >> >> >   architectures that require that any code is within certain 
> > > > > >> >> > distance
> > > > > >> >> >   from the kernel image, jit_text_alloc() will be essentially 
> > > > > >> >> > aliased to
> > > > > >> >> >   execmem_text_alloc().
> > > > > >> >> >
> > > > > >> >>
> > > > > >> >> Is there anything in this series to help users do the 
> > > > > >> >> appropriate
> > > > > >> >> synchronization when the actually populate the allocated memory 
> > > > > >> >> with
> > > > > >> >> code?  See here, for example:
> > > > > >> >
> > > > > >> > This series only factors out the executable allocations from 
> > > > > >> > modules and
> > > > > >> > puts them in a central place.
> > > > > >> > Anything else would go on top after this lands.
> > > > > >>
> > > > > >> Hmm.
> > > > > >>
> > > > > >> On the one hand, there's nothing wrong with factoring out common 
> > > > > >> code. On
> > > > > >> the other hand, this is probably the right time to at least start
> > > > > >> thinking about synchronization, at least to the extent that it 
> > > > > >> might make
> > > > > >> us want to change this API.  (I'm not at all saying that this 
> > > > > >> series
> > > > > >> should require changes -- I'm just saying that this is a good time 
> > > > > >> to
> > > > > >> think about how this should work.)
> > > > > >>
> > > > > >> The current APIs, *and* the proposed jit_text_alloc() API, don't 
> > > > > >> actually
> > > > > >> look like the one think in the Linux ecosystem that actually
> > > > > >> intelligently and efficiently maps new text into an address space:
> > > > > >> mmap().
> > > > > >>
> > > > > >> On x86, you can mmap() an existing file full of executable code 
> > > > > >> PROT_EXEC
> > > > > >> and jump to it with minimal synchronization (just the standard 
> > > > > >> implicit
> > > > > >> ordering in the kernel that populates the pages before setting up 
> > > > > >> the
> > > > > >> PTEs and whatever user synchronization is needed to avoid jumping 
> > > > > >> into
> > > > > >> the mapping before mmap() finishes).  It works across CPUs, and 
> > > > > >> the only
> > > > > >> possible way userspace can screw it up (for a read-only mapping of
> > > > > >> read-only text, anyway) is to jump to the mapping too early, in 
> > > > > >> which
> > > > > >> case userspace 

Re: linux-next: build failure after merge of the crypto tree

2023-06-26 Thread Herbert Xu
On Mon, Jun 26, 2023 at 12:39:46PM +1000, Stephen Rothwell wrote:
> Hi all,
> 
> After merging the crypto tree, today's linux-next build (powerpc
> ppc64_defconfig) failed like this:
> 
> ld: warning: discarding dynamic section .glink
> ld: warning: discarding dynamic section .plt
> ld: linkage table error against `sm2_compute_z_digest'
> ld: stubs don't match calculated size
> ld: can not build stubs: bad value
> ld: crypto/asymmetric_keys/x509_public_key.o: in function 
> `x509_get_sig_params':
> x509_public_key.c:(.text+0x474): undefined reference to `sm2_compute_z_digest'
> 
> Possibly caused by commit
> 
>   e5221fa6a355 ("KEYS: asymmetric: Move sm2 code into x509_public_key")
> 
> This looks like it may be a compiler bug?  Maybe the deep ternary
> expressions may be contributing to that? (cc'ing the ppc guys in case
> they have any ideas.)
> 
> I have reverted that commit (and the following one) for today.

Thanks Stephen.  I've just pushed out a fix for this.
-- 
Email: Herbert Xu 
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt


[RFC PATCH 5/5] powerpc/book3s64/memhotplug: Enable memmap on memory for radix

2023-06-26 Thread Aneesh Kumar K.V
Radix vmemmap mapping can map things correctly at the PMD level or PTE
level based on different device boundary checks. We also use altmap.reserve
feature to align things correctly at pageblock granularity. We can end up
loosing some pages in memory with this. For ex: with 256MB memory block
size, we require 4 pages to map vmemmap pages, In order to align things
correctly we end up adding a reserve of 28 pages. ie, for every 4096 pages
28 pages get reserved.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/Kconfig  |  1 +
 arch/powerpc/mm/book3s64/radix_pgtable.c  | 28 +++
 .../platforms/pseries/hotplug-memory.c|  4 ++-
 3 files changed, 32 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 6bd9ca6f2448..1b0954854a12 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -158,6 +158,7 @@ config PPC
select ARCH_HAS_UBSAN_SANITIZE_ALL
select ARCH_HAVE_NMI_SAFE_CMPXCHG
select ARCH_KEEP_MEMBLOCK
+   select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE if PPC_RADIX_MMU
select ARCH_MIGHT_HAVE_PC_PARPORT
select ARCH_MIGHT_HAVE_PC_SERIO
select ARCH_OPTIONAL_KERNEL_RWX if ARCH_HAS_STRICT_KERNEL_RWX
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index afbae37612ad..e0e292b87b4b 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -1692,3 +1692,31 @@ int pmd_free_pte_page(pmd_t *pmd, unsigned long addr)
 
return 1;
 }
+
+/*
+ * mm/memory_hotplug.c:mhp_supports_memmap_on_memory goes into details
+ * some of the restrictions. We don't check for PMD_SIZE because our
+ * vmemmap allocation code can fallback correctly. The pageblock
+ * alignment requirement is met using altmap->reserve blocks.
+ */
+bool mhp_supports_memmap_on_memory(unsigned long size)
+{
+   if (!radix_enabled())
+   return false;
+   /*
+* The pageblock alignment requirement is met by using
+* reserve blocks in altmap.
+*/
+   return size == memory_block_size_bytes();
+}
+
+unsigned long memory_block_align_base(struct resource *res)
+{
+   unsigned long base_pfn = PHYS_PFN(res->start);
+   unsigned long align, size = resource_size(res);
+   unsigned long nr_vmemmap_pages = size / PAGE_SIZE;
+   unsigned long vmemmap_size = (nr_vmemmap_pages * sizeof(struct 
page))/PAGE_SIZE;
+
+   align = pageblock_align(base_pfn + vmemmap_size) - (base_pfn + 
vmemmap_size);
+   return align;
+}
diff --git a/arch/powerpc/platforms/pseries/hotplug-memory.c 
b/arch/powerpc/platforms/pseries/hotplug-memory.c
index 9c62c2c3b3d0..326db26d773e 100644
--- a/arch/powerpc/platforms/pseries/hotplug-memory.c
+++ b/arch/powerpc/platforms/pseries/hotplug-memory.c
@@ -617,6 +617,7 @@ static int dlpar_memory_remove_by_ic(u32 lmbs_to_remove, 
u32 drc_index)
 
 static int dlpar_add_lmb(struct drmem_lmb *lmb)
 {
+   mhp_t mhp_flags = MHP_NONE;
unsigned long block_sz;
int nid, rc;
 
@@ -637,7 +638,8 @@ static int dlpar_add_lmb(struct drmem_lmb *lmb)
nid = first_online_node;
 
/* Add the memory */
-   rc = __add_memory(nid, lmb->base_addr, block_sz, MHP_NONE);
+   mhp_flags |= get_memmap_on_memory_flags();
+   rc = __add_memory(nid, lmb->base_addr, block_sz, mhp_flags);
if (rc) {
invalidate_lmb_associativity_index(lmb);
return rc;
-- 
2.41.0



[RFC PATCH 4/5] mm/hotplug: Simplify ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE kconfig

2023-06-26 Thread Aneesh Kumar K.V
Instead of adding menu entry with all supported architectures, add
mm/Kconfig variable and select the same from supported architectures.

No functional change in this patch.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/arm64/Kconfig | 4 +---
 arch/x86/Kconfig   | 4 +---
 mm/Kconfig | 3 +++
 3 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/arch/arm64/Kconfig b/arch/arm64/Kconfig
index 343e1e1cae10..20e909dac7ab 100644
--- a/arch/arm64/Kconfig
+++ b/arch/arm64/Kconfig
@@ -78,6 +78,7 @@ config ARM64
select ARCH_INLINE_SPIN_UNLOCK_IRQ if !PREEMPTION
select ARCH_INLINE_SPIN_UNLOCK_IRQRESTORE if !PREEMPTION
select ARCH_KEEP_MEMBLOCK
+   select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
select ARCH_USE_CMPXCHG_LOCKREF
select ARCH_USE_GNU_PROPERTY
select ARCH_USE_MEMTEST
@@ -338,9 +339,6 @@ config GENERIC_CSUM
 config GENERIC_CALIBRATE_DELAY
def_bool y
 
-config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
-   def_bool y
-
 config SMP
def_bool y
 
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index eb383960b6ee..c77c881e35da 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -101,6 +101,7 @@ config X86
select ARCH_HAS_DEBUG_WX
select ARCH_HAS_ZONE_DMA_SET if EXPERT
select ARCH_HAVE_NMI_SAFE_CMPXCHG
+   select ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
select ARCH_MIGHT_HAVE_ACPI_PDC if ACPI
select ARCH_MIGHT_HAVE_PC_PARPORT
select ARCH_MIGHT_HAVE_PC_SERIO
@@ -2656,9 +2657,6 @@ config ARCH_HAS_ADD_PAGES
def_bool y
depends on ARCH_ENABLE_MEMORY_HOTPLUG
 
-config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
-   def_bool y
-
 menu "Power management and ACPI options"
 
 config ARCH_HIBERNATION_HEADER
diff --git a/mm/Kconfig b/mm/Kconfig
index 7b388c10baab..4e5862c001e4 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -544,6 +544,9 @@ config MHP_MEMMAP_ON_MEMORY
depends on MEMORY_HOTPLUG && SPARSEMEM_VMEMMAP
depends on ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
 
+config ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE
+   bool
+
 endif # MEMORY_HOTPLUG
 
 # Heavily threaded applications may benefit from splitting the mm-wide
-- 
2.41.0



[RFC PATCH 3/5] mm/hotplug: Simplify the handling of MHP_MEMMAP_ON_MEMORY flag

2023-06-26 Thread Aneesh Kumar K.V
Instead of checking for memmap on memory feature enablement within the
functions checking for alignment, use the kernel parameter to control the
memory hotplug flags. The generic kernel now enables memmap on memory
feature if the hotplug flag request for the same.

The ACPI code now can pass the flag unconditionally because the kernel will
fallback to not using the feature if the alignment rules are not met.

Signed-off-by: Aneesh Kumar K.V 
---
 drivers/acpi/acpi_memhotplug.c |  3 +--
 include/linux/memory_hotplug.h | 14 ++
 mm/memory_hotplug.c| 19 ---
 3 files changed, 19 insertions(+), 17 deletions(-)

diff --git a/drivers/acpi/acpi_memhotplug.c b/drivers/acpi/acpi_memhotplug.c
index 24f662d8bd39..4d0096fc4cc2 100644
--- a/drivers/acpi/acpi_memhotplug.c
+++ b/drivers/acpi/acpi_memhotplug.c
@@ -211,8 +211,7 @@ static int acpi_memory_enable_device(struct 
acpi_memory_device *mem_device)
if (!info->length)
continue;
 
-   if (mhp_supports_memmap_on_memory(info->length))
-   mhp_flags |= MHP_MEMMAP_ON_MEMORY;
+   mhp_flags |= get_memmap_on_memory_flags();
result = __add_memory(mgid, info->start_addr, info->length,
  mhp_flags);
 
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 2387391ee93a..add3e7829c80 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -362,4 +362,18 @@ bool mhp_supports_memmap_on_memory(unsigned long size);
 bool __mhp_supports_memmap_on_memory(unsigned long size);
 #endif /* CONFIG_MEMORY_HOTPLUG */
 
+#ifdef CONFIG_MHP_MEMMAP_ON_MEMORY
+extern bool memmap_on_memory;
+static inline unsigned long get_memmap_on_memory_flags(void)
+{
+   if (memmap_on_memory)
+   return MHP_MEMMAP_ON_MEMORY;
+   return 0;
+}
+#else
+static inline unsigned long get_memmap_on_memory_flags(void)
+{
+   return 0;
+}
+#endif
 #endif /* __LINUX_MEMORY_HOTPLUG_H */
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 7cb112fb4996..9cfa6fa31df5 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -46,19 +46,9 @@
 /*
  * memory_hotplug.memmap_on_memory parameter
  */
-static bool memmap_on_memory __ro_after_init;
+bool memmap_on_memory __ro_after_init;
 module_param(memmap_on_memory, bool, 0444);
 MODULE_PARM_DESC(memmap_on_memory, "Enable memmap on memory for memory 
hotplug");
-
-static inline bool mhp_memmap_on_memory(void)
-{
-   return memmap_on_memory;
-}
-#else
-static inline bool mhp_memmap_on_memory(void)
-{
-   return false;
-}
 #endif
 
 enum {
@@ -1317,10 +1307,9 @@ bool __mhp_supports_memmap_on_memory(unsigned long size)
 *   altmap as an alternative source of memory, and we do not 
exactly
 *   populate a single PMD.
 */
-   return mhp_memmap_on_memory() &&
-  size == memory_block_size_bytes() &&
-  IS_ALIGNED(vmemmap_size, PMD_SIZE) &&
-  IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT));
+   return size == memory_block_size_bytes() &&
+   IS_ALIGNED(vmemmap_size, PMD_SIZE) &&
+   IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT));
 }
 
 bool __weak mhp_supports_memmap_on_memory(unsigned long size)
-- 
2.41.0



[RFC PATCH 2/5] mm/hotplug: Allow architecture override for memmap on memory feature

2023-06-26 Thread Aneesh Kumar K.V
Some architectures like ppc64 wants to enable this feature only with radix
translation and their vemmap mappings have different alignment
requirements. Add overrider for mhp_supports_memmap_on_memory() and also
use altmap.reserve feature to adjust the pageblock alignment requirement.

The patch also fallback to allocation of memmap outside memblock if the
alignment rules are not met for memmap on memory allocation. This allows to
use the feature more widely allocating memmap as much as possible within
the memory block getting added.

A follow patch to enable memmap on memory for ppc64 will use this.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/arm64/mm/mmu.c|  5 +
 arch/x86/mm/init_64.c  |  6 ++
 include/linux/memory_hotplug.h |  3 ++-
 mm/memory_hotplug.c| 36 --
 4 files changed, 39 insertions(+), 11 deletions(-)

diff --git a/arch/arm64/mm/mmu.c b/arch/arm64/mm/mmu.c
index af6bc8403ee4..a5165897ea58 100644
--- a/arch/arm64/mm/mmu.c
+++ b/arch/arm64/mm/mmu.c
@@ -1335,6 +1335,11 @@ void arch_remove_memory(u64 start, u64 size, struct 
vmem_altmap *altmap)
__remove_pgd_mapping(swapper_pg_dir, __phys_to_virt(start), size);
 }
 
+bool mhp_supports_memmap_on_memory(unsigned long size)
+{
+   return __mhp_supports_memmap_on_memory(size);
+}
+
 /*
  * This memory hotplug notifier helps prevent boot memory from being
  * inadvertently removed as it blocks pfn range offlining process in
diff --git a/arch/x86/mm/init_64.c b/arch/x86/mm/init_64.c
index a190aae8ceaf..b318d26a70d4 100644
--- a/arch/x86/mm/init_64.c
+++ b/arch/x86/mm/init_64.c
@@ -1264,6 +1264,12 @@ void __ref arch_remove_memory(u64 start, u64 size, 
struct vmem_altmap *altmap)
__remove_pages(start_pfn, nr_pages, altmap);
kernel_physical_mapping_remove(start, start + size);
 }
+
+bool mhp_supports_memmap_on_memory(unsigned long size)
+{
+   return __mhp_supports_memmap_on_memory(size);
+}
+
 #endif /* CONFIG_MEMORY_HOTPLUG */
 
 static struct kcore_list kcore_vsyscall;
diff --git a/include/linux/memory_hotplug.h b/include/linux/memory_hotplug.h
index 9fcbf5706595..2387391ee93a 100644
--- a/include/linux/memory_hotplug.h
+++ b/include/linux/memory_hotplug.h
@@ -358,7 +358,8 @@ extern struct zone *zone_for_pfn_range(int online_type, int 
nid,
 extern int arch_create_linear_mapping(int nid, u64 start, u64 size,
  struct mhp_params *params);
 void arch_remove_linear_mapping(u64 start, u64 size);
-extern bool mhp_supports_memmap_on_memory(unsigned long size);
+bool mhp_supports_memmap_on_memory(unsigned long size);
+bool __mhp_supports_memmap_on_memory(unsigned long size);
 #endif /* CONFIG_MEMORY_HOTPLUG */
 
 #endif /* __LINUX_MEMORY_HOTPLUG_H */
diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
index 88a9c4443fc0..7cb112fb4996 100644
--- a/mm/memory_hotplug.c
+++ b/mm/memory_hotplug.c
@@ -1284,7 +1284,8 @@ static int online_memory_block(struct memory_block *mem, 
void *arg)
return device_online(>dev);
 }
 
-bool mhp_supports_memmap_on_memory(unsigned long size)
+/* Helper function for architecture to use. */
+bool __mhp_supports_memmap_on_memory(unsigned long size)
 {
unsigned long nr_vmemmap_pages = size / PAGE_SIZE;
unsigned long vmemmap_size = nr_vmemmap_pages * sizeof(struct page);
@@ -1322,6 +1323,20 @@ bool mhp_supports_memmap_on_memory(unsigned long size)
   IS_ALIGNED(remaining_size, (pageblock_nr_pages << PAGE_SHIFT));
 }
 
+bool __weak mhp_supports_memmap_on_memory(unsigned long size)
+{
+   return false;
+}
+
+/*
+ * Architectures may want to override the altmap reserve details based
+ * on the alignment requirement for vmemmap mapping.
+ */
+unsigned __weak long memory_block_align_base(struct resource *res)
+{
+   return 0;
+}
+
 /*
  * NOTE: The caller must call lock_device_hotplug() to serialize hotplug
  * and online/offline operations (triggered e.g. by sysfs).
@@ -1332,7 +1347,11 @@ int __ref add_memory_resource(int nid, struct resource 
*res, mhp_t mhp_flags)
 {
struct mhp_params params = { .pgprot = pgprot_mhp(PAGE_KERNEL) };
enum memblock_flags memblock_flags = MEMBLOCK_NONE;
-   struct vmem_altmap mhp_altmap = {};
+   struct vmem_altmap mhp_altmap = {
+   .base_pfn =  PHYS_PFN(res->start),
+   .end_pfn  =  PHYS_PFN(res->end),
+   .reserve  = memory_block_align_base(res),
+   };
struct memory_group *group = NULL;
u64 start, size;
bool new_node = false;
@@ -1376,13 +1395,11 @@ int __ref add_memory_resource(int nid, struct resource 
*res, mhp_t mhp_flags)
 * Self hosted memmap array
 */
if (mhp_flags & MHP_MEMMAP_ON_MEMORY) {
-   if (!mhp_supports_memmap_on_memory(size)) {
-   ret = -EINVAL;
-   goto error;
+   if (mhp_supports_memmap_on_memory(size)) {
+   

[RFC PATCH 1/5] mm/hotplug: Embed vmem_altmap details in memory block

2023-06-26 Thread Aneesh Kumar K.V
With memmap on memory, some architecture needs more details w.r.t altmap
such as base_pfn, end_pfn, etc to unmap vmemmap memory.

Embed vmem_altmap data structure to memory_bock and use that instead of
nr_vmemmap_pages.

On memory unplug, if the kernel finds any memory block in the range to
be using vmem_altmap, the kernel fails to unplug the memory if the
request is not a single memory block unplug.

Signed-off-by: Aneesh Kumar K.V 
---
 drivers/base/memory.c| 28 ++-
 include/linux/memory.h   | 25 ++--
 include/linux/memremap.h | 18 +
 mm/memory_hotplug.c  | 42 +---
 4 files changed, 57 insertions(+), 56 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index b456ac213610..523cc1d37c81 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -106,6 +106,7 @@ static void memory_block_release(struct device *dev)
 {
struct memory_block *mem = to_memory_block(dev);
 
+   WARN(mem->altmap.alloc, "Altmap not fully unmapped");
kfree(mem);
 }
 
@@ -183,7 +184,7 @@ static int memory_block_online(struct memory_block *mem)
 {
unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
-   unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages;
+   unsigned long nr_vmemmap_pages = 0;
struct zone *zone;
int ret;
 
@@ -200,6 +201,9 @@ static int memory_block_online(struct memory_block *mem)
 * stage helps to keep accounting easier to follow - e.g vmemmaps
 * belong to the same zone as the memory they backed.
 */
+   if (mem->altmap.alloc)
+   nr_vmemmap_pages = mem->altmap.alloc + mem->altmap.reserve;
+
if (nr_vmemmap_pages) {
ret = mhp_init_memmap_on_memory(start_pfn, nr_vmemmap_pages, 
zone);
if (ret)
@@ -230,7 +234,7 @@ static int memory_block_offline(struct memory_block *mem)
 {
unsigned long start_pfn = section_nr_to_pfn(mem->start_section_nr);
unsigned long nr_pages = PAGES_PER_SECTION * sections_per_block;
-   unsigned long nr_vmemmap_pages = mem->nr_vmemmap_pages;
+   unsigned long nr_vmemmap_pages = 0;
int ret;
 
if (!mem->zone)
@@ -240,6 +244,9 @@ static int memory_block_offline(struct memory_block *mem)
 * Unaccount before offlining, such that unpopulated zone and kthreads
 * can properly be torn down in offline_pages().
 */
+   if (mem->altmap.alloc)
+   nr_vmemmap_pages = mem->altmap.alloc + mem->altmap.reserve;
+
if (nr_vmemmap_pages)
adjust_present_page_count(pfn_to_page(start_pfn), mem->group,
  -nr_vmemmap_pages);
@@ -726,7 +733,7 @@ void memory_block_add_nid(struct memory_block *mem, int nid,
 #endif
 
 static int add_memory_block(unsigned long block_id, unsigned long state,
-   unsigned long nr_vmemmap_pages,
+   struct vmem_altmap *altmap,
struct memory_group *group)
 {
struct memory_block *mem;
@@ -744,7 +751,10 @@ static int add_memory_block(unsigned long block_id, 
unsigned long state,
mem->start_section_nr = block_id * sections_per_block;
mem->state = state;
mem->nid = NUMA_NO_NODE;
-   mem->nr_vmemmap_pages = nr_vmemmap_pages;
+   if (altmap)
+   memcpy(>altmap, altmap, sizeof(*altmap));
+   else
+   mem->altmap.alloc = 0;
INIT_LIST_HEAD(>group_next);
 
 #ifndef CONFIG_NUMA
@@ -783,14 +793,14 @@ static int __init add_boot_memory_block(unsigned long 
base_section_nr)
if (section_count == 0)
return 0;
return add_memory_block(memory_block_id(base_section_nr),
-   MEM_ONLINE, 0,  NULL);
+   MEM_ONLINE, NULL,  NULL);
 }
 
 static int add_hotplug_memory_block(unsigned long block_id,
-   unsigned long nr_vmemmap_pages,
+   struct vmem_altmap *altmap,
struct memory_group *group)
 {
-   return add_memory_block(block_id, MEM_OFFLINE, nr_vmemmap_pages, group);
+   return add_memory_block(block_id, MEM_OFFLINE, altmap, group);
 }
 
 static void remove_memory_block(struct memory_block *memory)
@@ -818,7 +828,7 @@ static void remove_memory_block(struct memory_block *memory)
  * Called under device_hotplug_lock.
  */
 int create_memory_block_devices(unsigned long start, unsigned long size,
-   unsigned long vmemmap_pages,
+   struct vmem_altmap *altmap,
struct memory_group *group)
 {
const unsigned long start_block_id = pfn_to_block_id(PFN_DOWN(start));
@@ -832,7 +842,7 @@ int 

[RFC PATCH 0/5] Add support for memmap on memory feature on ppc64

2023-06-26 Thread Aneesh Kumar K.V
This patch series update memmap on memory feature to fall back to
memmap allocation outside the memory block if the alignment rules are
not met. This makes the feature more useful on architectures like
ppc64 where alignment rules are different with 64K page size.

This patch series is dependent on dax vmemmap optimization series
posted here 
https://lore.kernel.org/linux-mm/20230616110826.344417-1-aneesh.ku...@linux.ibm.com


Aneesh Kumar K.V (5):
  mm/hotplug: Embed vmem_altmap details in memory block
  mm/hotplug: Allow architecture override for memmap on memory feature
  mm/hotplug: Simplify the handling of MHP_MEMMAP_ON_MEMORY flag
  mm/hotplug: Simplify ARCH_MHP_MEMMAP_ON_MEMORY_ENABLE kconfig
  powerpc/book3s64/memhotplug: Enable memmap on memory for radix

 arch/arm64/Kconfig|  4 +-
 arch/arm64/mm/mmu.c   |  5 +
 arch/powerpc/Kconfig  |  1 +
 arch/powerpc/mm/book3s64/radix_pgtable.c  | 28 ++
 .../platforms/pseries/hotplug-memory.c|  4 +-
 arch/x86/Kconfig  |  4 +-
 arch/x86/mm/init_64.c |  6 ++
 drivers/acpi/acpi_memhotplug.c|  3 +-
 drivers/base/memory.c | 28 --
 include/linux/memory.h| 25 +++--
 include/linux/memory_hotplug.h| 17 +++-
 include/linux/memremap.h  | 18 +---
 mm/Kconfig|  3 +
 mm/memory_hotplug.c   | 95 +--
 14 files changed, 151 insertions(+), 90 deletions(-)

-- 
2.41.0



Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()

2023-06-26 Thread Puranjay Mohan
On Mon, Jun 26, 2023 at 8:13 AM Song Liu  wrote:
>
> On Sun, Jun 25, 2023 at 11:07 AM Kent Overstreet
>  wrote:
> >
> > On Sun, Jun 25, 2023 at 08:42:57PM +0300, Mike Rapoport wrote:
> > > On Sun, Jun 25, 2023 at 09:59:34AM -0700, Andy Lutomirski wrote:
> > > >
> > > >
> > > > On Sun, Jun 25, 2023, at 9:14 AM, Mike Rapoport wrote:
> > > > > On Mon, Jun 19, 2023 at 10:09:02AM -0700, Andy Lutomirski wrote:
> > > > >>
> > > > >> On Sun, Jun 18, 2023, at 1:00 AM, Mike Rapoport wrote:
> > > > >> > On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote:
> > > > >> >> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
> > > > >> >> > From: "Mike Rapoport (IBM)" 
> > > > >> >> >
> > > > >> >> > module_alloc() is used everywhere as a mean to allocate memory 
> > > > >> >> > for code.
> > > > >> >> >
> > > > >> >> > Beside being semantically wrong, this unnecessarily ties all 
> > > > >> >> > subsystems
> > > > >> >> > that need to allocate code, such as ftrace, kprobes and BPF to 
> > > > >> >> > modules
> > > > >> >> > and puts the burden of code allocation to the modules code.
> > > > >> >> >
> > > > >> >> > Several architectures override module_alloc() because of various
> > > > >> >> > constraints where the executable memory can be located and this 
> > > > >> >> > causes
> > > > >> >> > additional obstacles for improvements of code allocation.
> > > > >> >> >
> > > > >> >> > Start splitting code allocation from modules by introducing
> > > > >> >> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), 
> > > > >> >> > jit_free() APIs.
> > > > >> >> >
> > > > >> >> > Initially, execmem_text_alloc() and jit_text_alloc() are 
> > > > >> >> > wrappers for
> > > > >> >> > module_alloc() and execmem_free() and jit_free() are 
> > > > >> >> > replacements of
> > > > >> >> > module_memfree() to allow updating all call sites to use the 
> > > > >> >> > new APIs.
> > > > >> >> >
> > > > >> >> > The intention semantics for new allocation APIs:
> > > > >> >> >
> > > > >> >> > * execmem_text_alloc() should be used to allocate memory that 
> > > > >> >> > must reside
> > > > >> >> >   close to the kernel image, like loadable kernel modules and 
> > > > >> >> > generated
> > > > >> >> >   code that is restricted by relative addressing.
> > > > >> >> >
> > > > >> >> > * jit_text_alloc() should be used to allocate memory for 
> > > > >> >> > generated code
> > > > >> >> >   when there are no restrictions for the code placement. For
> > > > >> >> >   architectures that require that any code is within certain 
> > > > >> >> > distance
> > > > >> >> >   from the kernel image, jit_text_alloc() will be essentially 
> > > > >> >> > aliased to
> > > > >> >> >   execmem_text_alloc().
> > > > >> >> >
> > > > >> >>
> > > > >> >> Is there anything in this series to help users do the appropriate
> > > > >> >> synchronization when the actually populate the allocated memory 
> > > > >> >> with
> > > > >> >> code?  See here, for example:
> > > > >> >
> > > > >> > This series only factors out the executable allocations from 
> > > > >> > modules and
> > > > >> > puts them in a central place.
> > > > >> > Anything else would go on top after this lands.
> > > > >>
> > > > >> Hmm.
> > > > >>
> > > > >> On the one hand, there's nothing wrong with factoring out common 
> > > > >> code. On
> > > > >> the other hand, this is probably the right time to at least start
> > > > >> thinking about synchronization, at least to the extent that it might 
> > > > >> make
> > > > >> us want to change this API.  (I'm not at all saying that this series
> > > > >> should require changes -- I'm just saying that this is a good time to
> > > > >> think about how this should work.)
> > > > >>
> > > > >> The current APIs, *and* the proposed jit_text_alloc() API, don't 
> > > > >> actually
> > > > >> look like the one think in the Linux ecosystem that actually
> > > > >> intelligently and efficiently maps new text into an address space:
> > > > >> mmap().
> > > > >>
> > > > >> On x86, you can mmap() an existing file full of executable code 
> > > > >> PROT_EXEC
> > > > >> and jump to it with minimal synchronization (just the standard 
> > > > >> implicit
> > > > >> ordering in the kernel that populates the pages before setting up the
> > > > >> PTEs and whatever user synchronization is needed to avoid jumping 
> > > > >> into
> > > > >> the mapping before mmap() finishes).  It works across CPUs, and the 
> > > > >> only
> > > > >> possible way userspace can screw it up (for a read-only mapping of
> > > > >> read-only text, anyway) is to jump to the mapping too early, in which
> > > > >> case userspace gets a page fault.  Incoherence is impossible, and no 
> > > > >> one
> > > > >> needs to "serialize" (in the SDM sense).
> > > > >>
> > > > >> I think the same sequence (from userspace's perspective) works on 
> > > > >> other
> > > > >> architectures, too, although I think more cache management is needed 
> > > > >> on
> > > > >> the kernel's end.  As 

[PATCH] powernv/opal-prd: Silence memcpy() run-time false positive warnings

2023-06-26 Thread Mahesh Salgaonkar
opal_prd_msg_notifier extracts the opal prd message size from the message
header and uses it for allocating opal_prd_msg_queue_item that includes
the correct message size to be copied. However, while running under
CONFIG_FORTIFY_SOURCE=y, it triggers following run-time warning:

[ 6458.234352] memcpy: detected field-spanning write (size 32) of single field 
">msg" at arch/powerpc/platforms/powernv/opal-prd.c:355 (size 4)
[ 6458.234390] WARNING: CPU: 9 PID: 660 at 
arch/powerpc/platforms/powernv/opal-prd.c:355 opal_prd_msg_notifier+0x174/0x188 
[opal_prd]
[...]
[ 6458.234709] NIP [c0080e0c0e6c] opal_prd_msg_notifier+0x174/0x188 
[opal_prd]
[ 6458.234723] LR [c0080e0c0e68] opal_prd_msg_notifier+0x170/0x188 
[opal_prd]
[ 6458.234736] Call Trace:
[ 6458.234742] [c002acb23c10] [c0080e0c0e68] 
opal_prd_msg_notifier+0x170/0x188 [opal_prd] (unreliable)
[ 6458.234759] [c002acb23ca0] [c019ccc0] 
notifier_call_chain+0xc0/0x1b0
[ 6458.234774] [c002acb23d00] [c019ceac] 
atomic_notifier_call_chain+0x2c/0x40
[ 6458.234788] [c002acb23d20] [c00d69b4] 
opal_message_notify+0xf4/0x2c0
[...]

Add a flexible array member to avoid false positive run-time warning.

Reported-by: Aneesh Kumar K.V 
Signed-off-by: Mahesh Salgaonkar 
---
 arch/powerpc/platforms/powernv/opal-prd.c |7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/arch/powerpc/platforms/powernv/opal-prd.c 
b/arch/powerpc/platforms/powernv/opal-prd.c
index 113bdb151f687..9e2c4775f75f5 100644
--- a/arch/powerpc/platforms/powernv/opal-prd.c
+++ b/arch/powerpc/platforms/powernv/opal-prd.c
@@ -30,7 +30,10 @@
  */
 struct opal_prd_msg_queue_item {
struct list_headlist;
-   struct opal_prd_msg_header  msg;
+   union {
+   struct opal_prd_msg_header  msg;
+   DECLARE_FLEX_ARRAY(__u8, msg_flex);
+   };
 };
 
 static struct device_node *prd_node;
@@ -352,7 +355,7 @@ static int opal_prd_msg_notifier(struct notifier_block *nb,
if (!item)
return -ENOMEM;
 
-   memcpy(>msg, msg->params, msg_size);
+   memcpy(>msg_flex, msg->params, msg_size);
 
spin_lock_irqsave(_prd_msg_queue_lock, flags);
list_add_tail(>list, _prd_msg_queue);




Re: [PATCH 00/24 v2] Documentation: correct lots of spelling errors (series 1)

2023-06-26 Thread Krzysztof Wilczyński
Hello,

> Correct many spelling errors in Documentation/ as reported by codespell.
> 
> Maintainers of specific kernel subsystems are only Cc-ed on their
> respective patches, not the entire series.
> 
> These patches are based on linux-next-20230209.
> 
[...]
>  [PATCH 13/24] Documentation: PCI: correct spelling
[...]

Applied to misc, thank you!

[1/1] Documentation: PCI: correct spelling
  https://git.kernel.org/pci/pci/c/b58d6d89ae02

Krzysztof


Re: [PATCH v2 02/12] mm: introduce execmem_text_alloc() and jit_text_alloc()

2023-06-26 Thread Song Liu
On Sun, Jun 25, 2023 at 11:07 AM Kent Overstreet
 wrote:
>
> On Sun, Jun 25, 2023 at 08:42:57PM +0300, Mike Rapoport wrote:
> > On Sun, Jun 25, 2023 at 09:59:34AM -0700, Andy Lutomirski wrote:
> > >
> > >
> > > On Sun, Jun 25, 2023, at 9:14 AM, Mike Rapoport wrote:
> > > > On Mon, Jun 19, 2023 at 10:09:02AM -0700, Andy Lutomirski wrote:
> > > >>
> > > >> On Sun, Jun 18, 2023, at 1:00 AM, Mike Rapoport wrote:
> > > >> > On Sat, Jun 17, 2023 at 01:38:29PM -0700, Andy Lutomirski wrote:
> > > >> >> On Fri, Jun 16, 2023, at 1:50 AM, Mike Rapoport wrote:
> > > >> >> > From: "Mike Rapoport (IBM)" 
> > > >> >> >
> > > >> >> > module_alloc() is used everywhere as a mean to allocate memory 
> > > >> >> > for code.
> > > >> >> >
> > > >> >> > Beside being semantically wrong, this unnecessarily ties all 
> > > >> >> > subsystems
> > > >> >> > that need to allocate code, such as ftrace, kprobes and BPF to 
> > > >> >> > modules
> > > >> >> > and puts the burden of code allocation to the modules code.
> > > >> >> >
> > > >> >> > Several architectures override module_alloc() because of various
> > > >> >> > constraints where the executable memory can be located and this 
> > > >> >> > causes
> > > >> >> > additional obstacles for improvements of code allocation.
> > > >> >> >
> > > >> >> > Start splitting code allocation from modules by introducing
> > > >> >> > execmem_text_alloc(), execmem_free(), jit_text_alloc(), 
> > > >> >> > jit_free() APIs.
> > > >> >> >
> > > >> >> > Initially, execmem_text_alloc() and jit_text_alloc() are wrappers 
> > > >> >> > for
> > > >> >> > module_alloc() and execmem_free() and jit_free() are replacements 
> > > >> >> > of
> > > >> >> > module_memfree() to allow updating all call sites to use the new 
> > > >> >> > APIs.
> > > >> >> >
> > > >> >> > The intention semantics for new allocation APIs:
> > > >> >> >
> > > >> >> > * execmem_text_alloc() should be used to allocate memory that 
> > > >> >> > must reside
> > > >> >> >   close to the kernel image, like loadable kernel modules and 
> > > >> >> > generated
> > > >> >> >   code that is restricted by relative addressing.
> > > >> >> >
> > > >> >> > * jit_text_alloc() should be used to allocate memory for 
> > > >> >> > generated code
> > > >> >> >   when there are no restrictions for the code placement. For
> > > >> >> >   architectures that require that any code is within certain 
> > > >> >> > distance
> > > >> >> >   from the kernel image, jit_text_alloc() will be essentially 
> > > >> >> > aliased to
> > > >> >> >   execmem_text_alloc().
> > > >> >> >
> > > >> >>
> > > >> >> Is there anything in this series to help users do the appropriate
> > > >> >> synchronization when the actually populate the allocated memory with
> > > >> >> code?  See here, for example:
> > > >> >
> > > >> > This series only factors out the executable allocations from modules 
> > > >> > and
> > > >> > puts them in a central place.
> > > >> > Anything else would go on top after this lands.
> > > >>
> > > >> Hmm.
> > > >>
> > > >> On the one hand, there's nothing wrong with factoring out common code. 
> > > >> On
> > > >> the other hand, this is probably the right time to at least start
> > > >> thinking about synchronization, at least to the extent that it might 
> > > >> make
> > > >> us want to change this API.  (I'm not at all saying that this series
> > > >> should require changes -- I'm just saying that this is a good time to
> > > >> think about how this should work.)
> > > >>
> > > >> The current APIs, *and* the proposed jit_text_alloc() API, don't 
> > > >> actually
> > > >> look like the one think in the Linux ecosystem that actually
> > > >> intelligently and efficiently maps new text into an address space:
> > > >> mmap().
> > > >>
> > > >> On x86, you can mmap() an existing file full of executable code 
> > > >> PROT_EXEC
> > > >> and jump to it with minimal synchronization (just the standard implicit
> > > >> ordering in the kernel that populates the pages before setting up the
> > > >> PTEs and whatever user synchronization is needed to avoid jumping into
> > > >> the mapping before mmap() finishes).  It works across CPUs, and the 
> > > >> only
> > > >> possible way userspace can screw it up (for a read-only mapping of
> > > >> read-only text, anyway) is to jump to the mapping too early, in which
> > > >> case userspace gets a page fault.  Incoherence is impossible, and no 
> > > >> one
> > > >> needs to "serialize" (in the SDM sense).
> > > >>
> > > >> I think the same sequence (from userspace's perspective) works on other
> > > >> architectures, too, although I think more cache management is needed on
> > > >> the kernel's end.  As far as I know, no Linux SMP architecture needs an
> > > >> IPI to map executable text into usermode, but I could easily be wrong.
> > > >> (IIRC RISC-V has very developer-unfriendly icache management, but I 
> > > >> don't
> > > >> remember the details.)
> > > >>
> > > >> Of course, using ptrace or any other