Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()

2023-06-05 Thread Hugh Dickins
On Sun, 28 May 2023, Hugh Dickins wrote:

> Add s390-specific pte_free_defer(), to call pte_free() via call_rcu().
> pte_free_defer() will be called inside khugepaged's retract_page_tables()
> loop, where allocating extra memory cannot be relied upon.  This precedes
> the generic version to avoid build breakage from incompatible pgtable_t.
> 
> This version is more complicated than others: because page_table_free()
> needs to know which fragment is being freed, and which mm to link it to.
> 
> page_table_free()'s fragment handling is clever, but I could too easily
> break it: what's done here in pte_free_defer() and pte_free_now() might
> be better integrated with page_table_free()'s cleverness, but not by me!
> 
> By the time that page_table_free() gets called via RCU, it's conceivable
> that mm would already have been freed: so mmgrab() in pte_free_defer()
> and mmdrop() in pte_free_now().  No, that is not a good context to call
> mmdrop() from, so make mmdrop_async() public and use that.

But Matthew Wilcox quickly pointed out that sharing one page->rcu_head
between multiple page tables is tricky: something I knew but had lost
sight of.  So the powerpc and s390 patches were broken: powerpc fairly
easily fixed, but s390 more painful.

In https://lore.kernel.org/linux-s390/20230601155751.7c949ca4@thinkpad-T15/
On Thu, 1 Jun 2023 15:57:51 +0200
Gerald Schaefer  wrote:
> 
> Yes, we have 2 pagetables in one 4K page, which could result in same
> rcu_head reuse. It might be possible to use the cleverness from our
> page_table_free() function, e.g. to only do the call_rcu() once, for
> the case where both 2K pagetable fragments become unused, similar to
> how we decide when to actually call __free_page().
> 
> However, it might be much worse, and page->rcu_head from a pagetable
> page cannot be used at all for s390, because we also use page->lru
> to keep our list of free 2K pagetable fragments. I always get confused
> by struct page unions, so not completely sure, but it seems to me that
> page->rcu_head would overlay with page->lru, right?

Sigh, yes, page->rcu_head overlays page->lru.  But (please correct me if
I'm wrong) I think that s390 could use exactly the same technique for
its list of free 2K pagetable fragments as it uses for its list of THP
"deposited" pagetable fragments, over in arch/s390/mm/pgtable.c: use
the first two longs of the page table itself for threading the list.

And while it could use third and fourth longs instead, I don't see any
need for that: a deposited pagetable has been allocated, so would not
be on the list of free fragments.

Below is one of the grossest patches I've ever posted: gross because
it's a rushed attempt to see whether that is viable, while it would take
me longer to understand all the s390 cleverness there (even though the
PP AA commentary above page_table_alloc() is excellent).

I'm hoping the use of page->lru in arch/s390/mm/gmap.c is disjoint.
And cmma_init_nodat()? Ah, that's __init so I guess disjoint.

Gerald, s390 folk: would it be possible for you to give this
a try, suggest corrections and improvements, and then I can make it
a separate patch of the series; and work on avoiding concurrent use
of the rcu_head by pagetable fragment buddies (ideally fit in with
the scheme already there, maybe DD bits to go along with the PP AA).

Why am I even asking you to move away from page->lru: why don't I
thread s390's pte_free_defer() pagetables like THP's deposit does?
I cannot, because the deferred pagetables have to remain accessible
as valid pagetables, until the RCU grace period has elapsed - unless
all the list pointers would appear as pte_none(), which I doubt.

(That may limit our possibilities with the deposited pagetables in
future: I can imagine them too wanting to remain accessible as valid
pagetables.  But that's not needed by this series, and s390 only uses
deposit/withdraw for anon THP; and some are hoping that we might be
able to move away from deposit/withdraw altogther - though powerpc's
special use will make that more difficult.)

Thanks!
Hugh

--- 6.4-rc5/arch/s390/mm/pgalloc.c
+++ linux/arch/s390/mm/pgalloc.c
@@ -232,6 +232,7 @@ void page_table_free_pgste(struct page *
  */
 unsigned long *page_table_alloc(struct mm_struct *mm)
 {
+   struct list_head *listed;
unsigned long *table;
struct page *page;
unsigned int mask, bit;
@@ -241,8 +242,8 @@ unsigned long *page_table_alloc(struct m
table = NULL;
spin_lock_bh(>context.lock);
if (!list_empty(>context.pgtable_list)) {
-   page = list_first_entry(>context.pgtable_list,
-   struct page, lru);
+   listed = mm->context.pgtable_list.next;
+   page = virt_to_page(listed);
mask = atomic_read(>_refcount) >> 24;
/*
 * The pending removal bits must also be 

[PATCH 16/16] powerpc/book3s64/radix: Remove mmu_vmemmap_psize

2023-06-05 Thread Aneesh Kumar K.V
This is not used by radix anymore.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/book3s64/radix_pgtable.c | 10 --
 arch/powerpc/mm/init_64.c| 21 ++---
 2 files changed, 14 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 82d36df52a8d..b59219751599 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -601,16 +601,6 @@ void __init radix__early_init_mmu(void)
mmu_virtual_psize = MMU_PAGE_4K;
 #endif
 
-#ifdef CONFIG_SPARSEMEM_VMEMMAP
-   /* vmemmap mapping */
-   if (mmu_psize_defs[MMU_PAGE_2M].shift) {
-   /*
-* map vmemmap using 2M if available
-*/
-   mmu_vmemmap_psize = MMU_PAGE_2M;
-   } else
-   mmu_vmemmap_psize = mmu_virtual_psize;
-#endif
/*
 * initialize page table size
 */
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 5701faca39ef..6db7a063ba63 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -198,17 +198,12 @@ bool altmap_cross_boundary(struct vmem_altmap *altmap, 
unsigned long start,
return false;
 }
 
-int __meminit vmemmap_populate(unsigned long start, unsigned long end, int 
node,
-   struct vmem_altmap *altmap)
+int __meminit __vmemmap_populate(unsigned long start, unsigned long end, int 
node,
+struct vmem_altmap *altmap)
 {
bool altmap_alloc;
unsigned long page_size = 1 << mmu_psize_defs[mmu_vmemmap_psize].shift;
 
-#ifdef CONFIG_PPC_BOOK3S_64
-   if (radix_enabled())
-   return radix__vmemmap_populate(start, end, node, altmap);
-#endif
-
/* Align to the page size of the linear mapping. */
start = ALIGN_DOWN(start, page_size);
 
@@ -277,6 +272,18 @@ int __meminit vmemmap_populate(unsigned long start, 
unsigned long end, int node,
return 0;
 }
 
+int __meminit vmemmap_populate(unsigned long start, unsigned long end, int 
node,
+  struct vmem_altmap *altmap)
+{
+
+#ifdef CONFIG_PPC_BOOK3S_64
+   if (radix_enabled())
+   return radix__vmemmap_populate(start, end, node, altmap);
+#endif
+
+   return __vmemmap_populate(start, end, node, altmap);
+}
+
 #ifdef CONFIG_MEMORY_HOTPLUG
 static unsigned long vmemmap_list_free(unsigned long start)
 {
-- 
2.40.1



[PATCH 15/16] powerpc/book3s64/radix: Add support for vmemmap optimization for radix

2023-06-05 Thread Aneesh Kumar K.V
With 2M PMD-level mapping, we require 32 struct pages and a single vmemmap page
can contain 1024 struct pages (PAGE_SIZE/sizeof(struct page)). Hence with 64K
page size, we don't use vmemmap deduplication for PMD-level mapping.

Signed-off-by: Aneesh Kumar K.V 
---
 Documentation/mm/vmemmap_dedup.rst |   1 +
 Documentation/powerpc/vmemmap_dedup.rst| 101 ++
 arch/powerpc/Kconfig   |   1 +
 arch/powerpc/include/asm/book3s/64/radix.h |   8 +
 arch/powerpc/mm/book3s64/radix_pgtable.c   | 203 +
 5 files changed, 314 insertions(+)
 create mode 100644 Documentation/powerpc/vmemmap_dedup.rst

diff --git a/Documentation/mm/vmemmap_dedup.rst 
b/Documentation/mm/vmemmap_dedup.rst
index a4b12ff906c4..c573e08b5043 100644
--- a/Documentation/mm/vmemmap_dedup.rst
+++ b/Documentation/mm/vmemmap_dedup.rst
@@ -210,6 +210,7 @@ the device (altmap).
 
 The following page sizes are supported in DAX: PAGE_SIZE (4K on x86_64),
 PMD_SIZE (2M on x86_64) and PUD_SIZE (1G on x86_64).
+For powerpc equivalent details see Documentation/powerpc/vmemmap_dedup.rst
 
 The differences with HugeTLB are relatively minor.
 
diff --git a/Documentation/powerpc/vmemmap_dedup.rst 
b/Documentation/powerpc/vmemmap_dedup.rst
new file mode 100644
index ..dc4db59fdf87
--- /dev/null
+++ b/Documentation/powerpc/vmemmap_dedup.rst
@@ -0,0 +1,101 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==
+Device DAX
+==
+
+The device-dax interface uses the tail deduplication technique explained in
+Documentation/mm/vmemmap_dedup.rst
+
+On powerpc, vmemmap deduplication is only used with radix MMU translation. Also
+with a 64K page size, only the devdax namespace with 1G alignment uses vmemmap
+deduplication.
+
+With 2M PMD level mapping, we require 32 struct pages and a single 64K vmemmap
+page can contain 1024 struct pages (64K/sizeof(struct page)). Hence there is no
+vmemmap deduplication possible.
+
+With 1G PUD level mapping, we require 16384 struct pages and a single 64K
+vmemmap page can contain 1024 struct pages (64K/sizeof(struct page)). Hence we
+require 16 64K pages in vmemmap to map the struct page for 1G PUD level 
mapping.
+
+Here's how things look like on device-dax after the sections are populated::
+ +---+ ---virt_to_page---> +---+   mapping to   +---+
+ |   | | 0 | -> | 0 |
+ |   | +---++---+
+ |   | | 1 | -> | 1 |
+ |   | +---++---+
+ |   | | 2 | ^ ^ ^ ^ ^ ^
+ |   | +---+   | | | | |
+ |   | | 3 | --+ | | | |
+ |   | +---+ | | | |
+ |   | | 4 | + | | |
+ |PUD| +---+   | | |
+ |   level   | | . | --+ | |
+ |  mapping  | +---+ | |
+ |   | | . | + |
+ |   | +---+   |
+ |   | | 15| --+
+ |   | +---+
+ |   |
+ |   |
+ |   |
+ +---+
+
+
+With 4K page size, 2M PMD level mapping requires 512 struct pages and a single
+4K vmemmap page contains 64 struct pages(4K/sizeof(struct page)). Hence we
+require 8 4K pages in vmemmap to map the struct page for 2M pmd level mapping.
+
+Here's how things look like on device-dax after the sections are populated::
+
+ +---+ ---virt_to_page---> +---+   mapping to   +---+
+ |   | | 0 | -> | 0 |
+ |   | +---++---+
+ |   | | 1 | -> | 1 |
+ |   | +---++---+
+ |   | | 2 | ^ ^ ^ ^ ^ ^
+ |   | +---+   | | | | |
+ |   | | 3 | --+ | | | |
+ |   | +---+ | | | |
+ |   | | 4 | + | | |
+ |PMD| +---+   | | |
+ |   level   | | 5 | --+ | |
+ |  mapping  | +---+ | |
+ |   | 

[PATCH 14/16] powerpc/book3s64/vmemmap: Switch radix to use a different vmemmap handling function

2023-06-05 Thread Aneesh Kumar K.V
This is in preparation to update radix to implement vmemmap optimization for
devdax. Below are the rules w.r.t radix vmemmap mapping

1. First try to map things using PMD (2M)
2. With altmap if altmap cross-boundary check returns true, fall back to 
PAGE_SIZE
3. IF we can't allocate PMD_SIZE backing memory for vmemmap, fallback to 
PAGE_SIZE

On removing vmemmap mapping, check if every subsection that is using the vmemmap
area is invalid. If found to be invalid, that implies we can safely free the
vmemmap area. We don't use the PAGE_UNUSED pattern used by x86 because with 64K
page size, we need to do the above check even at the PAGE_SIZE granularity.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/radix.h |   2 +
 arch/powerpc/include/asm/pgtable.h |   3 +
 arch/powerpc/mm/book3s64/radix_pgtable.c   | 293 +++--
 arch/powerpc/mm/init_64.c  |  26 +-
 4 files changed, 293 insertions(+), 31 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/radix.h 
b/arch/powerpc/include/asm/book3s/64/radix.h
index 8cdff5a05011..87d4c1e62491 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -332,6 +332,8 @@ extern int __meminit radix__vmemmap_create_mapping(unsigned 
long start,
 unsigned long phys);
 int __meminit radix__vmemmap_populate(unsigned long start, unsigned long end,
  int node, struct vmem_altmap *altmap);
+void __ref radix__vmemmap_free(unsigned long start, unsigned long end,
+  struct vmem_altmap *altmap);
 extern void radix__vmemmap_remove_mapping(unsigned long start,
unsigned long page_size);
 
diff --git a/arch/powerpc/include/asm/pgtable.h 
b/arch/powerpc/include/asm/pgtable.h
index 9972626ddaf6..6d4cd2ebae6e 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -168,6 +168,9 @@ static inline bool is_ioremap_addr(const void *x)
 
 struct seq_file;
 void arch_report_meminfo(struct seq_file *m);
+int __meminit vmemmap_populated(unsigned long vmemmap_addr, int 
vmemmap_map_size);
+bool altmap_cross_boundary(struct vmem_altmap *altmap, unsigned long start,
+  unsigned long page_size);
 #endif /* CONFIG_PPC64 */
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index d7e2dd3d4add..65de8630abcb 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -742,8 +742,57 @@ static void free_pud_table(pud_t *pud_start, p4d_t *p4d)
p4d_clear(p4d);
 }
 
+static bool __meminit vmemmap_pmd_is_unused(unsigned long addr, unsigned long 
end)
+{
+   unsigned long start = ALIGN_DOWN(addr, PMD_SIZE);
+
+   return vmemmap_populated(start, PMD_SIZE);
+}
+
+static bool __meminit vmemmap_page_is_unused(unsigned long addr, unsigned long 
end)
+{
+   unsigned long start = ALIGN_DOWN(addr, PAGE_SIZE);
+
+   return vmemmap_populated(start, PAGE_SIZE);
+
+}
+
+static void __meminit free_vmemmap_pages(struct page *page,
+struct vmem_altmap *altmap,
+int order)
+{
+   unsigned int nr_pages = 1 << order;
+
+   if (altmap) {
+   unsigned long alt_start, alt_end;
+   unsigned long base_pfn = page_to_pfn(page);
+
+   /*
+* with 1G vmemmap mmaping we can have things setup
+* such that even though atlmap is specified we never
+* used altmap.
+*/
+   alt_start = altmap->base_pfn;
+   alt_end = altmap->base_pfn + altmap->reserve +
+   altmap->free + altmap->alloc + altmap->align;
+
+   if (base_pfn >= alt_start && base_pfn < alt_end) {
+   vmem_altmap_free(altmap, nr_pages);
+   return;
+   }
+   }
+
+   if (PageReserved(page)) {
+   /* allocated from memblock */
+   while (nr_pages--)
+   free_reserved_page(page++);
+   } else
+   free_pages((unsigned long)page_address(page), order);
+}
+
 static void remove_pte_table(pte_t *pte_start, unsigned long addr,
-unsigned long end, bool direct)
+unsigned long end, bool direct,
+struct vmem_altmap *altmap)
 {
unsigned long next, pages = 0;
pte_t *pte;
@@ -757,24 +806,23 @@ static void remove_pte_table(pte_t *pte_start, unsigned 
long addr,
if (!pte_present(*pte))
continue;
 
-   if (!PAGE_ALIGNED(addr) || !PAGE_ALIGNED(next)) {
-   /*
-* The vmemmap_free() and 

[PATCH 13/16] powerpc/book3s64/mm: Enable transparent pud hugepage

2023-06-05 Thread Aneesh Kumar K.V
This is enabled only with radix translation and 1G hugepage size. This will be
used with devdax device memory with a namespace alignment of 1G.

Anon transparent hugepage is not supported even though we do have helpers
checking pud_trans_huge(). We should never find that return true. The only
expected pte bit combination is _PAGE_PTE | _PAGE_DEVMAP.

Some of the helpers are never expected to get called on hash translation and
hence is marked to call BUG() in such a case.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/pgtable.h  | 156 --
 arch/powerpc/include/asm/book3s/64/radix.h|  37 +
 .../include/asm/book3s/64/tlbflush-radix.h|   2 +
 arch/powerpc/include/asm/book3s/64/tlbflush.h |   8 +
 arch/powerpc/mm/book3s64/pgtable.c|  78 +
 arch/powerpc/mm/book3s64/radix_pgtable.c  |  28 
 arch/powerpc/mm/book3s64/radix_tlb.c  |   7 +
 arch/powerpc/platforms/Kconfig.cputype|   1 +
 include/trace/events/thp.h|  17 ++
 9 files changed, 323 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 4acc9690f599..9a05de007956 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -921,8 +921,29 @@ static inline pud_t pte_pud(pte_t pte)
 {
return __pud_raw(pte_raw(pte));
 }
+
+static inline pte_t *pudp_ptep(pud_t *pud)
+{
+   return (pte_t *)pud;
+}
+
+#define pud_pfn(pud)   pte_pfn(pud_pte(pud))
+#define pud_dirty(pud) pte_dirty(pud_pte(pud))
+#define pud_young(pud) pte_young(pud_pte(pud))
+#define pud_mkold(pud) pte_pud(pte_mkold(pud_pte(pud)))
+#define pud_wrprotect(pud) pte_pud(pte_wrprotect(pud_pte(pud)))
+#define pud_mkdirty(pud)   pte_pud(pte_mkdirty(pud_pte(pud)))
+#define pud_mkclean(pud)   pte_pud(pte_mkclean(pud_pte(pud)))
+#define pud_mkyoung(pud)   pte_pud(pte_mkyoung(pud_pte(pud)))
+#define pud_mkwrite(pud)   pte_pud(pte_mkwrite(pud_pte(pud)))
 #define pud_write(pud) pte_write(pud_pte(pud))
 
+#ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
+#define pud_soft_dirty(pmd)pte_soft_dirty(pud_pte(pud))
+#define pud_mksoft_dirty(pmd)  pte_pud(pte_mksoft_dirty(pud_pte(pud)))
+#define pud_clear_soft_dirty(pmd) pte_pud(pte_clear_soft_dirty(pud_pte(pud)))
+#endif /* CONFIG_HAVE_ARCH_SOFT_DIRTY */
+
 static inline int pud_bad(pud_t pud)
 {
if (radix_enabled())
@@ -1115,15 +1136,24 @@ static inline bool pmd_access_permitted(pmd_t pmd, bool 
write)
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 extern pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot);
+extern pud_t pfn_pud(unsigned long pfn, pgprot_t pgprot);
 extern pmd_t mk_pmd(struct page *page, pgprot_t pgprot);
 extern pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot);
 extern void set_pmd_at(struct mm_struct *mm, unsigned long addr,
   pmd_t *pmdp, pmd_t pmd);
+extern void set_pud_at(struct mm_struct *mm, unsigned long addr,
+  pud_t *pudp, pud_t pud);
+
 static inline void update_mmu_cache_pmd(struct vm_area_struct *vma,
unsigned long addr, pmd_t *pmd)
 {
 }
 
+static inline void update_mmu_cache_pud(struct vm_area_struct *vma,
+   unsigned long addr, pud_t *pud)
+{
+}
+
 extern int hash__has_transparent_hugepage(void);
 static inline int has_transparent_hugepage(void)
 {
@@ -1133,6 +1163,14 @@ static inline int has_transparent_hugepage(void)
 }
 #define has_transparent_hugepage has_transparent_hugepage
 
+static inline int has_transparent_pud_hugepage(void)
+{
+   if (radix_enabled())
+   return radix__has_transparent_pud_hugepage();
+   return 0;
+}
+#define has_transparent_pud_hugepage has_transparent_pud_hugepage
+
 static inline unsigned long
 pmd_hugepage_update(struct mm_struct *mm, unsigned long addr, pmd_t *pmdp,
unsigned long clr, unsigned long set)
@@ -1142,6 +1180,16 @@ pmd_hugepage_update(struct mm_struct *mm, unsigned long 
addr, pmd_t *pmdp,
return hash__pmd_hugepage_update(mm, addr, pmdp, clr, set);
 }
 
+static inline unsigned long
+pud_hugepage_update(struct mm_struct *mm, unsigned long addr, pud_t *pudp,
+   unsigned long clr, unsigned long set)
+{
+   if (radix_enabled())
+   return radix__pud_hugepage_update(mm, addr, pudp, clr, set);
+   BUG();
+   return pud_val(*pudp);
+}
+
 /*
  * returns true for pmd migration entries, THP, devmap, hugetlb
  * But compile time dependent on THP config
@@ -1151,6 +1199,11 @@ static inline int pmd_large(pmd_t pmd)
return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_PTE));
 }
 
+static inline int pud_large(pud_t pud)
+{
+   return !!(pud_raw(pud) & cpu_to_be64(_PAGE_PTE));
+}
+
 /*
  * For radix we should always find H_PAGE_HASHPTE zero. Hence
  * the below will work for radix too
@@ 

[PATCH 12/16] mm/vmemmap optimization: Split hugetlb and devdax vmemmap optimization

2023-06-05 Thread Aneesh Kumar K.V
Arm disabled hugetlb vmemmap optimization [1] because hugetlb vmemmap
optimization includes an update of both the permissions (writeable to read-only)
and the output address (pfn) of the vmemmap ptes. That is not supported without
unmapping of pte(marking it invalid) by some architectures.

With DAX vmemmap optimization we don't require such pte updates and
architectures can enable DAX vmemmap optimization while having hugetlb vmemmap
optimization disabled. Hence split DAX optimization support into a different
config.

loongarch and riscv don't have devdax support. So the DAX config is not enabled
for them. With this change, arm64 should be able to select DAX optimization

[1] commit 060a2c92d1b6 ("arm64: mm: hugetlb: Disable 
HUGETLB_PAGE_OPTIMIZE_VMEMMAP")

Signed-off-by: Aneesh Kumar K.V 
---
 arch/loongarch/Kconfig | 2 +-
 arch/riscv/Kconfig | 2 +-
 arch/x86/Kconfig   | 3 ++-
 fs/Kconfig | 2 +-
 include/linux/mm.h | 2 +-
 mm/Kconfig | 5 -
 6 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig
index d38b066fc931..2060990c4612 100644
--- a/arch/loongarch/Kconfig
+++ b/arch/loongarch/Kconfig
@@ -55,7 +55,7 @@ config LOONGARCH
select ARCH_USE_QUEUED_SPINLOCKS
select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
select ARCH_WANT_LD_ORPHAN_WARN
-   select ARCH_WANT_OPTIMIZE_VMEMMAP
+   select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP
select ARCH_WANTS_NO_INSTR
select BUILDTIME_TABLE_SORT
select COMMON_CLK
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 348c0fa1fc8c..aafbf201f708 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -49,7 +49,7 @@ config RISCV
select ARCH_WANT_GENERAL_HUGETLB if !RISCV_ISA_SVNAPOT
select ARCH_WANT_HUGE_PMD_SHARE if 64BIT
select ARCH_WANT_LD_ORPHAN_WARN if !XIP_KERNEL
-   select ARCH_WANT_OPTIMIZE_VMEMMAP
+   select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP
select ARCH_WANTS_THP_SWAP if HAVE_ARCH_TRANSPARENT_HUGEPAGE
select BINFMT_FLAT_NO_DATA_START_OFFSET if !MMU
select BUILDTIME_TABLE_SORT if MMU
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 53bab123a8ee..eb383960b6ee 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -127,7 +127,8 @@ config X86
select ARCH_WANT_GENERAL_HUGETLB
select ARCH_WANT_HUGE_PMD_SHARE
select ARCH_WANT_LD_ORPHAN_WARN
-   select ARCH_WANT_OPTIMIZE_VMEMMAP   if X86_64
+   select ARCH_WANT_OPTIMIZE_DAX_VMEMMAP   if X86_64
+   select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP   if X86_64
select ARCH_WANTS_THP_SWAP  if X86_64
select ARCH_HAS_PARANOID_L1D_FLUSH
select BUILDTIME_TABLE_SORT
diff --git a/fs/Kconfig b/fs/Kconfig
index 18d034ec7953..9c104c130a6e 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -252,7 +252,7 @@ config HUGETLB_PAGE
 
 config HUGETLB_PAGE_OPTIMIZE_VMEMMAP
def_bool HUGETLB_PAGE
-   depends on ARCH_WANT_OPTIMIZE_VMEMMAP
+   depends on ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP
depends on SPARSEMEM_VMEMMAP
 
 config HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9a45e61cd83f..6e56ae09f0c1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3553,7 +3553,7 @@ void vmemmap_free(unsigned long start, unsigned long end,
 #endif
 
 #define VMEMMAP_RESERVE_NR 2
-#ifdef CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP
+#ifdef CONFIG_ARCH_WANT_OPTIMIZE_DAX_VMEMMAP
 static inline bool __vmemmap_can_optimize(struct vmem_altmap *altmap,
  struct dev_pagemap *pgmap)
 {
diff --git a/mm/Kconfig b/mm/Kconfig
index 7672a22647b4..7b388c10baab 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -461,7 +461,10 @@ config SPARSEMEM_VMEMMAP
 # Select this config option from the architecture Kconfig, if it is preferred
 # to enable the feature of HugeTLB/dev_dax vmemmap optimization.
 #
-config ARCH_WANT_OPTIMIZE_VMEMMAP
+config ARCH_WANT_OPTIMIZE_DAX_VMEMMAP
+   bool
+
+config ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP
bool
 
 config HAVE_MEMBLOCK_PHYS_MAP
-- 
2.40.1



[PATCH 11/16] mm/huge pud: Use transparent huge pud helpers only with CONFIG_TRANSPARENT_HUGEPAGE

2023-06-05 Thread Aneesh Kumar K.V
pudp_set_wrprotect and move_huge_pud helpers are only used when
CONFIG_TRANSPARENT_HUGEPAGE is enabled. Similar to pmdp_set_wrprotect and
move_huge_pmd_helpers use architecture override only if
CONFIG_TRANSPARENT_HUGEPAGE is set

Signed-off-by: Aneesh Kumar K.V 
---
 include/linux/pgtable.h | 2 ++
 mm/mremap.c | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 8c5174d1f9db..c7f5806dc9d1 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -550,6 +550,7 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
 #endif
 #ifndef __HAVE_ARCH_PUDP_SET_WRPROTECT
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline void pudp_set_wrprotect(struct mm_struct *mm,
  unsigned long address, pud_t *pudp)
 {
@@ -563,6 +564,7 @@ static inline void pudp_set_wrprotect(struct mm_struct *mm,
 {
BUILD_BUG();
 }
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 #endif
 
diff --git a/mm/mremap.c b/mm/mremap.c
index b11ce6c92099..6373db571e5c 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -338,7 +338,7 @@ static inline bool move_normal_pud(struct vm_area_struct 
*vma,
 }
 #endif
 
-#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && 
defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
 static bool move_huge_pud(struct vm_area_struct *vma, unsigned long old_addr,
  unsigned long new_addr, pud_t *old_pud, pud_t 
*new_pud)
 {
-- 
2.40.1



[PATCH 10/16] mm: Add __HAVE_ARCH_PUD_SAME similar to __HAVE_ARCH_P4D_SAME

2023-06-05 Thread Aneesh Kumar K.V
This helps architectures to override pmd_same and pud_same
independently.

Signed-off-by: Aneesh Kumar K.V 
---
 include/linux/pgtable.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 2fe19720075e..8c5174d1f9db 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -681,7 +681,9 @@ static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
 {
return pmd_val(pmd_a) == pmd_val(pmd_b);
 }
+#endif
 
+#ifndef __HAVE_ARCH_PUD_SAME
 static inline int pud_same(pud_t pud_a, pud_t pud_b)
 {
return pud_val(pud_a) == pud_val(pud_b);
-- 
2.40.1



[PATCH 09/16] mm/vmemmap: Allow architectures to override how vmemmap optimization works

2023-06-05 Thread Aneesh Kumar K.V
Architectures like powerpc will like to use different page table allocators and
mapping mechanisms to implement vmemmap optimization. Similar to
vmemmap_populate allow architectures to implement vmemap_populate_compound_pages

Signed-off-by: Aneesh Kumar K.V 
---
 mm/sparse-vmemmap.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 10d73a0dfcec..0b83706c08fd 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -141,6 +141,7 @@ void __meminit vmemmap_verify(pte_t *pte, int node,
start, end - 1);
 }
 
+#ifndef vmemmap_populate_compound_pages
 pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int 
node,
   struct vmem_altmap *altmap,
   struct page *reuse)
@@ -446,6 +447,8 @@ static int __meminit 
vmemmap_populate_compound_pages(unsigned long start_pfn,
return 0;
 }
 
+#endif
+
 struct page * __meminit __populate_section_memmap(unsigned long pfn,
unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
struct dev_pagemap *pgmap)
-- 
2.40.1



[PATCH 08/16] mm/vmemmap: Improve vmemmap_can_optimize and allow architectures to override

2023-06-05 Thread Aneesh Kumar K.V
dax vmemmap optimization requires a minimum of 2 PAGE_SIZE area within vmemmap
such that tail page mapping can point to the second PAGE_SIZE area. Enforce that
in vmemmap_can_optimize() function.

Architectures like powerpc also want to enable vmemmap optimization
conditionally (only with radix MMU translation). Hence allow architecture
override.

Signed-off-by: Aneesh Kumar K.V 
---
 include/linux/mm.h | 30 ++
 mm/mm_init.c   |  2 +-
 2 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 27ce77080c79..9a45e61cd83f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -31,6 +31,8 @@
 #include 
 #include 
 
+#include 
+
 struct mempolicy;
 struct anon_vma;
 struct anon_vma_chain;
@@ -3550,13 +3552,33 @@ void vmemmap_free(unsigned long start, unsigned long 
end,
struct vmem_altmap *altmap);
 #endif
 
+#define VMEMMAP_RESERVE_NR 2
 #ifdef CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP
-static inline bool vmemmap_can_optimize(struct vmem_altmap *altmap,
-  struct dev_pagemap *pgmap)
+static inline bool __vmemmap_can_optimize(struct vmem_altmap *altmap,
+ struct dev_pagemap *pgmap)
 {
-   return is_power_of_2(sizeof(struct page)) &&
-   pgmap && (pgmap_vmemmap_nr(pgmap) > 1) && !altmap;
+   if (pgmap) {
+   unsigned long nr_pages;
+   unsigned long nr_vmemmap_pages;
+
+   nr_pages = pgmap_vmemmap_nr(pgmap);
+   nr_vmemmap_pages = ((nr_pages * sizeof(struct page)) >> 
PAGE_SHIFT);
+   /*
+* For vmemmap optimization with DAX we need minimum 2 vmemmap
+* pages. See layout diagram in 
Documentation/mm/vmemmap_dedup.rst
+*/
+   return is_power_of_2(sizeof(struct page)) &&
+   (nr_vmemmap_pages > VMEMMAP_RESERVE_NR) && !altmap;
+   }
+   return false;
 }
+/*
+ * If we don't have an architecture override, use the generic rule
+ */
+#ifndef vmemmap_can_optimize
+#define vmemmap_can_optimize __vmemmap_can_optimize
+#endif
+
 #else
 static inline bool vmemmap_can_optimize(struct vmem_altmap *altmap,
   struct dev_pagemap *pgmap)
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 7f7f9c677854..d1676afc94f1 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1020,7 +1020,7 @@ static inline unsigned long compound_nr_pages(struct 
vmem_altmap *altmap,
if (!vmemmap_can_optimize(altmap, pgmap))
return pgmap_vmemmap_nr(pgmap);
 
-   return 2 * (PAGE_SIZE / sizeof(struct page));
+   return VMEMMAP_RESERVE_NR * (PAGE_SIZE / sizeof(struct page));
 }
 
 static void __ref memmap_init_compound(struct page *head,
-- 
2.40.1



[PATCH 07/16] mm: Change pudp_huge_get_and_clear_full take vm_area_struct as arg

2023-06-05 Thread Aneesh Kumar K.V
We will use this in a later patch to do tlb flush when clearing pud entries on
powerpc. This is similar to
commit 93a98695f2f9 ("mm: change pmdp_huge_get_and_clear_full take 
vm_area_struct as arg")

Signed-off-by: Aneesh Kumar K.V 
---
 include/linux/pgtable.h | 4 ++--
 mm/debug_vm_pgtable.c   | 2 +-
 mm/huge_memory.c| 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index b3f4dd0240f5..2fe19720075e 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -442,11 +442,11 @@ static inline pmd_t pmdp_huge_get_and_clear_full(struct 
vm_area_struct *vma,
 #endif
 
 #ifndef __HAVE_ARCH_PUDP_HUGE_GET_AND_CLEAR_FULL
-static inline pud_t pudp_huge_get_and_clear_full(struct mm_struct *mm,
+static inline pud_t pudp_huge_get_and_clear_full(struct vm_area_struct *vma,
unsigned long address, pud_t *pudp,
int full)
 {
-   return pudp_huge_get_and_clear(mm, address, pudp);
+   return pudp_huge_get_and_clear(vma->vm_mm, address, pudp);
 }
 #endif
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index c54177aabebd..c2bf25d5e5cd 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -382,7 +382,7 @@ static void __init pud_advanced_tests(struct 
pgtable_debug_args *args)
WARN_ON(!(pud_write(pud) && pud_dirty(pud)));
 
 #ifndef __PAGETABLE_PMD_FOLDED
-   pudp_huge_get_and_clear_full(args->mm, vaddr, args->pudp, 1);
+   pudp_huge_get_and_clear_full(args->vma, vaddr, args->pudp, 1);
pud = READ_ONCE(*args->pudp);
WARN_ON(!pud_none(pud));
 #endif /* __PAGETABLE_PMD_FOLDED */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 624671aaa60d..8774b4751a84 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1980,7 +1980,7 @@ int zap_huge_pud(struct mmu_gather *tlb, struct 
vm_area_struct *vma,
if (!ptl)
return 0;
 
-   pudp_huge_get_and_clear_full(tlb->mm, addr, pud, tlb->fullmm);
+   pudp_huge_get_and_clear_full(vma, addr, pud, tlb->fullmm);
tlb_remove_pud_tlb_entry(tlb, pud, addr);
if (vma_is_special_huge(vma)) {
spin_unlock(ptl);
-- 
2.40.1



[PATCH 06/16] mm/hugepage pud: Allow arch-specific helper function to check huge page pud support

2023-06-05 Thread Aneesh Kumar K.V
Architectures like powerpc would like to enable transparent huge page pud
support only with radix translation. To support that add
has_transparent_pud_hugepage() helper that architectures can override.

Signed-off-by: Aneesh Kumar K.V 
---
 drivers/nvdimm/pfn_devs.c | 2 +-
 include/linux/pgtable.h   | 3 +++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index af7d9301520c..18ad315581ca 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -100,7 +100,7 @@ static unsigned long *nd_pfn_supported_alignments(unsigned 
long *alignments)
 
if (has_transparent_hugepage()) {
alignments[1] = HPAGE_PMD_SIZE;
-   if (IS_ENABLED(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD))
+   if (has_transparent_pud_hugepage())
alignments[2] = HPAGE_PUD_SIZE;
}
 
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index c5a51481bbb9..b3f4dd0240f5 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1597,6 +1597,9 @@ typedef unsigned int pgtbl_mod_mask;
 #define has_transparent_hugepage() IS_BUILTIN(CONFIG_TRANSPARENT_HUGEPAGE)
 #endif
 
+#ifndef has_transparent_pud_hugepage
+#define has_transparent_pud_hugepage() 
IS_BUILTIN(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
+#endif
 /*
  * On some architectures it depends on the mm if the p4d/pud or pmd
  * layer of the page table hierarchy is folded or not.
-- 
2.40.1



[PATCH 05/16] powerpc/mm/dax: Fix the condition when checking if altmap vmemap can cross-boundary

2023-06-05 Thread Aneesh Kumar K.V
Without this fix, the last subsection vmemmap can end up in memory even if the
namespace is created with -M mem and has sufficient space in the altmap area.

Fixes: cf387d9644d8 ("libnvdimm/altmap: Track namespace boundaries in altmap")
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/init_64.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 05b0d584e50b..fe1b83020e0d 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -189,7 +189,7 @@ static bool altmap_cross_boundary(struct vmem_altmap 
*altmap, unsigned long star
unsigned long nr_pfn = page_size / sizeof(struct page);
unsigned long start_pfn = page_to_pfn((struct page *)start);
 
-   if ((start_pfn + nr_pfn) > altmap->end_pfn)
+   if ((start_pfn + nr_pfn - 1) > altmap->end_pfn)
return true;
 
if (start_pfn < altmap->base_pfn)
-- 
2.40.1



[PATCH 04/16] powerpc/book3s64/mm: Use PAGE_KERNEL instead of opencoding

2023-06-05 Thread Aneesh Kumar K.V
No functional change in this patch.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/book3s64/radix_pgtable.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 15a099e53cde..76f6a1f3b9d8 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -910,7 +910,6 @@ int __meminit radix__vmemmap_create_mapping(unsigned long 
start,
  unsigned long phys)
 {
/* Create a PTE encoding */
-   unsigned long flags = _PAGE_PRESENT | _PAGE_ACCESSED | _PAGE_KERNEL_RW;
int nid = early_pfn_to_nid(phys >> PAGE_SHIFT);
int ret;
 
@@ -919,7 +918,7 @@ int __meminit radix__vmemmap_create_mapping(unsigned long 
start,
return -1;
}
 
-   ret = __map_kernel_page_nid(start, phys, __pgprot(flags), page_size, 
nid);
+   ret = __map_kernel_page_nid(start, phys, PAGE_KERNEL, page_size, nid);
BUG_ON(ret);
 
return 0;
-- 
2.40.1



[PATCH 03/16] powerpc/book3s64/mm: Fix DirectMap stats in /proc/meminfo

2023-06-05 Thread Aneesh Kumar K.V
On memory unplug reduce DirectMap page count correctly.
root@ubuntu-guest:# grep Direct /proc/meminfo
DirectMap4k:   0 kB
DirectMap64k:   0 kB
DirectMap2M:115343360 kB
DirectMap1G:   0 kB

Before fix:
root@ubuntu-guest:# ndctl disable-namespace all
disabled 1 namespace
root@ubuntu-guest:# grep Direct /proc/meminfo
DirectMap4k:   0 kB
DirectMap64k:   0 kB
DirectMap2M:115343360 kB
DirectMap1G:   0 kB

After fix:
root@ubuntu-guest:# ndctl disable-namespace all
disabled 1 namespace
root@ubuntu-guest:# grep Direct /proc/meminfo
DirectMap4k:   0 kB
DirectMap64k:   0 kB
DirectMap2M:104857600 kB
DirectMap1G:   0 kB

Fixes: a2dc009afa9a ("powerpc/mm/book3s/radix: Add mapping statistics")
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/book3s64/radix_pgtable.c | 34 +++-
 1 file changed, 22 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 570add33c02d..15a099e53cde 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -743,9 +743,9 @@ static void free_pud_table(pud_t *pud_start, p4d_t *p4d)
 }
 
 static void remove_pte_table(pte_t *pte_start, unsigned long addr,
-unsigned long end)
+unsigned long end, bool direct)
 {
-   unsigned long next;
+   unsigned long next, pages = 0;
pte_t *pte;
 
pte = pte_start + pte_index(addr);
@@ -767,13 +767,16 @@ static void remove_pte_table(pte_t *pte_start, unsigned 
long addr,
}
 
pte_clear(_mm, addr, pte);
+   pages++;
}
+   if (direct)
+   update_page_count(mmu_virtual_psize, -pages);
 }
 
 static void __meminit remove_pmd_table(pmd_t *pmd_start, unsigned long addr,
-unsigned long end)
+  unsigned long end, bool direct)
 {
-   unsigned long next;
+   unsigned long next, pages = 0;
pte_t *pte_base;
pmd_t *pmd;
 
@@ -791,19 +794,22 @@ static void __meminit remove_pmd_table(pmd_t *pmd_start, 
unsigned long addr,
continue;
}
pte_clear(_mm, addr, (pte_t *)pmd);
+   pages++;
continue;
}
 
pte_base = (pte_t *)pmd_page_vaddr(*pmd);
-   remove_pte_table(pte_base, addr, next);
+   remove_pte_table(pte_base, addr, next, direct);
free_pte_table(pte_base, pmd);
}
+   if (direct)
+   update_page_count(MMU_PAGE_2M, -pages);
 }
 
 static void __meminit remove_pud_table(pud_t *pud_start, unsigned long addr,
-unsigned long end)
+  unsigned long end, bool direct)
 {
-   unsigned long next;
+   unsigned long next, pages = 0;
pmd_t *pmd_base;
pud_t *pud;
 
@@ -821,16 +827,20 @@ static void __meminit remove_pud_table(pud_t *pud_start, 
unsigned long addr,
continue;
}
pte_clear(_mm, addr, (pte_t *)pud);
+   pages++;
continue;
}
 
pmd_base = pud_pgtable(*pud);
-   remove_pmd_table(pmd_base, addr, next);
+   remove_pmd_table(pmd_base, addr, next, direct);
free_pmd_table(pmd_base, pud);
}
+   if (direct)
+   update_page_count(MMU_PAGE_1G, -pages);
 }
 
-static void __meminit remove_pagetable(unsigned long start, unsigned long end)
+static void __meminit remove_pagetable(unsigned long start, unsigned long end,
+  bool direct)
 {
unsigned long addr, next;
pud_t *pud_base;
@@ -859,7 +869,7 @@ static void __meminit remove_pagetable(unsigned long start, 
unsigned long end)
}
 
pud_base = p4d_pgtable(*p4d);
-   remove_pud_table(pud_base, addr, next);
+   remove_pud_table(pud_base, addr, next, direct);
free_pud_table(pud_base, p4d);
}
 
@@ -882,7 +892,7 @@ int __meminit radix__create_section_mapping(unsigned long 
start,
 
 int __meminit radix__remove_section_mapping(unsigned long start, unsigned long 
end)
 {
-   remove_pagetable(start, end);
+   remove_pagetable(start, end, true);
return 0;
 }
 #endif /* CONFIG_MEMORY_HOTPLUG */
@@ -918,7 +928,7 @@ int __meminit radix__vmemmap_create_mapping(unsigned long 
start,
 #ifdef CONFIG_MEMORY_HOTPLUG
 void __meminit radix__vmemmap_remove_mapping(unsigned long start, unsigned 
long page_size)
 {
-   remove_pagetable(start, start + page_size);
+   remove_pagetable(start, start + page_size, false);
 }
 #endif
 #endif
-- 

[PATCH 02/16] powerpc/book3s64/mm: mmu_vmemmap_psize is used by radix

2023-06-05 Thread Aneesh Kumar K.V
This should not be within CONFIG_PPC_64S_HASHS_MMU. We use mmu_vmemmap_psize
on radix while mapping the vmemmap area.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/book3s64/radix_pgtable.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 5f8c6fbe8a69..570add33c02d 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -594,7 +594,6 @@ void __init radix__early_init_mmu(void)
 {
unsigned long lpcr;
 
-#ifdef CONFIG_PPC_64S_HASH_MMU
 #ifdef CONFIG_PPC_64K_PAGES
/* PAGE_SIZE mappings */
mmu_virtual_psize = MMU_PAGE_64K;
@@ -611,7 +610,6 @@ void __init radix__early_init_mmu(void)
mmu_vmemmap_psize = MMU_PAGE_2M;
} else
mmu_vmemmap_psize = mmu_virtual_psize;
-#endif
 #endif
/*
 * initialize page table size
-- 
2.40.1



[PATCH 01/16] powerpc/mm/book3s64: Use pmdp_ptep helper instead of typecasting.

2023-06-05 Thread Aneesh Kumar K.V
No functional change in this patch.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/book3s64/radix_pgtable.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 2297aa764ecd..5f8c6fbe8a69 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -952,7 +952,7 @@ unsigned long radix__pmd_hugepage_update(struct mm_struct 
*mm, unsigned long add
assert_spin_locked(pmd_lockptr(mm, pmdp));
 #endif
 
-   old = radix__pte_update(mm, addr, (pte_t *)pmdp, clr, set, 1);
+   old = radix__pte_update(mm, addr, pmdp_ptep(pmdp), clr, set, 1);
trace_hugepage_update(addr, old, clr, set);
 
return old;
-- 
2.40.1



[PATCH 00/16] Add support for DAX vmemmap optimization for ppc64

2023-06-05 Thread Aneesh Kumar K.V
This patch series implements changes required to support DAX vmemmap
optimization for ppc64. The vmemmap optimization is only enabled with radix MMU
translation and 1GB PUD mapping with 64K page size. The patch series also split
hugetlb vmemmap optimization as a separate Kconfig variable so that
architectures can enable DAX vmemmap optimization without enabling hugetlb
vmemmap optimization. This should enable architectures like arm64 to enable DAX
vmemmap optimization while they can't enable hugetlb vmemmap optimization. More
details of the same are in patch "mm/vmemmap optimization: Split hugetlb and
d

Aneesh Kumar K.V (16):
  powerpc/mm/book3s64: Use pmdp_ptep helper instead of typecasting.
  powerpc/book3s64/mm: mmu_vmemmap_psize is used by radix
  powerpc/book3s64/mm: Fix DirectMap stats in /proc/meminfo
  powerpc/book3s64/mm: Use PAGE_KERNEL instead of opencoding
  powerpc/mm/dax: Fix the condition when checking if altmap vmemap can
cross-boundary
  mm/hugepage pud: Allow arch-specific helper function to check huge
page pud support
  mm: Change pudp_huge_get_and_clear_full take vm_area_struct as arg
  mm/vmemmap: Improve vmemmap_can_optimize and allow architectures to
override
  mm/vmemmap: Allow architectures to override how vmemmap optimization
works
  mm: Add __HAVE_ARCH_PUD_SAME similar to __HAVE_ARCH_P4D_SAME
  mm/huge pud: Use transparent huge pud helpers only with
CONFIG_TRANSPARENT_HUGEPAGE
  mm/vmemmap optimization: Split hugetlb and devdax vmemmap optimization
  powerpc/book3s64/mm: Enable transparent pud hugepage
  powerpc/book3s64/vmemmap: Switch radix to use a different vmemmap
handling function
  powerpc/book3s64/radix: Add support for vmemmap optimization for radix
  powerpc/book3s64/radix: Remove mmu_vmemmap_psize

 Documentation/mm/vmemmap_dedup.rst|   1 +
 Documentation/powerpc/vmemmap_dedup.rst   | 101 
 arch/loongarch/Kconfig|   2 +-
 arch/powerpc/Kconfig  |   1 +
 arch/powerpc/include/asm/book3s/64/pgtable.h  | 156 -
 arch/powerpc/include/asm/book3s/64/radix.h|  47 ++
 .../include/asm/book3s/64/tlbflush-radix.h|   2 +
 arch/powerpc/include/asm/book3s/64/tlbflush.h |   8 +
 arch/powerpc/include/asm/pgtable.h|   3 +
 arch/powerpc/mm/book3s64/pgtable.c|  78 +++
 arch/powerpc/mm/book3s64/radix_pgtable.c  | 551 --
 arch/powerpc/mm/book3s64/radix_tlb.c  |   7 +
 arch/powerpc/mm/init_64.c |  39 +-
 arch/powerpc/platforms/Kconfig.cputype|   1 +
 arch/riscv/Kconfig|   2 +-
 arch/x86/Kconfig  |   3 +-
 drivers/nvdimm/pfn_devs.c |   2 +-
 fs/Kconfig|   2 +-
 include/linux/mm.h|  32 +-
 include/linux/pgtable.h   |  11 +-
 include/trace/events/thp.h|  17 +
 mm/Kconfig|   5 +-
 mm/debug_vm_pgtable.c |   2 +-
 mm/huge_memory.c  |   2 +-
 mm/mm_init.c  |   2 +-
 mm/mremap.c   |   2 +-
 mm/sparse-vmemmap.c   |   3 +
 27 files changed, 1005 insertions(+), 77 deletions(-)
 create mode 100644 Documentation/powerpc/vmemmap_dedup.rst

-- 
2.40.1



Re: [PATCHv2 pci-next 2/2] PCI/AER: Rate limit the reporting of the correctable errors

2023-06-05 Thread Grant Grundler
On Wed, May 17, 2023 at 11:11 PM Grant Grundler 
wrote:

> On Fri, Apr 7, 2023 at 12:46 PM Bjorn Helgaas  wrote:
> ...
> > But I don't think we need output in a single step; we just need a
> > single instance of ratelimit_state (or one for CPER path and another
> > for native AER path), and that can control all the output for a single
> > error.  E.g., print_hmi_event_info() looks like this:
> >
> >   static void print_hmi_event_info(...)
> >   {
> > static DEFINE_RATELIMIT_STATE(rs, ...);
> >
> > if (__ratelimit()) {
> >   printk("%s%s Hypervisor Maintenance interrupt ...");
> >   printk("%s Error detail: %s\n", ...);
> >   printk("%s  HMER: %016llx\n", ...);
> > }
> >   }
> >
> > I think it's nice that the struct ratelimit_state is explicit and
> > there's no danger of breaking it when adding another printk later.
>
> Since the output is spread across at least two functions, I think your
> proposal is a better solution.
>
> I'm not happy with the patch series I sent in my previous reply as an
> attachment. It's only marginally better than the original code.
>

Despite not being happy about it, after a week of vacation I now think it
would be better to include them as is since they solve the immediate
problems and then solve the above two issues in additional patches. The two
changes I have prepared so far correctly fix the original issues they
intended to fix and don't affect the new issues we've found.

I'll post a V3 of this series tonight after making sure it at least
compiles and "looks right".

cheers,
grant


> I need another day or two to see if I can implement your proposal
> correctly.
>
> cheers,
> grant
>


[PATCHv3 pci-next 2/2] PCI/AER: Rate limit the reporting of the correctable errors

2023-06-05 Thread Grant Grundler
From: Rajat Khandelwal 

There are many instances where correctable errors tend to inundate
the message buffer. We observe such instances during thunderbolt PCIe
tunneling.

It's true that they are mitigated by the hardware and are non-fatal
but we shouldn't be spamming the logs with such correctable errors as it
confuses other kernel developers less familiar with PCI errors, support
staff, and users who happen to look at the logs, hence rate limit them.

A typical example log inside an HP TBT4 dock:
[54912.661142] pcieport :00:07.0: AER: Multiple Corrected error received: 
:2b:00.0
[54912.661194] igc :2b:00.0: PCIe Bus Error: severity=Corrected, type=Data 
Link Layer, (Transmitter ID)
[54912.661203] igc :2b:00.0:   device [8086:5502] error 
status/mask=1100/2000
[54912.661211] igc :2b:00.0:[ 8] Rollover
[54912.661219] igc :2b:00.0:[12] Timeout
[54982.838760] pcieport :00:07.0: AER: Corrected error received: 
:2b:00.0
[54982.838798] igc :2b:00.0: PCIe Bus Error: severity=Corrected, type=Data 
Link Layer, (Transmitter ID)
[54982.838808] igc :2b:00.0:   device [8086:5502] error 
status/mask=1000/2000
[54982.838817] igc :2b:00.0:[12] Timeout

This gets repeated continuously, thus inundating the buffer.

Signed-off-by: Rajat Khandelwal 
Signed-off-by: Grant Grundler 
---
 drivers/pci/pcie/aer.c | 80 +++---
 1 file changed, 51 insertions(+), 29 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index d7bfc6070ddb..830f5a1261c9 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -686,26 +686,36 @@ static void __aer_print_error(struct pci_dev *dev,
  struct aer_err_info *info)
 {
const char **strings;
+   char aer_msg[512];
unsigned long status = info->status & ~info->mask;
-   const char *level, *errmsg;
int i;
 
-   if (info->severity == AER_CORRECTABLE) {
-   strings = aer_correctable_error_string;
-   level = KERN_INFO;
-   } else {
-   strings = aer_uncorrectable_error_string;
-   level = KERN_ERR;
-   }
+   memset(aer_msg, 0, sizeof(*aer_msg));
+   snprintf(aer_msg, sizeof(*aer_msg), "aer_status: 0x%08x, aer_mask: 
0x%08x\n",
+   info->status, info->mask);
+
+   strings = (info->severity == AER_CORRECTABLE) ?
+   aer_correctable_error_string : aer_uncorrectable_error_string;
 
for_each_set_bit(i, , 32) {
-   errmsg = strings[i];
+   const char *errmsg = strings[i];
+   char bitmsg[64];
+   memset(bitmsg, 0, sizeof(*bitmsg));
+
if (!errmsg)
errmsg = "Unknown Error Bit";
 
-   pci_printk(level, dev, "   [%2d] %-22s%s\n", i, errmsg,
-   info->first_error == i ? " (First)" : "");
+   snprintf(bitmsg, sizeof(*bitmsg), "   [%2d] %-22s%s\n", i, 
errmsg,
+   info->first_error == i ? " (First)" : "");
+
+   strlcat(aer_msg, bitmsg, sizeof(*aer_msg));
}
+
+   if (info->severity == AER_CORRECTABLE)
+   pci_info_ratelimited(dev, "%s", aer_msg);
+   else
+   pci_err(dev, "%s", aer_msg):
+
pci_dev_aer_stats_incr(dev, info);
 }
 
@@ -713,7 +723,6 @@ void aer_print_error(struct pci_dev *dev, struct 
aer_err_info *info)
 {
int layer, agent;
int id = ((dev->bus->number << 8) | dev->devfn);
-   const char *level;
 
if (!info->status) {
pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, 
(Unregistered Agent ID)\n",
@@ -724,14 +733,19 @@ void aer_print_error(struct pci_dev *dev, struct 
aer_err_info *info)
layer = AER_GET_LAYER_ERROR(info->severity, info->status);
agent = AER_GET_AGENT(info->severity, info->status);
 
-   level = (info->severity == AER_CORRECTABLE) ? KERN_INFO : KERN_ERR;
-
-   pci_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
-  aer_error_severity_string[info->severity],
-  aer_error_layer[layer], aer_agent_string[agent]);
-
-   pci_printk(level, dev, "  device [%04x:%04x] error 
status/mask=%08x/%08x\n",
-  dev->vendor, dev->device, info->status, info->mask);
+   if (info->severity == AER_CORRECTABLE) {
+   pci_info_ratelimited(dev, "PCIe Bus Error: severity=%s, 
type=%s, (%s)\n"
+   "  device [%04x:%04x] error 
status/mask=%08x/%08x\n",
+aer_error_severity_string[info->severity],
+aer_error_layer[layer], 
aer_agent_string[agent],
+dev->vendor, dev->device, info->status, 
info->mask);
+   } else {
+   pci_err(dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",

[PATCHv3 pci-next 1/2] PCI/AER: correctable error message as KERN_INFO

2023-06-05 Thread Grant Grundler
Since correctable errors have been corrected (and counted), the dmesg output
should not be reported as a warning, but rather as "informational".

Otherwise, using a certain well known vendor's PCIe parts in a USB4 docking
station, the dmesg buffer can be spammed with correctable errors, 717 bytes
per instance, potentially many MB per day.

Given the "WARN" priority, these messages have already confused the typical
user that stumbles across them, support staff (triaging feedback reports),
and more than a few linux kernel devs. Changing to INFO will hide these
messages from most audiences.

Signed-off-by: Grant Grundler 
---
 drivers/pci/pcie/aer.c | 20 ++--
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index f6c24ded134c..d7bfc6070ddb 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -692,7 +692,7 @@ static void __aer_print_error(struct pci_dev *dev,
 
if (info->severity == AER_CORRECTABLE) {
strings = aer_correctable_error_string;
-   level = KERN_WARNING;
+   level = KERN_INFO;
} else {
strings = aer_uncorrectable_error_string;
level = KERN_ERR;
@@ -724,7 +724,7 @@ void aer_print_error(struct pci_dev *dev, struct 
aer_err_info *info)
layer = AER_GET_LAYER_ERROR(info->severity, info->status);
agent = AER_GET_AGENT(info->severity, info->status);
 
-   level = (info->severity == AER_CORRECTABLE) ? KERN_WARNING : KERN_ERR;
+   level = (info->severity == AER_CORRECTABLE) ? KERN_INFO : KERN_ERR;
 
pci_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
   aer_error_severity_string[info->severity],
@@ -797,14 +797,22 @@ void cper_print_aer(struct pci_dev *dev, int aer_severity,
info.mask = mask;
info.first_error = PCI_ERR_CAP_FEP(aer->cap_control);
 
-   pci_err(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask);
+   if (aer_severity == AER_CORRECTABLE)
+   pci_info(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, 
mask);
+   else
+   pci_err(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, 
mask);
+
__aer_print_error(dev, );
-   pci_err(dev, "aer_layer=%s, aer_agent=%s\n",
-   aer_error_layer[layer], aer_agent_string[agent]);
 
-   if (aer_severity != AER_CORRECTABLE)
+   if (aer_severity == AER_CORRECTABLE) {
+   pci_info(dev, "aer_layer=%s, aer_agent=%s\n",
+   aer_error_layer[layer], aer_agent_string[agent]);
+   } else {
+   pci_err(dev, "aer_layer=%s, aer_agent=%s\n",
+   aer_error_layer[layer], aer_agent_string[agent]);
pci_err(dev, "aer_uncor_severity: 0x%08x\n",
aer->uncor_severity);
+   }
 
if (tlp_header_valid)
__print_tlp_header(dev, >header_log);
-- 
2.41.0.rc0.172.g3f132b7071-goog



[PATCH 2/2] PCI/AER: Rate limit the reporting of the correctable errors

2023-06-05 Thread Grant Grundler
From: Rajat Khandelwal 

There are many instances where correctable errors tend to inundate
the message buffer. We observe such instances during thunderbolt PCIe
tunneling.

It's true that they are mitigated by the hardware and are non-fatal
but we shouldn't be spamming the logs with such correctable errors as it
confuses other kernel developers less familiar with PCI errors, support
staff, and users who happen to look at the logs, hence rate limit them.

A typical example log inside an HP TBT4 dock:
[54912.661142] pcieport :00:07.0: AER: Multiple Corrected error received: 
:2b:00.0
[54912.661194] igc :2b:00.0: PCIe Bus Error: severity=Corrected, type=Data 
Link Layer, (Transmitter ID)
[54912.661203] igc :2b:00.0:   device [8086:5502] error 
status/mask=1100/2000
[54912.661211] igc :2b:00.0:[ 8] Rollover
[54912.661219] igc :2b:00.0:[12] Timeout
[54982.838760] pcieport :00:07.0: AER: Corrected error received: 
:2b:00.0
[54982.838798] igc :2b:00.0: PCIe Bus Error: severity=Corrected, type=Data 
Link Layer, (Transmitter ID)
[54982.838808] igc :2b:00.0:   device [8086:5502] error 
status/mask=1000/2000
[54982.838817] igc :2b:00.0:[12] Timeout

This gets repeated continuously, thus inundating the buffer.

Signed-off-by: Rajat Khandelwal 
Signed-off-by: Grant Grundler 
---
 drivers/pci/pcie/aer.c | 80 +++---
 1 file changed, 51 insertions(+), 29 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index d7bfc6070ddb..830f5a1261c9 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -686,26 +686,36 @@ static void __aer_print_error(struct pci_dev *dev,
  struct aer_err_info *info)
 {
const char **strings;
+   char aer_msg[512];
unsigned long status = info->status & ~info->mask;
-   const char *level, *errmsg;
int i;
 
-   if (info->severity == AER_CORRECTABLE) {
-   strings = aer_correctable_error_string;
-   level = KERN_INFO;
-   } else {
-   strings = aer_uncorrectable_error_string;
-   level = KERN_ERR;
-   }
+   memset(aer_msg, 0, sizeof(*aer_msg));
+   snprintf(aer_msg, sizeof(*aer_msg), "aer_status: 0x%08x, aer_mask: 
0x%08x\n",
+   info->status, info->mask);
+
+   strings = (info->severity == AER_CORRECTABLE) ?
+   aer_correctable_error_string : aer_uncorrectable_error_string;
 
for_each_set_bit(i, , 32) {
-   errmsg = strings[i];
+   const char *errmsg = strings[i];
+   char bitmsg[64];
+   memset(bitmsg, 0, sizeof(*bitmsg));
+
if (!errmsg)
errmsg = "Unknown Error Bit";
 
-   pci_printk(level, dev, "   [%2d] %-22s%s\n", i, errmsg,
-   info->first_error == i ? " (First)" : "");
+   snprintf(bitmsg, sizeof(*bitmsg), "   [%2d] %-22s%s\n", i, 
errmsg,
+   info->first_error == i ? " (First)" : "");
+
+   strlcat(aer_msg, bitmsg, sizeof(*aer_msg));
}
+
+   if (info->severity == AER_CORRECTABLE)
+   pci_info_ratelimited(dev, "%s", aer_msg);
+   else
+   pci_err(dev, "%s", aer_msg):
+
pci_dev_aer_stats_incr(dev, info);
 }
 
@@ -713,7 +723,6 @@ void aer_print_error(struct pci_dev *dev, struct 
aer_err_info *info)
 {
int layer, agent;
int id = ((dev->bus->number << 8) | dev->devfn);
-   const char *level;
 
if (!info->status) {
pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, 
(Unregistered Agent ID)\n",
@@ -724,14 +733,19 @@ void aer_print_error(struct pci_dev *dev, struct 
aer_err_info *info)
layer = AER_GET_LAYER_ERROR(info->severity, info->status);
agent = AER_GET_AGENT(info->severity, info->status);
 
-   level = (info->severity == AER_CORRECTABLE) ? KERN_INFO : KERN_ERR;
-
-   pci_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
-  aer_error_severity_string[info->severity],
-  aer_error_layer[layer], aer_agent_string[agent]);
-
-   pci_printk(level, dev, "  device [%04x:%04x] error 
status/mask=%08x/%08x\n",
-  dev->vendor, dev->device, info->status, info->mask);
+   if (info->severity == AER_CORRECTABLE) {
+   pci_info_ratelimited(dev, "PCIe Bus Error: severity=%s, 
type=%s, (%s)\n"
+   "  device [%04x:%04x] error 
status/mask=%08x/%08x\n",
+aer_error_severity_string[info->severity],
+aer_error_layer[layer], 
aer_agent_string[agent],
+dev->vendor, dev->device, info->status, 
info->mask);
+   } else {
+   pci_err(dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",

[PATCH 1/2] PCI/AER: correctable error message as KERN_INFO

2023-06-05 Thread Grant Grundler
Since correctable errors have been corrected (and counted), the dmesg output
should not be reported as a warning, but rather as "informational".

Otherwise, using a certain well known vendor's PCIe parts in a USB4 docking
station, the dmesg buffer can be spammed with correctable errors, 717 bytes
per instance, potentially many MB per day.

Given the "WARN" priority, these messages have already confused the typical
user that stumbles across them, support staff (triaging feedback reports),
and more than a few linux kernel devs. Changing to INFO will hide these
messages from most audiences.

Signed-off-by: Grant Grundler 
---
 drivers/pci/pcie/aer.c | 20 ++--
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index f6c24ded134c..d7bfc6070ddb 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -692,7 +692,7 @@ static void __aer_print_error(struct pci_dev *dev,
 
if (info->severity == AER_CORRECTABLE) {
strings = aer_correctable_error_string;
-   level = KERN_WARNING;
+   level = KERN_INFO;
} else {
strings = aer_uncorrectable_error_string;
level = KERN_ERR;
@@ -724,7 +724,7 @@ void aer_print_error(struct pci_dev *dev, struct 
aer_err_info *info)
layer = AER_GET_LAYER_ERROR(info->severity, info->status);
agent = AER_GET_AGENT(info->severity, info->status);
 
-   level = (info->severity == AER_CORRECTABLE) ? KERN_WARNING : KERN_ERR;
+   level = (info->severity == AER_CORRECTABLE) ? KERN_INFO : KERN_ERR;
 
pci_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
   aer_error_severity_string[info->severity],
@@ -797,14 +797,22 @@ void cper_print_aer(struct pci_dev *dev, int aer_severity,
info.mask = mask;
info.first_error = PCI_ERR_CAP_FEP(aer->cap_control);
 
-   pci_err(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask);
+   if (aer_severity == AER_CORRECTABLE)
+   pci_info(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, 
mask);
+   else
+   pci_err(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, 
mask);
+
__aer_print_error(dev, );
-   pci_err(dev, "aer_layer=%s, aer_agent=%s\n",
-   aer_error_layer[layer], aer_agent_string[agent]);
 
-   if (aer_severity != AER_CORRECTABLE)
+   if (aer_severity == AER_CORRECTABLE) {
+   pci_info(dev, "aer_layer=%s, aer_agent=%s\n",
+   aer_error_layer[layer], aer_agent_string[agent]);
+   } else {
+   pci_err(dev, "aer_layer=%s, aer_agent=%s\n",
+   aer_error_layer[layer], aer_agent_string[agent]);
pci_err(dev, "aer_uncor_severity: 0x%08x\n",
aer->uncor_severity);
+   }
 
if (tlp_header_valid)
__print_tlp_header(dev, >header_log);
-- 
2.41.0.rc0.172.g3f132b7071-goog



Re: [PATCH 06/12] sparc: add pte_free_defer() for pgtables sharing page

2023-06-05 Thread Hugh Dickins
On Sun, 28 May 2023, Hugh Dickins wrote:

> Add sparc-specific pte_free_defer(), to call pte_free() via call_rcu().
> pte_free_defer() will be called inside khugepaged's retract_page_tables()
> loop, where allocating extra memory cannot be relied upon.  This precedes
> the generic version to avoid build breakage from incompatible pgtable_t.

sparc32 supports pagetables sharing a page, but does not support THP;
sparc64 supports THP, but does not support pagetables sharing a page.
So the sparc-specific pte_free_defer() is as simple as the generic one,
except for converting between pte_t *pgtable_t and struct page *.
The patch should be fine as posted (except its title is misleading).

> 
> Signed-off-by: Hugh Dickins 
> ---
>  arch/sparc/include/asm/pgalloc_64.h |  4 
>  arch/sparc/mm/init_64.c | 16 
>  2 files changed, 20 insertions(+)
> 
> diff --git a/arch/sparc/include/asm/pgalloc_64.h 
> b/arch/sparc/include/asm/pgalloc_64.h
> index 7b5561d17ab1..caa7632be4c2 100644
> --- a/arch/sparc/include/asm/pgalloc_64.h
> +++ b/arch/sparc/include/asm/pgalloc_64.h
> @@ -65,6 +65,10 @@ pgtable_t pte_alloc_one(struct mm_struct *mm);
>  void pte_free_kernel(struct mm_struct *mm, pte_t *pte);
>  void pte_free(struct mm_struct *mm, pgtable_t ptepage);
>  
> +/* arch use pte_free_defer() implementation in arch/sparc/mm/init_64.c */
> +#define pte_free_defer pte_free_defer
> +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
> +
>  #define pmd_populate_kernel(MM, PMD, PTE)pmd_set(MM, PMD, PTE)
>  #define pmd_populate(MM, PMD, PTE)   pmd_set(MM, PMD, PTE)
>  
> diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
> index 04f9db0c3111..b7c6aa085ef6 100644
> --- a/arch/sparc/mm/init_64.c
> +++ b/arch/sparc/mm/init_64.c
> @@ -2930,6 +2930,22 @@ void pgtable_free(void *table, bool is_page)
>  }
>  
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +static void pte_free_now(struct rcu_head *head)
> +{
> + struct page *page;
> +
> + page = container_of(head, struct page, rcu_head);
> + __pte_free((pgtable_t)page_to_virt(page));
> +}
> +
> +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> +{
> + struct page *page;
> +
> + page = virt_to_page(pgtable);
> + call_rcu(>rcu_head, pte_free_now);
> +}
> +
>  void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
> pmd_t *pmd)
>  {
> -- 
> 2.35.3
> 
> 


Re: [PATCHv2 pci-next 2/2] PCI/AER: Rate limit the reporting of the correctable errors

2023-06-05 Thread Grant Grundler
[plain text only this time...]

On Wed, May 17, 2023 at 11:11 PM Grant Grundler  wrote:
>
> On Fri, Apr 7, 2023 at 12:46 PM Bjorn Helgaas  wrote:
> ...
> > But I don't think we need output in a single step; we just need a
> > single instance of ratelimit_state (or one for CPER path and another
> > for native AER path), and that can control all the output for a single
> > error.  E.g., print_hmi_event_info() looks like this:
> >
> >   static void print_hmi_event_info(...)
> >   {
> > static DEFINE_RATELIMIT_STATE(rs, ...);
> >
> > if (__ratelimit()) {
> >   printk("%s%s Hypervisor Maintenance interrupt ...");
> >   printk("%s Error detail: %s\n", ...);
> >   printk("%s  HMER: %016llx\n", ...);
> > }
> >   }
> >
> > I think it's nice that the struct ratelimit_state is explicit and
> > there's no danger of breaking it when adding another printk later.
>
> Since the output is spread across at least two functions, I think your
> proposal is a better solution.
>
> I'm not happy with the patch series I sent in my previous reply as an
> attachment. It's only marginally better than the original code.

Despite not being happy about it, after a week of vacation I now think
it would be better to include them as is since they solve the
immediate problems and then solve the above two issues in additional
patches. The two changes I have prepared so far correctly fix the
original issues they intended to fix and don't affect the new issues
we've found.

I'll post a V3 of this series tonight after making sure it at least
compiles and "looks right".

cheers,
grant

>
> I need another day or two to see if I can implement your proposal correctly.
>
> cheers,
> grant


Re: [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page

2023-06-05 Thread Hugh Dickins
On Fri, 2 Jun 2023, Jason Gunthorpe wrote:
> On Mon, May 29, 2023 at 03:02:02PM +0100, Matthew Wilcox wrote:
> > On Sun, May 28, 2023 at 11:20:21PM -0700, Hugh Dickins wrote:
> > > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> > > +{
> > > + struct page *page;
> > > +
> > > + page = virt_to_page(pgtable);
> > > + call_rcu(>rcu_head, pte_free_now);
> > > +}
> > 
> > This can't be safe (on ppc).  IIRC you might have up to 16x4k page
> > tables sharing one 64kB page.  So if you have two page tables from the
> > same page being defer-freed simultaneously, you'll reuse the rcu_head
> > and I cannot imagine things go well from that point.
> > 
> > I have no idea how to solve this problem.
> 
> Maybe power and s390 should allocate a side structure, sort of a
> pre-memdesc thing to store enough extra data?
> 
> If we can get enough bytes then something like this would let a single
> rcu head be shared to manage the free bits.
> 
> struct 64k_page {
> u8 free_pages;
> u8 pending_rcu_free_pages;
> struct rcu_head head;
> }
> 
> free_sub_page(sub_id)
> if (atomic_fetch_or(1 << sub_id, &64k_page->pending_rcu_free_pages))
>  call_rcu(&64k_page->head)
> 
> rcu_func()
>64k_page->free_pages |= atomic_xchg(0, &64k_page->pending_rcu_free_pages)
> 
>if (64k_pages->free_pages == all_ones)
>   free_pgea(64k_page);

Or simply allocate as many rcu_heads as page tables.

I have not thought through your suggestion above, because I'm against
asking s390, or any other architecture, to degrade its page table
implementation by demanding more memory, just for the sake of my patch
series.  In a future memdesc world it might turn out to be reasonable,
but not for this (if I can possibly avoid it).

Below is what I believe to be the correct powerpc patch (built but not
retested).  sparc I thought was going to be an equal problem, but turns
out not: I'll comment on 06/12.  And let's move s390 discussion to 07/12.

[PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page

Add powerpc-specific pte_free_defer(), to call pte_free() via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon.  This precedes
the generic version to avoid build breakage from incompatible pgtable_t.

This is awkward because the struct page contains only one rcu_head, but
that page may be shared between PTE_FRAG_NR pagetables, each wanting to
use the rcu_head at the same time: account concurrent deferrals with a
heightened refcount, only the first making use of the rcu_head, but
re-deferring if more deferrals arrived during its grace period.

Signed-off-by: Hugh Dickins 
---
 arch/powerpc/include/asm/pgalloc.h |  4 +++
 arch/powerpc/mm/pgtable-frag.c | 51 ++
 2 files changed, 55 insertions(+)

diff --git a/arch/powerpc/include/asm/pgalloc.h 
b/arch/powerpc/include/asm/pgalloc.h
index 3360cad78ace..3a971e2a8c73 100644
--- a/arch/powerpc/include/asm/pgalloc.h
+++ b/arch/powerpc/include/asm/pgalloc.h
@@ -45,6 +45,10 @@ static inline void pte_free(struct mm_struct *mm, pgtable_t 
ptepage)
pte_fragment_free((unsigned long *)ptepage, 0);
 }
 
+/* arch use pte_free_defer() implementation in arch/powerpc/mm/pgtable-frag.c 
*/
+#define pte_free_defer pte_free_defer
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
 /*
  * Functions that deal with pagetables that could be at any level of
  * the table need to be passed an "index_size" so they know how to
diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c
index 20652daa1d7e..e4f58c5fc2ac 100644
--- a/arch/powerpc/mm/pgtable-frag.c
+++ b/arch/powerpc/mm/pgtable-frag.c
@@ -120,3 +120,54 @@ void pte_fragment_free(unsigned long *table, int kernel)
__free_page(page);
}
 }
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define PTE_FREE_DEFERRED 0x1 /* beyond any PTE_FRAG_NR */
+
+static void pte_free_now(struct rcu_head *head)
+{
+   struct page *page;
+   int refcount;
+
+   page = container_of(head, struct page, rcu_head);
+   refcount = atomic_sub_return(PTE_FREE_DEFERRED - 1,
+>pt_frag_refcount);
+   if (refcount < PTE_FREE_DEFERRED) {
+   pte_fragment_free((unsigned long *)page_address(page), 0);
+   return;
+   }
+   /*
+* One page may be shared between PTE_FRAG_NR pagetables.
+* At least one more call to pte_free_defer() came in while we
+* were already deferring, so the free must be deferred again;
+* but just for one grace period, however many calls came in.
+*/
+   while (refcount >= PTE_FREE_DEFERRED + PTE_FREE_DEFERRED) {
+   refcount = atomic_sub_return(PTE_FREE_DEFERRED,
+>pt_frag_refcount);
+   }
+   /* Remove that refcount of 1 left for fragment freeing above */
+   

Re: [PATCH 00/89] i2c: Convert to platform remove callback returning void

2023-06-05 Thread Wolfram Sang
On Thu, Jun 01, 2023 at 03:54:50PM +0200, Wolfram Sang wrote:
> 
> > I wonder how this series will go in. My expectation was that Wolfram
> > picks up the whole series via his tree?!
> 
> Will do. I am currently super-busy, though.

Whole series applied to for-next. I squashed all the commits into one.
These are mostly simple changes which we won't revert anyhow, but fix
incrementally if we ever find an issue.



signature.asc
Description: PGP signature


Re: [PATCH v4 RESEND 0/3] sed-opal: keyrings, discovery, revert, key store

2023-06-05 Thread Jens Axboe
On 6/1/23 4:37 PM, gjo...@linux.vnet.ibm.com wrote:
> From: Greg Joyce 
> 
> This patchset has gone through numerous rounds of review and
> all comments/suggetions have been addressed. I believe that
> this patchset is ready for inclusion.
> 
> TCG SED Opal is a specification from The Trusted Computing Group
> that allows self encrypting storage devices (SED) to be locked at
> power on and require an authentication key to unlock the drive.
> 
> The current SED Opal implementation in the block driver
> requires that authentication keys be provided in an ioctl
> so that they can be presented to the underlying SED
> capable drive. Currently, the key is typically entered by
> a user with an application like sedutil or sedcli. While
> this process works, it does not lend itself to automation
> like unlock by a udev rule.
> 
> The SED block driver has been extended so it can alternatively
> obtain a key from a sed-opal kernel keyring. The SED ioctls
> will indicate the source of the key, either directly in the
> ioctl data or from the keyring.
> 
> Two new SED ioctls have also been added. These are:
>   1) IOC_OPAL_REVERT_LSP to revert LSP state
>   2) IOC_OPAL_DISCOVERY to discover drive capabilities/state
> 
> change log v4:
> - rebase to 6.3-rc7
>   - replaced "255" magic number with U8_MAX

None of this applies for for-6.5/block, and I'm a little puzzled
as to why you'd rebase to an old kernel rather than a 6.4-rc at
least?

Please resend one that is current.

-- 
Jens Axboe




Re: [PATCH 00/13] mm: jit/text allocator

2023-06-05 Thread Kent Overstreet
On Mon, Jun 05, 2023 at 12:20:40PM +0300, Mike Rapoport wrote:
> On Fri, Jun 02, 2023 at 10:35:09AM +0100, Mark Rutland wrote:
> > On Thu, Jun 01, 2023 at 02:14:56PM -0400, Kent Overstreet wrote:
> > > On Thu, Jun 01, 2023 at 05:12:03PM +0100, Mark Rutland wrote:
> > > > For a while I have wanted to give kprobes its own allocator so that it 
> > > > can work
> > > > even with CONFIG_MODULES=n, and so that it doesn't have to waste VA 
> > > > space in
> > > > the modules area.
> > > > 
> > > > Given that, I think these should have their own allocator functions 
> > > > that can be
> > > > provided independently, even if those happen to use common 
> > > > infrastructure.
> > > 
> > > How much memory can kprobes conceivably use? I think we also want to try
> > > to push back on combinatorial new allocators, if we can.
> > 
> > That depends on who's using it, and how (e.g. via BPF).
> > 
> > To be clear, I'm not necessarily asking for entirely different allocators, 
> > but
> > I do thinkg that we want wrappers that can at least pass distinct start+end
> > parameters to a common allocator, and for arm64's modules code I'd expect 
> > that
> > we'd keep the range falblack logic out of the common allcoator, and just 
> > call
> > it twice.
> > 
> > > > > Several architectures override module_alloc() because of various
> > > > > constraints where the executable memory can be located and this causes
> > > > > additional obstacles for improvements of code allocation.
> > > > > 
> > > > > This set splits code allocation from modules by introducing
> > > > > jit_text_alloc(), jit_data_alloc() and jit_free() APIs, replaces call
> > > > > sites of module_alloc() and module_memfree() with the new APIs and
> > > > > implements core text and related allocation in a central place.
> > > > > 
> > > > > Instead of architecture specific overrides for module_alloc(), the
> > > > > architectures that require non-default behaviour for text allocation 
> > > > > must
> > > > > fill jit_alloc_params structure and implement jit_alloc_arch_params() 
> > > > > that
> > > > > returns a pointer to that structure. If an architecture does not 
> > > > > implement
> > > > > jit_alloc_arch_params(), the defaults compatible with the current
> > > > > modules::module_alloc() are used.
> > > > 
> > > > As above, I suspect that each of the callsites should probably be using 
> > > > common
> > > > infrastructure, but I don't think that a single jit_alloc_arch_params() 
> > > > makes
> > > > sense, since the parameters for each case may need to be distinct.
> > > 
> > > I don't see how that follows. The whole point of function parameters is
> > > that they may be different :)
> > 
> > What I mean is that jit_alloc_arch_params() tries to aggregate common
> > parameters, but they aren't actually common (e.g. the actual start+end range
> > for allocation).
> 
> jit_alloc_arch_params() tries to aggregate architecture constraints and
> requirements for allocations of executable memory and this exactly what
> the first 6 patches of this set do.
> 
> A while ago Thomas suggested to use a structure that parametrizes
> architecture constraints by the memory type used in modules [1] and Song
> implemented the infrastructure for it and x86 part [2].
> 
> I liked the idea of defining parameters in a single structure, but I
> thought that approaching the problem from the arch side rather than from
> modules perspective will be better starting point, hence these patches.
> 
> I don't see a fundamental reason why a single structure cannot describe
> what is needed for different code allocation cases, be it modules, kprobes
> or bpf. There is of course an assumption that the core allocations will be
> the same for all the users, and it seems to me that something like 
> 
> * allocate physical memory if allocator caches are empty
> * map it in vmalloc or modules address space
> * return memory from the allocator cache to the caller
> 
> will work for all usecases.
> 
> We might need separate caches for different cases on different
> architectures, and a way to specify what cache should be used in the
> allocator API, but that does not contradict a single structure for arch
> specific parameters, but only makes it more elaborate, e.g. something like
> 
> enum jit_type {
>   JIT_MODULES_TEXT,
>   JIT_MODULES_DATA,
>   JIT_KPROBES,
>   JIT_FTRACE,
>   JIT_BPF,
>   JIT_TYPE_MAX,
> };

Why would we actually need different enums for modules_text, kprobes,
ftrace and bpf? Why can't we treat all text allocations the same?

The reason we can't do that currently is because modules need to go in a
128Mb region on some archs, and without sub page allocation
bpf/kprobes/etc. burn a full page for each allocation. But we're doing
sub page allocation - right?

That leaves module data - which really needs to be split out into rw,
ro, ro_after_init - but I'm not sure we'd even want the same API for
those, they need fairly different page permissions 

Re: [PATCH 12/13] x86/jitalloc: prepare to allocate exectuatble memory as ROX

2023-06-05 Thread Nadav Amit



> On Jun 5, 2023, at 9:10 AM, Edgecombe, Rick P  
> wrote:
> 
> On Mon, 2023-06-05 at 11:11 +0300, Mike Rapoport wrote:
>> On Sun, Jun 04, 2023 at 10:52:44PM -0400, Steven Rostedt wrote:
>>> On Thu, 1 Jun 2023 16:54:36 -0700
>>> Nadav Amit  wrote:
>>> 
> The way text_poke() is used here, it is creating a new writable
> alias
> and flushing it for *each* write to the module (like for each
> write of
> an individual relocation, etc). I was just thinking it might
> warrant
> some batching or something.  
>> 
 I am not advocating to do so, but if you want to have many
 efficient
 writes, perhaps you can just disable CR0.WP. Just saying that if
 you
 are about to write all over the memory, text_poke() does not
 provide
 too much security for the poking thread.
>> 
>> Heh, this is definitely and easier hack to implement :)
> 
> I don't know the details, but previously there was some strong dislike
> of CR0.WP toggling. And now there is also the problem of CET. Setting
> CR0.WP=0 will #GP if CR4.CET is 1 (as it currently is for kernel IBT).
> I guess you might get away with toggling them both in some controlled
> situation, but it might be a lot easier to hack up then to be made
> fully acceptable. It does sound much more efficient though.

Thanks for highlighting this issue. I understand the limitations of
CR0.WP. There is also always the concerns that without CET or other
control flow integrity mechanism, someone would abuse (using ROP/JOP)
functions that clear CR0.WP…



Re: [PATCH 12/13] x86/jitalloc: prepare to allocate exectuatble memory as ROX

2023-06-05 Thread Edgecombe, Rick P
On Mon, 2023-06-05 at 23:42 +0300, Mike Rapoport wrote:
> > I tried this technique previously [0], and I thought it was not too
> > bad. In most of the callers it looks similar to what you have in
> > do_text_poke(). Sometimes less, sometimes more. It might need
> > enlightening of some of the stuff currently using text_poke()
> > during
> > module loading, like jump labels. So that bit is more intrusive,
> > yea.
> > But it sounds so much cleaner and well controlled. Did you have a
> > particular trouble spot in mind?
> 
> Nothing in particular, except the intrusive part. Except the changes
> in
> modules.c we'd need to teach alternatives to deal with a writable
> copy.

I didn't think alternatives piece looked too bad on the caller side (if
that's what you meant):
https://lore.kernel.org/lkml/20201120202426.18009-7-rick.p.edgeco...@intel.com/

The ugly part was in the (poorly named) module_adjust_writable_addr():

+static inline void *module_adjust_writable_addr(void *addr)
+{
+   unsigned long laddr = (unsigned long)addr;
+   struct module *mod;
+
+   mutex_lock(_mutex);
+   mod = __module_address(laddr);
+   if (!mod) {
+   mutex_unlock(_mutex);
+   return addr;
+   }
+   mutex_unlock(_mutex);
+   /* The module shouldn't be going away if someone is trying to
write to it */
+
+   return (void *)perm_writable_addr(module_get_allocation(mod,
laddr), laddr);
+}
+

It took module_mutex and looked up the module in order to find the
writable buffer from just the executable address. Basically all the
loading code external to modules had to go through that interface. But
now I'm wondering what I was thinking, it seems this could just be an
RCU read lock. That doesn't seem to bad...


Re: [PATCH 12/13] x86/jitalloc: prepare to allocate exectuatble memory as ROX

2023-06-05 Thread Mike Rapoport
On Mon, Jun 05, 2023 at 04:10:21PM +, Edgecombe, Rick P wrote:
> On Mon, 2023-06-05 at 11:11 +0300, Mike Rapoport wrote:
> > On Sun, Jun 04, 2023 at 10:52:44PM -0400, Steven Rostedt wrote:
> > > On Thu, 1 Jun 2023 16:54:36 -0700
> > > Nadav Amit  wrote:
> > > 
> > > > > The way text_poke() is used here, it is creating a new writable
> > > > > alias
> > > > > and flushing it for *each* write to the module (like for each
> > > > > write of
> > > > > an individual relocation, etc). I was just thinking it might
> > > > > warrant
> > > > > some batching or something.  
> > 
> > > > I am not advocating to do so, but if you want to have many
> > > > efficient
> > > > writes, perhaps you can just disable CR0.WP. Just saying that if
> > > > you
> > > > are about to write all over the memory, text_poke() does not
> > > > provide
> > > > too much security for the poking thread.
> > 
> > Heh, this is definitely and easier hack to implement :)
> 
> I don't know the details, but previously there was some strong dislike
> of CR0.WP toggling. And now there is also the problem of CET. Setting
> CR0.WP=0 will #GP if CR4.CET is 1 (as it currently is for kernel IBT).
> I guess you might get away with toggling them both in some controlled
> situation, but it might be a lot easier to hack up then to be made
> fully acceptable. It does sound much more efficient though.
 
I don't think we'd really want that, especially looking at 

WARN_ONCE(bits_missing, "CR0 WP bit went missing!?\n");

at native_write_cr0().
 
> > > Batching does exist, which is what the text_poke_queue() thing
> > > does.
> > 
> > For module loading text_poke_queue() will still be much slower than a
> > bunch
> > of memset()s for no good reason because we don't need all the
> > complexity of
> > text_poke_bp_batch() for module initialization because we are sure we
> > are
> > not patching live code.
> > 
> > What we'd need here is a new batching mode that will create a
> > writable
> > alias mapping at the beginning of apply_relocate_*() and
> > module_finalize(),
> > then it will use memcpy() to that writable alias and will tear the
> > mapping
> > down in the end.
> 
> It's probably only a tiny bit faster than keeping a separate writable
> allocation and text_poking it in at the end.

Right, but it still will be faster than text_poking every relocation.
 
> > Another option is to teach alternatives to update a writable copy
> > rather
> > than do in place changes like Song suggested. My feeling is that it
> > will be
> > more intrusive change though.
> 
> You mean keeping a separate RW allocation and then text_poking() the
> whole thing in when you are done? That is what I was trying to say at
> the beginning of this thread. The other benefit is you don't make the
> intermediate loading states of the module, executable.
> 
> I tried this technique previously [0], and I thought it was not too
> bad. In most of the callers it looks similar to what you have in
> do_text_poke(). Sometimes less, sometimes more. It might need
> enlightening of some of the stuff currently using text_poke() during
> module loading, like jump labels. So that bit is more intrusive, yea.
> But it sounds so much cleaner and well controlled. Did you have a
> particular trouble spot in mind?

Nothing in particular, except the intrusive part. Except the changes in
modules.c we'd need to teach alternatives to deal with a writable copy.
 
> [0]
> https://lore.kernel.org/lkml/20201120202426.18009-5-rick.p.edgeco...@intel.com/

-- 
Sincerely yours,
Mike.


[PATCH] powerpc/iommu: Only build sPAPR access functions on pSeries

2023-06-05 Thread Timothy Pearson
 and PowerNV

A build failure with CONFIG_HAVE_PCI=y set without PSERIES or POWERNV
set was caught by the random configuration checker.  Guard the sPAPR
specific IOMMU functions on CONFIG_PPC_PSERIES || CONFIG_PPC_POWERNV.

Signed-off-by: Timothy Pearson 
---
 arch/powerpc/kernel/iommu.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 67f0b01e6ff5..c52449ae6936 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1090,6 +1090,7 @@ void iommu_tce_kill(struct iommu_table *tbl,
 }
 EXPORT_SYMBOL_GPL(iommu_tce_kill);
 
+#if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_PPC_POWERNV)
 static int iommu_take_ownership(struct iommu_table *tbl)
 {
unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
@@ -1140,6 +1141,7 @@ static void iommu_release_ownership(struct iommu_table 
*tbl)
spin_unlock(>pools[i].lock);
spin_unlock_irqrestore(>large_pool.lock, flags);
 }
+#endif
 
 int iommu_add_device(struct iommu_table_group *table_group, struct device *dev)
 {
@@ -1171,6 +1173,7 @@ int iommu_add_device(struct iommu_table_group 
*table_group, struct device *dev)
 }
 EXPORT_SYMBOL_GPL(iommu_add_device);
 
+#if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_PPC_POWERNV)
 /*
  * A simple iommu_table_group_ops which only allows reusing the existing
  * iommu_table. This handles VFIO for POWER7 or the nested KVM.
@@ -1398,5 +1401,6 @@ static int __init 
spapr_tce_setup_phb_iommus_initcall(void)
return 0;
 }
 postcore_initcall_sync(spapr_tce_setup_phb_iommus_initcall);
+#endif
 
 #endif /* CONFIG_IOMMU_API */
-- 
2.30.2



Re: [PATCH 12/13] x86/jitalloc: prepare to allocate exectuatble memory as ROX

2023-06-05 Thread Edgecombe, Rick P
On Mon, 2023-06-05 at 11:11 +0300, Mike Rapoport wrote:
> On Sun, Jun 04, 2023 at 10:52:44PM -0400, Steven Rostedt wrote:
> > On Thu, 1 Jun 2023 16:54:36 -0700
> > Nadav Amit  wrote:
> > 
> > > > The way text_poke() is used here, it is creating a new writable
> > > > alias
> > > > and flushing it for *each* write to the module (like for each
> > > > write of
> > > > an individual relocation, etc). I was just thinking it might
> > > > warrant
> > > > some batching or something.  
> 
> > > I am not advocating to do so, but if you want to have many
> > > efficient
> > > writes, perhaps you can just disable CR0.WP. Just saying that if
> > > you
> > > are about to write all over the memory, text_poke() does not
> > > provide
> > > too much security for the poking thread.
> 
> Heh, this is definitely and easier hack to implement :)

I don't know the details, but previously there was some strong dislike
of CR0.WP toggling. And now there is also the problem of CET. Setting
CR0.WP=0 will #GP if CR4.CET is 1 (as it currently is for kernel IBT).
I guess you might get away with toggling them both in some controlled
situation, but it might be a lot easier to hack up then to be made
fully acceptable. It does sound much more efficient though.

> 
> > Batching does exist, which is what the text_poke_queue() thing
> > does.
> 
> For module loading text_poke_queue() will still be much slower than a
> bunch
> of memset()s for no good reason because we don't need all the
> complexity of
> text_poke_bp_batch() for module initialization because we are sure we
> are
> not patching live code.
> 
> What we'd need here is a new batching mode that will create a
> writable
> alias mapping at the beginning of apply_relocate_*() and
> module_finalize(),
> then it will use memcpy() to that writable alias and will tear the
> mapping
> down in the end.

It's probably only a tiny bit faster than keeping a separate writable
allocation and text_poking it in at the end.

> 
> Another option is to teach alternatives to update a writable copy
> rather
> than do in place changes like Song suggested. My feeling is that it
> will be
> more intrusive change though.

You mean keeping a separate RW allocation and then text_poking() the
whole thing in when you are done? That is what I was trying to say at
the beginning of this thread. The other benefit is you don't make the
intermediate loading states of the module, executable.

I tried this technique previously [0], and I thought it was not too
bad. In most of the callers it looks similar to what you have in
do_text_poke(). Sometimes less, sometimes more. It might need
enlightening of some of the stuff currently using text_poke() during
module loading, like jump labels. So that bit is more intrusive, yea.
But it sounds so much cleaner and well controlled. Did you have a
particular trouble spot in mind?


[0]
https://lore.kernel.org/lkml/20201120202426.18009-5-rick.p.edgeco...@intel.com/


Re: [PATCH v3 2/2] serial: 8250: Apply FSL workarounds also without SERIAL_8250_CONSOLE

2023-06-05 Thread Ilpo Järvinen
On Mon, 5 Jun 2023, Uwe Kleine-König wrote:

> Hello Ilpo,
> 
> On Mon, Jun 05, 2023 at 04:44:08PM +0300, Ilpo Järvinen wrote:
> > On Mon, 5 Jun 2023, Uwe Kleine-König wrote:
> > > On Mon, Jun 05, 2023 at 04:22:55PM +0300, Ilpo Järvinen wrote:
> > > > On Mon, 5 Jun 2023, Uwe Kleine-König wrote:
> > > > 
> > > > > The need to handle the FSL variant of 8250 in a special way is also
> > > > > present without console support. So soften the dependency for
> > > > > SERIAL_8250_FSL accordingly. Note that with the 8250 driver compiled 
> > > > > as
> > > > > a module, some devices still might not make use of the needed
> > > > > workarounds. That affects the ports instantiated in
> > > > > arch/powerpc/kernel/legacy_serial.c.
> > > > > 
> > > > > This issue was identified by Dominik Andreas Schorpp.
> > > > > 
> > > > > To cope for CONFIG_SERIAL_8250=m + CONFIG_SERIAL_8250_FSL=y, 
> > > > > 8250_fsl.o
> > > > > must be put in the same compilation unit as 8250_port.o because the
> > > > > latter defines some functions needed in the former and so 8250_fsl.o
> > > > > must not be built-in if 8250_port.o is available in a module.
> > > > > 
> > > > > Acked-by: Ilpo Järvinen 
> > > > > Link: 
> > > > > https://lore.kernel.org/r/20230531083230.2702181-1-u.kleine-koe...@pengutronix.de
> > > > > Signed-off-by: Uwe Kleine-König 
> > > > > ---
> > > > >  drivers/tty/serial/8250/Kconfig  | 2 +-
> > > > >  drivers/tty/serial/8250/Makefile | 2 +-
> > > > >  2 files changed, 2 insertions(+), 2 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/tty/serial/8250/Kconfig 
> > > > > b/drivers/tty/serial/8250/Kconfig
> > > > > index 5313aa31930f..10c09b19c871 100644
> > > > > --- a/drivers/tty/serial/8250/Kconfig
> > > > > +++ b/drivers/tty/serial/8250/Kconfig
> > > > > @@ -378,7 +378,7 @@ config SERIAL_8250_BCM2835AUX
> > > > >  
> > > > >  config SERIAL_8250_FSL
> > > > >   bool "Freescale 16550 UART support" if COMPILE_TEST && !(PPC || 
> > > > > ARM || ARM64)
> > > > > - depends on SERIAL_8250_CONSOLE
> > > > > + depends on SERIAL_8250
> > > > 
> > > > Just one additional thought: After the adding the arch side 
> > > > workaround/hack, SERIAL_8250_FSL could become a tristate?
> > > 
> > > I see no benefit for a module separate from 8250_base.ko. There are
> > > dependencies in both directions between 8250_port.o and 8250_fsl.o[1].
> > > So in my book a bool SERIAL_8250_FSL that modifies 8250_base.ko (with
> > > SERIAL_8250=m) is fine.
> > > 
> > > Best regards
> > > Uwe
> > > 
> > > [1] 8250_port.o uses fsl8250_handle_irq() from 8250_fsl.o
> > 
> > Is that after some fix which isn't in tty-next? I see only these:
> > 
> > $ git grep -l fsl8250_handle_irq
> > arch/powerpc/kernel/legacy_serial.c
> > drivers/tty/serial/8250/8250_fsl.c
> > drivers/tty/serial/8250/8250_of.c
> > include/linux/serial_8250.h
> > 
> > No users of fsl8250_handle_irq in 8250_port.c.
> 
> Ah right, I was too quick:
> 
>   8250_of.o uses fsl8250_handle_irq() from 8250_fsl.o
>   8250_fsl.o uses serial8250_modem_status() from 8250_port.o (which is in 
> 8250_base.o)
> 
> 
> However linking 8250_fsl.o in 8250_of.o isn't a good solution either as
> 8250_fsl.o should also be available with CONFIG_SERIAL_OF_PLATFORM
> disabled to provide the ACPI driver. And as 8250_of.o already depends on
> 8250_port.o (e.g. via serial8250_em485_config()) adding 8250_fsl.o
> together with 8250_port.o into 8250_base.ko is fine and doesn't add new
> dependencies.

So we have dependencies one-way only:

8250_port

/\|\
\
8250_fsl \
 /
/\  /

8250_of

There's no loop here, both they be indepedent modules and configured 
independently (with a correct IS_*() in 8250_of.c).

-- 
 i.


Re: [PATCH v3 2/2] serial: 8250: Apply FSL workarounds also without SERIAL_8250_CONSOLE

2023-06-05 Thread Uwe Kleine-König
Hello Ilpo,

On Mon, Jun 05, 2023 at 04:44:08PM +0300, Ilpo Järvinen wrote:
> On Mon, 5 Jun 2023, Uwe Kleine-König wrote:
> > On Mon, Jun 05, 2023 at 04:22:55PM +0300, Ilpo Järvinen wrote:
> > > On Mon, 5 Jun 2023, Uwe Kleine-König wrote:
> > > 
> > > > The need to handle the FSL variant of 8250 in a special way is also
> > > > present without console support. So soften the dependency for
> > > > SERIAL_8250_FSL accordingly. Note that with the 8250 driver compiled as
> > > > a module, some devices still might not make use of the needed
> > > > workarounds. That affects the ports instantiated in
> > > > arch/powerpc/kernel/legacy_serial.c.
> > > > 
> > > > This issue was identified by Dominik Andreas Schorpp.
> > > > 
> > > > To cope for CONFIG_SERIAL_8250=m + CONFIG_SERIAL_8250_FSL=y, 8250_fsl.o
> > > > must be put in the same compilation unit as 8250_port.o because the
> > > > latter defines some functions needed in the former and so 8250_fsl.o
> > > > must not be built-in if 8250_port.o is available in a module.
> > > > 
> > > > Acked-by: Ilpo Järvinen 
> > > > Link: 
> > > > https://lore.kernel.org/r/20230531083230.2702181-1-u.kleine-koe...@pengutronix.de
> > > > Signed-off-by: Uwe Kleine-König 
> > > > ---
> > > >  drivers/tty/serial/8250/Kconfig  | 2 +-
> > > >  drivers/tty/serial/8250/Makefile | 2 +-
> > > >  2 files changed, 2 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/drivers/tty/serial/8250/Kconfig 
> > > > b/drivers/tty/serial/8250/Kconfig
> > > > index 5313aa31930f..10c09b19c871 100644
> > > > --- a/drivers/tty/serial/8250/Kconfig
> > > > +++ b/drivers/tty/serial/8250/Kconfig
> > > > @@ -378,7 +378,7 @@ config SERIAL_8250_BCM2835AUX
> > > >  
> > > >  config SERIAL_8250_FSL
> > > > bool "Freescale 16550 UART support" if COMPILE_TEST && !(PPC || 
> > > > ARM || ARM64)
> > > > -   depends on SERIAL_8250_CONSOLE
> > > > +   depends on SERIAL_8250
> > > 
> > > Just one additional thought: After the adding the arch side 
> > > workaround/hack, SERIAL_8250_FSL could become a tristate?
> > 
> > I see no benefit for a module separate from 8250_base.ko. There are
> > dependencies in both directions between 8250_port.o and 8250_fsl.o[1].
> > So in my book a bool SERIAL_8250_FSL that modifies 8250_base.ko (with
> > SERIAL_8250=m) is fine.
> > 
> > Best regards
> > Uwe
> > 
> > [1] 8250_port.o uses fsl8250_handle_irq() from 8250_fsl.o
> 
> Is that after some fix which isn't in tty-next? I see only these:
> 
> $ git grep -l fsl8250_handle_irq
> arch/powerpc/kernel/legacy_serial.c
> drivers/tty/serial/8250/8250_fsl.c
> drivers/tty/serial/8250/8250_of.c
> include/linux/serial_8250.h
> 
> No users of fsl8250_handle_irq in 8250_port.c.

Ah right, I was too quick:

8250_of.o uses fsl8250_handle_irq() from 8250_fsl.o
8250_fsl.o uses serial8250_modem_status() from 8250_port.o (which is in 
8250_base.o)


However linking 8250_fsl.o in 8250_of.o isn't a good solution either as
8250_fsl.o should also be available with CONFIG_SERIAL_OF_PLATFORM
disabled to provide the ACPI driver. And as 8250_of.o already depends on
8250_port.o (e.g. via serial8250_em485_config()) adding 8250_fsl.o
together with 8250_port.o into 8250_base.ko is fine and doesn't add new
dependencies.

Best regards
Uwe

-- 
Pengutronix e.K.   | Uwe Kleine-König|
Industrial Linux Solutions | https://www.pengutronix.de/ |


signature.asc
Description: PGP signature


Re: [PATCH v8 0/7] Add pci_dev_for_each_resource() helper and update users

2023-06-05 Thread Andy Shevchenko
On Wed, May 31, 2023 at 04:30:28PM -0500, Bjorn Helgaas wrote:
> On Wed, May 31, 2023 at 08:48:35PM +0200, Jonas Gorski wrote:

...

> > Looking at the code I understand where coverity is coming from:
> > 
> > #define __pci_dev_for_each_res0(dev, res, ...) \
> >for (unsigned int __b = 0;  \
> > res = pci_resource_n(dev, __b), __b < PCI_NUM_RESOURCES;   \
> > __b++)
> > 
> >  res will be assigned before __b is checked for being less than
> > PCI_NUM_RESOURCES, making it point to behind the array at the end of
> > the last loop iteration.
> > 
> > Rewriting the test expression as
> > 
> > __b < PCI_NUM_RESOURCES && (res = pci_resource_n(dev, __b));
> > 
> > should avoid the (coverity) warning by making use of lazy evaluation.
> > 
> > It probably makes the code slightly less performant as res will now be
> > checked for being not NULL (which will always be true), but I doubt it
> > will be significant (or in any hot paths).
> 
> Thanks a lot for looking into this!  I think you're right, and I think
> the rewritten expression is more logical as well.  Do you want to post
> a patch for it?

Gimme some time, I was on a long leave and now it's a pile to handle.

-- 
With Best Regards,
Andy Shevchenko




Re: [PATCH v3 2/2] serial: 8250: Apply FSL workarounds also without SERIAL_8250_CONSOLE

2023-06-05 Thread Andy Shevchenko
On Mon, Jun 05, 2023 at 04:44:08PM +0300, Ilpo Järvinen wrote:
> On Mon, 5 Jun 2023, Uwe Kleine-König wrote:
> > On Mon, Jun 05, 2023 at 04:22:55PM +0300, Ilpo Järvinen wrote:
> > > On Mon, 5 Jun 2023, Uwe Kleine-König wrote:
> > > > The need to handle the FSL variant of 8250 in a special way is also
> > > > present without console support. So soften the dependency for
> > > > SERIAL_8250_FSL accordingly. Note that with the 8250 driver compiled as
> > > > a module, some devices still might not make use of the needed
> > > > workarounds. That affects the ports instantiated in
> > > > arch/powerpc/kernel/legacy_serial.c.
> > > > 
> > > > This issue was identified by Dominik Andreas Schorpp.
> > > > 
> > > > To cope for CONFIG_SERIAL_8250=m + CONFIG_SERIAL_8250_FSL=y, 8250_fsl.o
> > > > must be put in the same compilation unit as 8250_port.o because the
> > > > latter defines some functions needed in the former and so 8250_fsl.o
> > > > must not be built-in if 8250_port.o is available in a module.
> > > > 
> > > > Acked-by: Ilpo Järvinen 
> > > > Link: 
> > > > https://lore.kernel.org/r/20230531083230.2702181-1-u.kleine-koe...@pengutronix.de
> > > > Signed-off-by: Uwe Kleine-König 
> > > > ---
> > > >  drivers/tty/serial/8250/Kconfig  | 2 +-
> > > >  drivers/tty/serial/8250/Makefile | 2 +-
> > > >  2 files changed, 2 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/drivers/tty/serial/8250/Kconfig 
> > > > b/drivers/tty/serial/8250/Kconfig
> > > > index 5313aa31930f..10c09b19c871 100644
> > > > --- a/drivers/tty/serial/8250/Kconfig
> > > > +++ b/drivers/tty/serial/8250/Kconfig
> > > > @@ -378,7 +378,7 @@ config SERIAL_8250_BCM2835AUX
> > > >  
> > > >  config SERIAL_8250_FSL
> > > > bool "Freescale 16550 UART support" if COMPILE_TEST && !(PPC || 
> > > > ARM || ARM64)
> > > > -   depends on SERIAL_8250_CONSOLE
> > > > +   depends on SERIAL_8250
> > > 
> > > Just one additional thought: After the adding the arch side 
> > > workaround/hack, SERIAL_8250_FSL could become a tristate?
> > 
> > I see no benefit for a module separate from 8250_base.ko. There are
> > dependencies in both directions between 8250_port.o and 8250_fsl.o[1].
> > So in my book a bool SERIAL_8250_FSL that modifies 8250_base.ko (with
> > SERIAL_8250=m) is fine.

> > [1] 8250_port.o uses fsl8250_handle_irq() from 8250_fsl.o
> 
> Is that after some fix which isn't in tty-next? I see only these:
> 
> $ git grep -l fsl8250_handle_irq
> arch/powerpc/kernel/legacy_serial.c
> drivers/tty/serial/8250/8250_fsl.c
> drivers/tty/serial/8250/8250_of.c
> include/linux/serial_8250.h
> 
> No users of fsl8250_handle_irq in 8250_port.c.

> >, and 8250_fsl.o uses serial8250_modem_status from 8250_port.o.

I don't like 8250_base to be fattened by some stuff that has no
generic meaning. Can we avoid putting every quirk there?

-- 
With Best Regards,
Andy Shevchenko




Re: [PATCH v3 2/2] serial: 8250: Apply FSL workarounds also without SERIAL_8250_CONSOLE

2023-06-05 Thread Ilpo Järvinen
On Mon, 5 Jun 2023, Uwe Kleine-König wrote:

> On Mon, Jun 05, 2023 at 04:22:55PM +0300, Ilpo Järvinen wrote:
> > On Mon, 5 Jun 2023, Uwe Kleine-König wrote:
> > 
> > > The need to handle the FSL variant of 8250 in a special way is also
> > > present without console support. So soften the dependency for
> > > SERIAL_8250_FSL accordingly. Note that with the 8250 driver compiled as
> > > a module, some devices still might not make use of the needed
> > > workarounds. That affects the ports instantiated in
> > > arch/powerpc/kernel/legacy_serial.c.
> > > 
> > > This issue was identified by Dominik Andreas Schorpp.
> > > 
> > > To cope for CONFIG_SERIAL_8250=m + CONFIG_SERIAL_8250_FSL=y, 8250_fsl.o
> > > must be put in the same compilation unit as 8250_port.o because the
> > > latter defines some functions needed in the former and so 8250_fsl.o
> > > must not be built-in if 8250_port.o is available in a module.
> > > 
> > > Acked-by: Ilpo Järvinen 
> > > Link: 
> > > https://lore.kernel.org/r/20230531083230.2702181-1-u.kleine-koe...@pengutronix.de
> > > Signed-off-by: Uwe Kleine-König 
> > > ---
> > >  drivers/tty/serial/8250/Kconfig  | 2 +-
> > >  drivers/tty/serial/8250/Makefile | 2 +-
> > >  2 files changed, 2 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/drivers/tty/serial/8250/Kconfig 
> > > b/drivers/tty/serial/8250/Kconfig
> > > index 5313aa31930f..10c09b19c871 100644
> > > --- a/drivers/tty/serial/8250/Kconfig
> > > +++ b/drivers/tty/serial/8250/Kconfig
> > > @@ -378,7 +378,7 @@ config SERIAL_8250_BCM2835AUX
> > >  
> > >  config SERIAL_8250_FSL
> > >   bool "Freescale 16550 UART support" if COMPILE_TEST && !(PPC || ARM || 
> > > ARM64)
> > > - depends on SERIAL_8250_CONSOLE
> > > + depends on SERIAL_8250
> > 
> > Just one additional thought: After the adding the arch side 
> > workaround/hack, SERIAL_8250_FSL could become a tristate?
> 
> I see no benefit for a module separate from 8250_base.ko. There are
> dependencies in both directions between 8250_port.o and 8250_fsl.o[1].
> So in my book a bool SERIAL_8250_FSL that modifies 8250_base.ko (with
> SERIAL_8250=m) is fine.
> 
> Best regards
> Uwe
> 
> [1] 8250_port.o uses fsl8250_handle_irq() from 8250_fsl.o

Is that after some fix which isn't in tty-next? I see only these:

$ git grep -l fsl8250_handle_irq
arch/powerpc/kernel/legacy_serial.c
drivers/tty/serial/8250/8250_fsl.c
drivers/tty/serial/8250/8250_of.c
include/linux/serial_8250.h

No users of fsl8250_handle_irq in 8250_port.c.

-- 
 i.

>, and 8250_fsl.o uses serial8250_modem_status from 8250_port.o.


Re: [PATCH v3 2/2] serial: 8250: Apply FSL workarounds also without SERIAL_8250_CONSOLE

2023-06-05 Thread Uwe Kleine-König
On Mon, Jun 05, 2023 at 04:22:55PM +0300, Ilpo Järvinen wrote:
> On Mon, 5 Jun 2023, Uwe Kleine-König wrote:
> 
> > The need to handle the FSL variant of 8250 in a special way is also
> > present without console support. So soften the dependency for
> > SERIAL_8250_FSL accordingly. Note that with the 8250 driver compiled as
> > a module, some devices still might not make use of the needed
> > workarounds. That affects the ports instantiated in
> > arch/powerpc/kernel/legacy_serial.c.
> > 
> > This issue was identified by Dominik Andreas Schorpp.
> > 
> > To cope for CONFIG_SERIAL_8250=m + CONFIG_SERIAL_8250_FSL=y, 8250_fsl.o
> > must be put in the same compilation unit as 8250_port.o because the
> > latter defines some functions needed in the former and so 8250_fsl.o
> > must not be built-in if 8250_port.o is available in a module.
> > 
> > Acked-by: Ilpo Järvinen 
> > Link: 
> > https://lore.kernel.org/r/20230531083230.2702181-1-u.kleine-koe...@pengutronix.de
> > Signed-off-by: Uwe Kleine-König 
> > ---
> >  drivers/tty/serial/8250/Kconfig  | 2 +-
> >  drivers/tty/serial/8250/Makefile | 2 +-
> >  2 files changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/tty/serial/8250/Kconfig 
> > b/drivers/tty/serial/8250/Kconfig
> > index 5313aa31930f..10c09b19c871 100644
> > --- a/drivers/tty/serial/8250/Kconfig
> > +++ b/drivers/tty/serial/8250/Kconfig
> > @@ -378,7 +378,7 @@ config SERIAL_8250_BCM2835AUX
> >  
> >  config SERIAL_8250_FSL
> > bool "Freescale 16550 UART support" if COMPILE_TEST && !(PPC || ARM || 
> > ARM64)
> > -   depends on SERIAL_8250_CONSOLE
> > +   depends on SERIAL_8250
> 
> Just one additional thought: After the adding the arch side 
> workaround/hack, SERIAL_8250_FSL could become a tristate?

I see no benefit for a module separate from 8250_base.ko. There are
dependencies in both directions between 8250_port.o and 8250_fsl.o[1].
So in my book a bool SERIAL_8250_FSL that modifies 8250_base.ko (with
SERIAL_8250=m) is fine.

Best regards
Uwe

[1] 8250_port.o uses fsl8250_handle_irq() from 8250_fsl.o, and
8250_fsl.o uses serial8250_modem_status from 8250_port.o.

-- 
Pengutronix e.K.   | Uwe Kleine-König|
Industrial Linux Solutions | https://www.pengutronix.de/ |


signature.asc
Description: PGP signature


Re: [PATCH v3 2/2] serial: 8250: Apply FSL workarounds also without SERIAL_8250_CONSOLE

2023-06-05 Thread Ilpo Järvinen
On Mon, 5 Jun 2023, Uwe Kleine-König wrote:

> The need to handle the FSL variant of 8250 in a special way is also
> present without console support. So soften the dependency for
> SERIAL_8250_FSL accordingly. Note that with the 8250 driver compiled as
> a module, some devices still might not make use of the needed
> workarounds. That affects the ports instantiated in
> arch/powerpc/kernel/legacy_serial.c.
> 
> This issue was identified by Dominik Andreas Schorpp.
> 
> To cope for CONFIG_SERIAL_8250=m + CONFIG_SERIAL_8250_FSL=y, 8250_fsl.o
> must be put in the same compilation unit as 8250_port.o because the
> latter defines some functions needed in the former and so 8250_fsl.o
> must not be built-in if 8250_port.o is available in a module.
> 
> Acked-by: Ilpo Järvinen 
> Link: 
> https://lore.kernel.org/r/20230531083230.2702181-1-u.kleine-koe...@pengutronix.de
> Signed-off-by: Uwe Kleine-König 
> ---
>  drivers/tty/serial/8250/Kconfig  | 2 +-
>  drivers/tty/serial/8250/Makefile | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/tty/serial/8250/Kconfig b/drivers/tty/serial/8250/Kconfig
> index 5313aa31930f..10c09b19c871 100644
> --- a/drivers/tty/serial/8250/Kconfig
> +++ b/drivers/tty/serial/8250/Kconfig
> @@ -378,7 +378,7 @@ config SERIAL_8250_BCM2835AUX
>  
>  config SERIAL_8250_FSL
>   bool "Freescale 16550 UART support" if COMPILE_TEST && !(PPC || ARM || 
> ARM64)
> - depends on SERIAL_8250_CONSOLE
> + depends on SERIAL_8250

Just one additional thought: After the adding the arch side 
workaround/hack, SERIAL_8250_FSL could become a tristate?

(1/2 might need a small change to take into account that it can now be a 
module).

>   default PPC || ARM || ARM64
>   help
> Selecting this option enables a workaround for a break-detection
> diff --git a/drivers/tty/serial/8250/Makefile 
> b/drivers/tty/serial/8250/Makefile
> index 4fc2fc1f41b6..8824ba5295b6 100644
> --- a/drivers/tty/serial/8250/Makefile
> +++ b/drivers/tty/serial/8250/Makefile
> @@ -12,6 +12,7 @@ obj-$(CONFIG_SERIAL_8250)   += 8250.o 8250_base.o
>  8250_base-$(CONFIG_SERIAL_8250_DMA)  += 8250_dma.o
>  8250_base-$(CONFIG_SERIAL_8250_DWLIB)+= 8250_dwlib.o
>  8250_base-$(CONFIG_SERIAL_8250_FINTEK)   += 8250_fintek.o
> +8250_base-$(CONFIG_SERIAL_8250_FSL)  += 8250_fsl.o
>  8250_base-$(CONFIG_SERIAL_8250_PCILIB)   += 8250_pcilib.o
>  obj-$(CONFIG_SERIAL_8250_PARISC) += 8250_parisc.o
>  obj-$(CONFIG_SERIAL_8250_PCI)+= 8250_pci.o
> @@ -28,7 +29,6 @@ obj-$(CONFIG_SERIAL_8250_BOCA)  += 8250_boca.o
>  obj-$(CONFIG_SERIAL_8250_EXAR_ST16C554)  += 8250_exar_st16c554.o
>  obj-$(CONFIG_SERIAL_8250_HUB6)   += 8250_hub6.o
>  obj-$(CONFIG_SERIAL_8250_PCI1)   += 8250_pci1.o
> -obj-$(CONFIG_SERIAL_8250_FSL)+= 8250_fsl.o
>  obj-$(CONFIG_SERIAL_8250_MEN_MCB)+= 8250_men_mcb.o
>  obj-$(CONFIG_SERIAL_8250_DFL)+= 8250_dfl.o
>  obj-$(CONFIG_SERIAL_8250_DW) += 8250_dw.o

-- 
 i.


[PATCH v3 2/2] serial: 8250: Apply FSL workarounds also without SERIAL_8250_CONSOLE

2023-06-05 Thread Uwe Kleine-König
The need to handle the FSL variant of 8250 in a special way is also
present without console support. So soften the dependency for
SERIAL_8250_FSL accordingly. Note that with the 8250 driver compiled as
a module, some devices still might not make use of the needed
workarounds. That affects the ports instantiated in
arch/powerpc/kernel/legacy_serial.c.

This issue was identified by Dominik Andreas Schorpp.

To cope for CONFIG_SERIAL_8250=m + CONFIG_SERIAL_8250_FSL=y, 8250_fsl.o
must be put in the same compilation unit as 8250_port.o because the
latter defines some functions needed in the former and so 8250_fsl.o
must not be built-in if 8250_port.o is available in a module.

Acked-by: Ilpo Järvinen 
Link: 
https://lore.kernel.org/r/20230531083230.2702181-1-u.kleine-koe...@pengutronix.de
Signed-off-by: Uwe Kleine-König 
---
 drivers/tty/serial/8250/Kconfig  | 2 +-
 drivers/tty/serial/8250/Makefile | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/tty/serial/8250/Kconfig b/drivers/tty/serial/8250/Kconfig
index 5313aa31930f..10c09b19c871 100644
--- a/drivers/tty/serial/8250/Kconfig
+++ b/drivers/tty/serial/8250/Kconfig
@@ -378,7 +378,7 @@ config SERIAL_8250_BCM2835AUX
 
 config SERIAL_8250_FSL
bool "Freescale 16550 UART support" if COMPILE_TEST && !(PPC || ARM || 
ARM64)
-   depends on SERIAL_8250_CONSOLE
+   depends on SERIAL_8250
default PPC || ARM || ARM64
help
  Selecting this option enables a workaround for a break-detection
diff --git a/drivers/tty/serial/8250/Makefile b/drivers/tty/serial/8250/Makefile
index 4fc2fc1f41b6..8824ba5295b6 100644
--- a/drivers/tty/serial/8250/Makefile
+++ b/drivers/tty/serial/8250/Makefile
@@ -12,6 +12,7 @@ obj-$(CONFIG_SERIAL_8250) += 8250.o 8250_base.o
 8250_base-$(CONFIG_SERIAL_8250_DMA)+= 8250_dma.o
 8250_base-$(CONFIG_SERIAL_8250_DWLIB)  += 8250_dwlib.o
 8250_base-$(CONFIG_SERIAL_8250_FINTEK) += 8250_fintek.o
+8250_base-$(CONFIG_SERIAL_8250_FSL)+= 8250_fsl.o
 8250_base-$(CONFIG_SERIAL_8250_PCILIB) += 8250_pcilib.o
 obj-$(CONFIG_SERIAL_8250_PARISC)   += 8250_parisc.o
 obj-$(CONFIG_SERIAL_8250_PCI)  += 8250_pci.o
@@ -28,7 +29,6 @@ obj-$(CONFIG_SERIAL_8250_BOCA)+= 8250_boca.o
 obj-$(CONFIG_SERIAL_8250_EXAR_ST16C554)+= 8250_exar_st16c554.o
 obj-$(CONFIG_SERIAL_8250_HUB6) += 8250_hub6.o
 obj-$(CONFIG_SERIAL_8250_PCI1) += 8250_pci1.o
-obj-$(CONFIG_SERIAL_8250_FSL)  += 8250_fsl.o
 obj-$(CONFIG_SERIAL_8250_MEN_MCB)  += 8250_men_mcb.o
 obj-$(CONFIG_SERIAL_8250_DFL)  += 8250_dfl.o
 obj-$(CONFIG_SERIAL_8250_DW)   += 8250_dw.o
-- 
2.39.2



[PATCH v3 0/2] serial: 8250: Apply FSL workarounds also without SERIAL_8250_CONSOLE

2023-06-05 Thread Uwe Kleine-König
Hello,

this is v3 of the series that now also copes for
arch/powerpc/kernel/legacy_serial.c using fsl8250_handle_irq().

For kernel configurations that already before were correctly using the
Freescale workarounds, this is the case with this series applied, too.
So in all cases the situation doesn't get worse. The upside is that even
with the 8250 driver compiled as a module (or built-in but without
console support) the workarounds are now applied for all devices but for
the ones instantiated in arch/powerpc/kernel/legacy_serial.c. (And even
for these there might not be a problem as they might benefit from
enabling the workarounds in drivers/tty/serial/8250/8250_of.c. Not sure
though.)

Patch #1 is new here. Patch #2 only changed lightly: I restored
alphabetic order in drivers/tty/serial/8250/Kconfig.

As patch #1 is needed for patch #2 to not introduce a build failure,
both patches should be taken together. I suggest to add them to Greg's
serial tree, but the changes pending there should not conflict with this
series such that taking them both via powerpc works, too.

Best regards
Uwe

Uwe Kleine-König (2):
  powerpc/legacy_serial: Warn about 8250 devices operated without active
FSL workarounds
  serial: 8250: Apply FSL workarounds also without SERIAL_8250_CONSOLE

 arch/powerpc/kernel/legacy_serial.c | 14 +-
 drivers/tty/serial/8250/Kconfig |  2 +-
 drivers/tty/serial/8250/Makefile|  2 +-
 3 files changed, 11 insertions(+), 7 deletions(-)


base-commit: ac9a78681b921877518763ba0e89202254349d1b
-- 
2.39.2



[PATCH v3 1/2] powerpc/legacy_serial: Warn about 8250 devices operated without active FSL workarounds

2023-06-05 Thread Uwe Kleine-König
If the 8250 driver is built as a module (or built-in without console
support) the Freescale specific workaround were silently not activated.
Add a warning in this case.

Currently CONFIG_SERIAL_8250_FSL=y implies that the function
fsl8250_handle_irq() is built-in and can be used. However with the
changes of the next commit CONFIG_SERIAL_8250_FSL might be enabled also
when the 8250 driver is a module and so more care is needed when
fsl8250_handle_irq() is to be used. The code added here is able to
handle the new situation already.

Signed-off-by: Uwe Kleine-König 
---
 arch/powerpc/kernel/legacy_serial.c | 14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/legacy_serial.c 
b/arch/powerpc/kernel/legacy_serial.c
index c9ad12461d44..fdbd85aafeb1 100644
--- a/arch/powerpc/kernel/legacy_serial.c
+++ b/arch/powerpc/kernel/legacy_serial.c
@@ -508,12 +508,16 @@ static void __init fixup_port_irq(int index,
 
port->irq = virq;
 
-#ifdef CONFIG_SERIAL_8250_FSL
-   if (of_device_is_compatible(np, "fsl,ns16550")) {
-   port->handle_irq = fsl8250_handle_irq;
-   port->has_sysrq = IS_ENABLED(CONFIG_SERIAL_8250_CONSOLE);
+   if (IS_ENABLED(CONFIG_SERIAL_8250) &&
+   of_device_is_compatible(np, "fsl,ns16550")) {
+   if (IS_REACHABLE(CONFIG_SERIAL_8250)) {
+   port->handle_irq = fsl8250_handle_irq;
+   port->has_sysrq = 
IS_ENABLED(CONFIG_SERIAL_8250_CONSOLE);
+   } else {
+   pr_warn_once("Not activating Freescale specific 
workaround for device %pOFP\n",
+np);
+   }
}
-#endif
 }
 
 static void __init fixup_port_pio(int index,
-- 
2.39.2



[PATCH 4/4] powerpc/kuap: Make disabling KUAP at boottime optional

2023-06-05 Thread Christophe Leroy
It is possible to disable KUAP at boottime with 'nosmap' parameter.

That is implemented with jump_label hence adds a 'nop' in front
of each open/close of userspace access.

>From a security point of view it makes sence to disallow disabling
KUAP. And on processors like the 8xx where 'nop' is not seamless,
it saves a few cycles.

So add a CONFIG item to make it optionnal.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/kup.h |  2 +-
 arch/powerpc/mm/init-common.c  |  3 +++
 arch/powerpc/platforms/Kconfig.cputype | 10 ++
 3 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/kup.h b/arch/powerpc/include/asm/kup.h
index 74b7f4cee2ed..f3280169aeec 100644
--- a/arch/powerpc/include/asm/kup.h
+++ b/arch/powerpc/include/asm/kup.h
@@ -53,7 +53,7 @@ extern struct static_key_false disable_kuap_key;
 
 static __always_inline bool kuap_is_disabled(void)
 {
-   return static_branch_unlikely(_kuap_key);
+   return IS_ENABLED(CONFIG_PPC_KUAP_BOOTTIME) && 
static_branch_unlikely(_kuap_key);
 }
 #endif
 #else
diff --git a/arch/powerpc/mm/init-common.c b/arch/powerpc/mm/init-common.c
index 74e140b1efef..994ee58f0092 100644
--- a/arch/powerpc/mm/init-common.c
+++ b/arch/powerpc/mm/init-common.c
@@ -48,6 +48,9 @@ early_param("nosmep", parse_nosmep);
 
 static int __init parse_nosmap(char *p)
 {
+   if (!IS_ENABLED(CONFIG_PPC_KUAP_BOOTTIME))
+   return 0;
+
disable_kuap = true;
pr_warn("Disabling Kernel Userspace Access Protection\n");
return 0;
diff --git a/arch/powerpc/platforms/Kconfig.cputype 
b/arch/powerpc/platforms/Kconfig.cputype
index 45fd975ef521..f75c2d5cd182 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -502,6 +502,16 @@ config PPC_KUAP
 
  If you're unsure, say Y.
 
+config PPC_KUAP_BOOTTIME
+   bool "Allow disabling Kernel Userspace Access Protection at boottime"
+   depends on PPC_KUAP
+   default y
+   help
+ Allow the user to disable Kernel Userspace Access Protection (KUAP)
+ at boot time using 'nosmap' kernel parameter.
+
+ If you're unsure, say Y.
+
 config PPC_KUAP_DEBUG
bool "Extra debugging for Kernel Userspace Access Protection"
depends on PPC_KUAP
-- 
2.40.1



[PATCH 3/4] powerpc/kuap: Refactor static branch for disabling kuap

2023-06-05 Thread Christophe Leroy
All but book3s/64 use a static branch key for disabling kuap.
book3s/64 uses a memory feature.

Refactor all targets except book3s/64.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/book3s/32/kup.h |  7 ---
 arch/powerpc/include/asm/book3s/64/kup.h |  1 +
 arch/powerpc/include/asm/kup.h   | 15 +++
 arch/powerpc/include/asm/nohash/32/kup-8xx.h |  7 ---
 arch/powerpc/include/asm/nohash/kup-booke.h  |  7 ---
 arch/powerpc/mm/book3s32/kuap.c  |  3 ---
 arch/powerpc/mm/init-common.c|  3 +++
 arch/powerpc/mm/nohash/kup.c |  3 ---
 8 files changed, 19 insertions(+), 27 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/32/kup.h 
b/arch/powerpc/include/asm/book3s/32/kup.h
index 466a19cfb4df..8da9997a67ba 100644
--- a/arch/powerpc/include/asm/book3s/32/kup.h
+++ b/arch/powerpc/include/asm/book3s/32/kup.h
@@ -11,8 +11,6 @@
 
 #include 
 
-extern struct static_key_false disable_kuap_key;
-
 static __always_inline bool kuep_is_disabled(void)
 {
return !IS_ENABLED(CONFIG_PPC_KUEP);
@@ -25,11 +23,6 @@ static __always_inline bool kuep_is_disabled(void)
 #define KUAP_NONE  (~0UL)
 #define KUAP_ALL   (~1UL)
 
-static __always_inline bool kuap_is_disabled(void)
-{
-   return static_branch_unlikely(_kuap_key);
-}
-
 static inline void kuap_lock_one(unsigned long addr)
 {
mtsr(mfsr(addr) | SR_KS, addr);
diff --git a/arch/powerpc/include/asm/book3s/64/kup.h 
b/arch/powerpc/include/asm/book3s/64/kup.h
index 1b0215ff3710..f8b8e93c488c 100644
--- a/arch/powerpc/include/asm/book3s/64/kup.h
+++ b/arch/powerpc/include/asm/book3s/64/kup.h
@@ -233,6 +233,7 @@ static __always_inline bool kuap_is_disabled(void)
 {
return !mmu_has_feature(MMU_FTR_BOOK3S_KUAP);
 }
+#define kuap_is_disabled kuap_is_disabled
 
 static inline void kuap_user_restore(struct pt_regs *regs)
 {
diff --git a/arch/powerpc/include/asm/kup.h b/arch/powerpc/include/asm/kup.h
index 0a4e07175612..74b7f4cee2ed 100644
--- a/arch/powerpc/include/asm/kup.h
+++ b/arch/powerpc/include/asm/kup.h
@@ -6,6 +6,12 @@
 #define KUAP_WRITE 2
 #define KUAP_READ_WRITE(KUAP_READ | KUAP_WRITE)
 
+#ifndef __ASSEMBLY__
+#include 
+
+static __always_inline bool kuap_is_disabled(void);
+#endif
+
 #ifdef CONFIG_PPC_BOOK3S_64
 #include 
 #endif
@@ -41,6 +47,15 @@ void setup_kuep(bool disabled);
 
 #ifdef CONFIG_PPC_KUAP
 void setup_kuap(bool disabled);
+
+#ifndef kuap_is_disabled
+extern struct static_key_false disable_kuap_key;
+
+static __always_inline bool kuap_is_disabled(void)
+{
+   return static_branch_unlikely(_kuap_key);
+}
+#endif
 #else
 static inline void setup_kuap(bool disabled) { }
 
diff --git a/arch/powerpc/include/asm/nohash/32/kup-8xx.h 
b/arch/powerpc/include/asm/nohash/32/kup-8xx.h
index a372cd822887..1d53f38c5cd5 100644
--- a/arch/powerpc/include/asm/nohash/32/kup-8xx.h
+++ b/arch/powerpc/include/asm/nohash/32/kup-8xx.h
@@ -13,13 +13,6 @@
 
 #include 
 
-extern struct static_key_false disable_kuap_key;
-
-static __always_inline bool kuap_is_disabled(void)
-{
-   return static_branch_unlikely(_kuap_key);
-}
-
 static inline void __kuap_save_and_lock(struct pt_regs *regs)
 {
regs->kuap = mfspr(SPRN_MD_AP);
diff --git a/arch/powerpc/include/asm/nohash/kup-booke.h 
b/arch/powerpc/include/asm/nohash/kup-booke.h
index 71182cbe20c3..07759ae9117b 100644
--- a/arch/powerpc/include/asm/nohash/kup-booke.h
+++ b/arch/powerpc/include/asm/nohash/kup-booke.h
@@ -18,13 +18,6 @@
 
 #include 
 
-extern struct static_key_false disable_kuap_key;
-
-static __always_inline bool kuap_is_disabled(void)
-{
-   return static_branch_unlikely(_kuap_key);
-}
-
 static inline void __kuap_lock(void)
 {
mtspr(SPRN_PID, 0);
diff --git a/arch/powerpc/mm/book3s32/kuap.c b/arch/powerpc/mm/book3s32/kuap.c
index 28676cabb005..c5484729b595 100644
--- a/arch/powerpc/mm/book3s32/kuap.c
+++ b/arch/powerpc/mm/book3s32/kuap.c
@@ -3,9 +3,6 @@
 #include 
 #include 
 
-struct static_key_false disable_kuap_key;
-EXPORT_SYMBOL(disable_kuap_key);
-
 void kuap_lock_all_ool(void)
 {
kuap_lock_all();
diff --git a/arch/powerpc/mm/init-common.c b/arch/powerpc/mm/init-common.c
index 119ef491f797..74e140b1efef 100644
--- a/arch/powerpc/mm/init-common.c
+++ b/arch/powerpc/mm/init-common.c
@@ -32,6 +32,9 @@ EXPORT_SYMBOL_GPL(kernstart_virt_addr);
 bool disable_kuep = !IS_ENABLED(CONFIG_PPC_KUEP);
 bool disable_kuap = !IS_ENABLED(CONFIG_PPC_KUAP);
 
+struct static_key_false disable_kuap_key;
+EXPORT_SYMBOL(disable_kuap_key);
+
 static int __init parse_nosmep(char *p)
 {
if (!IS_ENABLED(CONFIG_PPC_BOOK3S_64))
diff --git a/arch/powerpc/mm/nohash/kup.c b/arch/powerpc/mm/nohash/kup.c
index 552becf90e97..4e22adfa2aa8 100644
--- a/arch/powerpc/mm/nohash/kup.c
+++ b/arch/powerpc/mm/nohash/kup.c
@@ -13,9 +13,6 @@
 #include 
 
 #ifdef CONFIG_PPC_KUAP
-struct static_key_false disable_kuap_key;
-EXPORT_SYMBOL(disable_kuap_key);
-
 void 

[PATCH 2/4] powerpc/kuap: Avoid useless jump_label on empty function

2023-06-05 Thread Christophe Leroy
Disassembly of interrupt_enter_prepare() shows a pointless nop before the mftb

  c000abf0 :
  c000abf0:   81 23 00 84 lwz r9,132(r3)
  c000abf4:   71 29 40 00 andi.   r9,r9,16384
  c000abf8:   41 82 00 28 beq-c000ac20 

  c000abfc: ===>  60 00 00 00 nop   <
  c000ac00:   7d 0c 42 e6 mftbr8
  c000ac04:   80 e2 00 08 lwz r7,8(r2)
  c000ac08:   81 22 00 28 lwz r9,40(r2)
  c000ac0c:   91 02 00 24 stw r8,36(r2)
  c000ac10:   7d 29 38 50 subfr9,r9,r7
  c000ac14:   7d 29 42 14 add r9,r9,r8
  c000ac18:   91 22 00 08 stw r9,8(r2)
  c000ac1c:   4e 80 00 20 blr
  c000ac20:   60 00 00 00 nop
  c000ac24:   7d 5a c2 a6 mfmd_ap r10
  c000ac28:   3d 20 de 00 lis r9,-8704
  c000ac2c:   91 43 00 b0 stw r10,176(r3)
  c000ac30:   7d 3a c3 a6 mtspr   794,r9
  c000ac34:   4e 80 00 20 blr

That comes from the call to kuap_loc(), allthough __kuap_lock() is an empty
function on the 8xx.

To avoid that, only perform kuap_is_disabled() check when there is something
to do with __kuap_lock().

Do the same with __kuap_save_and_lock() and __kuap_get_and_assert_locked().

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/book3s/32/kup.h |  6 ++---
 arch/powerpc/include/asm/book3s/64/kup.h | 10 +---
 arch/powerpc/include/asm/kup.h   | 25 ++--
 arch/powerpc/include/asm/nohash/32/kup-8xx.h | 11 -
 arch/powerpc/include/asm/nohash/kup-booke.h  |  8 +--
 5 files changed, 26 insertions(+), 34 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/32/kup.h 
b/arch/powerpc/include/asm/book3s/32/kup.h
index 678f9c9d89b6..466a19cfb4df 100644
--- a/arch/powerpc/include/asm/book3s/32/kup.h
+++ b/arch/powerpc/include/asm/book3s/32/kup.h
@@ -77,10 +77,6 @@ static inline void kuap_unlock(unsigned long addr, bool ool)
kuap_unlock_all_ool();
 }
 
-static inline void __kuap_lock(void)
-{
-}
-
 static inline void __kuap_save_and_lock(struct pt_regs *regs)
 {
unsigned long kuap = current->thread.kuap;
@@ -92,6 +88,7 @@ static inline void __kuap_save_and_lock(struct pt_regs *regs)
current->thread.kuap = KUAP_NONE;
kuap_lock_addr(kuap, false);
 }
+#define __kuap_save_and_lock __kuap_save_and_lock
 
 static inline void kuap_user_restore(struct pt_regs *regs)
 {
@@ -120,6 +117,7 @@ static inline unsigned long 
__kuap_get_and_assert_locked(void)
 
return kuap;
 }
+#define __kuap_get_and_assert_locked __kuap_get_and_assert_locked
 
 static __always_inline void __allow_user_access(void __user *to, const void 
__user *from,
u32 size, unsigned long dir)
diff --git a/arch/powerpc/include/asm/book3s/64/kup.h 
b/arch/powerpc/include/asm/book3s/64/kup.h
index 54cf46808157..1b0215ff3710 100644
--- a/arch/powerpc/include/asm/book3s/64/kup.h
+++ b/arch/powerpc/include/asm/book3s/64/kup.h
@@ -297,15 +297,7 @@ static inline unsigned long 
__kuap_get_and_assert_locked(void)
WARN_ON_ONCE(amr != AMR_KUAP_BLOCKED);
return amr;
 }
-
-/* Do nothing, book3s/64 does that in ASM */
-static inline void __kuap_lock(void)
-{
-}
-
-static inline void __kuap_save_and_lock(struct pt_regs *regs)
-{
-}
+#define __kuap_get_and_assert_locked __kuap_get_and_assert_locked
 
 /*
  * We support individually allowing read or write, but we don't support nesting
diff --git a/arch/powerpc/include/asm/kup.h b/arch/powerpc/include/asm/kup.h
index d751ddd08110..0a4e07175612 100644
--- a/arch/powerpc/include/asm/kup.h
+++ b/arch/powerpc/include/asm/kup.h
@@ -52,16 +52,9 @@ __bad_kuap_fault(struct pt_regs *regs, unsigned long 
address, bool is_write)
return false;
 }
 
-static inline void __kuap_lock(void) { }
-static inline void __kuap_save_and_lock(struct pt_regs *regs) { }
 static inline void kuap_user_restore(struct pt_regs *regs) { }
 static inline void __kuap_kernel_restore(struct pt_regs *regs, unsigned long 
amr) { }
 
-static inline unsigned long __kuap_get_and_assert_locked(void)
-{
-   return 0;
-}
-
 /*
  * book3s/64/kup-radix.h defines these functions for the !KUAP case to flush
  * the L1D cache after user accesses. Only include the empty stubs for other
@@ -87,27 +80,32 @@ bad_kuap_fault(struct pt_regs *regs, unsigned long address, 
bool is_write)
 
 static __always_inline void kuap_assert_locked(void)
 {
+#if defined(CONFIG_PPC_KUAP_DEBUG) && defined(__kuap_get_and_assert_locked)
if (kuap_is_disabled())
return;
 
-   if (IS_ENABLED(CONFIG_PPC_KUAP_DEBUG))
-   __kuap_get_and_assert_locked();
+   __kuap_get_and_assert_locked();
+#endif
 }
 
 static __always_inline void kuap_lock(void)
 {
+#ifdef __kuap_lock
if (kuap_is_disabled())
return;
 
__kuap_lock();
+#endif
 }
 
 static __always_inline void kuap_save_and_lock(struct pt_regs 

[PATCH 1/4] powerpc/kuap: Avoid unnecessary reads of MD_AP

2023-06-05 Thread Christophe Leroy
A disassembly of interrupt_exit_kernel_prepare() shows a useless read
of MD_AP register. This is shown by r9 being re-used immediately without
doing anything with the value read.

  c000e0e0:   60 00 00 00 nop
  c000e0e4: ===>  7d 3a c2 a6 mfmd_ap r9<
  c000e0e8:   7d 20 00 a6 mfmsr   r9
  c000e0ec:   7c 51 13 a6 mtspr   81,r2
  c000e0f0:   81 3f 00 84 lwz r9,132(r31)
  c000e0f4:   71 29 80 00 andi.   r9,r9,32768

kuap_get_and_assert_locked() is paired with kuap_kernel_restore()
and are only used in interrupt_exit_kernel_prepare(). The value
returned by kuap_get_and_assert_locked() is only used by
kuap_kernel_restore().

On 8xx, kuap_kernel_restore() doesn't use the value read by
kuap_get_and_assert_locked() so modify kuap_get_and_assert_locked()
to not perform the read of MD_AP and return 0 instead.

The same applies on BOOKE.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/nohash/32/kup-8xx.h | 8 ++--
 arch/powerpc/include/asm/nohash/kup-booke.h  | 6 ++
 2 files changed, 4 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/nohash/32/kup-8xx.h 
b/arch/powerpc/include/asm/nohash/32/kup-8xx.h
index c44d97751723..8579210f2a6a 100644
--- a/arch/powerpc/include/asm/nohash/32/kup-8xx.h
+++ b/arch/powerpc/include/asm/nohash/32/kup-8xx.h
@@ -41,14 +41,10 @@ static inline void __kuap_kernel_restore(struct pt_regs 
*regs, unsigned long kua
 
 static inline unsigned long __kuap_get_and_assert_locked(void)
 {
-   unsigned long kuap;
-
-   kuap = mfspr(SPRN_MD_AP);
-
if (IS_ENABLED(CONFIG_PPC_KUAP_DEBUG))
-   WARN_ON_ONCE(kuap >> 16 != MD_APG_KUAP >> 16);
+   WARN_ON_ONCE(mfspr(SPRN_MD_AP) >> 16 != MD_APG_KUAP >> 16);
 
-   return kuap;
+   return 0;
 }
 
 static inline void __allow_user_access(void __user *to, const void __user 
*from,
diff --git a/arch/powerpc/include/asm/nohash/kup-booke.h 
b/arch/powerpc/include/asm/nohash/kup-booke.h
index 49bb41ed0816..823c5a3a96d8 100644
--- a/arch/powerpc/include/asm/nohash/kup-booke.h
+++ b/arch/powerpc/include/asm/nohash/kup-booke.h
@@ -58,12 +58,10 @@ static inline void __kuap_kernel_restore(struct pt_regs 
*regs, unsigned long kua
 
 static inline unsigned long __kuap_get_and_assert_locked(void)
 {
-   unsigned long kuap = mfspr(SPRN_PID);
-
if (IS_ENABLED(CONFIG_PPC_KUAP_DEBUG))
-   WARN_ON_ONCE(kuap);
+   WARN_ON_ONCE(mfspr(SPRN_PID));
 
-   return kuap;
+   return 0;
 }
 
 static inline void __allow_user_access(void __user *to, const void __user 
*from,
-- 
2.40.1



Re: [PATCH 00/13] mm: jit/text allocator

2023-06-05 Thread Mark Rutland
On Mon, Jun 05, 2023 at 12:20:40PM +0300, Mike Rapoport wrote:
> On Fri, Jun 02, 2023 at 10:35:09AM +0100, Mark Rutland wrote:
> > On Thu, Jun 01, 2023 at 02:14:56PM -0400, Kent Overstreet wrote:
> > > On Thu, Jun 01, 2023 at 05:12:03PM +0100, Mark Rutland wrote:
> > > > For a while I have wanted to give kprobes its own allocator so that it 
> > > > can work
> > > > even with CONFIG_MODULES=n, and so that it doesn't have to waste VA 
> > > > space in
> > > > the modules area.
> > > > 
> > > > Given that, I think these should have their own allocator functions 
> > > > that can be
> > > > provided independently, even if those happen to use common 
> > > > infrastructure.
> > > 
> > > How much memory can kprobes conceivably use? I think we also want to try
> > > to push back on combinatorial new allocators, if we can.
> > 
> > That depends on who's using it, and how (e.g. via BPF).
> > 
> > To be clear, I'm not necessarily asking for entirely different allocators, 
> > but
> > I do thinkg that we want wrappers that can at least pass distinct start+end
> > parameters to a common allocator, and for arm64's modules code I'd expect 
> > that
> > we'd keep the range falblack logic out of the common allcoator, and just 
> > call
> > it twice.
> > 
> > > > > Several architectures override module_alloc() because of various
> > > > > constraints where the executable memory can be located and this causes
> > > > > additional obstacles for improvements of code allocation.
> > > > > 
> > > > > This set splits code allocation from modules by introducing
> > > > > jit_text_alloc(), jit_data_alloc() and jit_free() APIs, replaces call
> > > > > sites of module_alloc() and module_memfree() with the new APIs and
> > > > > implements core text and related allocation in a central place.
> > > > > 
> > > > > Instead of architecture specific overrides for module_alloc(), the
> > > > > architectures that require non-default behaviour for text allocation 
> > > > > must
> > > > > fill jit_alloc_params structure and implement jit_alloc_arch_params() 
> > > > > that
> > > > > returns a pointer to that structure. If an architecture does not 
> > > > > implement
> > > > > jit_alloc_arch_params(), the defaults compatible with the current
> > > > > modules::module_alloc() are used.
> > > > 
> > > > As above, I suspect that each of the callsites should probably be using 
> > > > common
> > > > infrastructure, but I don't think that a single jit_alloc_arch_params() 
> > > > makes
> > > > sense, since the parameters for each case may need to be distinct.
> > > 
> > > I don't see how that follows. The whole point of function parameters is
> > > that they may be different :)
> > 
> > What I mean is that jit_alloc_arch_params() tries to aggregate common
> > parameters, but they aren't actually common (e.g. the actual start+end range
> > for allocation).
> 
> jit_alloc_arch_params() tries to aggregate architecture constraints and
> requirements for allocations of executable memory and this exactly what
> the first 6 patches of this set do.
> 
> A while ago Thomas suggested to use a structure that parametrizes
> architecture constraints by the memory type used in modules [1] and Song
> implemented the infrastructure for it and x86 part [2].
> 
> I liked the idea of defining parameters in a single structure, but I
> thought that approaching the problem from the arch side rather than from
> modules perspective will be better starting point, hence these patches.
> 
> I don't see a fundamental reason why a single structure cannot describe
> what is needed for different code allocation cases, be it modules, kprobes
> or bpf. There is of course an assumption that the core allocations will be
> the same for all the users, and it seems to me that something like 
> 
> * allocate physical memory if allocator caches are empty
> * map it in vmalloc or modules address space
> * return memory from the allocator cache to the caller
> 
> will work for all usecases.
> 
> We might need separate caches for different cases on different
> architectures, and a way to specify what cache should be used in the
> allocator API, but that does not contradict a single structure for arch
> specific parameters, but only makes it more elaborate, e.g. something like
> 
> enum jit_type {
>   JIT_MODULES_TEXT,
>   JIT_MODULES_DATA,
>   JIT_KPROBES,
>   JIT_FTRACE,
>   JIT_BPF,
>   JIT_TYPE_MAX,
> };
> 
> struct jit_alloc_params {
>   struct jit_rangeranges[JIT_TYPE_MAX];
>   /* ... */
> };
> 
> > > Can you give more detail on what parameters you need? If the only extra
> > > parameter is just "does this allocation need to live close to kernel
> > > text", that's not that big of a deal.
> > 
> > My thinking was that we at least need the start + end for each caller. That
> > might be it, tbh.
> 
> Do you mean that modules will have something like
> 
>   jit_text_alloc(size, MODULES_START, MODULES_END);
> 
> and kprobes will have
> 
> 

Re: [PATCH 00/13] mm: jit/text allocator

2023-06-05 Thread Mike Rapoport
On Fri, Jun 02, 2023 at 10:35:09AM +0100, Mark Rutland wrote:
> On Thu, Jun 01, 2023 at 02:14:56PM -0400, Kent Overstreet wrote:
> > On Thu, Jun 01, 2023 at 05:12:03PM +0100, Mark Rutland wrote:
> > > For a while I have wanted to give kprobes its own allocator so that it 
> > > can work
> > > even with CONFIG_MODULES=n, and so that it doesn't have to waste VA space 
> > > in
> > > the modules area.
> > > 
> > > Given that, I think these should have their own allocator functions that 
> > > can be
> > > provided independently, even if those happen to use common infrastructure.
> > 
> > How much memory can kprobes conceivably use? I think we also want to try
> > to push back on combinatorial new allocators, if we can.
> 
> That depends on who's using it, and how (e.g. via BPF).
> 
> To be clear, I'm not necessarily asking for entirely different allocators, but
> I do thinkg that we want wrappers that can at least pass distinct start+end
> parameters to a common allocator, and for arm64's modules code I'd expect that
> we'd keep the range falblack logic out of the common allcoator, and just call
> it twice.
> 
> > > > Several architectures override module_alloc() because of various
> > > > constraints where the executable memory can be located and this causes
> > > > additional obstacles for improvements of code allocation.
> > > > 
> > > > This set splits code allocation from modules by introducing
> > > > jit_text_alloc(), jit_data_alloc() and jit_free() APIs, replaces call
> > > > sites of module_alloc() and module_memfree() with the new APIs and
> > > > implements core text and related allocation in a central place.
> > > > 
> > > > Instead of architecture specific overrides for module_alloc(), the
> > > > architectures that require non-default behaviour for text allocation 
> > > > must
> > > > fill jit_alloc_params structure and implement jit_alloc_arch_params() 
> > > > that
> > > > returns a pointer to that structure. If an architecture does not 
> > > > implement
> > > > jit_alloc_arch_params(), the defaults compatible with the current
> > > > modules::module_alloc() are used.
> > > 
> > > As above, I suspect that each of the callsites should probably be using 
> > > common
> > > infrastructure, but I don't think that a single jit_alloc_arch_params() 
> > > makes
> > > sense, since the parameters for each case may need to be distinct.
> > 
> > I don't see how that follows. The whole point of function parameters is
> > that they may be different :)
> 
> What I mean is that jit_alloc_arch_params() tries to aggregate common
> parameters, but they aren't actually common (e.g. the actual start+end range
> for allocation).

jit_alloc_arch_params() tries to aggregate architecture constraints and
requirements for allocations of executable memory and this exactly what
the first 6 patches of this set do.

A while ago Thomas suggested to use a structure that parametrizes
architecture constraints by the memory type used in modules [1] and Song
implemented the infrastructure for it and x86 part [2].

I liked the idea of defining parameters in a single structure, but I
thought that approaching the problem from the arch side rather than from
modules perspective will be better starting point, hence these patches.

I don't see a fundamental reason why a single structure cannot describe
what is needed for different code allocation cases, be it modules, kprobes
or bpf. There is of course an assumption that the core allocations will be
the same for all the users, and it seems to me that something like 

* allocate physical memory if allocator caches are empty
* map it in vmalloc or modules address space
* return memory from the allocator cache to the caller

will work for all usecases.

We might need separate caches for different cases on different
architectures, and a way to specify what cache should be used in the
allocator API, but that does not contradict a single structure for arch
specific parameters, but only makes it more elaborate, e.g. something like

enum jit_type {
JIT_MODULES_TEXT,
JIT_MODULES_DATA,
JIT_KPROBES,
JIT_FTRACE,
JIT_BPF,
JIT_TYPE_MAX,
};

struct jit_alloc_params {
struct jit_rangeranges[JIT_TYPE_MAX];
/* ... */
};

> > Can you give more detail on what parameters you need? If the only extra
> > parameter is just "does this allocation need to live close to kernel
> > text", that's not that big of a deal.
> 
> My thinking was that we at least need the start + end for each caller. That
> might be it, tbh.

Do you mean that modules will have something like

jit_text_alloc(size, MODULES_START, MODULES_END);

and kprobes will have

jit_text_alloc(size, KPROBES_START, KPROBES_END);
?

It sill can be achieved with a single jit_alloc_arch_params(), just by
adding enum jit_type parameter to jit_text_alloc().

[1] https://lore.kernel.org/linux-mm/87v8mndy3y.ffs@tglx/ 
[2] 

[PATCH] powerpc/interrupt: Don't read MSR from interrupt_exit_kernel_prepare()

2023-06-05 Thread Christophe Leroy
A disassembly of interrupt_exit_kernel_prepare() shows a useless read
of MSR register. This is shown by r9 being re-used immediately without
doing anything with the value read.

  c000e0e0:   60 00 00 00 nop
  c000e0e4:   7d 3a c2 a6 mfmd_ap r9
  c000e0e8:   7d 20 00 a6 mfmsr   r9
  c000e0ec:   7c 51 13 a6 mtspr   81,r2
  c000e0f0:   81 3f 00 84 lwz r9,132(r31)
  c000e0f4:   71 29 80 00 andi.   r9,r9,32768

This is due to the use of local_irq_save(). The flags read by
local_irq_save() are never used, use local_irq_disable() instead.

Fixes: 13799748b957 ("powerpc/64: use interrupt restart table to speed up 
return from interrupt")
Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/interrupt.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/interrupt.c b/arch/powerpc/kernel/interrupt.c
index e34c72285b4e..f3fc5fe919d9 100644
--- a/arch/powerpc/kernel/interrupt.c
+++ b/arch/powerpc/kernel/interrupt.c
@@ -368,7 +368,6 @@ void preempt_schedule_irq(void);
 
 notrace unsigned long interrupt_exit_kernel_prepare(struct pt_regs *regs)
 {
-   unsigned long flags;
unsigned long ret = 0;
unsigned long kuap;
bool stack_store = read_thread_flags() & _TIF_EMULATE_STACK_STORE;
@@ -392,7 +391,7 @@ notrace unsigned long interrupt_exit_kernel_prepare(struct 
pt_regs *regs)
 
kuap = kuap_get_and_assert_locked();
 
-   local_irq_save(flags);
+   local_irq_disable();
 
if (!arch_irq_disabled_regs(regs)) {
/* Returning to a kernel context with local irqs enabled. */
-- 
2.40.1



[PATCH] powerpc/signal32: Force inlining of __unsafe_save_user_regs() and save_tm_user_regs_unsafe()

2023-06-05 Thread Christophe Leroy
Looking at generated code for handle_signal32() shows calls to a
function called __unsafe_save_user_regs.constprop.0 while user access
is open.

And that __unsafe_save_user_regs.constprop.0 function has two nops at
the begining, allowing it to be traced, which is unexpected during
user access open window.

The solution could be to mark __unsafe_save_user_regs() no trace, but
to be on the safe side the most efficient is to flag it __always_inline
as already done for function __unsafe_restore_general_regs(). The
function is relatively small and only called twice, so the size
increase will remain in the noise.

Do the same with save_tm_user_regs_unsafe() as it may suffer the
same issue.

Fixes: ef75e7318294 ("powerpc/signal32: Transform save_user_regs() and 
save_tm_user_regs() in 'unsafe' version")
Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/signal_32.c | 15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c
index c114c7f25645..7a718ed32b27 100644
--- a/arch/powerpc/kernel/signal_32.c
+++ b/arch/powerpc/kernel/signal_32.c
@@ -264,8 +264,9 @@ static void prepare_save_user_regs(int ctx_has_vsx_region)
 #endif
 }
 
-static int __unsafe_save_user_regs(struct pt_regs *regs, struct mcontext 
__user *frame,
-  struct mcontext __user *tm_frame, int 
ctx_has_vsx_region)
+static __always_inline int
+__unsafe_save_user_regs(struct pt_regs *regs, struct mcontext __user *frame,
+   struct mcontext __user *tm_frame, int 
ctx_has_vsx_region)
 {
unsigned long msr = regs->msr;
 
@@ -364,8 +365,9 @@ static void prepare_save_tm_user_regs(void)
current->thread.ckvrsave = mfspr(SPRN_VRSAVE);
 }
 
-static int save_tm_user_regs_unsafe(struct pt_regs *regs, struct mcontext 
__user *frame,
-   struct mcontext __user *tm_frame, unsigned 
long msr)
+static __always_inline int
+save_tm_user_regs_unsafe(struct pt_regs *regs, struct mcontext __user *frame,
+struct mcontext __user *tm_frame, unsigned long msr)
 {
/* Save both sets of general registers */
unsafe_save_general_regs(>thread.ckpt_regs, frame, failed);
@@ -444,8 +446,9 @@ static int save_tm_user_regs_unsafe(struct pt_regs *regs, 
struct mcontext __user
 #else
 static void prepare_save_tm_user_regs(void) { }
 
-static int save_tm_user_regs_unsafe(struct pt_regs *regs, struct mcontext 
__user *frame,
-   struct mcontext __user *tm_frame, unsigned 
long msr)
+static __always_inline int
+save_tm_user_regs_unsafe(struct pt_regs *regs, struct mcontext __user *frame,
+struct mcontext __user *tm_frame, unsigned long msr)
 {
return 0;
 }
-- 
2.40.1



Re: [PATCH 12/13] x86/jitalloc: prepare to allocate exectuatble memory as ROX

2023-06-05 Thread Mike Rapoport
On Sun, Jun 04, 2023 at 10:52:44PM -0400, Steven Rostedt wrote:
> On Thu, 1 Jun 2023 16:54:36 -0700
> Nadav Amit  wrote:
> 
> > > The way text_poke() is used here, it is creating a new writable alias
> > > and flushing it for *each* write to the module (like for each write of
> > > an individual relocation, etc). I was just thinking it might warrant
> > > some batching or something.  

> > I am not advocating to do so, but if you want to have many efficient
> > writes, perhaps you can just disable CR0.WP. Just saying that if you
> > are about to write all over the memory, text_poke() does not provide
> > too much security for the poking thread.

Heh, this is definitely and easier hack to implement :)

> Batching does exist, which is what the text_poke_queue() thing does.

For module loading text_poke_queue() will still be much slower than a bunch
of memset()s for no good reason because we don't need all the complexity of
text_poke_bp_batch() for module initialization because we are sure we are
not patching live code.

What we'd need here is a new batching mode that will create a writable
alias mapping at the beginning of apply_relocate_*() and module_finalize(),
then it will use memcpy() to that writable alias and will tear the mapping
down in the end.

Another option is to teach alternatives to update a writable copy rather
than do in place changes like Song suggested. My feeling is that it will be
more intrusive change though.

> -- Steve
> 

-- 
Sincerely yours,
Mike.


Re: [RFC 1/1] sched/fair: Consider asymmetric scheduler groups in load balancer

2023-06-05 Thread Tobias Huschle

On 2023-05-16 15:36, Vincent Guittot wrote:
On Mon, 15 May 2023 at 13:46, Tobias Huschle  
wrote:


The current load balancer implementation implies that scheduler 
groups,

within the same domain, all host the same number of CPUs. This is
reflected in the condition, that a scheduler group, which is load
balancing and classified as having spare capacity, should pull work
from the busiest group, if the local group runs less processes than
the busiest one. This implies that these two groups should run the
same number of processes, which is problematic if the groups are not
of the same size.

The assumption that scheduler groups within the same scheduler domain
host the same number of CPUs appears to be true for non-s390
architectures. Nevertheless, s390 can have scheduler groups of unequal
size.

This introduces a performance degredation in the following scenario:

Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket,
the remaining 2 are located on another socket:

Socket   -1--2-
CPU  1 2 3 4 5 67 8

Placing some workload ( x = one task ) yields the following
scenarios:

The first 5 tasks are distributed evenly across the two groups.

Socket   -1--2-
CPU  1 2 3 4 5 67 8
 x x x  x x

Adding a 6th task yields the following distribution:

Socket   -1--2-
CPU  1 2 3 4 5 67 8
SMT1 x x x  x x
SMT2x


Your description is a bit confusing for me. What you name CPU above
should be named Core, doesn' it ?

Could you share with us your scheduler topology ?



You are correct, it should say core instead of CPU.

One actual configuration from one of my test machines (lscpu -e):

CPU NODE DRAWER BOOK SOCKET CORE L1d:L1i:L2 ONLINE CONFIGURED 
POLARIZATION ADDRESS
  00  00  00 0:0:0 yes yeshorizontal 
  0
  10  00  00 1:1:0 yes yeshorizontal 
  1
  20  00  01 2:2:0 yes yeshorizontal 
  2
  30  00  01 3:3:0 yes yeshorizontal 
  3
  40  00  02 4:4:0 yes yeshorizontal 
  4
  50  00  02 5:5:0 yes yeshorizontal 
  5
  60  00  03 6:6:0 yes yeshorizontal 
  6
  70  00  03 7:7:0 yes yeshorizontal 
  7
  80  00  04 8:8:0 yes yeshorizontal 
  8
  90  00  04 9:9:0 yes yeshorizontal 
  9
 100  00  05 10:10:0   yes yeshorizontal 
  10
 110  00  05 11:11:0   yes yeshorizontal 
  11
 120  00  16 12:12:0   yes yeshorizontal 
  12
 130  00  16 13:13:0   yes yeshorizontal 
  13
 140  00  17 14:14:0   yes yeshorizontal 
  14
 150  00  17 15:15:0   yes yeshorizontal 
  15


So, 6 cores / 12 CPUs in one group 2 cores / 4 CPUs in the other.

If I run stress-ng with 8 cpu stressors on the original code I get a 
distribution

like this:

00 01 02 03 04 05 06 07 08 09 10 11  || 12 13 14 15
x x x x  x  x  x  x

Which means that the two cores in the smaller group are running into SMT 
while two
cores in the larger group are still idle. This is caused by the 
prefer_sibling path

which really wants to see both groups run the same number of tasks.



The task is added to the 2nd scheduler group, as the scheduler has the
assumption that scheduler groups are of the same size, so they should
also host the same number of tasks. This makes CPU 7 run into SMT
thread, which comes with a performance penalty. This means, that in
the window of 6-8 tasks, load balancing is done suboptimally, because
SMT is used although there is no reason to do so as fully idle CPUs
are still available.

Taking the weight of the scheduler groups into account, ensures that
a load balancing CPU within a smaller group will not try to pull tasks
from a bigger group while the bigger group still has idle CPUs
available.

Signed-off-by: Tobias Huschle 
---
 kernel/sched/fair.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 48b6f0ca13ac..b1307d7e4065 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10426,7 +10426,8 @@ static struct sched_group 
*find_busiest_group(struct lb_env *env)

 * group's child domain.
 */
if (sds.prefer_sibling && local->group_type == group_has_spare 
&&

-   busiest->sum_nr_running > local->sum_nr_running + 1)
+   busiest->sum_nr_running * local->group_weight >
+   local->sum_nr_running * busiest->group_weight 
+ 1)


This is the prefer_sibling path. Could it be that you should disable
prefer_siling between your sockets for 

Re: [PATCH] sound: Switch i2c drivers back to use .probe()

2023-06-05 Thread Takashi Iwai
On Thu, 25 May 2023 22:36:40 +0200,
Uwe Kleine-König wrote:
> 
> After commit b8a1a4cd5a98 ("i2c: Provide a temporary .probe_new()
> call-back type"), all drivers being converted to .probe_new() and then
> 03c835f498b5 ("i2c: Switch .probe() to not take an id parameter") convert
> back to (the new) .probe() to be able to eventually drop .probe_new() from
> struct i2c_driver.
> 
> Signed-off-by: Uwe Kleine-König 

Applied now to for-next branch.  Thanks.


Takashi


[RFC PATCH v2 5/6] KVM: PPC: Add support for nested PAPR guests

2023-06-05 Thread Jordan Niethe
A series of hcalls have been added to the PAPR which allow a regular
guest partition to create and manage guest partitions of its own. Add
support to KVM to utilize these hcalls to enable running nested guests.

Overview of the new hcall usage:

- L1 and L0 negotiate capabilities with
  H_GUEST_{G,S}ET_CAPABILITIES()

- L1 requests the L0 create a L2 with
  H_GUEST_CREATE() and receives a handle to use in future hcalls

- L1 requests the L0 create a L2 vCPU with
  H_GUEST_CREATE_VCPU()

- L1 sets up the L2 using H_GUEST_SET and the
  H_GUEST_VCPU_RUN input buffer

- L1 requests the L0 runs the L2 vCPU using H_GUEST_VCPU_RUN()

- L2 returns to L1 with an exit reason and L1 reads the
  H_GUEST_VCPU_RUN output buffer populated by the L0

- L1 handles the exit using H_GET_STATE if necessary

- L1 reruns L2 vCPU with H_GUEST_VCPU_RUN

- L1 frees the L2 in the L0 with H_GUEST_DELETE()

Support for the new API is determined by trying
H_GUEST_GET_CAPABILITIES. On a successful return, the new API will then
be used.

Use the vcpu register state setters for tracking modified guest state
elements and copy the thread wide values into the H_GUEST_VCPU_RUN input
buffer immediately before running a L2. The guest wide
elements can not be added to the input buffer so send them with a
separate H_GUEST_SET call if necessary.

Make the vcpu register getter load the corresponding value from the real
host with H_GUEST_GET. To avoid unnecessarily calling H_GUEST_GET, track
which values have already been loaded between H_GUEST_VCPU_RUN calls. If
an element is present in the H_GUEST_VCPU_RUN output buffer it also does
not need to be loaded again.

There is existing support for running nested guests on KVM
with powernv. However the interface used for this is not supported by
other PAPR hosts. This existing API is still supported.

Signed-off-by: Jordan Niethe 
---
v2:
  - Declare op structs as static
  - Use expressions in switch case with local variables
  - Do not use the PVR for the LOGICAL PVR ID
  - Handle emul_inst as now a double word
  - Use new GPR(), etc macros
  - Determine PAPR nested capabilities from cpu features
---
 arch/powerpc/include/asm/guest-state-buffer.h | 105 +-
 arch/powerpc/include/asm/hvcall.h |  30 +
 arch/powerpc/include/asm/kvm_book3s.h | 122 ++-
 arch/powerpc/include/asm/kvm_book3s_64.h  |   6 +
 arch/powerpc/include/asm/kvm_host.h   |  21 +
 arch/powerpc/include/asm/kvm_ppc.h|  64 +-
 arch/powerpc/include/asm/plpar_wrappers.h | 198 
 arch/powerpc/kvm/Makefile |   1 +
 arch/powerpc/kvm/book3s_hv.c  | 126 ++-
 arch/powerpc/kvm/book3s_hv.h  |  74 +-
 arch/powerpc/kvm/book3s_hv_nested.c   |  38 +-
 arch/powerpc/kvm/book3s_hv_papr.c | 940 ++
 arch/powerpc/kvm/emulate_loadstore.c  |   4 +-
 arch/powerpc/kvm/guest-state-buffer.c |  49 +
 14 files changed, 1684 insertions(+), 94 deletions(-)
 create mode 100644 arch/powerpc/kvm/book3s_hv_papr.c

diff --git a/arch/powerpc/include/asm/guest-state-buffer.h 
b/arch/powerpc/include/asm/guest-state-buffer.h
index 65a840abf1bb..116126edd8e2 100644
--- a/arch/powerpc/include/asm/guest-state-buffer.h
+++ b/arch/powerpc/include/asm/guest-state-buffer.h
@@ -5,6 +5,7 @@
 #ifndef _ASM_POWERPC_GUEST_STATE_BUFFER_H
 #define _ASM_POWERPC_GUEST_STATE_BUFFER_H
 
+#include "asm/hvcall.h"
 #include 
 #include 
 #include 
@@ -14,16 +15,16 @@
  **/
 #define GSID_BLANK 0x
 
-#define GSID_HOST_STATE_SIZE   0x0001 /* Size of Hypervisor Internal 
Format VCPU state */
-#define GSID_RUN_OUTPUT_MIN_SIZE   0x0002 /* Minimum size of the Run VCPU 
output buffer */
-#define GSID_LOGICAL_PVR   0x0003 /* Logical PVR */
-#define GSID_TB_OFFSET 0x0004 /* Timebase Offset */
-#define GSID_PARTITION_TABLE   0x0005 /* Partition Scoped Page Table */
-#define GSID_PROCESS_TABLE 0x0006 /* Process Table */
+#define GSID_HOST_STATE_SIZE   0x0001
+#define GSID_RUN_OUTPUT_MIN_SIZE   0x0002
+#define GSID_LOGICAL_PVR   0x0003
+#define GSID_TB_OFFSET 0x0004
+#define GSID_PARTITION_TABLE   0x0005
+#define GSID_PROCESS_TABLE 0x0006
 
-#define GSID_RUN_INPUT 0x0C00 /* Run VCPU Input Buffer */
-#define GSID_RUN_OUTPUT0x0C01 /* Run VCPU Out Buffer */
-#define GSID_VPA   0x0C02 /* HRA to Guest VCPU VPA */
+#define GSID_RUN_INPUT 0x0C00
+#define GSID_RUN_OUTPUT0x0C01
+#define GSID_VPA   0x0C02
 
 #define GSID_GPR(x)(0x1000 + (x))
 #define GSID_HDEC_EXPIRY_TB0x1020
@@ -300,6 +301,8 @@ struct gs_buff *gsb_new(size_t size, unsigned long guest_id,
unsigned long vcpu_id, gfp_t 

[RFC PATCH v2 6/6] docs: powerpc: Document nested KVM on POWER

2023-06-05 Thread Jordan Niethe
From: Michael Neuling 

Document support for nested KVM on POWER using the existing API as well
as the new PAPR API. This includes the new HCALL interface and how it
used by KVM.

Signed-off-by: Michael Neuling 
Signed-off-by: Jordan Niethe 
---
v2:
  - Separated into individual patch
---
 Documentation/powerpc/index.rst  |   1 +
 Documentation/powerpc/kvm-nested.rst | 636 +++
 2 files changed, 637 insertions(+)
 create mode 100644 Documentation/powerpc/kvm-nested.rst

diff --git a/Documentation/powerpc/index.rst b/Documentation/powerpc/index.rst
index 85e80e30160b..5a15dc6389ab 100644
--- a/Documentation/powerpc/index.rst
+++ b/Documentation/powerpc/index.rst
@@ -25,6 +25,7 @@ powerpc
 isa-versions
 kaslr-booke32
 mpc52xx
+kvm-nested
 papr_hcalls
 pci_iov_resource_on_powernv
 pmu-ebb
diff --git a/Documentation/powerpc/kvm-nested.rst 
b/Documentation/powerpc/kvm-nested.rst
new file mode 100644
index ..c0c2e29a59d3
--- /dev/null
+++ b/Documentation/powerpc/kvm-nested.rst
@@ -0,0 +1,636 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+
+Nested KVM on POWER
+
+
+Introduction
+
+
+This document explains how a guest operating system can act as a
+hypervisor and run nested guests through the use of hypercalls, if the
+hypervisor has implemented them. The terms L0, L1, and L2 are used to
+refer to different software entities. L0 is the hypervisor mode entity
+that would normally be called the "host" or "hypervisor". L1 is a
+guest virtual machine that is directly run under L0 and is initiated
+and controlled by L0. L2 is a guest virtual machine that is initiated
+and controlled by L1 acting as a hypervisor.
+
+Existing API
+
+
+Linux/KVM has had support for Nesting as an L0 or L1 since 2018
+
+The L0 code was added::
+
+   commit 8e3f5fc1045dc49fd175b978c5457f5f51e7a2ce
+   Author: Paul Mackerras 
+   Date:   Mon Oct 8 16:31:03 2018 +1100
+   KVM: PPC: Book3S HV: Framework and hcall stubs for nested virtualization
+
+The L1 code was added::
+
+   commit 360cae313702cdd0b90f82c261a8302fecef030a
+   Author: Paul Mackerras 
+   Date:   Mon Oct 8 16:31:04 2018 +1100
+   KVM: PPC: Book3S HV: Nested guest entry via hypercall
+
+This API works primarily using a single hcall h_enter_nested(). This
+call made by the L1 to tell the L0 to start an L2 vCPU with the given
+state. The L0 then starts this L2 and runs until an L2 exit condition
+is reached. Once the L2 exits, the state of the L2 is given back to
+the L1 by the L0. The full L2 vCPU state is always transferred from
+and to L1 when the L2 is run. The L0 doesn't keep any state on the L2
+vCPU (except in the short sequence in the L0 on L1 -> L2 entry and L2
+-> L1 exit).
+
+The only state kept by the L0 is the partition table. The L1 registers
+it's partition table using the h_set_partition_table() hcall. All
+other state held by the L0 about the L2s is cached state (such as
+shadow page tables).
+
+The L1 may run any L2 or vCPU without first informing the L0. It
+simply starts the vCPU using h_enter_nested(). The creation of L2s and
+vCPUs is done implicitly whenever h_enter_nested() is called.
+
+In this document, we call this existing API the v1 API.
+
+New PAPR API
+===
+
+The new PAPR API changes from the v1 API such that the creating L2 and
+associated vCPUs is explicit. In this document, we call this the v2
+API.
+
+h_enter_nested() is replaced with H_GUEST_VCPU_RUN().  Before this can
+be called the L1 must explicitly create the L2 using h_guest_create()
+and any associated vCPUs() created with h_guest_create_vCPU(). Getting
+and setting vCPU state can also be performed using h_guest_{g|s}et
+hcall.
+
+The basic execution flow is for an L1 to create an L2, run it, and
+delete it is:
+
+- L1 and L0 negotiate capabilities with H_GUEST_{G,S}ET_CAPABILITIES()
+  (normally at L1 boot time).
+
+- L1 requests the L0 create an L2 with H_GUEST_CREATE() and receives a token
+
+- L1 requests the L0 create an L2 vCPU with H_GUEST_CREATE_VCPU()
+
+- L1 and L0 communicate the vCPU state using the H_GUEST_{G,S}ET() hcall
+
+- L1 requests the L0 runs the vCPU running H_GUEST_VCPU_RUN() hcall
+
+- L1 deletes L2 with H_GUEST_DELETE()
+
+More details of the individual hcalls follows:
+
+HCALL Details
+=
+
+This documentation is provided to give an overall understating of the
+API. It doesn't aim to provide all the details required to implement
+an L1 or L0. Latest version of PAPR can be referred to for more details.
+
+All these HCALLs are made by the L1 to the L0.
+
+H_GUEST_GET_CAPABILITIES()
+--
+
+This is called to get the capabilities of the L0 nested
+hypervisor. This includes capabilities such the CPU versions (eg
+POWER9, POWER10) that are supported as L2s::
+
+  H_GUEST_GET_CAPABILITIES(uint64 flags)
+
+  Parameters:
+Input:
+  flags: Reserved
+Output:
+ 

[RFC PATCH v2 1/6] KVM: PPC: Use getters and setters for vcpu register state

2023-06-05 Thread Jordan Niethe
There are already some getter and setter functions used for accessing
vcpu register state, e.g. kvmppc_get_pc(). There are also more
complicated examples that are generated by macros like
kvmppc_get_sprg0() which are generated by the SHARED_SPRNG_WRAPPER()
macro.

In the new PAPR API for nested guest partitions the L1 is required to
communicate with the L0 to modify and read nested guest state.

Prepare to support this by replacing direct accesses to vcpu register
state with wrapper functions. Follow the existing pattern of using
macros to generate individual wrappers. These wrappers will
be augmented for supporting PAPR nested guests later.

Signed-off-by: Jordan Niethe 
---
 arch/powerpc/include/asm/kvm_book3s.h  |  68 +++-
 arch/powerpc/include/asm/kvm_ppc.h |  48 --
 arch/powerpc/kvm/book3s.c  |  22 +--
 arch/powerpc/kvm/book3s_64_mmu_hv.c|   4 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c |   9 +-
 arch/powerpc/kvm/book3s_64_vio.c   |   4 +-
 arch/powerpc/kvm/book3s_hv.c   | 222 +
 arch/powerpc/kvm/book3s_hv.h   |  59 +++
 arch/powerpc/kvm/book3s_hv_builtin.c   |  10 +-
 arch/powerpc/kvm/book3s_hv_p9_entry.c  |   4 +-
 arch/powerpc/kvm/book3s_hv_ras.c   |   5 +-
 arch/powerpc/kvm/book3s_hv_rm_mmu.c|   8 +-
 arch/powerpc/kvm/book3s_hv_rm_xics.c   |   4 +-
 arch/powerpc/kvm/book3s_xive.c |   9 +-
 arch/powerpc/kvm/powerpc.c |   4 +-
 15 files changed, 322 insertions(+), 158 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index bbf5e2c5fe09..4e91f54a3f9f 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -392,6 +392,16 @@ static inline ulong kvmppc_get_pc(struct kvm_vcpu *vcpu)
return vcpu->arch.regs.nip;
 }
 
+static inline void kvmppc_set_pid(struct kvm_vcpu *vcpu, u32 val)
+{
+   vcpu->arch.pid = val;
+}
+
+static inline u32 kvmppc_get_pid(struct kvm_vcpu *vcpu)
+{
+   return vcpu->arch.pid;
+}
+
 static inline u64 kvmppc_get_msr(struct kvm_vcpu *vcpu);
 static inline bool kvmppc_need_byteswap(struct kvm_vcpu *vcpu)
 {
@@ -403,10 +413,66 @@ static inline ulong kvmppc_get_fault_dar(struct kvm_vcpu 
*vcpu)
return vcpu->arch.fault_dar;
 }
 
+#define BOOK3S_WRAPPER_SET(reg, size)  \
+static inline void kvmppc_set_##reg(struct kvm_vcpu *vcpu, u##size val)
\
+{  \
+   \
+   vcpu->arch.reg = val;   \
+}
+
+#define BOOK3S_WRAPPER_GET(reg, size)  \
+static inline u##size kvmppc_get_##reg(struct kvm_vcpu *vcpu)  \
+{  \
+   return vcpu->arch.reg;  \
+}
+
+#define BOOK3S_WRAPPER(reg, size)  \
+   BOOK3S_WRAPPER_SET(reg, size)   \
+   BOOK3S_WRAPPER_GET(reg, size)   \
+
+BOOK3S_WRAPPER(tar, 64)
+BOOK3S_WRAPPER(ebbhr, 64)
+BOOK3S_WRAPPER(ebbrr, 64)
+BOOK3S_WRAPPER(bescr, 64)
+BOOK3S_WRAPPER(ic, 64)
+BOOK3S_WRAPPER(vrsave, 64)
+
+
+#define VCORE_WRAPPER_SET(reg, size)   \
+static inline void kvmppc_set_##reg ##_hv(struct kvm_vcpu *vcpu, u##size val)  
\
+{  \
+   vcpu->arch.vcore->reg = val;\
+}
+
+#define VCORE_WRAPPER_GET(reg, size)   \
+static inline u##size kvmppc_get_##reg ##_hv(struct kvm_vcpu *vcpu)\
+{  \
+   return vcpu->arch.vcore->reg;   \
+}
+
+#define VCORE_WRAPPER(reg, size)   \
+   VCORE_WRAPPER_SET(reg, size)\
+   VCORE_WRAPPER_GET(reg, size)\
+
+
+VCORE_WRAPPER(vtb, 64)
+VCORE_WRAPPER(tb_offset, 64)
+VCORE_WRAPPER(lpcr, 64)
+
+static inline u64 kvmppc_get_dec_expires(struct kvm_vcpu *vcpu)
+{
+   return vcpu->arch.dec_expires;
+}
+
+static inline void kvmppc_set_dec_expires(struct kvm_vcpu *vcpu, u64 val)
+{
+   vcpu->arch.dec_expires = val;
+}
+
 /* Expiry time of vcpu DEC relative to host TB */
 static inline u64 kvmppc_dec_expires_host_tb(struct kvm_vcpu *vcpu)
 {
-   return vcpu->arch.dec_expires - vcpu->arch.vcore->tb_offset;
+   return kvmppc_get_dec_expires(vcpu) - kvmppc_get_tb_offset_hv(vcpu);
 }
 
 static inline bool is_kvmppc_resume_guest(int r)
diff --git a/arch/powerpc/include/asm/kvm_ppc.h 
b/arch/powerpc/include/asm/kvm_ppc.h
index 79a9c0bb8bba..fbac353ac46b 100644
--- 

[RFC PATCH v2 2/6] KVM: PPC: Add fpr getters and setters

2023-06-05 Thread Jordan Niethe
Add wrappers for fpr registers to prepare for supporting PAPR nested
guests.

Signed-off-by: Jordan Niethe 
---
 arch/powerpc/include/asm/kvm_book3s.h | 31 +++
 arch/powerpc/include/asm/kvm_booke.h  | 10 +
 arch/powerpc/kvm/book3s.c | 16 +++---
 arch/powerpc/kvm/emulate_loadstore.c  |  2 +-
 arch/powerpc/kvm/powerpc.c| 22 +--
 5 files changed, 61 insertions(+), 20 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index 4e91f54a3f9f..a632e79639f0 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -413,6 +413,37 @@ static inline ulong kvmppc_get_fault_dar(struct kvm_vcpu 
*vcpu)
return vcpu->arch.fault_dar;
 }
 
+static inline u64 kvmppc_get_fpr(struct kvm_vcpu *vcpu, int i)
+{
+   return vcpu->arch.fp.fpr[i][TS_FPROFFSET];
+}
+
+static inline void kvmppc_set_fpr(struct kvm_vcpu *vcpu, int i, u64 val)
+{
+   vcpu->arch.fp.fpr[i][TS_FPROFFSET] = val;
+}
+
+static inline u64 kvmppc_get_fpscr(struct kvm_vcpu *vcpu)
+{
+   return vcpu->arch.fp.fpscr;
+}
+
+static inline void kvmppc_set_fpscr(struct kvm_vcpu *vcpu, u64 val)
+{
+   vcpu->arch.fp.fpscr = val;
+}
+
+
+static inline u64 kvmppc_get_vsx_fpr(struct kvm_vcpu *vcpu, int i, int j)
+{
+   return vcpu->arch.fp.fpr[i][j];
+}
+
+static inline void kvmppc_set_vsx_fpr(struct kvm_vcpu *vcpu, int i, int j, u64 
val)
+{
+   vcpu->arch.fp.fpr[i][j] = val;
+}
+
 #define BOOK3S_WRAPPER_SET(reg, size)  \
 static inline void kvmppc_set_##reg(struct kvm_vcpu *vcpu, u##size val)
\
 {  \
diff --git a/arch/powerpc/include/asm/kvm_booke.h 
b/arch/powerpc/include/asm/kvm_booke.h
index 0c3401b2e19e..7c3291aa8922 100644
--- a/arch/powerpc/include/asm/kvm_booke.h
+++ b/arch/powerpc/include/asm/kvm_booke.h
@@ -89,6 +89,16 @@ static inline ulong kvmppc_get_pc(struct kvm_vcpu *vcpu)
return vcpu->arch.regs.nip;
 }
 
+static inline void kvmppc_set_fpr(struct kvm_vcpu *vcpu, int i, u64 val)
+{
+   vcpu->arch.fp.fpr[i][TS_FPROFFSET] = val;
+}
+
+static inline u64 kvmppc_get_fpr(struct kvm_vcpu *vcpu, int i)
+{
+   return vcpu->arch.fp.fpr[i][TS_FPROFFSET];
+}
+
 #ifdef CONFIG_BOOKE
 static inline ulong kvmppc_get_fault_dar(struct kvm_vcpu *vcpu)
 {
diff --git a/arch/powerpc/kvm/book3s.c b/arch/powerpc/kvm/book3s.c
index 2fe31b518886..6cd20ab9e94e 100644
--- a/arch/powerpc/kvm/book3s.c
+++ b/arch/powerpc/kvm/book3s.c
@@ -636,17 +636,17 @@ int kvmppc_get_one_reg(struct kvm_vcpu *vcpu, u64 id,
break;
case KVM_REG_PPC_FPR0 ... KVM_REG_PPC_FPR31:
i = id - KVM_REG_PPC_FPR0;
-   *val = get_reg_val(id, VCPU_FPR(vcpu, i));
+   *val = get_reg_val(id, kvmppc_get_fpr(vcpu, i));
break;
case KVM_REG_PPC_FPSCR:
-   *val = get_reg_val(id, vcpu->arch.fp.fpscr);
+   *val = get_reg_val(id, kvmppc_get_fpscr(vcpu));
break;
 #ifdef CONFIG_VSX
case KVM_REG_PPC_VSR0 ... KVM_REG_PPC_VSR31:
if (cpu_has_feature(CPU_FTR_VSX)) {
i = id - KVM_REG_PPC_VSR0;
-   val->vsxval[0] = vcpu->arch.fp.fpr[i][0];
-   val->vsxval[1] = vcpu->arch.fp.fpr[i][1];
+   val->vsxval[0] = kvmppc_get_vsx_fpr(vcpu, i, 0);
+   val->vsxval[1] = kvmppc_get_vsx_fpr(vcpu, i, 1);
} else {
r = -ENXIO;
}
@@ -724,7 +724,7 @@ int kvmppc_set_one_reg(struct kvm_vcpu *vcpu, u64 id,
break;
case KVM_REG_PPC_FPR0 ... KVM_REG_PPC_FPR31:
i = id - KVM_REG_PPC_FPR0;
-   VCPU_FPR(vcpu, i) = set_reg_val(id, *val);
+   kvmppc_set_fpr(vcpu, i, set_reg_val(id, *val));
break;
case KVM_REG_PPC_FPSCR:
vcpu->arch.fp.fpscr = set_reg_val(id, *val);
@@ -733,8 +733,8 @@ int kvmppc_set_one_reg(struct kvm_vcpu *vcpu, u64 id,
case KVM_REG_PPC_VSR0 ... KVM_REG_PPC_VSR31:
if (cpu_has_feature(CPU_FTR_VSX)) {
i = id - KVM_REG_PPC_VSR0;
-   vcpu->arch.fp.fpr[i][0] = val->vsxval[0];
-   vcpu->arch.fp.fpr[i][1] = val->vsxval[1];
+   kvmppc_set_vsx_fpr(vcpu, i, 0, val->vsxval[0]);
+   kvmppc_set_vsx_fpr(vcpu, i, 1, val->vsxval[1]);
} else {
r = -ENXIO;
}
@@ 

[RFC PATCH v2 0/6] KVM: PPC: Nested PAPR guests

2023-06-05 Thread Jordan Niethe
There is existing support for nested guests on powernv hosts however the
hcall interface this uses is not support by other PAPR hosts. A set of
new hcalls will be added to PAPR to facilitate creating and managing
guests by a regular partition in the following way:

  - L1 and L0 negotiate capabilities with
H_GUEST_{G,S}ET_CAPABILITIES

  - L1 requests the L0 create a L2 with
H_GUEST_CREATE and receives a handle to use in future hcalls

  - L1 requests the L0 create a L2 vCPU with
H_GUEST_CREATE_VCPU

  - L1 sets up the L2 using H_GUEST_SET and the
H_GUEST_VCPU_RUN input buffer

  - L1 requests the L0 runs the L2 vCPU using H_GUEST_VCPU_RUN

  - L2 returns to L1 with an exit reason and L1 reads the
H_GUEST_VCPU_RUN output buffer populated by the L0

  - L1 handles the exit using H_GET_STATE if necessary

  - L1 reruns L2 vCPU with H_GUEST_VCPU_RUN

  - L1 frees the L2 in the L0 with H_GUEST_DELETE

Further details are available in Documentation/powerpc/kvm-nested.rst.

This series adds KVM support for using this hcall interface as a regular
PAPR partition, i.e. the L1. It does not add support for running as the
L0.

The new hcalls have been implemented in the spapr qemu model for
testing.

This is available at https://github.com/mikey/qemu/tree/kvm-papr

There are scripts available to assist in setting up an environment for
testing nested guests at https://github.com/mikey/kvm-powervm-test

A tree with this series is available at
https://github.com/iamjpn/linux/tree/features/kvm-papr

Thanks to Amit Machhiwal, Kautuk Consul, Vaibhav Jain, Michael Neuling,
Shivaprasad Bhat, Harsh Prateek Bora, Paul Mackerras and Nicholas
Piggin. 

Change overview in v2:
  - Rebase on top of kvm ppc prefix instruction support
  - Make documentation an individual patch
  - Move guest state buffer files from arch/powerpc/lib/ to
arch/powerpc/kvm/
  - Use kunit for testing guest state buffer
  - Fix some build errors
  - Change HEIR element from 4 bytes to 8 bytes

Previous revisions:

  - v1: 
https://lore.kernel.org/linuxppc-dev/20230508072332.2937883-1-...@linux.vnet.ibm.com/

Jordan Niethe (5):
  KVM: PPC: Use getters and setters for vcpu register state
  KVM: PPC: Add fpr getters and setters
  KVM: PPC: Add vr getters and setters
  KVM: PPC: Add helper library for Guest State Buffers
  KVM: PPC: Add support for nested PAPR guests

Michael Neuling (1):
  docs: powerpc: Document nested KVM on POWER

 Documentation/powerpc/index.rst   |   1 +
 Documentation/powerpc/kvm-nested.rst  | 636 +++
 arch/powerpc/Kconfig.debug|  12 +
 arch/powerpc/include/asm/guest-state-buffer.h | 988 ++
 arch/powerpc/include/asm/hvcall.h |  30 +
 arch/powerpc/include/asm/kvm_book3s.h | 205 +++-
 arch/powerpc/include/asm/kvm_book3s_64.h  |   6 +
 arch/powerpc/include/asm/kvm_booke.h  |  10 +
 arch/powerpc/include/asm/kvm_host.h   |  21 +
 arch/powerpc/include/asm/kvm_ppc.h|  80 +-
 arch/powerpc/include/asm/plpar_wrappers.h | 198 
 arch/powerpc/kvm/Makefile |   4 +
 arch/powerpc/kvm/book3s.c |  38 +-
 arch/powerpc/kvm/book3s_64_mmu_hv.c   |   4 +-
 arch/powerpc/kvm/book3s_64_mmu_radix.c|   9 +-
 arch/powerpc/kvm/book3s_64_vio.c  |   4 +-
 arch/powerpc/kvm/book3s_hv.c  | 336 --
 arch/powerpc/kvm/book3s_hv.h  |  65 ++
 arch/powerpc/kvm/book3s_hv_builtin.c  |  10 +-
 arch/powerpc/kvm/book3s_hv_nested.c   |  38 +-
 arch/powerpc/kvm/book3s_hv_p9_entry.c |   4 +-
 arch/powerpc/kvm/book3s_hv_papr.c | 940 +
 arch/powerpc/kvm/book3s_hv_ras.c  |   5 +-
 arch/powerpc/kvm/book3s_hv_rm_mmu.c   |   8 +-
 arch/powerpc/kvm/book3s_hv_rm_xics.c  |   4 +-
 arch/powerpc/kvm/book3s_xive.c|   9 +-
 arch/powerpc/kvm/emulate_loadstore.c  |   6 +-
 arch/powerpc/kvm/guest-state-buffer.c | 612 +++
 arch/powerpc/kvm/powerpc.c|  76 +-
 arch/powerpc/kvm/test-guest-state-buffer.c| 321 ++
 30 files changed, 4467 insertions(+), 213 deletions(-)
 create mode 100644 Documentation/powerpc/kvm-nested.rst
 create mode 100644 arch/powerpc/include/asm/guest-state-buffer.h
 create mode 100644 arch/powerpc/kvm/book3s_hv_papr.c
 create mode 100644 arch/powerpc/kvm/guest-state-buffer.c
 create mode 100644 arch/powerpc/kvm/test-guest-state-buffer.c

-- 
2.31.1



[RFC PATCH v2 4/6] KVM: PPC: Add helper library for Guest State Buffers

2023-06-05 Thread Jordan Niethe
The new PAPR nested guest API introduces the concept of a Guest State
Buffer for communication about L2 guests between L1 and L0 hosts.

In the new API, the L0 manages the L2 on behalf of the L1. This means
that if the L1 needs to change L2 state (e.g. GPRs, SPRs, partition
table...), it must request the L0 perform the modification. If the
nested host needs to read L2 state likewise this request must
go through the L0.

The Guest State Buffer is a Type-Length-Value style data format defined
in the PAPR which assigns all relevant partition state a unique
identity. Unlike a typical TLV format the length is redundant as the
length of each identity is fixed but is included for checking
correctness.

A guest state buffer consists of an element count followed by a stream
of elements, where elements are composed of an ID number, data length,
then the data:

  Header:

   <---4 bytes--->
  ++-
  | Element Count  | Elements...
  ++-

  Element:

   <2 bytes---> <-2 bytes-> <-Length bytes->
  ++---++
  | Guest State ID |  Length   |  Data  |
  ++---++

Guest State IDs have other attributes defined in the PAPR such as
whether they are per thread or per guest, or read-only.

Introduce a library for using guest state buffers. This includes support
for actions such as creating buffers, adding elements to buffers,
reading the value of elements and parsing buffers. This will be used
later by the PAPR nested guest support.

Signed-off-by: Jordan Niethe 
---
v2:
  - Add missing #ifdef CONFIG_VSXs
  - Move files from lib/ to kvm/
  - Guard compilation on CONFIG_KVM_BOOK3S_HV_POSSIBLE
  - Use kunit for guest state buffer tests
  - Add configuration option for the tests
  - Use macros for contiguous id ranges like GPRs
  - Add some missing EXPORTs to functions
  - HEIR element is a double word not a word
---
 arch/powerpc/Kconfig.debug|  12 +
 arch/powerpc/include/asm/guest-state-buffer.h | 901 ++
 arch/powerpc/include/asm/kvm_book3s.h |   2 +
 arch/powerpc/kvm/Makefile |   3 +
 arch/powerpc/kvm/guest-state-buffer.c | 563 +++
 arch/powerpc/kvm/test-guest-state-buffer.c| 321 +++
 6 files changed, 1802 insertions(+)
 create mode 100644 arch/powerpc/include/asm/guest-state-buffer.h
 create mode 100644 arch/powerpc/kvm/guest-state-buffer.c
 create mode 100644 arch/powerpc/kvm/test-guest-state-buffer.c

diff --git a/arch/powerpc/Kconfig.debug b/arch/powerpc/Kconfig.debug
index 6aaf8dc60610..ed830a714720 100644
--- a/arch/powerpc/Kconfig.debug
+++ b/arch/powerpc/Kconfig.debug
@@ -82,6 +82,18 @@ config MSI_BITMAP_SELFTEST
bool "Run self-tests of the MSI bitmap code"
depends on DEBUG_KERNEL
 
+config GUEST_STATE_BUFFER_TEST
+   def_tristate n
+   prompt "Enable Guest State Buffer unit tests"
+   depends on KUNIT
+   depends on KVM_BOOK3S_HV_POSSIBLE
+   default KUNIT_ALL_TESTS
+   help
+ The Guest State Buffer is a data format specified in the PAPR.
+ It is by hcalls to communicate the state of L2 guests between
+ the L1 and L0 hypervisors. Enable unit tests for the library
+ used to create and use guest state buffers.
+
 config PPC_IRQ_SOFT_MASK_DEBUG
bool "Include extra checks for powerpc irq soft masking"
depends on PPC64
diff --git a/arch/powerpc/include/asm/guest-state-buffer.h 
b/arch/powerpc/include/asm/guest-state-buffer.h
new file mode 100644
index ..65a840abf1bb
--- /dev/null
+++ b/arch/powerpc/include/asm/guest-state-buffer.h
@@ -0,0 +1,901 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+/*
+ * Interface based on include/net/netlink.h
+ */
+#ifndef _ASM_POWERPC_GUEST_STATE_BUFFER_H
+#define _ASM_POWERPC_GUEST_STATE_BUFFER_H
+
+#include 
+#include 
+#include 
+
+/**
+ * Guest State Buffer Constants
+ **/
+#define GSID_BLANK 0x
+
+#define GSID_HOST_STATE_SIZE   0x0001 /* Size of Hypervisor Internal 
Format VCPU state */
+#define GSID_RUN_OUTPUT_MIN_SIZE   0x0002 /* Minimum size of the Run VCPU 
output buffer */
+#define GSID_LOGICAL_PVR   0x0003 /* Logical PVR */
+#define GSID_TB_OFFSET 0x0004 /* Timebase Offset */
+#define GSID_PARTITION_TABLE   0x0005 /* Partition Scoped Page Table */
+#define GSID_PROCESS_TABLE 0x0006 /* Process Table */
+
+#define GSID_RUN_INPUT 0x0C00 /* Run VCPU Input Buffer */
+#define GSID_RUN_OUTPUT0x0C01 /* Run VCPU Out Buffer */
+#define GSID_VPA   0x0C02 /* HRA to Guest VCPU VPA */
+
+#define GSID_GPR(x)(0x1000 + (x))
+#define GSID_HDEC_EXPIRY_TB0x1020
+#define GSID_NIA

[RFC PATCH v2 3/6] KVM: PPC: Add vr getters and setters

2023-06-05 Thread Jordan Niethe
Add wrappers for vr registers to prepare for supporting PAPR nested
guests.

Signed-off-by: Jordan Niethe 
---
 arch/powerpc/include/asm/kvm_book3s.h | 20 +++
 arch/powerpc/kvm/powerpc.c| 50 +--
 2 files changed, 45 insertions(+), 25 deletions(-)

diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
b/arch/powerpc/include/asm/kvm_book3s.h
index a632e79639f0..77653c5b356b 100644
--- a/arch/powerpc/include/asm/kvm_book3s.h
+++ b/arch/powerpc/include/asm/kvm_book3s.h
@@ -444,6 +444,26 @@ static inline void kvmppc_set_vsx_fpr(struct kvm_vcpu 
*vcpu, int i, int j, u64 v
vcpu->arch.fp.fpr[i][j] = val;
 }
 
+static inline vector128 kvmppc_get_vsx_vr(struct kvm_vcpu *vcpu, int i)
+{
+   return vcpu->arch.vr.vr[i];
+}
+
+static inline void kvmppc_set_vsx_vr(struct kvm_vcpu *vcpu, int i, vector128 
val)
+{
+   vcpu->arch.vr.vr[i] = val;
+}
+
+static inline u32 kvmppc_get_vscr(struct kvm_vcpu *vcpu)
+{
+   return vcpu->arch.vr.vscr.u[3];
+}
+
+static inline void kvmppc_set_vscr(struct kvm_vcpu *vcpu, u32 val)
+{
+   vcpu->arch.vr.vscr.u[3] = val;
+}
+
 #define BOOK3S_WRAPPER_SET(reg, size)  \
 static inline void kvmppc_set_##reg(struct kvm_vcpu *vcpu, u##size val)
\
 {  \
diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c
index 7f913e68342a..10436213aea2 100644
--- a/arch/powerpc/kvm/powerpc.c
+++ b/arch/powerpc/kvm/powerpc.c
@@ -934,9 +934,9 @@ static inline void kvmppc_set_vsr_dword(struct kvm_vcpu 
*vcpu,
return;
 
if (index >= 32) {
-   val.vval = VCPU_VSX_VR(vcpu, index - 32);
+   val.vval = kvmppc_get_vsx_vr(vcpu, index - 32);
val.vsxval[offset] = gpr;
-   VCPU_VSX_VR(vcpu, index - 32) = val.vval;
+   kvmppc_set_vsx_vr(vcpu, index - 32, val.vval);
} else {
kvmppc_set_vsx_fpr(vcpu, index, offset, gpr);
}
@@ -949,10 +949,10 @@ static inline void kvmppc_set_vsr_dword_dump(struct 
kvm_vcpu *vcpu,
int index = vcpu->arch.io_gpr & KVM_MMIO_REG_MASK;
 
if (index >= 32) {
-   val.vval = VCPU_VSX_VR(vcpu, index - 32);
+   val.vval = kvmppc_get_vsx_vr(vcpu, index - 32);
val.vsxval[0] = gpr;
val.vsxval[1] = gpr;
-   VCPU_VSX_VR(vcpu, index - 32) = val.vval;
+   kvmppc_set_vsx_vr(vcpu, index - 32, val.vval);
} else {
kvmppc_set_vsx_fpr(vcpu, index, 0, gpr);
kvmppc_set_vsx_fpr(vcpu, index, 1,  gpr);
@@ -970,7 +970,7 @@ static inline void kvmppc_set_vsr_word_dump(struct kvm_vcpu 
*vcpu,
val.vsx32val[1] = gpr;
val.vsx32val[2] = gpr;
val.vsx32val[3] = gpr;
-   VCPU_VSX_VR(vcpu, index - 32) = val.vval;
+   kvmppc_set_vsx_vr(vcpu, index - 32, val.vval);
} else {
val.vsx32val[0] = gpr;
val.vsx32val[1] = gpr;
@@ -991,9 +991,9 @@ static inline void kvmppc_set_vsr_word(struct kvm_vcpu 
*vcpu,
return;
 
if (index >= 32) {
-   val.vval = VCPU_VSX_VR(vcpu, index - 32);
+   val.vval = kvmppc_get_vsx_vr(vcpu, index - 32);
val.vsx32val[offset] = gpr32;
-   VCPU_VSX_VR(vcpu, index - 32) = val.vval;
+   kvmppc_set_vsx_vr(vcpu, index - 32, val.vval);
} else {
dword_offset = offset / 2;
word_offset = offset % 2;
@@ -1058,9 +1058,9 @@ static inline void kvmppc_set_vmx_dword(struct kvm_vcpu 
*vcpu,
if (offset == -1)
return;
 
-   val.vval = VCPU_VSX_VR(vcpu, index);
+   val.vval = kvmppc_get_vsx_vr(vcpu, index);
val.vsxval[offset] = gpr;
-   VCPU_VSX_VR(vcpu, index) = val.vval;
+   kvmppc_set_vsx_vr(vcpu, index, val.vval);
 }
 
 static inline void kvmppc_set_vmx_word(struct kvm_vcpu *vcpu,
@@ -1074,9 +1074,9 @@ static inline void kvmppc_set_vmx_word(struct kvm_vcpu 
*vcpu,
if (offset == -1)
return;
 
-   val.vval = VCPU_VSX_VR(vcpu, index);
+   val.vval = kvmppc_get_vsx_vr(vcpu, index);
val.vsx32val[offset] = gpr32;
-   VCPU_VSX_VR(vcpu, index) = val.vval;
+   kvmppc_set_vsx_vr(vcpu, index, val.vval);
 }
 
 static inline void kvmppc_set_vmx_hword(struct kvm_vcpu *vcpu,
@@ -1090,9 +1090,9 @@ static inline void kvmppc_set_vmx_hword(struct kvm_vcpu 
*vcpu,
if (offset == -1)
return;
 
-   val.vval = VCPU_VSX_VR(vcpu, index);
+   val.vval = kvmppc_get_vsx_vr(vcpu, index);
val.vsx16val[offset] = gpr16;
-   VCPU_VSX_VR(vcpu, index) = val.vval;
+   kvmppc_set_vsx_vr(vcpu, index, val.vval);
 }
 
 static inline void kvmppc_set_vmx_byte(struct kvm_vcpu *vcpu,
@@ -1106,9 +1106,9 @@ static inline void kvmppc_set_vmx_byte(struct kvm_vcpu