Re: [RFC v2 01/13] mm/kfence: Add a new kunit test test_use_after_free_read_nofault()

2024-10-02 Thread IBM


Hello Kasan/kfence-devs, 

Wanted your inputs on this kfence kunit test [PATCH-1] and it's respective
powerpc fix [Patch-2]. The commit msgs has a good description of it. I
see that the same problem was noticed on s390 as well [1] a while ago.
So that makes me believe that maybe we should have a kunit test for the
same to make sure all architectures handles this properly. 

Thoughts?

[1]: https://lore.kernel.org/all/20230213183858.1473681-1-...@linux.ibm.com/

-ritesh


"Ritesh Harjani (IBM)"  writes:

> From: Nirjhar Roy 
>
> Faults from copy_from_kernel_nofault() needs to be handled by fixup
> table and should not be handled by kfence. Otherwise while reading
> /proc/kcore which uses copy_from_kernel_nofault(), kfence can generate
> false negatives. This can happen when /proc/kcore ends up reading an
> unmapped address from kfence pool.
>
> Let's add a testcase to cover this case.
>
> Co-developed-by: Ritesh Harjani (IBM) 
> Signed-off-by: Ritesh Harjani (IBM) 
> Signed-off-by: Nirjhar Roy 
> Cc: kasan-...@googlegroups.com
> Cc: Alexander Potapenko 
> Cc: linux...@kvack.org
> ---
>  mm/kfence/kfence_test.c | 17 +
>  1 file changed, 17 insertions(+)
>
> diff --git a/mm/kfence/kfence_test.c b/mm/kfence/kfence_test.c
> index 00fd17285285..f65fb182466d 100644
> --- a/mm/kfence/kfence_test.c
> +++ b/mm/kfence/kfence_test.c
> @@ -383,6 +383,22 @@ static void test_use_after_free_read(struct kunit *test)
>   KUNIT_EXPECT_TRUE(test, report_matches(&expect));
>  }
>
> +static void test_use_after_free_read_nofault(struct kunit *test)
> +{
> + const size_t size = 32;
> + char *addr;
> + char dst;
> + int ret;
> +
> + setup_test_cache(test, size, 0, NULL);
> + addr = test_alloc(test, size, GFP_KERNEL, ALLOCATE_ANY);
> + test_free(addr);
> + /* Use after free with *_nofault() */
> + ret = copy_from_kernel_nofault(&dst, addr, 1);
> + KUNIT_EXPECT_EQ(test, ret, -EFAULT);
> + KUNIT_EXPECT_FALSE(test, report_available());
> +}
> +
>  static void test_double_free(struct kunit *test)
>  {
>   const size_t size = 32;
> @@ -780,6 +796,7 @@ static struct kunit_case kfence_test_cases[] = {
>   KFENCE_KUNIT_CASE(test_out_of_bounds_read),
>   KFENCE_KUNIT_CASE(test_out_of_bounds_write),
>   KFENCE_KUNIT_CASE(test_use_after_free_read),
> + KFENCE_KUNIT_CASE(test_use_after_free_read_nofault),
>   KFENCE_KUNIT_CASE(test_double_free),
>   KFENCE_KUNIT_CASE(test_invalid_addr_free),
>   KFENCE_KUNIT_CASE(test_corruption),
> --
> 2.46.0



Re: [PATCH v2] selftests/powerpc: Remove the path after initialization.

2024-10-01 Thread IBM
zhangjiao2  writes:

> From: zhang jiao 
>
> If there were no anamolies noted, then we can
> simply remove the log file and return, 

after the path variable has been initialized.

(minor nit)


>
> Signed-off-by: zhang jiao 
> ---
> v1->v2:
>   Remove the path after initialization.
>
>  tools/testing/selftests/powerpc/mm/tlbie_test.c | 10 +-
>  1 file changed, 5 insertions(+), 5 deletions(-)


Thanks for the fix. Looks good to me. 
Please feel free to add - 

Reviewed-by: Ritesh Harjani (IBM) 



Re: [PATCH v2] powerpc/xive: Use cpumask_intersects()

2024-09-27 Thread IBM
Costa Shulyupin  writes:

> Replace `cpumask_any_and(a, b) >= nr_cpu_ids`
> with the more readable `!cpumask_intersects(a, b)`.
>
> Comparison between cpumask_any_and() and cpumask_intersects()
>
> The cpumask_any_and() function expands using FIND_FIRST_BIT(),
> resulting in a loop that iterates through each bit of the bitmask:
>
> for (idx = 0; idx * BITS_PER_LONG < sz; idx++) {
>   val = (FETCH);
>   if (val) {
>   sz = min(idx * BITS_PER_LONG + __ffs(MUNGE(val)), sz);
>   break;
>   }
> }
>
> The cpumask_intersects() function expands using __bitmap_intersects(),
> resulting in that the first loop iterates through each long word of the 
> bitmask,
> and the second through each bit within a long word:
>
> unsigned int k, lim = bits/BITS_PER_LONG;
> for (k = 0; k < lim; ++k)
>   if (bitmap1[k] & bitmap2[k])
>   return true;
>
> if (bits % BITS_PER_LONG)
>   if ((bitmap1[k] & bitmap2[k]) & BITMAP_LAST_WORD_MASK(bits))
>   return true;
>
> Conclusion: cpumask_intersects() is at least as efficient as 
> cpumask_any_and(),
> if not more so, as it typically performs fewer loops and comparisons.
>

I agree with the analysis in above. cpumask_any_and() has to get the
first set bit from the two cpumask for which it also does some
additional calculations like __ffs().

whereas cpumask_intersects() has to only check if any of the bits is set
hence does fewer operations.


Looks good to me. Please feel free to add - 

Reviewed-by: Ritesh Harjani (IBM) 


> Signed-off-by: Costa Shulyupin 
> Reviewed-by: Ming Lei 
>
> ---
>
> v2: add comparison between cpumask_any_and() and cpumask_intersects()
>
> ---
>  arch/powerpc/sysdev/xive/common.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/sysdev/xive/common.c 
> b/arch/powerpc/sysdev/xive/common.c
> index fa01818c1972c..a6c388bdf5d08 100644
> --- a/arch/powerpc/sysdev/xive/common.c
> +++ b/arch/powerpc/sysdev/xive/common.c
> @@ -726,7 +726,7 @@ static int xive_irq_set_affinity(struct irq_data *d,
>   pr_debug("%s: irq %d/0x%x\n", __func__, d->irq, hw_irq);
>  
>   /* Is this valid ? */
> - if (cpumask_any_and(cpumask, cpu_online_mask) >= nr_cpu_ids)
> + if (!cpumask_intersects(cpumask, cpu_online_mask))
>   return -EINVAL;
>  
>   /*
> -- 
> 2.45.0



Re: [PATCH] powerpc: remove dead config options for MPC85xx platform support

2024-09-27 Thread IBM
Lukas Bulwahn  writes:

> From: Lukas Bulwahn 
>
> Commit 384e338a9187 ("powerpc: drop MPC8540_ADS and MPC8560_ADS platform
> support") and commit b751ed04bc5e ("powerpc: drop MPC85xx_CDS platform
> support") removes the platform support for MPC8540_ADS, MPC8560_ADS and
> MPC85xx_CDS in the source tree, but misses to remove the config options in
> the Kconfig file. Hence, these three config options are without any effect
> since then.
>
> Drop these three dead config options.
>

Indeed these looks to be dead config remaining.

> Fixes: 384e338a9187 ("powerpc: drop MPC8540_ADS and MPC8560_ADS platform 
> support")
> Fixes: b751ed04bc5e ("powerpc: drop MPC85xx_CDS platform support")
> Signed-off-by: Lukas Bulwahn 
> ---
>  arch/powerpc/platforms/85xx/Kconfig | 21 -
>  1 file changed, 21 deletions(-)

I couldn't find any relevant reference of MPC8540_ADS, MPC8560_ADS or 
MPC85xx_CDS
after this patch

So please feel free to add - 

Reviewed-by: Ritesh Harjani (IBM) 

>
> diff --git a/arch/powerpc/platforms/85xx/Kconfig 
> b/arch/powerpc/platforms/85xx/Kconfig
> index 9315a3b69d6d..604c1b4b6d45 100644
> --- a/arch/powerpc/platforms/85xx/Kconfig
> +++ b/arch/powerpc/platforms/85xx/Kconfig
> @@ -40,27 +40,6 @@ config BSC9132_QDS
> and dual StarCore SC3850 DSP cores.
> Manufacturer : Freescale Semiconductor, Inc
>  
> -config MPC8540_ADS
> - bool "Freescale MPC8540 ADS"
> - select DEFAULT_UIMAGE
> - help
> -   This option enables support for the MPC 8540 ADS board
> -
> -config MPC8560_ADS
> - bool "Freescale MPC8560 ADS"
> - select DEFAULT_UIMAGE
> - select CPM2
> - help
> -   This option enables support for the MPC 8560 ADS board
> -
> -config MPC85xx_CDS
> - bool "Freescale MPC85xx CDS"
> - select DEFAULT_UIMAGE
> - select PPC_I8259
> - select HAVE_RAPIDIO
> - help
> -   This option enables support for the MPC85xx CDS board
> -
>  config MPC85xx_MDS
>   bool "Freescale MPC8568 MDS / MPC8569 MDS / P1021 MDS"
>   select DEFAULT_UIMAGE
> -- 
> 2.46.1



Re: [PATCH] selftests/powerpc: Rm the unnecessary remove function.

2024-09-27 Thread IBM
zhangjiao2  writes:

> From: zhang jiao 
>
> Path is not initialized before use,
> remove the unnecessary remove function.
>
> Signed-off-by: zhang jiao 
> ---
>  tools/testing/selftests/powerpc/mm/tlbie_test.c | 1 -
>  1 file changed, 1 deletion(-)
>
> diff --git a/tools/testing/selftests/powerpc/mm/tlbie_test.c 
> b/tools/testing/selftests/powerpc/mm/tlbie_test.c
> index 48344a74b212..fd1456d16a7d 100644
> --- a/tools/testing/selftests/powerpc/mm/tlbie_test.c
> +++ b/tools/testing/selftests/powerpc/mm/tlbie_test.c
> @@ -314,7 +314,6 @@ static inline void end_verification_log(unsigned int tid, 
> unsigned nr_anamolies)
>   fclose(f);
>  
>   if (nr_anamolies == 0) {
> - remove(path);
>   return;
>   }

Nice catch. Indeed the path is uninitialized here. 

However, I believe the above "if" block should come after initializing
the path. The idea is if there were no anamolies noted, then we can
simply remove the log file and return.

Something like below. Thoughts?

diff --git a/tools/testing/selftests/powerpc/mm/tlbie_test.c 
b/tools/testing/selftests/powerpc/mm/tlbie_test.c
index 48344a74b212..35f0098399cc 100644
--- a/tools/testing/selftests/powerpc/mm/tlbie_test.c
+++ b/tools/testing/selftests/powerpc/mm/tlbie_test.c
@@ -313,16 +313,16 @@ static inline void end_verification_log(unsigned int tid, 
unsigned nr_anamolies)

fclose(f);

-   if (nr_anamolies == 0) {
-   remove(path);
-   return;
-   }
-
sprintf(logfile, logfilename, tid);
strcpy(path, logdir);
strcat(path, separator);
strcat(path, logfile);

+   if (nr_anamolies == 0) {
+   remove(path);
+   return;
+   }
+
printf("Thread %02d chunk has %d corrupted words. For details check 
%s\n",
tid, nr_anamolies, path);
 }


-ritesh



Re: [PATCH v3] powerpc/pseries/eeh: Fix pseries_eeh_err_inject

2024-09-23 Thread IBM
Guenter Roeck  writes:

> Hi,
>
> On Mon, Sep 09, 2024 at 09:02:20AM -0500, Narayana Murty N wrote:
>> VFIO_EEH_PE_INJECT_ERR ioctl is currently failing on pseries
>> due to missing implementation of err_inject eeh_ops for pseries.
>> This patch implements pseries_eeh_err_inject in eeh_ops/pseries
>> eeh_ops. Implements support for injecting MMIO load/store error
>> for testing from user space.
>> 
>> The check on PCI error type (bus type) code is moved to platform
>> code, since the eeh_pe_inject_err can be allowed to more error
>> types depending on platform requirement. Removal of the check for
>> 'type' in eeh_pe_inject_err() doesn't impact PowerNV as
>> pnv_eeh_err_inject() already has an equivalent check in place.
>> 
>> Signed-off-by: Narayana Murty N 
>> Reviewed-by: Vaibhav Jain 
>> 
>> ---
>>  arch/powerpc/include/asm/eeh.h   |  2 +-
>>  arch/powerpc/kernel/eeh.c|  9 +++--
>>  arch/powerpc/platforms/pseries/eeh_pseries.c | 39 +++-
>>  3 files changed, 44 insertions(+), 6 deletions(-)
>> 
>> diff --git a/arch/powerpc/include/asm/eeh.h b/arch/powerpc/include/asm/eeh.h
>> index 91a9fd53254f..317b12fc1fe4 100644
>> --- a/arch/powerpc/include/asm/eeh.h
>> +++ b/arch/powerpc/include/asm/eeh.h
>> @@ -308,7 +308,7 @@ int eeh_pe_reset(struct eeh_pe *pe, int option, bool 
>> include_passed);
>>  int eeh_pe_configure(struct eeh_pe *pe);
>>  int eeh_pe_inject_err(struct eeh_pe *pe, int type, int func,
>>unsigned long addr, unsigned long mask);
>> -
>> +int eeh_pe_inject_mmio_error(struct pci_dev *pdev);
>>  /**
>>   * EEH_POSSIBLE_ERROR() -- test for possible MMIO failure.
>>   *
>> diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
>> index d03f17987fca..49ab11a287a3 100644
>> --- a/arch/powerpc/kernel/eeh.c
>> +++ b/arch/powerpc/kernel/eeh.c
>> @@ -1537,10 +1537,6 @@ int eeh_pe_inject_err(struct eeh_pe *pe, int type, 
>> int func,
>>  if (!eeh_ops || !eeh_ops->err_inject)
>>  return -ENOENT;
>>  
>> -/* Check on PCI error type */
>> -if (type != EEH_ERR_TYPE_32 && type != EEH_ERR_TYPE_64)
>> -return -EINVAL;
>> -
>>  /* Check on PCI error function */
>>  if (func < EEH_ERR_FUNC_MIN || func > EEH_ERR_FUNC_MAX)
>>  return -EINVAL;
>> @@ -1851,6 +1847,11 @@ static const struct file_operations 
>> eeh_dev_break_fops = {
>>  .read   = eeh_debugfs_dev_usage,
>>  };
>>  
>> +int eeh_pe_inject_mmio_error(struct pci_dev *pdev)
>> +{
>> +return eeh_debugfs_break_device(pdev);
>> +}
>> +
>
> The new function, as the context suggests, is only compiled if 
> CONFIG_DEBUG_FS=y.
> However, it is called unconditionally. With CONFIG_DEBUG_FS=n, this results in
>
> powerpc64-linux-ld: arch/powerpc/platforms/pseries/eeh_pseries.o: in function 
> `pseries_eeh_err_inject':
> /opt/buildbot/slave/qemu-ppc64/build/arch/powerpc/platforms/pseries/eeh_pseries.c:814:(.text+0x554):
>  undefined reference to `eeh_pe_inject_mmio_error'
> make[3]: *** 
> [/opt/buildbot/slave/qemu-ppc64/build/scripts/Makefile.vmlinux:34: vmlinux] 
> Error 1
> make[2]: *** [/opt/buildbot/slave/qemu-ppc64/build/Makefile:1157: vmlinux] 
> Error 2
>
> I'll enable CONFIG_DEBUG_FS in my tests and won't report this further,
> but you might want to consider fixing the problem at some point.
>

Yes, this is fixed and picked up in powerpc tree.

https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git/commit/?h=merge&id=3af2e2f68cc6baf0a11f662d30b0bf981f77bfea

-ritesh



Re: [RFC PATCH] powerpc/tlb: enable arch want batched unmap tlb flush

2024-09-22 Thread IBM
Luming Yu  writes:

> On Sun, Sep 22, 2024 at 04:39:53PM +0530, Ritesh Harjani wrote:
>> Luming Yu  writes:
>> 
>> > From: Yu Luming 
>> >
>> > ppc always do its own tracking for batch tlb. By trivially enabling
>> > the ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH in ppc, ppc arch can re-use
>> > common code in rmap and reduce overhead and do optimization it could not
>> > have without a tlb flushing context at low architecture level.
>> 
>> I looked at this patch and other than the compile failure, this patch
>> still won't optimize anything. The idea of this config is that we want
>> to batch all the tlb flush operation at the end. By returning false from
>> should_defer_flush() (in this patch), we are saying we cannot defer
>> the flush and hence we do tlb flush in the same context of unmap.
> not exactly, as false return implies, we currently do nothing but relying on
> book3S_64's tlb batch implementation which contains a bit of defer 
> optimization
> that we need to use a real benchmark to do some performance characterization.
>
> And I need to get my test bed ready for patch testing first. So I have to
> defer the real optimization in this area.
>> 
>> Anyway, I took a quick look at ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
>> and I have a quick PoC for the same. I will soon post it.
> thanks for picking up the barton for the future collaboration on the
> potential common performance benefits among us for powerpc arch.

Sure Thanks, Luming. 
I have posted this work here [1].

[1]: 
https://lore.kernel.org/linuxppc-dev/cover.1727001426.git.ritesh.l...@gmail.com/
-ritesh



[RFC / PoC v1 1/1] powerpc: Add support for batched unmap TLB flush

2024-09-22 Thread Ritesh Harjani (IBM)
=== NOT FOR MERGE YET ===

This adds the support for ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH.
More details are added to the cover letter.

---
 arch/powerpc/Kconfig  |  1 +
 arch/powerpc/include/asm/book3s/64/tlbflush.h |  5 +++
 arch/powerpc/include/asm/tlbbatch.h   | 14 
 arch/powerpc/mm/book3s64/radix_tlb.c  | 32 +++
 4 files changed, 52 insertions(+)
 create mode 100644 arch/powerpc/include/asm/tlbbatch.h

diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
index 0b8b2e3a6381..c3a23c1894dd 100644
--- a/arch/powerpc/Kconfig
+++ b/arch/powerpc/Kconfig
@@ -171,6 +171,7 @@ config PPC
select ARCH_USE_CMPXCHG_LOCKREF if PPC64
select ARCH_USE_MEMTEST
select ARCH_USE_QUEUED_RWLOCKS  if PPC_QUEUED_SPINLOCKS
+   select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH if PPC64 && PPC_BOOK3S_64
select ARCH_WANT_DEFAULT_BPF_JIT
select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
select ARCH_WANT_IPC_PARSE_VERSION
diff --git a/arch/powerpc/include/asm/book3s/64/tlbflush.h 
b/arch/powerpc/include/asm/book3s/64/tlbflush.h
index fd642b729775..f872537715e7 100644
--- a/arch/powerpc/include/asm/book3s/64/tlbflush.h
+++ b/arch/powerpc/include/asm/book3s/64/tlbflush.h
@@ -222,4 +222,9 @@ static inline bool cputlb_use_tlbie(void)
return tlbie_enabled;
 }

+bool arch_tlbbatch_should_defer(struct mm_struct *mm);
+void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
+  struct mm_struct *mm, unsigned long uaddr);
+void arch_flush_tlb_batched_pending(struct mm_struct *mm);
+void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch);
 #endif /*  _ASM_POWERPC_BOOK3S_64_TLBFLUSH_H */
diff --git a/arch/powerpc/include/asm/tlbbatch.h 
b/arch/powerpc/include/asm/tlbbatch.h
new file mode 100644
index ..fa738462a242
--- /dev/null
+++ b/arch/powerpc/include/asm/tlbbatch.h
@@ -0,0 +1,14 @@
+/* SPDX-License-Identifier: GPL-2.0-only */
+/*
+ * Copyright (C) 2024 IBM Corporation.
+ */
+#ifndef _ASM_POWERPC_TLBBATCH_H
+#define _ASM_POWERPC_TLBBATCH_H
+
+#include 
+
+struct arch_tlbflush_unmap_batch {
+   struct cpumask cpumask;
+};
+
+#endif /* _ASM_POWERPC_TLBBATCH_H */
diff --git a/arch/powerpc/mm/book3s64/radix_tlb.c 
b/arch/powerpc/mm/book3s64/radix_tlb.c
index 9e1f6558d026..2b1b2f7429fc 100644
--- a/arch/powerpc/mm/book3s64/radix_tlb.c
+++ b/arch/powerpc/mm/book3s64/radix_tlb.c
@@ -11,6 +11,7 @@
 #include 
 #include 
 #include 
+#include 

 #include 
 #include 
@@ -1585,3 +1586,34 @@ static int __init 
create_tlb_single_page_flush_ceiling(void)
 }
 late_initcall(create_tlb_single_page_flush_ceiling);

+#ifdef CONFIG_ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
+bool arch_tlbbatch_should_defer(struct mm_struct *mm)
+{
+   if (!radix_enabled())
+   return false;
+   return true;
+}
+
+void arch_tlbbatch_add_pending(struct arch_tlbflush_unmap_batch *batch,
+  struct mm_struct *mm,
+  unsigned long uaddr)
+{
+   cpumask_or(&batch->cpumask, &batch->cpumask, mm_cpumask(mm));
+}
+
+void arch_flush_tlb_batched_pending(struct mm_struct *mm)
+{
+   flush_tlb_mm(mm);
+}
+
+static inline void tlbiel_flush_all_lpid(void *arg)
+{
+   tlbiel_all_lpid(true);
+}
+
+void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch *batch)
+{
+   on_each_cpu_mask(&batch->cpumask, tlbiel_flush_all_lpid, NULL, 1);
+   cpumask_clear(&batch->cpumask);
+}
+#endif
--
2.46.0




[RFC / PoC v1 0/1] powerpc: Add support for batched unmap TLB flush

2024-09-22 Thread Ritesh Harjani (IBM)
Hello All,

This is a quick PoC to add ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH support to
powerpc for book3s64. The ISA in 6.10 of "Translation Table Update
Synchronization Requirements" says that the architecture allows for optimizing
the translation cache invalidation by doing it in bulk later after the PTE
change has been done.
That means if we can add ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH support, it will be
possible to utilize optimizations in reclaim and migrate pages path which can
defer the tlb invalidations to be done in bulk after all the page unmap
operations has been completed.

This a quick PoC for the same. Note that this may not be a complete patch yet,
TLB on Power is already complex from the hardware side :) and then many
optimizations done in the software (e.g. exit_lazy_flush_tlb to avoid tlbies).
But since the current patch looked somewhat sane to me, I wanted to share to get
an early feedback from people who are well versed with this side of code.

Meanwhile I have many TODOs to look into which I am working in parallel for this
work. Later will also get some benchmarks w.r.t promotion / demotion.

I ran a micro-benchmark which was shared in other commits that adds this
support on other archs. I can see some good initial improvements.

without patch (perf report showing 7% in radix__flush_tlb_page_psize, even with
single thread)
==
root# time ./a.out
real0m23.538s
user0m0.191s
sys 0m5.270s

# Overhead  Command  Shared Object   Symbol
#   ...  ..  
.
#
 7.19%  a.out[kernel.vmlinux][k] radix__flush_tlb_page_psize
 5.63%  a.out[kernel.vmlinux][k] _raw_spin_lock
 3.21%  a.outa.out   [.] main
 2.93%  a.out[kernel.vmlinux][k] page_counter_cancel
 2.58%  a.out[kernel.vmlinux][k] page_counter_try_charge
 2.56%  a.out[kernel.vmlinux][k] _raw_spin_lock_irq
 2.30%  a.out[kernel.vmlinux][k] try_to_unmap_one

with patch

root# time ./a.out
real0m8.593s
user0m0.064s
sys 0m1.610s

# Overhead  Command  Shared Object   Symbol
#   ...  ..  
.
#
 5.10%  a.out[kernel.vmlinux][k] _raw_spin_lock
 3.55%  a.out[kernel.vmlinux][k] __mod_memcg_lruvec_state
 3.13%  a.outa.out   [.] main
 3.00%  a.out[kernel.vmlinux][k] page_counter_try_charge
 2.62%  a.out[kernel.vmlinux][k] _raw_spin_lock_irq
 2.58%  a.out[kernel.vmlinux][k] page_counter_cancel
 2.22%  a.out[kernel.vmlinux][k] try_to_unmap_one




#define PAGESIZE 65536
#define SIZE (1 * 1024 * 1024 * 10)
int main()
{
volatile unsigned char *p = mmap(NULL, SIZE, PROT_READ | PROT_WRITE,
 MAP_SHARED | MAP_ANONYMOUS, -1, 0);

memset(p, 0x88, SIZE);

for (int k = 0; k < 1; k++) {
/* swap in */
for (int i = 0; i < SIZE; i += PAGESIZE) {
(void)p[i];
}

/* swap out */
madvise(p, SIZE, MADV_PAGEOUT);
}
}



Ritesh Harjani (IBM) (1):
  powerpc: Add support for batched unmap TLB flush

 arch/powerpc/Kconfig  |  1 +
 arch/powerpc/include/asm/book3s/64/tlbflush.h |  5 +++
 arch/powerpc/include/asm/tlbbatch.h   | 14 
 arch/powerpc/mm/book3s64/radix_tlb.c  | 32 +++
 4 files changed, 52 insertions(+)
 create mode 100644 arch/powerpc/include/asm/tlbbatch.h

--
2.46.0




Re: [RFC PATCH] powerpc/tlb: enable arch want batched unmap tlb flush

2024-09-22 Thread IBM
Luming Yu  writes:

> From: Yu Luming 
>
> ppc always do its own tracking for batch tlb. By trivially enabling
> the ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH in ppc, ppc arch can re-use
> common code in rmap and reduce overhead and do optimization it could not
> have without a tlb flushing context at low architecture level.

I looked at this patch and other than the compile failure, this patch
still won't optimize anything. The idea of this config is that we want
to batch all the tlb flush operation at the end. By returning false from
should_defer_flush() (in this patch), we are saying we cannot defer
the flush and hence we do tlb flush in the same context of unmap.

Anyway, I took a quick look at ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
and I have a quick PoC for the same. I will soon post it.

-ritesh

>
> Signed-off-by: Luming Yu 
> ---
>  arch/powerpc/Kconfig|  1 +
>  arch/powerpc/include/asm/tlbbatch.h | 30 +
>  2 files changed, 31 insertions(+)
>  create mode 100644 arch/powerpc/include/asm/tlbbatch.h
>
> diff --git a/arch/powerpc/Kconfig b/arch/powerpc/Kconfig
> index e94e7e4bfd40..e6db84dd014a 100644
> --- a/arch/powerpc/Kconfig
> +++ b/arch/powerpc/Kconfig
> @@ -175,6 +175,7 @@ config PPC
>   select ARCH_WANT_IPC_PARSE_VERSION
>   select ARCH_WANT_IRQS_OFF_ACTIVATE_MM
>   select ARCH_WANT_LD_ORPHAN_WARN
> + select ARCH_WANT_BATCHED_UNMAP_TLB_FLUSH
>   select ARCH_WANT_OPTIMIZE_DAX_VMEMMAP   if PPC_RADIX_MMU
>   select ARCH_WANTS_MODULES_DATA_IN_VMALLOC   if PPC_BOOK3S_32 || 
> PPC_8xx
>   select ARCH_WEAK_RELEASE_ACQUIRE
> diff --git a/arch/powerpc/include/asm/tlbbatch.h 
> b/arch/powerpc/include/asm/tlbbatch.h
> new file mode 100644
> index ..484628460057
> --- /dev/null
> +++ b/arch/powerpc/include/asm/tlbbatch.h
> @@ -0,0 +1,30 @@
> +/* SPDX-License-Identifier: GPL-2.0 */
> +#ifndef _ARCH_PPC_TLBBATCH_H
> +#define _ARCH_PPC_TLBBATCH_H
> +
> +struct arch_tlbflush_unmap_batch {
> + /*
> + *
> +  */
> +};
> +
> +static inline void arch_tlbbatch_flush(struct arch_tlbflush_unmap_batch 
> *batch)
> +{
> +}
> +
> +static inline void arch_tlbbatch_add_pending(struct 
> arch_tlbflush_unmap_batch *batch,
> + struct mm_struct *mm,
> + unsigned long uarddr)
> +{
> +}
> +
> +static inline bool arch_tlbbatch_should_defer(struct mm_struct *mm)
> +{
> + /*ppc always do tlb flush in batch*/
> + return false;
> +}
> +
> +static inline void arch_flush_tlb_batched_pending(struct mm_struct *mm)
> +{
> +}
> +#endif /* _ARCH_PPC_TLBBATCH_H */
> -- 
> 2.42.0.windows.2



Re: [PATCH v3 2/3] powerpc/pseries: Export hardware trace macro dump via debugfs

2024-09-20 Thread IBM
Madhavan Srinivasan  writes:

> This patch adds debugfs interface to export Hardware Trace Macro (HTM)
> function data in a LPAR. New hypervisor call "H_HTM" has been
> defined to setup, configure, control and dump the HTM data.
> This patch supports only dumping of HTM data in a LPAR.
> New debugfs folder called "htmdump" has been added under
> /sys/kernel/debug/arch path which contains files need to
> pass required parameters for the H_HTM dump function. New Kconfig
> option called "CONFIG_HTMDUMP" has been in platform/pseries for the same.
>
> With this module loaded, list of files in debugfs path
>
> /sys/kernel/debug/powerpc/htmdump
> coreindexonchip  htmtype  nodalchipindex  nodeindex  trace
>
> Signed-off-by: Madhavan Srinivasan 
> ---
> Changelog v2:
> - Made driver as modules based on review comments
> Changelog v1:
> - Changed from tristate to bool with dependency flags
> - Trimmed the include headers
>
>  arch/powerpc/platforms/pseries/Kconfig   |   9 ++
>  arch/powerpc/platforms/pseries/Makefile  |   1 +
>  arch/powerpc/platforms/pseries/htmdump.c | 130 +++
>  3 files changed, 140 insertions(+)
>  create mode 100644 arch/powerpc/platforms/pseries/htmdump.c
>
> diff --git a/arch/powerpc/platforms/pseries/Kconfig 
> b/arch/powerpc/platforms/pseries/Kconfig
> index afc0f6a61337..a66be66d690e 100644
> --- a/arch/powerpc/platforms/pseries/Kconfig
> +++ b/arch/powerpc/platforms/pseries/Kconfig
> @@ -128,6 +128,15 @@ config CMM
> will be reused for other LPARs. The interface allows firmware to
> balance memory across many LPARs.
>  
> +config HTMDUMP
> + tristate "PHYP HTM data dumper"
> + depends on PPC_PSERIES && DEBUG_FS
> + default m
> + help
> +   Select this option, if you want to enable the kernel debugfs
> +   interface to dump the Hardware Trace Macro (HTM) function data
> +   in the LPAR.
> +
>  config HV_PERF_CTRS
>   bool "Hypervisor supplied PMU events (24x7 & GPCI)"
>   default y
> diff --git a/arch/powerpc/platforms/pseries/Makefile 
> b/arch/powerpc/platforms/pseries/Makefile
> index 7bf506f6b8c8..3f3e3492e436 100644
> --- a/arch/powerpc/platforms/pseries/Makefile
> +++ b/arch/powerpc/platforms/pseries/Makefile
> @@ -19,6 +19,7 @@ obj-$(CONFIG_HVC_CONSOLE)   += hvconsole.o
>  obj-$(CONFIG_HVCS)   += hvcserver.o
>  obj-$(CONFIG_HCALL_STATS)+= hvCall_inst.o
>  obj-$(CONFIG_CMM)+= cmm.o
> +obj-$(CONFIG_HTMDUMP)+= htmdump.o
>  obj-$(CONFIG_IO_EVENT_IRQ)   += io_event_irq.o
>  obj-$(CONFIG_LPARCFG)+= lparcfg.o
>  obj-$(CONFIG_IBMVIO) += vio.o
> diff --git a/arch/powerpc/platforms/pseries/htmdump.c 
> b/arch/powerpc/platforms/pseries/htmdump.c
> new file mode 100644
> index ..54c28525c4a7
> --- /dev/null
> +++ b/arch/powerpc/platforms/pseries/htmdump.c
> @@ -0,0 +1,130 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * Copyright (C) IBM Corporation, 2024
> + */
> +
> +#define pr_fmt(fmt) "htmdump: " fmt
> +
> +#include 
> +
> +#include 
> +#include 
> +
> +/* This enables us to keep track of the memory removed from each node. */
> +struct htmdump_entry {
> + void *buf;
> + struct dentry *dir;
> + char name[16];
> +};
>

How does dir and name gets used?
It isn't that obvious, so maybe a comment will be gr8.

> +static u32 nodeindex;
> +static u32 nodalchipindex;
> +static u32 coreindexonchip;
> +static u32 htmtype;
> +static struct dentry *htmdump_debugfs_dir;
> +static struct htmdump_entry *ent;
> +
> +#define BUFFER_SIZE PAGE_SIZE
> +
> +static ssize_t htmdump_read(struct file *filp, char __user *ubuf,
> +  size_t count, loff_t *ppos)
> +{
> + struct htmdump_entry *ent = filp->private_data;
> + unsigned long page, read_size, available;
> + loff_t offset;
> + long rc;
> +
> + page = ALIGN_DOWN(*ppos, BUFFER_SIZE);
> + offset = (*ppos) % BUFFER_SIZE;
> +
> + rc = htm_get_dump_hardware(nodeindex, nodalchipindex, coreindexonchip,
> +htmtype, virt_to_phys(ent->buf), 
> BUFFER_SIZE, page);
> +
> + switch (rc) {
> + case H_SUCCESS:
> + case H_PARTIAL:
> + break;
> + case H_NOT_AVAILABLE:
> + return 0;

Minor nits for error returns here...

Is returning 0 correct here? Maybe it is (since 0 means no data read),
but wanted to confirm if we should return -ENODATA, or -ENODEV 
(not sure what does H_NOT_AVAILABLE here means)

#define ENODATA 61  /* No data avai

Re: [PATCH] powerpc/kvm: Fix typo in the kvm functions

2024-09-20 Thread IBM
Kajol Jain  writes:

> Fix typo in the following kvm function names from:
>
> kmvhv_counters_tracepoint_regfunc -> kvmhv_counters_tracepoint_regfunc
> kmvhv_counters_tracepoint_unregfunc -> kvmhv_counters_tracepoint_unregfunc

Gr8 spotting!
It took sometime to realize k[mv] and k[vm] is the change :) 

>
> Fixes: e1f288d2f9c6 ("KVM: PPC: Book3S HV nestedv2: Add support for reading 
> VPA counters for pseries guests")

Right. This commit added the registration and unregistration helpers
for TRACE_EVEN_FN_COND tracepoint which mainly collects the
observability stats for nested guest on pseries.


> Reported-by: Madhavan Srinivasan 
> Signed-off-by: Kajol Jain 
> ---
>  arch/powerpc/include/asm/kvm_book3s_64.h | 4 ++--
>  arch/powerpc/kvm/book3s_hv.c | 4 ++--
>  arch/powerpc/kvm/trace_hv.h  | 2 +-
>  3 files changed, 5 insertions(+), 5 deletions(-)

Not an expert in kvm area. But the change looks very straight forward to
me. Searching for "kmv" string in arch/powerpc/ after applying this
patch indeed resulted in zero hits. 

Please feel free to add - 

Reviewed-by: Ritesh Harjani (IBM) 




Re: [PATCH] powerpc/pseries/eeh: move pseries_eeh_err_inject() outside CONFIG_DEBUG_FS block

2024-09-19 Thread IBM
Vaibhav Jain  writes:

> Hi Ritesh,
>
> Thanks for looking into this patch. My responses your review inline
> below:
>
> Ritesh Harjani (IBM)  writes:
>
>> Narayana Murty N  writes:
>>
>>> Makes pseries_eeh_err_inject() available even when debugfs
>>> is disabled (CONFIG_DEBUG_FS=n). It moves eeh_debugfs_break_device()
>>> and eeh_pe_inject_mmio_error() out of the CONFIG_DEBUG_FS block
>>> and renames it as eeh_break_device().
>>>
>>> Reported-by: kernel test robot 
>>> Closes: 
>>> https://lore.kernel.org/oe-kbuild-all/202409170509.vwc6jadc-...@intel.com/
>>> Fixes: b0e2b828dfca ("powerpc/pseries/eeh: Fix pseries_eeh_err_inject")
>>> Signed-off-by: Narayana Murty N 
>>> ---
>>>  arch/powerpc/kernel/eeh.c | 198 +++---
>>>  1 file changed, 99 insertions(+), 99 deletions(-)
>>
>> Ok, so in your original patch you implemented eeh_inject ops for pseries
>> using mmio based eeh error injection (eeh_pe_inject_mmio_error()), which
>> uses the functions defined under debugfs -> eeh_debugfs_break_device(). 
>>
>> This was failing when CONFIG_DEBUGFS is not defined, thus referring to
>> undefined function definition. 
>>
>> Minor nit below.
>>
>>>
>>> diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
>>> index 49ab11a287a3..0fe25e907ea6 100644
>>> --- a/arch/powerpc/kernel/eeh.c
>>> +++ b/arch/powerpc/kernel/eeh.c
>>> @@ -1574,6 +1574,104 @@ static int proc_eeh_show(struct seq_file *m, void 
>>> *v)
>>>  }
>>>  #endif /* CONFIG_PROC_FS */
>>>  
>>> +static int eeh_break_device(struct pci_dev *pdev)
>>> +{
>>> +   struct resource *bar = NULL;
>>> +   void __iomem *mapped;
>>> +   u16 old, bit;
>>> +   int i, pos;
>>> +
>>> +   /* Do we have an MMIO BAR to disable? */
>>> +   for (i = 0; i <= PCI_STD_RESOURCE_END; i++) {
>>> +   struct resource *r = &pdev->resource[i];
>>> +
>>> +   if (!r->flags || !r->start)
>>> +   continue;
>>> +   if (r->flags & IORESOURCE_IO)
>>> +   continue;
>>> +   if (r->flags & IORESOURCE_UNSET)
>>> +   continue;
>>> +
>>> +   bar = r;
>>> +   break;
>>> +   }
>>> +
>>> +   if (!bar) {
>>> +   pci_err(pdev, "Unable to find Memory BAR to cause EEH with\n");
>>> +   return -ENXIO;
>>> +   }
>>> +
>>> +   pci_err(pdev, "Going to break: %pR\n", bar);
>>> +
>>> +   if (pdev->is_virtfn) {
>>> +#ifndef CONFIG_PCI_IOV
>>> +   return -ENXIO;
>>> +#else
>>> +   /*
>>> +* VFs don't have a per-function COMMAND register, so the best
>>> +* we can do is clear the Memory Space Enable bit in the PF's
>>> +* SRIOV control reg.
>>> +*
>>> +* Unfortunately, this requires that we have a PF (i.e doesn't
>>> +* work for a passed-through VF) and it has the potential side
>>> +* effect of also causing an EEH on every other VF under the
>>> +* PF. Oh well.
>>> +*/
>>> +   pdev = pdev->physfn;
>>> +   if (!pdev)
>>> +   return -ENXIO; /* passed through VFs have no PF */
>>> +
>>> +   pos  = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_SRIOV);
>>> +   pos += PCI_SRIOV_CTRL;
>>> +   bit  = PCI_SRIOV_CTRL_MSE;
>>> +#endif /* !CONFIG_PCI_IOV */
>>> +   } else {
>>> +   bit = PCI_COMMAND_MEMORY;
>>> +   pos = PCI_COMMAND;
>>> +   }
>>> +
>>> +   /*
>>> +* Process here is:
>>> +*
>>> +* 1. Disable Memory space.
>>> +*
>>> +* 2. Perform an MMIO to the device. This should result in an error
>>> +*(CA  / UR) being raised by the device which results in an EEH
>>> +*PE freeze. Using the in_8() accessor skips the eeh detection hook
>>> +*so the freeze hook so the EEH Detection machinery won't be
>>> +*triggered here. This is to match the usual behaviour of EEH
>>> +*where the HW will asynchronously freeze a PE and it's up to
>>>

Re: [RFC v2 03/13] book3s64/hash: Remove kfence support temporarily

2024-09-18 Thread IBM
Christophe Leroy  writes:

> Le 19/09/2024 à 04:56, Ritesh Harjani (IBM) a écrit :
>> Kfence on book3s Hash on pseries is anyways broken. It fails to boot
>> due to RMA size limitation. That is because, kfence with Hash uses
>> debug_pagealloc infrastructure. debug_pagealloc allocates linear map
>> for entire dram size instead of just kfence relevant objects.
>> This means for 16TB of DRAM it will require (16TB >> PAGE_SHIFT)
>> which is 256MB which is half of RMA region on P8.
>> crash kernel reserves 256MB and we also need 2048 * 16KB * 3 for
>> emergency stack and some more for paca allocations.
>> That means there is not enough memory for reserving the full linear map
>> in the RMA region, if the DRAM size is too big (>=16TB)
>> (The issue is seen above 8TB with crash kernel 256 MB reservation).
>> 
>> Now Kfence does not require linear memory map for entire DRAM.
>> It only needs for kfence objects. So this patch temporarily removes the
>> kfence functionality since debug_pagealloc code needs some refactoring.
>> We will bring in kfence on Hash support in later patches.
>> 
>> Signed-off-by: Ritesh Harjani (IBM) 
>> ---
>>   arch/powerpc/include/asm/kfence.h |  5 +
>>   arch/powerpc/mm/book3s64/hash_utils.c | 16 +++-
>>   2 files changed, 16 insertions(+), 5 deletions(-)
>> 
>> diff --git a/arch/powerpc/include/asm/kfence.h 
>> b/arch/powerpc/include/asm/kfence.h
>> index fab124ada1c7..f3a9476a71b3 100644
>> --- a/arch/powerpc/include/asm/kfence.h
>> +++ b/arch/powerpc/include/asm/kfence.h
>> @@ -10,6 +10,7 @@
>>   
>>   #include 
>>   #include 
>> +#include 
>>   
>>   #ifdef CONFIG_PPC64_ELF_ABI_V1
>>   #define ARCH_FUNC_PREFIX "."
>> @@ -25,6 +26,10 @@ static inline void disable_kfence(void)
>>   
>>   static inline bool arch_kfence_init_pool(void)
>>   {
>> +#ifdef CONFIG_PPC64
>> +if (!radix_enabled())
>
> No need for a #ifdef here, you can just do:
>
>   if (IS_ENABLED(CONFIG_PPC64) && !radix_enabled())
>   return false;
>
>

This special radix handling is anyway dropped in later pacthes. 
So I didn't bother changing it here.

>> +return false;
>> +#endif
>>  return !kfence_disabled;
>
> But why not just set kfence_disabled to true by calling disable_kfence() 
> from one of the powerpc init functions ?
>

This patch is only temporarily disabling kfence support for only Hash.
This special Hash handling gets removed in patch-10 which brings back
kfence support.

-ritesh



Re: [RFC v2 02/13] powerpc: mm: Fix kfence page fault reporting

2024-09-18 Thread IBM
Christophe Leroy  writes:

> Le 19/09/2024 à 04:56, Ritesh Harjani (IBM) a écrit :
>> copy_from_kernel_nofault() can be called when doing read of /proc/kcore.
>> /proc/kcore can have some unmapped kfence objects which when read via
>> copy_from_kernel_nofault() can cause page faults. Since *_nofault()
>> functions define their own fixup table for handling fault, use that
>> instead of asking kfence to handle such faults.
>> 
>> Hence we search the exception tables for the nip which generated the
>> fault. If there is an entry then we let the fixup table handler handle the
>> page fault by returning an error from within ___do_page_fault().
>
> Searching the exception table is a heavy operation and all has been done 
> in the past to minimise the number of times it is called, see for 
> instance commit cbd7e6ca0210 ("powerpc/fault: Avoid heavy 
> search_exception_tables() verification")

This should not cause latency in user page fault paths. We call
search_exception_tables() only when there is a page fault for kernel
address (which isn't that common right) which otherwise kfence will handle.

>
> Also, by trying to hide false positives you also hide real ones. For 

I believe these should be false negatives. If kernel functions provides an
exception table to handle such a fault, then shouldn't it be handled via
fixup table provided rather then via kfence?

> instance if csum_partial_copy_generic() is using a kfence protected 
> area, it will now go undetected.

I can go and look into usages of csum_partial_copy_generic(). But can
you please expand more here on what you meant? 

... so if a fault occurs for above case, this patch will just let the
fixup table handle that fault rather than kfence reporting it and
returning 0.


The issue we see here is when unmapped kfence addresses get accessed via
*_nofault() variants which causes kfence to report a false negative
(this happens when we use read /proc/kcore or tools like perf read that)

This is because as per my understanding copy_from_kernel_nofault()
should return -EFAULT from it's fixup table if a fault occurs...
whereas with kfence it will report the warning and will return 0 after
kfence handled the fault.

I see other archs too calling fixup_table() in their fault handling
routine before allowing kfence to handle the fault. 

>
> IIUC, here your problem is limited to copy_from_kernel_nofault(). You 
> should handle the root cause, not its effects. For that, you could 
> perform additional verifications in copy_from_kernel_nofault_allowed().

Sorry, why make copy_from_kernel_nofault() as a special case for powerpc?
I don't see any other arch making copy_from_kernel_nofault() as a
special case. Shouldn't Kernel faults be handled via fixup_table(), if
it is supplied, before kfence handling it?
(maybe I am missing something)


-ritesh



[RFC v2 12/13] book3s64/hash: Disable kfence if not early init

2024-09-18 Thread Ritesh Harjani (IBM)
Enable kfence on book3s64 hash only when early init is enabled.
This is because, kfence could cause the kernel linear map to be mapped
at PAGE_SIZE level instead of 16M (which I guess we don't want).

Also currently there is no way to -
1. Make multiple page size entries for the SLB used for kernel linear
   map.
2. No easy way of getting the hash slot details after the page table
   mapping for kernel linear setup. So even if kfence allocate the
   pool in late init, we won't be able to get the hash slot details in
   kfence linear map.

Thus this patch disables kfence on hash if kfence early init is not
enabled.

Signed-off-by: Ritesh Harjani (IBM) 
---
 arch/powerpc/mm/book3s64/hash_utils.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index 53e6f3a524eb..b6da25719e37 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -410,6 +410,8 @@ static phys_addr_t kfence_pool;
 
 static inline void hash_kfence_alloc_pool(void)
 {
+   if (!kfence_early_init_enabled())
+   goto err;
 
// allocate linear map for kfence within RMA region
linear_map_kf_hash_count = KFENCE_POOL_SIZE >> PAGE_SHIFT;
@@ -1074,7 +1076,7 @@ static void __init htab_init_page_sizes(void)
bool aligned = true;
init_hpte_page_sizes();
 
-   if (!debug_pagealloc_enabled_or_kfence()) {
+   if (!debug_pagealloc_enabled() && !kfence_early_init_enabled()) {
/*
 * Pick a size for the linear mapping. Currently, we only
 * support 16M, 1M and 4K which is the default
-- 
2.46.0




[RFC v2 13/13] book3s64/hash: Early detect debug_pagealloc size requirement

2024-09-18 Thread Ritesh Harjani (IBM)
Add hash_supports_debug_pagealloc() helper to detect whether
debug_pagealloc can be supported on hash or not. This checks for both,
whether debug_pagealloc config is enabled and the linear map should
fit within rma_size/4 region size.

This can then be used early during htab_init_page_sizes() to decide
linear map pagesize if hash supports either debug_pagealloc or
kfence.

Signed-off-by: Ritesh Harjani (IBM) 
---
 arch/powerpc/mm/book3s64/hash_utils.c | 25 +
 1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index b6da25719e37..3ffc98b3deb1 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -329,25 +329,26 @@ static void kernel_unmap_linear_page(unsigned long vaddr, 
unsigned long idx,
 }
 #endif

+static inline bool hash_supports_debug_pagealloc(void)
+{
+   unsigned long max_hash_count = ppc64_rma_size / 4;
+   unsigned long linear_map_count = memblock_end_of_DRAM() >> PAGE_SHIFT;
+
+   if (!debug_pagealloc_enabled() || linear_map_count > max_hash_count)
+   return false;
+   return true;
+}
+
 #ifdef CONFIG_DEBUG_PAGEALLOC
 static u8 *linear_map_hash_slots;
 static unsigned long linear_map_hash_count;
 static DEFINE_RAW_SPINLOCK(linear_map_hash_lock);
 static void hash_debug_pagealloc_alloc_slots(void)
 {
-   unsigned long max_hash_count = ppc64_rma_size / 4;
-
-   if (!debug_pagealloc_enabled())
-   return;
-   linear_map_hash_count = memblock_end_of_DRAM() >> PAGE_SHIFT;
-   if (unlikely(linear_map_hash_count > max_hash_count)) {
-   pr_info("linear map size (%llu) greater than 4 times RMA region 
(%llu). Disabling debug_pagealloc\n",
-   ((u64)linear_map_hash_count << PAGE_SHIFT),
-   ppc64_rma_size);
-   linear_map_hash_count = 0;
+   if (!hash_supports_debug_pagealloc())
return;
-   }

+   linear_map_hash_count = memblock_end_of_DRAM() >> PAGE_SHIFT;
linear_map_hash_slots = memblock_alloc_try_nid(
linear_map_hash_count, 1, MEMBLOCK_LOW_LIMIT,
ppc64_rma_size, NUMA_NO_NODE);
@@ -1076,7 +1077,7 @@ static void __init htab_init_page_sizes(void)
bool aligned = true;
init_hpte_page_sizes();

-   if (!debug_pagealloc_enabled() && !kfence_early_init_enabled()) {
+   if (!hash_supports_debug_pagealloc() && !kfence_early_init_enabled()) {
/*
 * Pick a size for the linear mapping. Currently, we only
 * support 16M, 1M and 4K which is the default
--
2.46.0




[RFC v2 11/13] book3s64/radix: Refactoring common kfence related functions

2024-09-18 Thread Ritesh Harjani (IBM)
Both radix and hash on book3s requires to detect if kfence
early init is enabled or not. Hash needs to disable kfence
if early init is not enabled because with kfence the linear map is
mapped using PAGE_SIZE rather than 16M mapping.
We don't support multiple page sizes for slb entry used for kernel
linear map in book3s64.

This patch refactors out the common functions required to detect kfence
early init is enabled or not.

Signed-off-by: Ritesh Harjani (IBM) 
---
 arch/powerpc/include/asm/kfence.h|  8 ++--
 arch/powerpc/mm/book3s64/pgtable.c   | 13 +
 arch/powerpc/mm/book3s64/radix_pgtable.c | 12 
 arch/powerpc/mm/init-common.c|  1 +
 4 files changed, 20 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/include/asm/kfence.h 
b/arch/powerpc/include/asm/kfence.h
index fab124ada1c7..1f7cab58ab2c 100644
--- a/arch/powerpc/include/asm/kfence.h
+++ b/arch/powerpc/include/asm/kfence.h
@@ -15,7 +15,7 @@
 #define ARCH_FUNC_PREFIX "."
 #endif
 
-#ifdef CONFIG_KFENCE
+extern bool kfence_early_init;
 extern bool kfence_disabled;
 
 static inline void disable_kfence(void)
@@ -27,7 +27,11 @@ static inline bool arch_kfence_init_pool(void)
 {
return !kfence_disabled;
 }
-#endif
+
+static inline bool kfence_early_init_enabled(void)
+{
+   return IS_ENABLED(CONFIG_KFENCE) && kfence_early_init;
+}
 
 #ifdef CONFIG_PPC64
 static inline bool kfence_protect_page(unsigned long addr, bool protect)
diff --git a/arch/powerpc/mm/book3s64/pgtable.c 
b/arch/powerpc/mm/book3s64/pgtable.c
index f4d8d3c40e5c..1563a8c28feb 100644
--- a/arch/powerpc/mm/book3s64/pgtable.c
+++ b/arch/powerpc/mm/book3s64/pgtable.c
@@ -37,6 +37,19 @@ EXPORT_SYMBOL(__pmd_frag_nr);
 unsigned long __pmd_frag_size_shift;
 EXPORT_SYMBOL(__pmd_frag_size_shift);
 
+#ifdef CONFIG_KFENCE
+extern bool kfence_early_init;
+static int __init parse_kfence_early_init(char *arg)
+{
+   int val;
+
+   if (get_option(&arg, &val))
+   kfence_early_init = !!val;
+   return 0;
+}
+early_param("kfence.sample_interval", parse_kfence_early_init);
+#endif
+
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 /*
  * This is called when relaxing access to a hugepage. It's also called in the 
page
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index b0d927009af8..311e2112d782 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -363,18 +363,6 @@ static int __meminit create_physical_mapping(unsigned long 
start,
 }
 
 #ifdef CONFIG_KFENCE
-static bool __ro_after_init kfence_early_init = 
!!CONFIG_KFENCE_SAMPLE_INTERVAL;
-
-static int __init parse_kfence_early_init(char *arg)
-{
-   int val;
-
-   if (get_option(&arg, &val))
-   kfence_early_init = !!val;
-   return 0;
-}
-early_param("kfence.sample_interval", parse_kfence_early_init);
-
 static inline phys_addr_t alloc_kfence_pool(void)
 {
phys_addr_t kfence_pool;
diff --git a/arch/powerpc/mm/init-common.c b/arch/powerpc/mm/init-common.c
index 9b4a675eb8f8..85875820b113 100644
--- a/arch/powerpc/mm/init-common.c
+++ b/arch/powerpc/mm/init-common.c
@@ -33,6 +33,7 @@ bool disable_kuep = !IS_ENABLED(CONFIG_PPC_KUEP);
 bool disable_kuap = !IS_ENABLED(CONFIG_PPC_KUAP);
 #ifdef CONFIG_KFENCE
 bool __ro_after_init kfence_disabled;
+bool __ro_after_init kfence_early_init = !!CONFIG_KFENCE_SAMPLE_INTERVAL;
 #endif
 
 static int __init parse_nosmep(char *p)
-- 
2.46.0




[RFC v2 09/13] book3s64/hash: Disable debug_pagealloc if it requires more memory

2024-09-18 Thread Ritesh Harjani (IBM)
Make size of the linear map to be allocated in RMA region to be of
ppc64_rma_size / 4. If debug_pagealloc requires more memory than that
then do not allocate any memory and disable debug_pagealloc.

Signed-off-by: Ritesh Harjani (IBM) 
---
 arch/powerpc/mm/book3s64/hash_utils.c | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index cc2eaa97982c..cffbb6499ac4 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -331,9 +331,19 @@ static unsigned long linear_map_hash_count;
 static DEFINE_RAW_SPINLOCK(linear_map_hash_lock);
 static inline void hash_debug_pagealloc_alloc_slots(void)
 {
+   unsigned long max_hash_count = ppc64_rma_size / 4;
+
if (!debug_pagealloc_enabled())
return;
linear_map_hash_count = memblock_end_of_DRAM() >> PAGE_SHIFT;
+   if (unlikely(linear_map_hash_count > max_hash_count)) {
+   pr_info("linear map size (%llu) greater than 4 times RMA region 
(%llu). Disabling debug_pagealloc\n",
+   ((u64)linear_map_hash_count << PAGE_SHIFT),
+   ppc64_rma_size);
+   linear_map_hash_count = 0;
+   return;
+   }
+
linear_map_hash_slots = memblock_alloc_try_nid(
linear_map_hash_count, 1, MEMBLOCK_LOW_LIMIT,
ppc64_rma_size, NUMA_NO_NODE);
@@ -344,7 +354,7 @@ static inline void hash_debug_pagealloc_alloc_slots(void)
 
 static inline void hash_debug_pagealloc_add_slot(phys_addr_t paddr, int slot)
 {
-   if (!debug_pagealloc_enabled())
+   if (!debug_pagealloc_enabled() || !linear_map_hash_count)
return;
if ((paddr >> PAGE_SHIFT) < linear_map_hash_count)
linear_map_hash_slots[paddr >> PAGE_SHIFT] = slot | 0x80;
@@ -356,6 +366,9 @@ static int hash_debug_pagealloc_map_pages(struct page 
*page, int numpages,
unsigned long flags, vaddr, lmi;
int i;
 
+   if (!debug_pagealloc_enabled() || !linear_map_hash_count)
+   return 0;
+
local_irq_save(flags);
for (i = 0; i < numpages; i++, page++) {
vaddr = (unsigned long)page_address(page);
-- 
2.46.0




[RFC v2 10/13] book3s64/hash: Add kfence functionality

2024-09-18 Thread Ritesh Harjani (IBM)
Now that linear map functionality of debug_pagealloc is made generic,
enable kfence to use this generic infrastructure.

1. Define kfence related linear map variables.
   - u8 *linear_map_kf_hash_slots;
   - unsigned long linear_map_kf_hash_count;
   - DEFINE_RAW_SPINLOCK(linear_map_kf_hash_lock);
2. The linear map size allocated in RMA region is quite small
   (KFENCE_POOL_SIZE >> PAGE_SHIFT) which is 512 bytes by default.
3. kfence pool memory is reserved using memblock_phys_alloc() which has
   can come from anywhere.
   (default 255 objects => ((1+255) * 2) << PAGE_SHIFT = 32MB)
4. The hash slot information for kfence memory gets added in linear map
   in hash_linear_map_add_slot() (which also adds for debug_pagealloc).

Reported-by: Pavithra Prakash 
Signed-off-by: Ritesh Harjani (IBM) 
---
 arch/powerpc/include/asm/kfence.h |   5 -
 arch/powerpc/mm/book3s64/hash_utils.c | 162 +++---
 2 files changed, 149 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/include/asm/kfence.h 
b/arch/powerpc/include/asm/kfence.h
index f3a9476a71b3..fab124ada1c7 100644
--- a/arch/powerpc/include/asm/kfence.h
+++ b/arch/powerpc/include/asm/kfence.h
@@ -10,7 +10,6 @@
 
 #include 
 #include 
-#include 
 
 #ifdef CONFIG_PPC64_ELF_ABI_V1
 #define ARCH_FUNC_PREFIX "."
@@ -26,10 +25,6 @@ static inline void disable_kfence(void)
 
 static inline bool arch_kfence_init_pool(void)
 {
-#ifdef CONFIG_PPC64
-   if (!radix_enabled())
-   return false;
-#endif
return !kfence_disabled;
 }
 #endif
diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index cffbb6499ac4..53e6f3a524eb 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -40,6 +40,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -66,6 +67,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -271,7 +273,7 @@ void hash__tlbiel_all(unsigned int action)
WARN(1, "%s called on pre-POWER7 CPU\n", __func__);
 }
 
-#ifdef CONFIG_DEBUG_PAGEALLOC
+#if defined(CONFIG_DEBUG_PAGEALLOC) || defined(CONFIG_KFENCE)
 static void kernel_map_linear_page(unsigned long vaddr, unsigned long idx,
   u8 *slots, raw_spinlock_t *lock)
 {
@@ -325,11 +327,13 @@ static void kernel_unmap_linear_page(unsigned long vaddr, 
unsigned long idx,
 mmu_linear_psize,
 mmu_kernel_ssize, 0);
 }
+#endif
 
+#ifdef CONFIG_DEBUG_PAGEALLOC
 static u8 *linear_map_hash_slots;
 static unsigned long linear_map_hash_count;
 static DEFINE_RAW_SPINLOCK(linear_map_hash_lock);
-static inline void hash_debug_pagealloc_alloc_slots(void)
+static void hash_debug_pagealloc_alloc_slots(void)
 {
unsigned long max_hash_count = ppc64_rma_size / 4;
 
@@ -352,7 +356,8 @@ static inline void hash_debug_pagealloc_alloc_slots(void)
  __func__, linear_map_hash_count, &ppc64_rma_size);
 }
 
-static inline void hash_debug_pagealloc_add_slot(phys_addr_t paddr, int slot)
+static inline void hash_debug_pagealloc_add_slot(phys_addr_t paddr,
+   int slot)
 {
if (!debug_pagealloc_enabled() || !linear_map_hash_count)
return;
@@ -386,20 +391,148 @@ static int hash_debug_pagealloc_map_pages(struct page 
*page, int numpages,
return 0;
 }
 
-int hash__kernel_map_pages(struct page *page, int numpages, int enable)
+#else /* CONFIG_DEBUG_PAGEALLOC */
+static inline void hash_debug_pagealloc_alloc_slots(void) {}
+static inline void hash_debug_pagealloc_add_slot(phys_addr_t paddr, int slot) 
{}
+static int __maybe_unused
+hash_debug_pagealloc_map_pages(struct page *page, int numpages, int enable)
 {
-   return hash_debug_pagealloc_map_pages(page, numpages, enable);
+   return 0;
 }
+#endif /* CONFIG_DEBUG_PAGEALLOC */
 
-#else /* CONFIG_DEBUG_PAGEALLOC */
-int hash__kernel_map_pages(struct page *page, int numpages,
-int enable)
+#ifdef CONFIG_KFENCE
+static u8 *linear_map_kf_hash_slots;
+static unsigned long linear_map_kf_hash_count;
+static DEFINE_RAW_SPINLOCK(linear_map_kf_hash_lock);
+
+static phys_addr_t kfence_pool;
+
+static inline void hash_kfence_alloc_pool(void)
+{
+
+   // allocate linear map for kfence within RMA region
+   linear_map_kf_hash_count = KFENCE_POOL_SIZE >> PAGE_SHIFT;
+   linear_map_kf_hash_slots = memblock_alloc_try_nid(
+   linear_map_kf_hash_count, 1,
+   MEMBLOCK_LOW_LIMIT, ppc64_rma_size,
+   NUMA_NO_NODE);
+   if (!linear_map_kf_hash_slots) {
+   pr_err("%s: memblock for linear map (%lu) failed\n", __func__,
+   linear_map_kf_hash_count);
+   goto

[RFC v2 06/13] book3s64/hash: Add hash_debug_pagealloc_alloc_slots() function

2024-09-18 Thread Ritesh Harjani (IBM)
This adds hash_debug_pagealloc_alloc_slots() function instead of open
coding that in htab_initialize(). This is required since we will be
separating the kfence functionality to not depend upon debug_pagealloc.

Now that everything required for debug_pagealloc is under a #ifdef
config. Bring in linear_map_hash_slots and linear_map_hash_count
variables under the same config too.

Signed-off-by: Ritesh Harjani (IBM) 
---
 arch/powerpc/mm/book3s64/hash_utils.c | 29 ---
 1 file changed, 17 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index 6e3860224351..030c120d1399 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -123,8 +123,6 @@ EXPORT_SYMBOL_GPL(mmu_slb_size);
 #ifdef CONFIG_PPC_64K_PAGES
 int mmu_ci_restrictions;
 #endif
-static u8 *linear_map_hash_slots;
-static unsigned long linear_map_hash_count;
 struct mmu_hash_ops mmu_hash_ops __ro_after_init;
 EXPORT_SYMBOL(mmu_hash_ops);
 
@@ -274,6 +272,8 @@ void hash__tlbiel_all(unsigned int action)
 }
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
+static u8 *linear_map_hash_slots;
+static unsigned long linear_map_hash_count;
 static DEFINE_RAW_SPINLOCK(linear_map_hash_lock);
 
 static void kernel_map_linear_page(unsigned long vaddr, unsigned long lmi)
@@ -328,6 +328,19 @@ static void kernel_unmap_linear_page(unsigned long vaddr, 
unsigned long lmi)
 mmu_kernel_ssize, 0);
 }
 
+static inline void hash_debug_pagealloc_alloc_slots(void)
+{
+   if (!debug_pagealloc_enabled())
+   return;
+   linear_map_hash_count = memblock_end_of_DRAM() >> PAGE_SHIFT;
+   linear_map_hash_slots = memblock_alloc_try_nid(
+   linear_map_hash_count, 1, MEMBLOCK_LOW_LIMIT,
+   ppc64_rma_size, NUMA_NO_NODE);
+   if (!linear_map_hash_slots)
+   panic("%s: Failed to allocate %lu bytes max_addr=%pa\n",
+ __func__, linear_map_hash_count, &ppc64_rma_size);
+}
+
 static inline void hash_debug_pagealloc_add_slot(phys_addr_t paddr, int slot)
 {
if (!debug_pagealloc_enabled())
@@ -361,6 +374,7 @@ int hash__kernel_map_pages(struct page *page, int numpages,
 {
return 0;
 }
+static inline void hash_debug_pagealloc_alloc_slots(void) {}
 static inline void hash_debug_pagealloc_add_slot(phys_addr_t paddr, int slot) 
{}
 #endif /* CONFIG_DEBUG_PAGEALLOC */
 
@@ -1223,16 +1237,7 @@ static void __init htab_initialize(void)
 
prot = pgprot_val(PAGE_KERNEL);
 
-   if (debug_pagealloc_enabled()) {
-   linear_map_hash_count = memblock_end_of_DRAM() >> PAGE_SHIFT;
-   linear_map_hash_slots = memblock_alloc_try_nid(
-   linear_map_hash_count, 1, MEMBLOCK_LOW_LIMIT,
-   ppc64_rma_size, NUMA_NO_NODE);
-   if (!linear_map_hash_slots)
-   panic("%s: Failed to allocate %lu bytes max_addr=%pa\n",
- __func__, linear_map_hash_count, &ppc64_rma_size);
-   }
-
+   hash_debug_pagealloc_alloc_slots();
/* create bolted the linear mapping in the hash table */
for_each_mem_range(i, &base, &end) {
size = end - base;
-- 
2.46.0




[RFC v2 08/13] book3s64/hash: Make kernel_map_linear_page() generic

2024-09-18 Thread Ritesh Harjani (IBM)
Currently kernel_map_linear_page() function assumes to be working on
linear_map_hash_slots array. But since in later patches we need a
separate linear map array for kfence, hence make
kernel_map_linear_page() take a linear map array and lock in it's
function argument.

This is needed to separate out kfence from debug_pagealloc
infrastructure.

Signed-off-by: Ritesh Harjani (IBM) 
---
 arch/powerpc/mm/book3s64/hash_utils.c | 47 ++-
 1 file changed, 25 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index da9b089c8e8b..cc2eaa97982c 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -272,11 +272,8 @@ void hash__tlbiel_all(unsigned int action)
 }
 
 #ifdef CONFIG_DEBUG_PAGEALLOC
-static u8 *linear_map_hash_slots;
-static unsigned long linear_map_hash_count;
-static DEFINE_RAW_SPINLOCK(linear_map_hash_lock);
-
-static void kernel_map_linear_page(unsigned long vaddr, unsigned long lmi)
+static void kernel_map_linear_page(unsigned long vaddr, unsigned long idx,
+  u8 *slots, raw_spinlock_t *lock)
 {
unsigned long hash;
unsigned long vsid = get_kernel_vsid(vaddr, mmu_kernel_ssize);
@@ -290,7 +287,7 @@ static void kernel_map_linear_page(unsigned long vaddr, 
unsigned long lmi)
if (!vsid)
return;
 
-   if (linear_map_hash_slots[lmi] & 0x80)
+   if (slots[idx] & 0x80)
return;
 
ret = hpte_insert_repeating(hash, vpn, __pa(vaddr), mode,
@@ -298,36 +295,40 @@ static void kernel_map_linear_page(unsigned long vaddr, 
unsigned long lmi)
mmu_linear_psize, mmu_kernel_ssize);
 
BUG_ON (ret < 0);
-   raw_spin_lock(&linear_map_hash_lock);
-   BUG_ON(linear_map_hash_slots[lmi] & 0x80);
-   linear_map_hash_slots[lmi] = ret | 0x80;
-   raw_spin_unlock(&linear_map_hash_lock);
+   raw_spin_lock(lock);
+   BUG_ON(slots[idx] & 0x80);
+   slots[idx] = ret | 0x80;
+   raw_spin_unlock(lock);
 }
 
-static void kernel_unmap_linear_page(unsigned long vaddr, unsigned long lmi)
+static void kernel_unmap_linear_page(unsigned long vaddr, unsigned long idx,
+u8 *slots, raw_spinlock_t *lock)
 {
-   unsigned long hash, hidx, slot;
+   unsigned long hash, hslot, slot;
unsigned long vsid = get_kernel_vsid(vaddr, mmu_kernel_ssize);
unsigned long vpn = hpt_vpn(vaddr, vsid, mmu_kernel_ssize);
 
hash = hpt_hash(vpn, PAGE_SHIFT, mmu_kernel_ssize);
-   raw_spin_lock(&linear_map_hash_lock);
-   if (!(linear_map_hash_slots[lmi] & 0x80)) {
-   raw_spin_unlock(&linear_map_hash_lock);
+   raw_spin_lock(lock);
+   if (!(slots[idx] & 0x80)) {
+   raw_spin_unlock(lock);
return;
}
-   hidx = linear_map_hash_slots[lmi] & 0x7f;
-   linear_map_hash_slots[lmi] = 0;
-   raw_spin_unlock(&linear_map_hash_lock);
-   if (hidx & _PTEIDX_SECONDARY)
+   hslot = slots[idx] & 0x7f;
+   slots[idx] = 0;
+   raw_spin_unlock(lock);
+   if (hslot & _PTEIDX_SECONDARY)
hash = ~hash;
slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
-   slot += hidx & _PTEIDX_GROUP_IX;
+   slot += hslot & _PTEIDX_GROUP_IX;
mmu_hash_ops.hpte_invalidate(slot, vpn, mmu_linear_psize,
 mmu_linear_psize,
 mmu_kernel_ssize, 0);
 }
 
+static u8 *linear_map_hash_slots;
+static unsigned long linear_map_hash_count;
+static DEFINE_RAW_SPINLOCK(linear_map_hash_lock);
 static inline void hash_debug_pagealloc_alloc_slots(void)
 {
if (!debug_pagealloc_enabled())
@@ -362,9 +363,11 @@ static int hash_debug_pagealloc_map_pages(struct page 
*page, int numpages,
if (lmi >= linear_map_hash_count)
continue;
if (enable)
-   kernel_map_linear_page(vaddr, lmi);
+   kernel_map_linear_page(vaddr, lmi,
+   linear_map_hash_slots, &linear_map_hash_lock);
else
-   kernel_unmap_linear_page(vaddr, lmi);
+   kernel_unmap_linear_page(vaddr, lmi,
+   linear_map_hash_slots, &linear_map_hash_lock);
}
local_irq_restore(flags);
return 0;
-- 
2.46.0




[RFC v2 07/13] book3s64/hash: Refactor hash__kernel_map_pages() function

2024-09-18 Thread Ritesh Harjani (IBM)
This refactors hash__kernel_map_pages() function to call
hash_debug_pagealloc_map_pages(). This will come useful when we will add
kfence support.

No functionality changes in this patch.

Signed-off-by: Ritesh Harjani (IBM) 
---
 arch/powerpc/mm/book3s64/hash_utils.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index 030c120d1399..da9b089c8e8b 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -349,7 +349,8 @@ static inline void 
hash_debug_pagealloc_add_slot(phys_addr_t paddr, int slot)
linear_map_hash_slots[paddr >> PAGE_SHIFT] = slot | 0x80;
 }
 
-int hash__kernel_map_pages(struct page *page, int numpages, int enable)
+static int hash_debug_pagealloc_map_pages(struct page *page, int numpages,
+ int enable)
 {
unsigned long flags, vaddr, lmi;
int i;
@@ -368,6 +369,12 @@ int hash__kernel_map_pages(struct page *page, int 
numpages, int enable)
local_irq_restore(flags);
return 0;
 }
+
+int hash__kernel_map_pages(struct page *page, int numpages, int enable)
+{
+   return hash_debug_pagealloc_map_pages(page, numpages, enable);
+}
+
 #else /* CONFIG_DEBUG_PAGEALLOC */
 int hash__kernel_map_pages(struct page *page, int numpages,
 int enable)
-- 
2.46.0




[RFC v2 05/13] book3s64/hash: Add hash_debug_pagealloc_add_slot() function

2024-09-18 Thread Ritesh Harjani (IBM)
This adds hash_debug_pagealloc_add_slot() function instead of open
coding that in htab_bolt_mapping(). This is required since we will be
separating kfence functionality to not depend upon debug_pagealloc.

No functionality change in this patch.

Signed-off-by: Ritesh Harjani (IBM) 
---
 arch/powerpc/mm/book3s64/hash_utils.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index 82151fff9648..6e3860224351 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -328,6 +328,14 @@ static void kernel_unmap_linear_page(unsigned long vaddr, 
unsigned long lmi)
 mmu_kernel_ssize, 0);
 }
 
+static inline void hash_debug_pagealloc_add_slot(phys_addr_t paddr, int slot)
+{
+   if (!debug_pagealloc_enabled())
+   return;
+   if ((paddr >> PAGE_SHIFT) < linear_map_hash_count)
+   linear_map_hash_slots[paddr >> PAGE_SHIFT] = slot | 0x80;
+}
+
 int hash__kernel_map_pages(struct page *page, int numpages, int enable)
 {
unsigned long flags, vaddr, lmi;
@@ -353,6 +361,7 @@ int hash__kernel_map_pages(struct page *page, int numpages,
 {
return 0;
 }
+static inline void hash_debug_pagealloc_add_slot(phys_addr_t paddr, int slot) 
{}
 #endif /* CONFIG_DEBUG_PAGEALLOC */
 
 /*
@@ -513,9 +522,7 @@ int htab_bolt_mapping(unsigned long vstart, unsigned long 
vend,
break;
 
cond_resched();
-   if (debug_pagealloc_enabled() &&
-   (paddr >> PAGE_SHIFT) < linear_map_hash_count)
-   linear_map_hash_slots[paddr >> PAGE_SHIFT] = ret | 0x80;
+   hash_debug_pagealloc_add_slot(paddr, ret);
}
return ret < 0 ? ret : 0;
 }
-- 
2.46.0




[RFC v2 04/13] book3s64/hash: Refactor kernel linear map related calls

2024-09-18 Thread Ritesh Harjani (IBM)
This just brings all linear map related handling at one place instead of
having those functions scattered in hash_utils file.
Makes it easy for review.

No functionality changes in this patch.

Signed-off-by: Ritesh Harjani (IBM) 
---
 arch/powerpc/mm/book3s64/hash_utils.c | 164 +-
 1 file changed, 82 insertions(+), 82 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index 296bb74dbf40..82151fff9648 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -273,6 +273,88 @@ void hash__tlbiel_all(unsigned int action)
WARN(1, "%s called on pre-POWER7 CPU\n", __func__);
 }
 
+#ifdef CONFIG_DEBUG_PAGEALLOC
+static DEFINE_RAW_SPINLOCK(linear_map_hash_lock);
+
+static void kernel_map_linear_page(unsigned long vaddr, unsigned long lmi)
+{
+   unsigned long hash;
+   unsigned long vsid = get_kernel_vsid(vaddr, mmu_kernel_ssize);
+   unsigned long vpn = hpt_vpn(vaddr, vsid, mmu_kernel_ssize);
+   unsigned long mode = htab_convert_pte_flags(pgprot_val(PAGE_KERNEL), 
HPTE_USE_KERNEL_KEY);
+   long ret;
+
+   hash = hpt_hash(vpn, PAGE_SHIFT, mmu_kernel_ssize);
+
+   /* Don't create HPTE entries for bad address */
+   if (!vsid)
+   return;
+
+   if (linear_map_hash_slots[lmi] & 0x80)
+   return;
+
+   ret = hpte_insert_repeating(hash, vpn, __pa(vaddr), mode,
+   HPTE_V_BOLTED,
+   mmu_linear_psize, mmu_kernel_ssize);
+
+   BUG_ON (ret < 0);
+   raw_spin_lock(&linear_map_hash_lock);
+   BUG_ON(linear_map_hash_slots[lmi] & 0x80);
+   linear_map_hash_slots[lmi] = ret | 0x80;
+   raw_spin_unlock(&linear_map_hash_lock);
+}
+
+static void kernel_unmap_linear_page(unsigned long vaddr, unsigned long lmi)
+{
+   unsigned long hash, hidx, slot;
+   unsigned long vsid = get_kernel_vsid(vaddr, mmu_kernel_ssize);
+   unsigned long vpn = hpt_vpn(vaddr, vsid, mmu_kernel_ssize);
+
+   hash = hpt_hash(vpn, PAGE_SHIFT, mmu_kernel_ssize);
+   raw_spin_lock(&linear_map_hash_lock);
+   if (!(linear_map_hash_slots[lmi] & 0x80)) {
+   raw_spin_unlock(&linear_map_hash_lock);
+   return;
+   }
+   hidx = linear_map_hash_slots[lmi] & 0x7f;
+   linear_map_hash_slots[lmi] = 0;
+   raw_spin_unlock(&linear_map_hash_lock);
+   if (hidx & _PTEIDX_SECONDARY)
+   hash = ~hash;
+   slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
+   slot += hidx & _PTEIDX_GROUP_IX;
+   mmu_hash_ops.hpte_invalidate(slot, vpn, mmu_linear_psize,
+mmu_linear_psize,
+mmu_kernel_ssize, 0);
+}
+
+int hash__kernel_map_pages(struct page *page, int numpages, int enable)
+{
+   unsigned long flags, vaddr, lmi;
+   int i;
+
+   local_irq_save(flags);
+   for (i = 0; i < numpages; i++, page++) {
+   vaddr = (unsigned long)page_address(page);
+   lmi = __pa(vaddr) >> PAGE_SHIFT;
+   if (lmi >= linear_map_hash_count)
+   continue;
+   if (enable)
+   kernel_map_linear_page(vaddr, lmi);
+   else
+   kernel_unmap_linear_page(vaddr, lmi);
+   }
+   local_irq_restore(flags);
+   return 0;
+}
+#else /* CONFIG_DEBUG_PAGEALLOC */
+int hash__kernel_map_pages(struct page *page, int numpages,
+int enable)
+{
+   return 0;
+}
+#endif /* CONFIG_DEBUG_PAGEALLOC */
+
 /*
  * 'R' and 'C' update notes:
  *  - Under pHyp or KVM, the updatepp path will not set C, thus it *will*
@@ -2120,88 +2202,6 @@ void hpt_do_stress(unsigned long ea, unsigned long 
hpte_group)
}
 }
 
-#ifdef CONFIG_DEBUG_PAGEALLOC
-static DEFINE_RAW_SPINLOCK(linear_map_hash_lock);
-
-static void kernel_map_linear_page(unsigned long vaddr, unsigned long lmi)
-{
-   unsigned long hash;
-   unsigned long vsid = get_kernel_vsid(vaddr, mmu_kernel_ssize);
-   unsigned long vpn = hpt_vpn(vaddr, vsid, mmu_kernel_ssize);
-   unsigned long mode = htab_convert_pte_flags(pgprot_val(PAGE_KERNEL), 
HPTE_USE_KERNEL_KEY);
-   long ret;
-
-   hash = hpt_hash(vpn, PAGE_SHIFT, mmu_kernel_ssize);
-
-   /* Don't create HPTE entries for bad address */
-   if (!vsid)
-   return;
-
-   if (linear_map_hash_slots[lmi] & 0x80)
-   return;
-
-   ret = hpte_insert_repeating(hash, vpn, __pa(vaddr), mode,
-   HPTE_V_BOLTED,
-   mmu_linear_psize, mmu_kernel_ssize);
-
-   BUG_ON (ret < 0);
-   raw_spin_lock(&linear_map_hash_lock);
-   BUG_ON(linear_map_hash_slots[lmi] &

[RFC v2 03/13] book3s64/hash: Remove kfence support temporarily

2024-09-18 Thread Ritesh Harjani (IBM)
Kfence on book3s Hash on pseries is anyways broken. It fails to boot
due to RMA size limitation. That is because, kfence with Hash uses
debug_pagealloc infrastructure. debug_pagealloc allocates linear map
for entire dram size instead of just kfence relevant objects.
This means for 16TB of DRAM it will require (16TB >> PAGE_SHIFT)
which is 256MB which is half of RMA region on P8.
crash kernel reserves 256MB and we also need 2048 * 16KB * 3 for
emergency stack and some more for paca allocations.
That means there is not enough memory for reserving the full linear map
in the RMA region, if the DRAM size is too big (>=16TB)
(The issue is seen above 8TB with crash kernel 256 MB reservation).

Now Kfence does not require linear memory map for entire DRAM.
It only needs for kfence objects. So this patch temporarily removes the
kfence functionality since debug_pagealloc code needs some refactoring.
We will bring in kfence on Hash support in later patches.

Signed-off-by: Ritesh Harjani (IBM) 
---
 arch/powerpc/include/asm/kfence.h |  5 +
 arch/powerpc/mm/book3s64/hash_utils.c | 16 +++-
 2 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/kfence.h 
b/arch/powerpc/include/asm/kfence.h
index fab124ada1c7..f3a9476a71b3 100644
--- a/arch/powerpc/include/asm/kfence.h
+++ b/arch/powerpc/include/asm/kfence.h
@@ -10,6 +10,7 @@
 
 #include 
 #include 
+#include 
 
 #ifdef CONFIG_PPC64_ELF_ABI_V1
 #define ARCH_FUNC_PREFIX "."
@@ -25,6 +26,10 @@ static inline void disable_kfence(void)
 
 static inline bool arch_kfence_init_pool(void)
 {
+#ifdef CONFIG_PPC64
+   if (!radix_enabled())
+   return false;
+#endif
return !kfence_disabled;
 }
 #endif
diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index e1eadd03f133..296bb74dbf40 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -431,7 +431,7 @@ int htab_bolt_mapping(unsigned long vstart, unsigned long 
vend,
break;
 
cond_resched();
-   if (debug_pagealloc_enabled_or_kfence() &&
+   if (debug_pagealloc_enabled() &&
(paddr >> PAGE_SHIFT) < linear_map_hash_count)
linear_map_hash_slots[paddr >> PAGE_SHIFT] = ret | 0x80;
}
@@ -814,7 +814,7 @@ static void __init htab_init_page_sizes(void)
bool aligned = true;
init_hpte_page_sizes();
 
-   if (!debug_pagealloc_enabled_or_kfence()) {
+   if (!debug_pagealloc_enabled()) {
/*
 * Pick a size for the linear mapping. Currently, we only
 * support 16M, 1M and 4K which is the default
@@ -1134,7 +1134,7 @@ static void __init htab_initialize(void)
 
prot = pgprot_val(PAGE_KERNEL);
 
-   if (debug_pagealloc_enabled_or_kfence()) {
+   if (debug_pagealloc_enabled()) {
linear_map_hash_count = memblock_end_of_DRAM() >> PAGE_SHIFT;
linear_map_hash_slots = memblock_alloc_try_nid(
linear_map_hash_count, 1, MEMBLOCK_LOW_LIMIT,
@@ -2120,7 +2120,7 @@ void hpt_do_stress(unsigned long ea, unsigned long 
hpte_group)
}
 }
 
-#if defined(CONFIG_DEBUG_PAGEALLOC) || defined(CONFIG_KFENCE)
+#ifdef CONFIG_DEBUG_PAGEALLOC
 static DEFINE_RAW_SPINLOCK(linear_map_hash_lock);
 
 static void kernel_map_linear_page(unsigned long vaddr, unsigned long lmi)
@@ -2194,7 +2194,13 @@ int hash__kernel_map_pages(struct page *page, int 
numpages, int enable)
local_irq_restore(flags);
return 0;
 }
-#endif /* CONFIG_DEBUG_PAGEALLOC || CONFIG_KFENCE */
+#else /* CONFIG_DEBUG_PAGEALLOC */
+int hash__kernel_map_pages(struct page *page, int numpages,
+int enable)
+{
+   return 0;
+}
+#endif /* CONFIG_DEBUG_PAGEALLOC */
 
 void hash__setup_initial_memory_limit(phys_addr_t first_memblock_base,
phys_addr_t first_memblock_size)
-- 
2.46.0




[RFC v2 02/13] powerpc: mm: Fix kfence page fault reporting

2024-09-18 Thread Ritesh Harjani (IBM)
copy_from_kernel_nofault() can be called when doing read of /proc/kcore.
/proc/kcore can have some unmapped kfence objects which when read via
copy_from_kernel_nofault() can cause page faults. Since *_nofault()
functions define their own fixup table for handling fault, use that
instead of asking kfence to handle such faults.

Hence we search the exception tables for the nip which generated the
fault. If there is an entry then we let the fixup table handler handle the
page fault by returning an error from within ___do_page_fault().

This can be easily triggered if someone tries to do dd from /proc/kcore.
dd if=/proc/kcore of=/dev/null bs=1M


===
BUG: KFENCE: invalid read in copy_from_kernel_nofault+0xb0/0x1c8
Invalid read at 0x4f749d2e:
 copy_from_kernel_nofault+0xb0/0x1c8
 0xc57f7950
 read_kcore_iter+0x41c/0x9ac
 proc_reg_read_iter+0xe4/0x16c
 vfs_read+0x2e4/0x3b0
 ksys_read+0x88/0x154
 system_call_exception+0x124/0x340
 system_call_common+0x160/0x2c4

BUG: KFENCE: use-after-free read in copy_from_kernel_nofault+0xb0/0x1c8
Use-after-free read at 0x8fbb08ad (in kfence-#0):
 copy_from_kernel_nofault+0xb0/0x1c8
 0xc57f7950
 read_kcore_iter+0x41c/0x9ac
 proc_reg_read_iter+0xe4/0x16c
 vfs_read+0x2e4/0x3b0
 ksys_read+0x88/0x154
 system_call_exception+0x124/0x340
 system_call_common+0x160/0x2c4

Guessing the fix should go back to when we first got kfence on PPC32.

Fixes: 90cbac0e995d ("powerpc: Enable KFENCE for PPC32")
Reported-by: Disha Goel 
Signed-off-by: Ritesh Harjani (IBM) 
---
 arch/powerpc/mm/fault.c | 10 +-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/fault.c b/arch/powerpc/mm/fault.c
index 81c77ddce2e3..fa825198f29f 100644
--- a/arch/powerpc/mm/fault.c
+++ b/arch/powerpc/mm/fault.c
@@ -439,9 +439,17 @@ static int ___do_page_fault(struct pt_regs *regs, unsigned 
long address,
/*
 * The kernel should never take an execute fault nor should it
 * take a page fault to a kernel address or a page fault to a user
-* address outside of dedicated places
+* address outside of dedicated places.
+*
+* Rather than kfence reporting false negatives, let the fixup table
+* handler handle the page fault by returning SIGSEGV, if the fault
+* has come from functions like copy_from_kernel_nofault().
 */
if (unlikely(!is_user && bad_kernel_fault(regs, error_code, address, 
is_write))) {
+
+   if (search_exception_tables(instruction_pointer(regs)))
+   return SIGSEGV;
+
if (kfence_handle_page_fault(address, is_write, regs))
return 0;
 
-- 
2.46.0




[RFC v2 01/13] mm/kfence: Add a new kunit test test_use_after_free_read_nofault()

2024-09-18 Thread Ritesh Harjani (IBM)
From: Nirjhar Roy 

Faults from copy_from_kernel_nofault() needs to be handled by fixup
table and should not be handled by kfence. Otherwise while reading
/proc/kcore which uses copy_from_kernel_nofault(), kfence can generate
false negatives. This can happen when /proc/kcore ends up reading an
unmapped address from kfence pool.

Let's add a testcase to cover this case.

Co-developed-by: Ritesh Harjani (IBM) 
Signed-off-by: Ritesh Harjani (IBM) 
Signed-off-by: Nirjhar Roy 
Cc: kasan-...@googlegroups.com
Cc: Alexander Potapenko 
Cc: linux...@kvack.org
---
 mm/kfence/kfence_test.c | 17 +
 1 file changed, 17 insertions(+)

diff --git a/mm/kfence/kfence_test.c b/mm/kfence/kfence_test.c
index 00fd17285285..f65fb182466d 100644
--- a/mm/kfence/kfence_test.c
+++ b/mm/kfence/kfence_test.c
@@ -383,6 +383,22 @@ static void test_use_after_free_read(struct kunit *test)
KUNIT_EXPECT_TRUE(test, report_matches(&expect));
 }

+static void test_use_after_free_read_nofault(struct kunit *test)
+{
+   const size_t size = 32;
+   char *addr;
+   char dst;
+   int ret;
+
+   setup_test_cache(test, size, 0, NULL);
+   addr = test_alloc(test, size, GFP_KERNEL, ALLOCATE_ANY);
+   test_free(addr);
+   /* Use after free with *_nofault() */
+   ret = copy_from_kernel_nofault(&dst, addr, 1);
+   KUNIT_EXPECT_EQ(test, ret, -EFAULT);
+   KUNIT_EXPECT_FALSE(test, report_available());
+}
+
 static void test_double_free(struct kunit *test)
 {
const size_t size = 32;
@@ -780,6 +796,7 @@ static struct kunit_case kfence_test_cases[] = {
KFENCE_KUNIT_CASE(test_out_of_bounds_read),
KFENCE_KUNIT_CASE(test_out_of_bounds_write),
KFENCE_KUNIT_CASE(test_use_after_free_read),
+   KFENCE_KUNIT_CASE(test_use_after_free_read_nofault),
KFENCE_KUNIT_CASE(test_double_free),
KFENCE_KUNIT_CASE(test_invalid_addr_free),
KFENCE_KUNIT_CASE(test_corruption),
--
2.46.0




[RFC v2 00/13] powerpc/kfence: Improve kfence support

2024-09-18 Thread Ritesh Harjani (IBM)
This patch series addresses following to improve kfence support on Powerpc.

1. Usage of copy_from_kernel_nofault() within kernel, such as read from
   /proc/kcore can cause kfence to report false negatives.

2. (book3s64) Kfence depends upon debug_pagealloc infrastructure on Hash.
   debug_pagealloc allocates a linear map based on the size of the DRAM i.e.
   1 byte for every 64k page. That means for a 16TB DRAM, it will need 256MB
   memory for linear map. Memory for linear map on pseries comes from
   RMA region which has size limitation. On P8 RMA is 512MB, in which we also
   fit crash kernel at 256MB, paca allocations and emergency stacks.
   That means there is not enough memory in the RMA region for the linear map
   based on DRAM size (required by debug_pagealloc).

   Now kfence only requires memory for it's kfence objects. kfence by default
   requires only (255 + 1) * 2 i.e. 32 MB for 64k pagesize.

This series in Patch-1 adds a kfence kunit testcase to detect
copy_from_kernel_nofault() case. I assume the same should be needed for all
other archs as well.

Patch-2 adds a fix to handle this false negatives from 
copy_from_kernel_nofault().

Patch[3-9] removes the direct dependency of kfence on debug_pagealloc
infrastructure. We make Hash kernel linear map functions to take linear map 
array
as a parameter so that it can support debug_pagealloc and kfence individually.
That means we don't need to keep the size of the linear map to be
DRAM_SIZE >> PAGE_SHIFT anymore for kfence.

Patch-10: Adds kfence support with above (abstracted out) kernel linear map
infrastructure. With it, this also fixes, the boot failure problem when kfence
gets enabled on Hash with >=16TB of RAM.

Patch-11 & Patch-12: Ensure late initialization of kfence is disabled for both
Hash and Radix due to linear mapping size limiations. Commit gives more
description.

Patch-13: Early detects if debug_pagealloc cannot be enabled (due to RMA size
limitation) so that the linear mapping size can be set correctly during init.

Testing:

It passes kfence kunit tests with Hash and Radix.
[   44.355173][T1] # kfence: pass:27 fail:0 skip:0 total:27
[   44.358631][T1] # Totals: pass:27 fail:0 skip:0 total:27
[   44.365570][T1] ok 1 kfence


Future TODO:

When kfence on Hash gets enabled, the kernel linear map uses PAGE_SIZE mapping
rather than 16MB mapping.


v1 -> v2:
=
1. Added a kunit testcase patch-1.
2. Fixed a false negative with copy_from_kernel_nofault() in patch-2.
3. Addressed review comments from Christophe Leroy.
4. Added patch-13.


Nirjhar Roy (1):
  mm/kfence: Add a new kunit test test_use_after_free_read_nofault()

Ritesh Harjani (IBM) (12):
  powerpc: mm: Fix kfence page fault reporting
  book3s64/hash: Remove kfence support temporarily
  book3s64/hash: Refactor kernel linear map related calls
  book3s64/hash: Add hash_debug_pagealloc_add_slot() function
  book3s64/hash: Add hash_debug_pagealloc_alloc_slots() function
  book3s64/hash: Refactor hash__kernel_map_pages() function
  book3s64/hash: Make kernel_map_linear_page() generic
  book3s64/hash: Disable debug_pagealloc if it requires more memory
  book3s64/hash: Add kfence functionality
  book3s64/radix: Refactoring common kfence related functions
  book3s64/hash: Disable kfence if not early init
  book3s64/hash: Early detect debug_pagealloc size requirement

 arch/powerpc/include/asm/kfence.h|   8 +-
 arch/powerpc/mm/book3s64/hash_utils.c| 364 +--
 arch/powerpc/mm/book3s64/pgtable.c   |  13 +
 arch/powerpc/mm/book3s64/radix_pgtable.c |  12 -
 arch/powerpc/mm/fault.c  |  10 +-
 arch/powerpc/mm/init-common.c|   1 +
 mm/kfence/kfence_test.c  |  17 ++
 7 files changed, 318 insertions(+), 107 deletions(-)

--
2.46.0




Re: [PATCH] powerpc/pseries/eeh: move pseries_eeh_err_inject() outside CONFIG_DEBUG_FS block

2024-09-17 Thread IBM
Narayana Murty N  writes:

> Makes pseries_eeh_err_inject() available even when debugfs
> is disabled (CONFIG_DEBUG_FS=n). It moves eeh_debugfs_break_device()
> and eeh_pe_inject_mmio_error() out of the CONFIG_DEBUG_FS block
> and renames it as eeh_break_device().
>
> Reported-by: kernel test robot 
> Closes: 
> https://lore.kernel.org/oe-kbuild-all/202409170509.vwc6jadc-...@intel.com/
> Fixes: b0e2b828dfca ("powerpc/pseries/eeh: Fix pseries_eeh_err_inject")
> Signed-off-by: Narayana Murty N 
> ---
>  arch/powerpc/kernel/eeh.c | 198 +++---
>  1 file changed, 99 insertions(+), 99 deletions(-)

Ok, so in your original patch you implemented eeh_inject ops for pseries
using mmio based eeh error injection (eeh_pe_inject_mmio_error()), which
uses the functions defined under debugfs -> eeh_debugfs_break_device(). 

This was failing when CONFIG_DEBUGFS is not defined, thus referring to
undefined function definition. 

Minor nit below.

>
> diff --git a/arch/powerpc/kernel/eeh.c b/arch/powerpc/kernel/eeh.c
> index 49ab11a287a3..0fe25e907ea6 100644
> --- a/arch/powerpc/kernel/eeh.c
> +++ b/arch/powerpc/kernel/eeh.c
> @@ -1574,6 +1574,104 @@ static int proc_eeh_show(struct seq_file *m, void *v)
>  }
>  #endif /* CONFIG_PROC_FS */
>  
> +static int eeh_break_device(struct pci_dev *pdev)
> +{
> + struct resource *bar = NULL;
> + void __iomem *mapped;
> + u16 old, bit;
> + int i, pos;
> +
> + /* Do we have an MMIO BAR to disable? */
> + for (i = 0; i <= PCI_STD_RESOURCE_END; i++) {
> + struct resource *r = &pdev->resource[i];
> +
> + if (!r->flags || !r->start)
> + continue;
> + if (r->flags & IORESOURCE_IO)
> + continue;
> + if (r->flags & IORESOURCE_UNSET)
> + continue;
> +
> + bar = r;
> + break;
> + }
> +
> + if (!bar) {
> + pci_err(pdev, "Unable to find Memory BAR to cause EEH with\n");
> + return -ENXIO;
> + }
> +
> + pci_err(pdev, "Going to break: %pR\n", bar);
> +
> + if (pdev->is_virtfn) {
> +#ifndef CONFIG_PCI_IOV
> + return -ENXIO;
> +#else
> + /*
> +  * VFs don't have a per-function COMMAND register, so the best
> +  * we can do is clear the Memory Space Enable bit in the PF's
> +  * SRIOV control reg.
> +  *
> +  * Unfortunately, this requires that we have a PF (i.e doesn't
> +  * work for a passed-through VF) and it has the potential side
> +  * effect of also causing an EEH on every other VF under the
> +  * PF. Oh well.
> +  */
> + pdev = pdev->physfn;
> + if (!pdev)
> + return -ENXIO; /* passed through VFs have no PF */
> +
> + pos  = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_SRIOV);
> + pos += PCI_SRIOV_CTRL;
> + bit  = PCI_SRIOV_CTRL_MSE;
> +#endif /* !CONFIG_PCI_IOV */
> + } else {
> + bit = PCI_COMMAND_MEMORY;
> + pos = PCI_COMMAND;
> + }
> +
> + /*
> +  * Process here is:
> +  *
> +  * 1. Disable Memory space.
> +  *
> +  * 2. Perform an MMIO to the device. This should result in an error
> +  *(CA  / UR) being raised by the device which results in an EEH
> +  *PE freeze. Using the in_8() accessor skips the eeh detection hook
> +  *so the freeze hook so the EEH Detection machinery won't be
> +  *triggered here. This is to match the usual behaviour of EEH
> +  *where the HW will asynchronously freeze a PE and it's up to
> +  *the kernel to notice and deal with it.
> +  *
> +  * 3. Turn Memory space back on. This is more important for VFs
> +  *since recovery will probably fail if we don't. For normal
> +  *the COMMAND register is reset as a part of re-initialising
> +  *the device.
> +  *
> +  * Breaking stuff is the point so who cares if it's racy ;)
> +  */
> + pci_read_config_word(pdev, pos, &old);
> +
> + mapped = ioremap(bar->start, PAGE_SIZE);
> + if (!mapped) {
> + pci_err(pdev, "Unable to map MMIO BAR %pR\n", bar);
> + return -ENXIO;
> + }
> +
> + pci_write_config_word(pdev, pos, old & ~bit);
> + in_8(mapped);
> + pci_write_config_word(pdev, pos, old);
> +
> + iounmap(mapped);
> +
> + return 0;
> +}
> +
> +int eeh_pe_inject_mmio_error(struct pci_dev *pdev)
> +{
> + return eeh_break_device(pdev);
> +}
> +

Why have an extra eeh_pe_inject_mmio_error() function which only calls
eeh_break_device()?

Maybe we can rename eeh_break_device() to eeh_mmio_break_device() and use
this function itself at both call sites?

-ritesh



Re: [RFC v1 10/10] book3s64/hash: Disable kfence if not early init

2024-09-04 Thread IBM
Christophe Leroy  writes:

> Le 31/07/2024 à 09:56, Ritesh Harjani (IBM) a écrit :
>> [Vous ne recevez pas souvent de courriers de ritesh.l...@gmail.com. 
>> Découvrez pourquoi ceci est important à 
>> https://aka.ms/LearnAboutSenderIdentification ]
>> 
>> Enable kfence on book3s64 hash only when early init is enabled.
>> This is because, kfence could cause the kernel linear map to be mapped
>> at PAGE_SIZE level instead of 16M (which I guess we don't want).
>> 
>> Also currently there is no way to -
>> 1. Make multiple page size entries for the SLB used for kernel linear
>> map.
>> 2. No easy way of getting the hash slot details after the page table
>> mapping for kernel linear setup. So even if kfence allocate the
>> pool in late init, we won't be able to get the hash slot details in
>> kfence linear map.
>> 
>> Thus this patch disables kfence on hash if kfence early init is not
>> enabled.
>> 
>> Signed-off-by: Ritesh Harjani (IBM) 
>> ---
>>   arch/powerpc/mm/book3s64/hash_utils.c | 5 -
>>   1 file changed, 4 insertions(+), 1 deletion(-)
>> 
>> diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
>> b/arch/powerpc/mm/book3s64/hash_utils.c
>> index c66b9921fc7d..759dbcbf1483 100644
>> --- a/arch/powerpc/mm/book3s64/hash_utils.c
>> +++ b/arch/powerpc/mm/book3s64/hash_utils.c
>> @@ -410,6 +410,8 @@ static phys_addr_t kfence_pool;
>> 
>>   static inline void hash_kfence_alloc_pool(void)
>>   {
>> +   if (!kfence_early_init)
>> +   goto err;
>> 
>>  // allocate linear map for kfence within RMA region
>>  linear_map_kf_hash_count = KFENCE_POOL_SIZE >> PAGE_SHIFT;
>> @@ -1074,7 +1076,8 @@ static void __init htab_init_page_sizes(void)
>>  bool aligned = true;
>>  init_hpte_page_sizes();
>> 
>> -   if (!debug_pagealloc_enabled_or_kfence()) {
>> +   if (!debug_pagealloc_enabled() &&
>> +   !(IS_ENABLED(CONFIG_KFENCE) && kfence_early_init)) {
>
> Looks complex, can we do simpler ?
>

Yes, kfence_early_init anyway needs clean up. Will make it simpler.

Thanks for the review!

-ritesh



Re: [RFC v1 09/10] book3s64/radix: Refactoring common kfence related functions

2024-09-04 Thread IBM
Christophe Leroy  writes:

> Le 31/07/2024 à 09:56, Ritesh Harjani (IBM) a écrit :
>> [Vous ne recevez pas souvent de courriers de ritesh.l...@gmail.com. 
>> Découvrez pourquoi ceci est important à 
>> https://aka.ms/LearnAboutSenderIdentification ]
>> 
>> Both radix and hash on book3s requires to detect if kfence
>> early init is enabled or not. Hash needs to disable kfence
>> if early init is not enabled because with kfence the linear map is
>> mapped using PAGE_SIZE rather than 16M mapping.
>> We don't support multiple page sizes for slb entry used for kernel
>> linear map in book3s64.
>> 
>> This patch refactors out the common functions required to detect kfence
>> early init is enabled or not.
>> 
>> Signed-off-by: Ritesh Harjani (IBM) 
>> ---
>>   arch/powerpc/include/asm/kfence.h|  2 ++
>>   arch/powerpc/mm/book3s64/radix_pgtable.c | 12 
>>   arch/powerpc/mm/init-common.c| 12 
>>   3 files changed, 14 insertions(+), 12 deletions(-)
>> 
>> diff --git a/arch/powerpc/include/asm/kfence.h 
>> b/arch/powerpc/include/asm/kfence.h
>> index fab124ada1c7..5975688d8de1 100644
>> --- a/arch/powerpc/include/asm/kfence.h
>> +++ b/arch/powerpc/include/asm/kfence.h
>> @@ -15,6 +15,8 @@
>>   #define ARCH_FUNC_PREFIX "."
>>   #endif
>> 
>> +extern bool kfence_early_init;
>> +
>>   #ifdef CONFIG_KFENCE
>>   extern bool kfence_disabled;
>> 
>> diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
>> b/arch/powerpc/mm/book3s64/radix_pgtable.c
>> index b0d927009af8..311e2112d782 100644
>> --- a/arch/powerpc/mm/book3s64/radix_pgtable.c
>> +++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
>> @@ -363,18 +363,6 @@ static int __meminit create_physical_mapping(unsigned 
>> long start,
>>   }
>> 
>>   #ifdef CONFIG_KFENCE
>> -static bool __ro_after_init kfence_early_init = 
>> !!CONFIG_KFENCE_SAMPLE_INTERVAL;
>> -
>> -static int __init parse_kfence_early_init(char *arg)
>> -{
>> -   int val;
>> -
>> -   if (get_option(&arg, &val))
>> -   kfence_early_init = !!val;
>> -   return 0;
>> -}
>> -early_param("kfence.sample_interval", parse_kfence_early_init);
>> -
>>   static inline phys_addr_t alloc_kfence_pool(void)
>>   {
>>  phys_addr_t kfence_pool;
>> diff --git a/arch/powerpc/mm/init-common.c b/arch/powerpc/mm/init-common.c
>> index 21131b96d209..259821a4db62 100644
>> --- a/arch/powerpc/mm/init-common.c
>> +++ b/arch/powerpc/mm/init-common.c
>> @@ -33,6 +33,18 @@ bool disable_kuep = !IS_ENABLED(CONFIG_PPC_KUEP);
>>   bool disable_kuap = !IS_ENABLED(CONFIG_PPC_KUAP);
>>   #ifdef CONFIG_KFENCE
>>   bool __ro_after_init kfence_disabled;
>> +bool __ro_after_init kfence_early_init = !!CONFIG_KFENCE_SAMPLE_INTERVAL;
>> +static int __init parse_kfence_early_init(char *arg)
>
> If I understand correctly, previously it was only for radix, now it is 
> for every platform including PPC32 ?
>

Ok. I see what you mean. Let me see how can I limit this cmdline parsing
of kfence and/or special case kfence handling to book3s64 only.

>> +{
>> +   int val;
>> +
>> +   if (get_option(&arg, &val))
>> +   kfence_early_init = !!val;
>> +   return 0;
>> +}
>> +early_param("kfence.sample_interval", parse_kfence_early_init);
>> +#else
>> +bool __ro_after_init kfence_early_init;
>
> I don't understand, why do you need that in the #else case ?
>

Yes, I don't like it either. Let me clean this up.
this was required in htab_init_page_sizes(). 

Thanks for pointing out.

-ritesh



Re: [RFC v1 01/10] book3s64/hash: Remove kfence support temporarily

2024-09-04 Thread IBM


Sorry for the delayed response. Was pulled into something else.

Christophe Leroy  writes:

> Le 31/07/2024 à 09:56, Ritesh Harjani (IBM) a écrit :
>> [Vous ne recevez pas souvent de courriers de ritesh.l...@gmail.com. 
>> Découvrez pourquoi ceci est important à 
>> https://aka.ms/LearnAboutSenderIdentification ]
>> 
>> Kfence on book3s Hash on pseries is anyways broken. It fails to boot
>> due to RMA size limitation. That is because, kfence with Hash uses
>> debug_pagealloc infrastructure. debug_pagealloc allocates linear map
>> for entire dram size instead of just kfence relevant objects.
>> This means for 16TB of DRAM it will require (16TB >> PAGE_SHIFT)
>> which is 256MB which is half of RMA region on P8.
>> crash kernel reserves 256MB and we also need 2048 * 16KB * 3 for
>> emergency stack and some more for paca allocations.
>> That means there is not enough memory for reserving the full linear map
>> in the RMA region, if the DRAM size is too big (>=16TB)
>> (The issue is seen above 8TB with crash kernel 256 MB reservation).
>> 
>> Now Kfence does not require linear memory map for entire DRAM.
>> It only needs for kfence objects. So this patch temporarily removes the
>> kfence functionality since debug_pagealloc code needs some refactoring.
>> We will bring in kfence on Hash support in later patches.
>> 
>> Signed-off-by: Ritesh Harjani (IBM) 
>> ---
>>   arch/powerpc/include/asm/kfence.h |  5 +
>>   arch/powerpc/mm/book3s64/hash_utils.c | 16 +++-
>>   2 files changed, 16 insertions(+), 5 deletions(-)
>> 
>> diff --git a/arch/powerpc/include/asm/kfence.h 
>> b/arch/powerpc/include/asm/kfence.h
>> index fab124ada1c7..f3a9476a71b3 100644
>> --- a/arch/powerpc/include/asm/kfence.h
>> +++ b/arch/powerpc/include/asm/kfence.h
>> @@ -10,6 +10,7 @@
>> 
>>   #include 
>>   #include 
>> +#include 
>> 
>>   #ifdef CONFIG_PPC64_ELF_ABI_V1
>>   #define ARCH_FUNC_PREFIX "."
>> @@ -25,6 +26,10 @@ static inline void disable_kfence(void)
>> 
>>   static inline bool arch_kfence_init_pool(void)
>>   {
>> +#ifdef CONFIG_PPC64
>> +   if (!radix_enabled())
>> +   return false;
>> +#endif
>
> Avoid #ifdefs whenever possible. Here you can do:
>
>   if (IS_ENABLED(CONFIG_PPC64) && !radix_enabled())
>   return false;
>

Sure. I will change it.

>>  return !kfence_disabled;
>>   }
>>   #endif
>> diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
>> b/arch/powerpc/mm/book3s64/hash_utils.c
>> index 01c3b4b65241..1a1b50735fa0 100644
>> --- a/arch/powerpc/mm/book3s64/hash_utils.c
>> +++ b/arch/powerpc/mm/book3s64/hash_utils.c
>> @@ -431,7 +431,7 @@ int htab_bolt_mapping(unsigned long vstart, unsigned 
>> long vend,
>>  break;
>> 
>>  cond_resched();
>> -   if (debug_pagealloc_enabled_or_kfence() &&
>> +   if (debug_pagealloc_enabled() &&
>>  (paddr >> PAGE_SHIFT) < linear_map_hash_count)
>>  linear_map_hash_slots[paddr >> PAGE_SHIFT] = ret | 
>> 0x80;
>>  }
>> @@ -814,7 +814,7 @@ static void __init htab_init_page_sizes(void)
>>  bool aligned = true;
>>  init_hpte_page_sizes();
>> 
>> -   if (!debug_pagealloc_enabled_or_kfence()) {
>> +   if (!debug_pagealloc_enabled()) {
>>  /*
>>   * Pick a size for the linear mapping. Currently, we only
>>   * support 16M, 1M and 4K which is the default
>> @@ -1134,7 +1134,7 @@ static void __init htab_initialize(void)
>> 
>>  prot = pgprot_val(PAGE_KERNEL);
>> 
>> -   if (debug_pagealloc_enabled_or_kfence()) {
>> +   if (debug_pagealloc_enabled()) {
>>  linear_map_hash_count = memblock_end_of_DRAM() >> 
>> PAGE_SHIFT;
>>  linear_map_hash_slots = memblock_alloc_try_nid(
>>  linear_map_hash_count, 1, 
>> MEMBLOCK_LOW_LIMIT,
>> @@ -2117,7 +2117,7 @@ void hpt_do_stress(unsigned long ea, unsigned long 
>> hpte_group)
>>  }
>>   }
>> 
>> -#if defined(CONFIG_DEBUG_PAGEALLOC) || defined(CONFIG_KFENCE)
>> +#if defined(CONFIG_DEBUG_PAGEALLOC)
>
> Use #ifdef
>

Sure. Thanks!

-ritesh



Re: [PATCH] powerpc: Use printk instead of WARN in change_memory_attr

2024-08-27 Thread IBM
Christophe Leroy  writes:

> Le 27/08/2024 à 11:12, Ritesh Harjani (IBM) a écrit :
>> [Vous ne recevez pas souvent de courriers de ritesh.l...@gmail.com. 
>> Découvrez pourquoi ceci est important à 
>> https://aka.ms/LearnAboutSenderIdentification ]
>> 
>> Use pr_warn_once instead of WARN_ON_ONCE as discussed here [1]
>> for printing possible use of set_memory_* on linear map on Hash.
>> 
>> [1]: https://lore.kernel.org/all/877cc2fpi2.fsf@mail.lhotse/#t
>> 
>> Signed-off-by: Ritesh Harjani (IBM) 
>> ---
>>   arch/powerpc/mm/pageattr.c | 5 -
>>   1 file changed, 4 insertions(+), 1 deletion(-)
>> 
>> diff --git a/arch/powerpc/mm/pageattr.c b/arch/powerpc/mm/pageattr.c
>> index ac22bf28086f..c8c2d664c6f3 100644
>> --- a/arch/powerpc/mm/pageattr.c
>> +++ b/arch/powerpc/mm/pageattr.c
>> @@ -94,8 +94,11 @@ int change_memory_attr(unsigned long addr, int numpages, 
>> long action)
>>  if (!radix_enabled()) {
>>  int region = get_region_id(addr);
>> 
>> -   if (WARN_ON_ONCE(region != VMALLOC_REGION_ID && region != 
>> IO_REGION_ID))
>> +   if (region != VMALLOC_REGION_ID && region != IO_REGION_ID) {
>> +   pr_warn_once("%s: possible use of set_memory_* on 
>> linear map on Hash from (%ps)\n",
>> +   __func__, 
>> __builtin_return_address(0));
>
> Is it really only linear map ?
>
> What about "possible user of set_memory_* outside of vmalloc or io region.

"warning: possible user of set_memory_* outside of vmalloc and io region."

I am thinking of adding a word "warning" too. I can make above change and send 
v2.

>
> Maybe a show_stack() would also be worth it ?

IMO, since we have the caller, we need not pollute the dmesg with the
entire call stack. Besides I am not aware of dump_stack_once() style prints.

>
>
> But in principle I think it would be better to keep the WARN_ONCE until 
> we can add __must_check to set_memory_xxx() functions to be sure all 
> callers check the result, as mandated by 
> https://github.com/KSPP/linux/issues/7

Fixing the callers to check the return value is something that need not
depend on this change no?

The intention of this change is to mainly remove the heavy WARN_ON_ONCE
from powerpc specific change_memory_attr() and convert to printk warn.

-ritesh



[PATCH] powerpc: Use printk instead of WARN in change_memory_attr

2024-08-27 Thread Ritesh Harjani (IBM)
Use pr_warn_once instead of WARN_ON_ONCE as discussed here [1]
for printing possible use of set_memory_* on linear map on Hash.

[1]: https://lore.kernel.org/all/877cc2fpi2.fsf@mail.lhotse/#t

Signed-off-by: Ritesh Harjani (IBM) 
---
 arch/powerpc/mm/pageattr.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/pageattr.c b/arch/powerpc/mm/pageattr.c
index ac22bf28086f..c8c2d664c6f3 100644
--- a/arch/powerpc/mm/pageattr.c
+++ b/arch/powerpc/mm/pageattr.c
@@ -94,8 +94,11 @@ int change_memory_attr(unsigned long addr, int numpages, 
long action)
if (!radix_enabled()) {
int region = get_region_id(addr);

-   if (WARN_ON_ONCE(region != VMALLOC_REGION_ID && region != 
IO_REGION_ID))
+   if (region != VMALLOC_REGION_ID && region != IO_REGION_ID) {
+   pr_warn_once("%s: possible use of set_memory_* on 
linear map on Hash from (%ps)\n",
+   __func__, __builtin_return_address(0));
return -EINVAL;
+   }
}
 #endif

--
2.39.2




Re: linux-next: boot warning after merge of the vfs-brauner tree

2024-08-26 Thread IBM
Luis Chamberlain  writes:

> On Mon, Aug 26, 2024 at 02:10:49PM -0700, Darrick J. Wong wrote:
>> On Mon, Aug 26, 2024 at 01:52:54PM -0700, Luis Chamberlain wrote:
>> > On Mon, Aug 26, 2024 at 07:43:20PM +0200, Christophe Leroy wrote:
>> > > 
>> > > 
>> > > Le 26/08/2024 à 17:48, Pankaj Raghav (Samsung) a écrit :
>> > > > On Mon, Aug 26, 2024 at 05:59:31PM +1000, Stephen Rothwell wrote:
>> > > > > Hi all,
>> > > > > 
>> > > > > After merging the vfs-brauner tree, today's linux-next boot test 
>> > > > > (powerpc
>> > > > > pseries_le_defconfig) produced this warning:
>> > > > 
>> > > > iomap dio calls set_memory_ro() on the page that is used for sub block
>> > > > zeroing.
>> > > > 
>> > > > But looking at powerpc code, they don't support set_memory_ro() for
>> > > > memory region that belongs to the kernel(LINEAR_MAP_REGION_ID).
>> > > > 
>> > > > /*
>> > > >   * On hash, the linear mapping is not in the Linux page table so
>> > > >   * apply_to_existing_page_range() will have no effect. If in the 
>> > > > future
>> > > >   * the set_memory_* functions are used on the linear map this will 
>> > > > need
>> > > >   * to be updated.
>> > > >   */
>> > > > if (!radix_enabled()) {
>> > > >  int region = get_region_id(addr);
>> > > > 
>> > > >  if (WARN_ON_ONCE(region != VMALLOC_REGION_ID && region != 
>> > > > IO_REGION_ID))
>> > > >  return -EINVAL;
>> > > > }
>> > > > 
>> > > > We call set_memory_ro() on the zero page as a extra security measure.
>> > > > I don't know much about powerpc, but looking at the comment, is it just
>> > > > adding the following to support it in powerpc:
>> > > > 
>> > > > diff --git a/arch/powerpc/mm/pageattr.c b/arch/powerpc/mm/pageattr.c
>> > > > index ac22bf28086fa..e6e0b40ba6db4 100644
>> > > > --- a/arch/powerpc/mm/pageattr.c
>> > > > +++ b/arch/powerpc/mm/pageattr.c
>> > > > @@ -94,7 +94,9 @@ int change_memory_attr(unsigned long addr, int 
>> > > > numpages, long action)
>> > > >  if (!radix_enabled()) {
>> > > >  int region = get_region_id(addr);
>> > > > -   if (WARN_ON_ONCE(region != VMALLOC_REGION_ID && region 
>> > > > != IO_REGION_ID))
>> > > > +   if (WARN_ON_ONCE(region != VMALLOC_REGION_ID &&
>> > > > +region != IO_REGION_ID &&
>> > > > +region != LINEAR_MAP_REGION_ID))
>> > > >  return -EINVAL;
>> > > >  }
>> > > >   #endif
>> > > 
>> > > By doing this you will just hide the fact that it didn't work.
>> > > 
>> > > See commit 1f9ad21c3b38 ("powerpc/mm: Implement set_memory() routines") 
>> > > for
>> > > details. The linear memory region is not mapped using page tables so
>> > > set_memory_ro() will have no effect on it.
>> > > 
>> > > You can either use vmalloc'ed pages, or do a const static allocation at
>> > > buildtime so that it will be allocated in the kernel static rodata area.
>> > > 
>> > > By the way, your code should check the value returned by set_memory_ro(),
>> > > there is some work in progress to make it mandatory, see
>> > > https://github.com/KSPP/linux/issues/7
>> > 
>> > Our users expect contiguous memory [0] and so we use alloc_pages() here,
>> > so if we're architecture limitted by this I'd rather we just remove the
>> > set_memory_ro() only for PPC, I don't see why other have to skip this.

Looks like not a standard thing to do for kernel linear memory map
region then and maybe few other archs could be ignoring too?

>> 
>> Just drop it, then.
>
> OK sent a patch for that.
>

Thanks for fixing it!

-ritesh



[RFC v1 10/10] book3s64/hash: Disable kfence if not early init

2024-07-31 Thread Ritesh Harjani (IBM)
Enable kfence on book3s64 hash only when early init is enabled.
This is because, kfence could cause the kernel linear map to be mapped
at PAGE_SIZE level instead of 16M (which I guess we don't want).

Also currently there is no way to -
1. Make multiple page size entries for the SLB used for kernel linear
   map.
2. No easy way of getting the hash slot details after the page table
   mapping for kernel linear setup. So even if kfence allocate the
   pool in late init, we won't be able to get the hash slot details in
   kfence linear map.

Thus this patch disables kfence on hash if kfence early init is not
enabled.

Signed-off-by: Ritesh Harjani (IBM) 
---
 arch/powerpc/mm/book3s64/hash_utils.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index c66b9921fc7d..759dbcbf1483 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -410,6 +410,8 @@ static phys_addr_t kfence_pool;

 static inline void hash_kfence_alloc_pool(void)
 {
+   if (!kfence_early_init)
+   goto err;

// allocate linear map for kfence within RMA region
linear_map_kf_hash_count = KFENCE_POOL_SIZE >> PAGE_SHIFT;
@@ -1074,7 +1076,8 @@ static void __init htab_init_page_sizes(void)
bool aligned = true;
init_hpte_page_sizes();

-   if (!debug_pagealloc_enabled_or_kfence()) {
+   if (!debug_pagealloc_enabled() &&
+   !(IS_ENABLED(CONFIG_KFENCE) && kfence_early_init)) {
/*
 * Pick a size for the linear mapping. Currently, we only
 * support 16M, 1M and 4K which is the default
--
2.45.2



[RFC v1 09/10] book3s64/radix: Refactoring common kfence related functions

2024-07-31 Thread Ritesh Harjani (IBM)
Both radix and hash on book3s requires to detect if kfence
early init is enabled or not. Hash needs to disable kfence
if early init is not enabled because with kfence the linear map is
mapped using PAGE_SIZE rather than 16M mapping.
We don't support multiple page sizes for slb entry used for kernel
linear map in book3s64.

This patch refactors out the common functions required to detect kfence
early init is enabled or not.

Signed-off-by: Ritesh Harjani (IBM) 
---
 arch/powerpc/include/asm/kfence.h|  2 ++
 arch/powerpc/mm/book3s64/radix_pgtable.c | 12 
 arch/powerpc/mm/init-common.c| 12 
 3 files changed, 14 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/include/asm/kfence.h 
b/arch/powerpc/include/asm/kfence.h
index fab124ada1c7..5975688d8de1 100644
--- a/arch/powerpc/include/asm/kfence.h
+++ b/arch/powerpc/include/asm/kfence.h
@@ -15,6 +15,8 @@
 #define ARCH_FUNC_PREFIX "."
 #endif
 
+extern bool kfence_early_init;
+
 #ifdef CONFIG_KFENCE
 extern bool kfence_disabled;
 
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index b0d927009af8..311e2112d782 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -363,18 +363,6 @@ static int __meminit create_physical_mapping(unsigned long 
start,
 }
 
 #ifdef CONFIG_KFENCE
-static bool __ro_after_init kfence_early_init = 
!!CONFIG_KFENCE_SAMPLE_INTERVAL;
-
-static int __init parse_kfence_early_init(char *arg)
-{
-   int val;
-
-   if (get_option(&arg, &val))
-   kfence_early_init = !!val;
-   return 0;
-}
-early_param("kfence.sample_interval", parse_kfence_early_init);
-
 static inline phys_addr_t alloc_kfence_pool(void)
 {
phys_addr_t kfence_pool;
diff --git a/arch/powerpc/mm/init-common.c b/arch/powerpc/mm/init-common.c
index 21131b96d209..259821a4db62 100644
--- a/arch/powerpc/mm/init-common.c
+++ b/arch/powerpc/mm/init-common.c
@@ -33,6 +33,18 @@ bool disable_kuep = !IS_ENABLED(CONFIG_PPC_KUEP);
 bool disable_kuap = !IS_ENABLED(CONFIG_PPC_KUAP);
 #ifdef CONFIG_KFENCE
 bool __ro_after_init kfence_disabled;
+bool __ro_after_init kfence_early_init = !!CONFIG_KFENCE_SAMPLE_INTERVAL;
+static int __init parse_kfence_early_init(char *arg)
+{
+   int val;
+
+   if (get_option(&arg, &val))
+   kfence_early_init = !!val;
+   return 0;
+}
+early_param("kfence.sample_interval", parse_kfence_early_init);
+#else
+bool __ro_after_init kfence_early_init;
 #endif
 
 static int __init parse_nosmep(char *p)
-- 
2.45.2



[RFC v1 08/10] book3s64/hash: Add kfence functionality

2024-07-31 Thread Ritesh Harjani (IBM)
Now that linear map functionality of debug_pagealloc is made generic,
enable kfence to use this generic infrastructure.

1. Define kfence related linear map variables.
   - u8 *linear_map_kf_hash_slots;
   - unsigned long linear_map_kf_hash_count;
   - DEFINE_RAW_SPINLOCK(linear_map_kf_hash_lock);
2. The linear map size allocated in RMA region is quite small
   (KFENCE_POOL_SIZE >> PAGE_SHIFT) which is 512 bytes by default.
3. kfence pool memory is reserved using memblock_phys_alloc() which has
   can come from anywhere.
   (default 255 objects => ((1+255) * 2) << PAGE_SHIFT = 32MB)
4. The hash slot information for kfence memory gets added in linear map
   in hash_linear_map_add_slot() (which also adds for debug_pagealloc).

Reported-by: Pavithra Prakash 
Signed-off-by: Ritesh Harjani (IBM) 
---
 arch/powerpc/include/asm/kfence.h |   5 -
 arch/powerpc/mm/book3s64/hash_utils.c | 162 +++---
 2 files changed, 149 insertions(+), 18 deletions(-)

diff --git a/arch/powerpc/include/asm/kfence.h 
b/arch/powerpc/include/asm/kfence.h
index f3a9476a71b3..fab124ada1c7 100644
--- a/arch/powerpc/include/asm/kfence.h
+++ b/arch/powerpc/include/asm/kfence.h
@@ -10,7 +10,6 @@
 
 #include 
 #include 
-#include 
 
 #ifdef CONFIG_PPC64_ELF_ABI_V1
 #define ARCH_FUNC_PREFIX "."
@@ -26,10 +25,6 @@ static inline void disable_kfence(void)
 
 static inline bool arch_kfence_init_pool(void)
 {
-#ifdef CONFIG_PPC64
-   if (!radix_enabled())
-   return false;
-#endif
return !kfence_disabled;
 }
 #endif
diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index 906cd167180a..c66b9921fc7d 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -40,6 +40,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 #include 
@@ -66,6 +67,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -271,7 +273,7 @@ void hash__tlbiel_all(unsigned int action)
WARN(1, "%s called on pre-POWER7 CPU\n", __func__);
 }
 
-#if defined(CONFIG_DEBUG_PAGEALLOC)
+#if defined(CONFIG_DEBUG_PAGEALLOC) || defined(CONFIG_KFENCE)
 static void kernel_map_linear_page(unsigned long vaddr, unsigned long idx,
   u8 *slots, raw_spinlock_t *lock)
 {
@@ -325,11 +327,13 @@ static void kernel_unmap_linear_page(unsigned long vaddr, 
unsigned long idx,
 mmu_linear_psize,
 mmu_kernel_ssize, 0);
 }
+#endif
 
+#if defined(CONFIG_DEBUG_PAGEALLOC)
 static u8 *linear_map_hash_slots;
 static unsigned long linear_map_hash_count;
 static DEFINE_RAW_SPINLOCK(linear_map_hash_lock);
-static inline void hash_debug_pagealloc_alloc_slots(void)
+static void hash_debug_pagealloc_alloc_slots(void)
 {
unsigned long max_hash_count = (ppc64_rma_size / 4) >> PAGE_SHIFT;
 
@@ -352,7 +356,8 @@ static inline void hash_debug_pagealloc_alloc_slots(void)
  __func__, linear_map_hash_count, &ppc64_rma_size);
 }
 
-static inline void hash_debug_pagealloc_add_slot(phys_addr_t paddr, int slot)
+static inline void hash_debug_pagealloc_add_slot(phys_addr_t paddr,
+   int slot)
 {
if (!debug_pagealloc_enabled() || !linear_map_hash_count)
return;
@@ -386,20 +391,148 @@ static int hash_debug_pagealloc_map_pages(struct page 
*page, int numpages,
return 0;
 }
 
-int hash__kernel_map_pages(struct page *page, int numpages, int enable)
+#else /* CONFIG_DEBUG_PAGEALLOC */
+static inline void hash_debug_pagealloc_alloc_slots(void) {}
+static inline void hash_debug_pagealloc_add_slot(phys_addr_t paddr, int slot) 
{}
+static int __maybe_unused
+hash_debug_pagealloc_map_pages(struct page *page, int numpages, int enable)
 {
-   return hash_debug_pagealloc_map_pages(page, numpages, enable);
+   return 0;
 }
+#endif /* CONFIG_DEBUG_PAGEALLOC */
 
-#else /* CONFIG_DEBUG_PAGEALLOC */
-int hash__kernel_map_pages(struct page *page, int numpages,
-int enable)
+#if defined(CONFIG_KFENCE)
+static u8 *linear_map_kf_hash_slots;
+static unsigned long linear_map_kf_hash_count;
+static DEFINE_RAW_SPINLOCK(linear_map_kf_hash_lock);
+
+static phys_addr_t kfence_pool;
+
+static inline void hash_kfence_alloc_pool(void)
+{
+
+   // allocate linear map for kfence within RMA region
+   linear_map_kf_hash_count = KFENCE_POOL_SIZE >> PAGE_SHIFT;
+   linear_map_kf_hash_slots = memblock_alloc_try_nid(
+   linear_map_kf_hash_count, 1,
+   MEMBLOCK_LOW_LIMIT, ppc64_rma_size,
+   NUMA_NO_NODE);
+   if (!linear_map_kf_hash_slots) {
+   pr_err("%s: memblock for linear map (%lu) failed\n", __func__,
+ 

[RFC v1 07/10] book3s64/hash: Disable debug_pagealloc if it requires more memory

2024-07-31 Thread Ritesh Harjani (IBM)
Make size of the linear map to be allocated in RMA region to be of
ppc64_rma_size / 4. If debug_pagealloc requires more memory than that
then do not allocate any memory and disable debug_pagealloc.

Signed-off-by: Ritesh Harjani (IBM) 
---
 arch/powerpc/mm/book3s64/hash_utils.c | 15 ++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index 3f3eaf0a254b..906cd167180a 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -331,9 +331,19 @@ static unsigned long linear_map_hash_count;
 static DEFINE_RAW_SPINLOCK(linear_map_hash_lock);
 static inline void hash_debug_pagealloc_alloc_slots(void)
 {
+   unsigned long max_hash_count = (ppc64_rma_size / 4) >> PAGE_SHIFT;
+
if (!debug_pagealloc_enabled())
return;
linear_map_hash_count = memblock_end_of_DRAM() >> PAGE_SHIFT;
+   if (unlikely(linear_map_hash_count > max_hash_count)) {
+   pr_info("linear map size (%llu) greater than 4 times RMA region 
(%llu). Disabling debug_pagealloc\n",
+   ((u64)linear_map_hash_count << PAGE_SHIFT),
+   ppc64_rma_size);
+   linear_map_hash_count = 0;
+   return;
+   }
+
linear_map_hash_slots = memblock_alloc_try_nid(
linear_map_hash_count, 1, MEMBLOCK_LOW_LIMIT,
ppc64_rma_size, NUMA_NO_NODE);
@@ -344,7 +354,7 @@ static inline void hash_debug_pagealloc_alloc_slots(void)
 
 static inline void hash_debug_pagealloc_add_slot(phys_addr_t paddr, int slot)
 {
-   if (!debug_pagealloc_enabled())
+   if (!debug_pagealloc_enabled() || !linear_map_hash_count)
return;
if ((paddr >> PAGE_SHIFT) < linear_map_hash_count)
linear_map_hash_slots[paddr >> PAGE_SHIFT] = slot | 0x80;
@@ -356,6 +366,9 @@ static int hash_debug_pagealloc_map_pages(struct page 
*page, int numpages,
unsigned long flags, vaddr, lmi;
int i;
 
+   if (!debug_pagealloc_enabled() || !linear_map_hash_count)
+   return 0;
+
local_irq_save(flags);
for (i = 0; i < numpages; i++, page++) {
vaddr = (unsigned long)page_address(page);
-- 
2.45.2



[RFC v1 06/10] book3s64/hash: Make kernel_map_linear_page() generic

2024-07-31 Thread Ritesh Harjani (IBM)
Currently kernel_map_linear_page() function assumes to be working on
linear_map_hash_slots array. But since in later patches we need a
separate linear map array for kfence, hence make
kernel_map_linear_page() take a linear map array and lock in it's
function argument.

This is needed to separate out kfence from debug_pagealloc
infrastructure.

Signed-off-by: Ritesh Harjani (IBM) 
---
 arch/powerpc/mm/book3s64/hash_utils.c | 47 ++-
 1 file changed, 25 insertions(+), 22 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index b96bbb0025fb..3f3eaf0a254b 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -272,11 +272,8 @@ void hash__tlbiel_all(unsigned int action)
 }
 
 #if defined(CONFIG_DEBUG_PAGEALLOC)
-static u8 *linear_map_hash_slots;
-static unsigned long linear_map_hash_count;
-static DEFINE_RAW_SPINLOCK(linear_map_hash_lock);
-
-static void kernel_map_linear_page(unsigned long vaddr, unsigned long lmi)
+static void kernel_map_linear_page(unsigned long vaddr, unsigned long idx,
+  u8 *slots, raw_spinlock_t *lock)
 {
unsigned long hash;
unsigned long vsid = get_kernel_vsid(vaddr, mmu_kernel_ssize);
@@ -290,7 +287,7 @@ static void kernel_map_linear_page(unsigned long vaddr, 
unsigned long lmi)
if (!vsid)
return;
 
-   if (linear_map_hash_slots[lmi] & 0x80)
+   if (slots[idx] & 0x80)
return;
 
ret = hpte_insert_repeating(hash, vpn, __pa(vaddr), mode,
@@ -298,36 +295,40 @@ static void kernel_map_linear_page(unsigned long vaddr, 
unsigned long lmi)
mmu_linear_psize, mmu_kernel_ssize);
 
BUG_ON (ret < 0);
-   raw_spin_lock(&linear_map_hash_lock);
-   BUG_ON(linear_map_hash_slots[lmi] & 0x80);
-   linear_map_hash_slots[lmi] = ret | 0x80;
-   raw_spin_unlock(&linear_map_hash_lock);
+   raw_spin_lock(lock);
+   BUG_ON(slots[idx] & 0x80);
+   slots[idx] = ret | 0x80;
+   raw_spin_unlock(lock);
 }
 
-static void kernel_unmap_linear_page(unsigned long vaddr, unsigned long lmi)
+static void kernel_unmap_linear_page(unsigned long vaddr, unsigned long idx,
+u8 *slots, raw_spinlock_t *lock)
 {
-   unsigned long hash, hidx, slot;
+   unsigned long hash, hslot, slot;
unsigned long vsid = get_kernel_vsid(vaddr, mmu_kernel_ssize);
unsigned long vpn = hpt_vpn(vaddr, vsid, mmu_kernel_ssize);
 
hash = hpt_hash(vpn, PAGE_SHIFT, mmu_kernel_ssize);
-   raw_spin_lock(&linear_map_hash_lock);
-   if (!(linear_map_hash_slots[lmi] & 0x80)) {
-   raw_spin_unlock(&linear_map_hash_lock);
+   raw_spin_lock(lock);
+   if (!(slots[idx] & 0x80)) {
+   raw_spin_unlock(lock);
return;
}
-   hidx = linear_map_hash_slots[lmi] & 0x7f;
-   linear_map_hash_slots[lmi] = 0;
-   raw_spin_unlock(&linear_map_hash_lock);
-   if (hidx & _PTEIDX_SECONDARY)
+   hslot = slots[idx] & 0x7f;
+   slots[idx] = 0;
+   raw_spin_unlock(lock);
+   if (hslot & _PTEIDX_SECONDARY)
hash = ~hash;
slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
-   slot += hidx & _PTEIDX_GROUP_IX;
+   slot += hslot & _PTEIDX_GROUP_IX;
mmu_hash_ops.hpte_invalidate(slot, vpn, mmu_linear_psize,
 mmu_linear_psize,
 mmu_kernel_ssize, 0);
 }
 
+static u8 *linear_map_hash_slots;
+static unsigned long linear_map_hash_count;
+static DEFINE_RAW_SPINLOCK(linear_map_hash_lock);
 static inline void hash_debug_pagealloc_alloc_slots(void)
 {
if (!debug_pagealloc_enabled())
@@ -362,9 +363,11 @@ static int hash_debug_pagealloc_map_pages(struct page 
*page, int numpages,
if (lmi >= linear_map_hash_count)
continue;
if (enable)
-   kernel_map_linear_page(vaddr, lmi);
+   kernel_map_linear_page(vaddr, lmi,
+   linear_map_hash_slots, &linear_map_hash_lock);
else
-   kernel_unmap_linear_page(vaddr, lmi);
+   kernel_unmap_linear_page(vaddr, lmi,
+   linear_map_hash_slots, &linear_map_hash_lock);
}
local_irq_restore(flags);
return 0;
-- 
2.45.2



[RFC v1 05/10] book3s64/hash: Refactor hash__kernel_map_pages() function

2024-07-31 Thread Ritesh Harjani (IBM)
This refactors hash__kernel_map_pages() function to call
hash_debug_pagealloc_map_pages(). This will come useful when we will add
kfence support.

No functionality changes in this patch.

Signed-off-by: Ritesh Harjani (IBM) 
---
 arch/powerpc/mm/book3s64/hash_utils.c | 9 -
 1 file changed, 8 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index 6af47b996e79..b96bbb0025fb 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -349,7 +349,8 @@ static inline void 
hash_debug_pagealloc_add_slot(phys_addr_t paddr, int slot)
linear_map_hash_slots[paddr >> PAGE_SHIFT] = slot | 0x80;
 }
 
-int hash__kernel_map_pages(struct page *page, int numpages, int enable)
+static int hash_debug_pagealloc_map_pages(struct page *page, int numpages,
+ int enable)
 {
unsigned long flags, vaddr, lmi;
int i;
@@ -368,6 +369,12 @@ int hash__kernel_map_pages(struct page *page, int 
numpages, int enable)
local_irq_restore(flags);
return 0;
 }
+
+int hash__kernel_map_pages(struct page *page, int numpages, int enable)
+{
+   return hash_debug_pagealloc_map_pages(page, numpages, enable);
+}
+
 #else /* CONFIG_DEBUG_PAGEALLOC */
 int hash__kernel_map_pages(struct page *page, int numpages,
 int enable)
-- 
2.45.2



[RFC v1 04/10] book3s64/hash: Add hash_debug_pagealloc_alloc_slots() function

2024-07-31 Thread Ritesh Harjani (IBM)
This adds hash_debug_pagealloc_alloc_slots() function instead of open
coding that in htab_initialize(). This is required since we will be
separating the kfence functionality to not depend upon debug_pagealloc.

Now that everything required for debug_pagealloc is under a #ifdef
config. Bring in linear_map_hash_slots and linear_map_hash_count
variables under the same config too.

Signed-off-by: Ritesh Harjani (IBM) 
---
 arch/powerpc/mm/book3s64/hash_utils.c | 29 ---
 1 file changed, 17 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index 47b40b9b49d6..6af47b996e79 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -123,8 +123,6 @@ EXPORT_SYMBOL_GPL(mmu_slb_size);
 #ifdef CONFIG_PPC_64K_PAGES
 int mmu_ci_restrictions;
 #endif
-static u8 *linear_map_hash_slots;
-static unsigned long linear_map_hash_count;
 struct mmu_hash_ops mmu_hash_ops;
 EXPORT_SYMBOL(mmu_hash_ops);
 
@@ -274,6 +272,8 @@ void hash__tlbiel_all(unsigned int action)
 }
 
 #if defined(CONFIG_DEBUG_PAGEALLOC)
+static u8 *linear_map_hash_slots;
+static unsigned long linear_map_hash_count;
 static DEFINE_RAW_SPINLOCK(linear_map_hash_lock);
 
 static void kernel_map_linear_page(unsigned long vaddr, unsigned long lmi)
@@ -328,6 +328,19 @@ static void kernel_unmap_linear_page(unsigned long vaddr, 
unsigned long lmi)
 mmu_kernel_ssize, 0);
 }
 
+static inline void hash_debug_pagealloc_alloc_slots(void)
+{
+   if (!debug_pagealloc_enabled())
+   return;
+   linear_map_hash_count = memblock_end_of_DRAM() >> PAGE_SHIFT;
+   linear_map_hash_slots = memblock_alloc_try_nid(
+   linear_map_hash_count, 1, MEMBLOCK_LOW_LIMIT,
+   ppc64_rma_size, NUMA_NO_NODE);
+   if (!linear_map_hash_slots)
+   panic("%s: Failed to allocate %lu bytes max_addr=%pa\n",
+ __func__, linear_map_hash_count, &ppc64_rma_size);
+}
+
 static inline void hash_debug_pagealloc_add_slot(phys_addr_t paddr, int slot)
 {
if (!debug_pagealloc_enabled())
@@ -361,6 +374,7 @@ int hash__kernel_map_pages(struct page *page, int numpages,
 {
return 0;
 }
+static inline void hash_debug_pagealloc_alloc_slots(void) {}
 static inline void hash_debug_pagealloc_add_slot(phys_addr_t paddr, int slot) 
{}
 #endif /* CONFIG_DEBUG_PAGEALLOC */
 
@@ -1223,16 +1237,7 @@ static void __init htab_initialize(void)
 
prot = pgprot_val(PAGE_KERNEL);
 
-   if (debug_pagealloc_enabled()) {
-   linear_map_hash_count = memblock_end_of_DRAM() >> PAGE_SHIFT;
-   linear_map_hash_slots = memblock_alloc_try_nid(
-   linear_map_hash_count, 1, MEMBLOCK_LOW_LIMIT,
-   ppc64_rma_size, NUMA_NO_NODE);
-   if (!linear_map_hash_slots)
-   panic("%s: Failed to allocate %lu bytes max_addr=%pa\n",
- __func__, linear_map_hash_count, &ppc64_rma_size);
-   }
-
+   hash_debug_pagealloc_alloc_slots();
/* create bolted the linear mapping in the hash table */
for_each_mem_range(i, &base, &end) {
size = end - base;
-- 
2.45.2



[RFC v1 03/10] book3s64/hash: Add hash_debug_pagealloc_add_slot() function

2024-07-31 Thread Ritesh Harjani (IBM)
This adds hash_debug_pagealloc_add_slot() function instead of open
coding that in htab_bolt_mapping(). This is required since we will be
separating kfence functionality to not depend upon debug_pagealloc.

No functionality change in this patch.

Signed-off-by: Ritesh Harjani (IBM) 
---
 arch/powerpc/mm/book3s64/hash_utils.c | 13 ++---
 1 file changed, 10 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index b6ae955971bf..47b40b9b49d6 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -328,6 +328,14 @@ static void kernel_unmap_linear_page(unsigned long vaddr, 
unsigned long lmi)
 mmu_kernel_ssize, 0);
 }
 
+static inline void hash_debug_pagealloc_add_slot(phys_addr_t paddr, int slot)
+{
+   if (!debug_pagealloc_enabled())
+   return;
+   if ((paddr >> PAGE_SHIFT) < linear_map_hash_count)
+   linear_map_hash_slots[paddr >> PAGE_SHIFT] = slot | 0x80;
+}
+
 int hash__kernel_map_pages(struct page *page, int numpages, int enable)
 {
unsigned long flags, vaddr, lmi;
@@ -353,6 +361,7 @@ int hash__kernel_map_pages(struct page *page, int numpages,
 {
return 0;
 }
+static inline void hash_debug_pagealloc_add_slot(phys_addr_t paddr, int slot) 
{}
 #endif /* CONFIG_DEBUG_PAGEALLOC */
 
 /*
@@ -513,9 +522,7 @@ int htab_bolt_mapping(unsigned long vstart, unsigned long 
vend,
break;
 
cond_resched();
-   if (debug_pagealloc_enabled() &&
-   (paddr >> PAGE_SHIFT) < linear_map_hash_count)
-   linear_map_hash_slots[paddr >> PAGE_SHIFT] = ret | 0x80;
+   hash_debug_pagealloc_add_slot(paddr, ret);
}
return ret < 0 ? ret : 0;
 }
-- 
2.45.2



[RFC v1 02/10] book3s64/hash: Refactor kernel linear map related calls

2024-07-31 Thread Ritesh Harjani (IBM)
This just brings all linear map related handling at one place instead of
having those functions scattered in hash_utils file.
Makes it easy for review.

No functionality changes in this patch.

Signed-off-by: Ritesh Harjani (IBM) 
---
 arch/powerpc/mm/book3s64/hash_utils.c | 164 +-
 1 file changed, 82 insertions(+), 82 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index 1a1b50735fa0..b6ae955971bf 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -273,6 +273,88 @@ void hash__tlbiel_all(unsigned int action)
WARN(1, "%s called on pre-POWER7 CPU\n", __func__);
 }
 
+#if defined(CONFIG_DEBUG_PAGEALLOC)
+static DEFINE_RAW_SPINLOCK(linear_map_hash_lock);
+
+static void kernel_map_linear_page(unsigned long vaddr, unsigned long lmi)
+{
+   unsigned long hash;
+   unsigned long vsid = get_kernel_vsid(vaddr, mmu_kernel_ssize);
+   unsigned long vpn = hpt_vpn(vaddr, vsid, mmu_kernel_ssize);
+   unsigned long mode = htab_convert_pte_flags(pgprot_val(PAGE_KERNEL), 
HPTE_USE_KERNEL_KEY);
+   long ret;
+
+   hash = hpt_hash(vpn, PAGE_SHIFT, mmu_kernel_ssize);
+
+   /* Don't create HPTE entries for bad address */
+   if (!vsid)
+   return;
+
+   if (linear_map_hash_slots[lmi] & 0x80)
+   return;
+
+   ret = hpte_insert_repeating(hash, vpn, __pa(vaddr), mode,
+   HPTE_V_BOLTED,
+   mmu_linear_psize, mmu_kernel_ssize);
+
+   BUG_ON (ret < 0);
+   raw_spin_lock(&linear_map_hash_lock);
+   BUG_ON(linear_map_hash_slots[lmi] & 0x80);
+   linear_map_hash_slots[lmi] = ret | 0x80;
+   raw_spin_unlock(&linear_map_hash_lock);
+}
+
+static void kernel_unmap_linear_page(unsigned long vaddr, unsigned long lmi)
+{
+   unsigned long hash, hidx, slot;
+   unsigned long vsid = get_kernel_vsid(vaddr, mmu_kernel_ssize);
+   unsigned long vpn = hpt_vpn(vaddr, vsid, mmu_kernel_ssize);
+
+   hash = hpt_hash(vpn, PAGE_SHIFT, mmu_kernel_ssize);
+   raw_spin_lock(&linear_map_hash_lock);
+   if (!(linear_map_hash_slots[lmi] & 0x80)) {
+   raw_spin_unlock(&linear_map_hash_lock);
+   return;
+   }
+   hidx = linear_map_hash_slots[lmi] & 0x7f;
+   linear_map_hash_slots[lmi] = 0;
+   raw_spin_unlock(&linear_map_hash_lock);
+   if (hidx & _PTEIDX_SECONDARY)
+   hash = ~hash;
+   slot = (hash & htab_hash_mask) * HPTES_PER_GROUP;
+   slot += hidx & _PTEIDX_GROUP_IX;
+   mmu_hash_ops.hpte_invalidate(slot, vpn, mmu_linear_psize,
+mmu_linear_psize,
+mmu_kernel_ssize, 0);
+}
+
+int hash__kernel_map_pages(struct page *page, int numpages, int enable)
+{
+   unsigned long flags, vaddr, lmi;
+   int i;
+
+   local_irq_save(flags);
+   for (i = 0; i < numpages; i++, page++) {
+   vaddr = (unsigned long)page_address(page);
+   lmi = __pa(vaddr) >> PAGE_SHIFT;
+   if (lmi >= linear_map_hash_count)
+   continue;
+   if (enable)
+   kernel_map_linear_page(vaddr, lmi);
+   else
+   kernel_unmap_linear_page(vaddr, lmi);
+   }
+   local_irq_restore(flags);
+   return 0;
+}
+#else /* CONFIG_DEBUG_PAGEALLOC */
+int hash__kernel_map_pages(struct page *page, int numpages,
+int enable)
+{
+   return 0;
+}
+#endif /* CONFIG_DEBUG_PAGEALLOC */
+
 /*
  * 'R' and 'C' update notes:
  *  - Under pHyp or KVM, the updatepp path will not set C, thus it *will*
@@ -2117,88 +2199,6 @@ void hpt_do_stress(unsigned long ea, unsigned long 
hpte_group)
}
 }
 
-#if defined(CONFIG_DEBUG_PAGEALLOC)
-static DEFINE_RAW_SPINLOCK(linear_map_hash_lock);
-
-static void kernel_map_linear_page(unsigned long vaddr, unsigned long lmi)
-{
-   unsigned long hash;
-   unsigned long vsid = get_kernel_vsid(vaddr, mmu_kernel_ssize);
-   unsigned long vpn = hpt_vpn(vaddr, vsid, mmu_kernel_ssize);
-   unsigned long mode = htab_convert_pte_flags(pgprot_val(PAGE_KERNEL), 
HPTE_USE_KERNEL_KEY);
-   long ret;
-
-   hash = hpt_hash(vpn, PAGE_SHIFT, mmu_kernel_ssize);
-
-   /* Don't create HPTE entries for bad address */
-   if (!vsid)
-   return;
-
-   if (linear_map_hash_slots[lmi] & 0x80)
-   return;
-
-   ret = hpte_insert_repeating(hash, vpn, __pa(vaddr), mode,
-   HPTE_V_BOLTED,
-   mmu_linear_psize, mmu_kernel_ssize);
-
-   BUG_ON (ret < 0);
-   raw_spin_lock(&linear_map_hash_lock);
-   BUG_ON(linear_map_hash_

[RFC v1 01/10] book3s64/hash: Remove kfence support temporarily

2024-07-31 Thread Ritesh Harjani (IBM)
Kfence on book3s Hash on pseries is anyways broken. It fails to boot
due to RMA size limitation. That is because, kfence with Hash uses
debug_pagealloc infrastructure. debug_pagealloc allocates linear map
for entire dram size instead of just kfence relevant objects.
This means for 16TB of DRAM it will require (16TB >> PAGE_SHIFT)
which is 256MB which is half of RMA region on P8.
crash kernel reserves 256MB and we also need 2048 * 16KB * 3 for
emergency stack and some more for paca allocations.
That means there is not enough memory for reserving the full linear map
in the RMA region, if the DRAM size is too big (>=16TB)
(The issue is seen above 8TB with crash kernel 256 MB reservation).

Now Kfence does not require linear memory map for entire DRAM.
It only needs for kfence objects. So this patch temporarily removes the
kfence functionality since debug_pagealloc code needs some refactoring.
We will bring in kfence on Hash support in later patches.

Signed-off-by: Ritesh Harjani (IBM) 
---
 arch/powerpc/include/asm/kfence.h |  5 +
 arch/powerpc/mm/book3s64/hash_utils.c | 16 +++-
 2 files changed, 16 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/include/asm/kfence.h 
b/arch/powerpc/include/asm/kfence.h
index fab124ada1c7..f3a9476a71b3 100644
--- a/arch/powerpc/include/asm/kfence.h
+++ b/arch/powerpc/include/asm/kfence.h
@@ -10,6 +10,7 @@
 
 #include 
 #include 
+#include 
 
 #ifdef CONFIG_PPC64_ELF_ABI_V1
 #define ARCH_FUNC_PREFIX "."
@@ -25,6 +26,10 @@ static inline void disable_kfence(void)
 
 static inline bool arch_kfence_init_pool(void)
 {
+#ifdef CONFIG_PPC64
+   if (!radix_enabled())
+   return false;
+#endif
return !kfence_disabled;
 }
 #endif
diff --git a/arch/powerpc/mm/book3s64/hash_utils.c 
b/arch/powerpc/mm/book3s64/hash_utils.c
index 01c3b4b65241..1a1b50735fa0 100644
--- a/arch/powerpc/mm/book3s64/hash_utils.c
+++ b/arch/powerpc/mm/book3s64/hash_utils.c
@@ -431,7 +431,7 @@ int htab_bolt_mapping(unsigned long vstart, unsigned long 
vend,
break;
 
cond_resched();
-   if (debug_pagealloc_enabled_or_kfence() &&
+   if (debug_pagealloc_enabled() &&
(paddr >> PAGE_SHIFT) < linear_map_hash_count)
linear_map_hash_slots[paddr >> PAGE_SHIFT] = ret | 0x80;
}
@@ -814,7 +814,7 @@ static void __init htab_init_page_sizes(void)
bool aligned = true;
init_hpte_page_sizes();
 
-   if (!debug_pagealloc_enabled_or_kfence()) {
+   if (!debug_pagealloc_enabled()) {
/*
 * Pick a size for the linear mapping. Currently, we only
 * support 16M, 1M and 4K which is the default
@@ -1134,7 +1134,7 @@ static void __init htab_initialize(void)
 
prot = pgprot_val(PAGE_KERNEL);
 
-   if (debug_pagealloc_enabled_or_kfence()) {
+   if (debug_pagealloc_enabled()) {
linear_map_hash_count = memblock_end_of_DRAM() >> PAGE_SHIFT;
linear_map_hash_slots = memblock_alloc_try_nid(
linear_map_hash_count, 1, MEMBLOCK_LOW_LIMIT,
@@ -2117,7 +2117,7 @@ void hpt_do_stress(unsigned long ea, unsigned long 
hpte_group)
}
 }
 
-#if defined(CONFIG_DEBUG_PAGEALLOC) || defined(CONFIG_KFENCE)
+#if defined(CONFIG_DEBUG_PAGEALLOC)
 static DEFINE_RAW_SPINLOCK(linear_map_hash_lock);
 
 static void kernel_map_linear_page(unsigned long vaddr, unsigned long lmi)
@@ -2191,7 +2191,13 @@ int hash__kernel_map_pages(struct page *page, int 
numpages, int enable)
local_irq_restore(flags);
return 0;
 }
-#endif /* CONFIG_DEBUG_PAGEALLOC || CONFIG_KFENCE */
+#else /* CONFIG_DEBUG_PAGEALLOC */
+int hash__kernel_map_pages(struct page *page, int numpages,
+int enable)
+{
+   return 0;
+}
+#endif /* CONFIG_DEBUG_PAGEALLOC */
 
 void hash__setup_initial_memory_limit(phys_addr_t first_memblock_base,
phys_addr_t first_memblock_size)
-- 
2.45.2



[RFC v1 00/10] book3s64/hash: Improve kfence support

2024-07-31 Thread Ritesh Harjani (IBM)
Kfence on book3s64 Hash is broken. Kfence depends upon debug_pagealloc
infrastructure on Hash. debug_pagealloc allocates a linear map based on the size
of the DRAM i.e. 1 byte for every 64k page. That means for a 16TB DRAM, it will
need 256MB memory for linear map. Memory for linear map on pseries comes from
RMA region which has size limitation. On P8 RMA is 512MB, in which we also
fit crash kernel at 256MB, paca allocations and emergency stacks.
That means there is not enough memory in the RMA region for the linear map
based on DRAM size (required by debug_pagealloc).

Now kfence only requires memory for it's kfence objects. kfence by default
requires only (255 + 1) * 2 i.e. 32 MB for 64k pagesize.

This patch series removes the direct dependency of kfence on debug_pagealloc
infrastructure. We separate the Hash kernel linear map functions to take
linear map array as a parameter so that it can support debug_pagealloc and
kfence individually. That means we don't need to keep the linear map region of
size DRAM_SIZE >> PAGE_SHIFT anymore for kfence.

Hence, the current patch series solves the boot failure problem when kfence is
enabled by optimizing the memory it requires for linear map within RMA region.

On radix we don't have this problem because no SLB and no RMA region size
limitation.

Testing:

The patch series is still undergoing some testing. However, given that it's in
good shape, I wanted to send out for review.
Note: It passes kfence kunit tests.
  
  [   48.715649][T1] # kfence: pass:23 fail:0 skip:2 total:25
  [   48.716697][T1] # Totals: pass:23 fail:0 skip:2 total:25
  [   48.717842][T1] ok 1 kfence


TODOs: (for future patches)
===
However, there is still another problem which IMO makes kfence not suitable to
be enabled by default on production kernels with Hash MMU i.e.
When kfence is enabled the kernel linear map uses PAGE_SIZE mapping rather than
16MB mapping as in the original case. Correct me if I am wrong, but 
theoretically
at least this could cause TLB pressure in certain cases, which makes it not
really suitable to be enabled by default on production kernels on Hash.

This is because on P8 book3s64, we don't support mapping multiple pagesizes
(MPSS) within the kernel linear map segment. Is this understanding correct?


Ritesh Harjani (IBM) (10):
  book3s64/hash: Remove kfence support temporarily
  book3s64/hash: Refactor kernel linear map related calls
  book3s64/hash: Add hash_debug_pagealloc_add_slot() function
  book3s64/hash: Add hash_debug_pagealloc_alloc_slots() function
  book3s64/hash: Refactor hash__kernel_map_pages() function
  book3s64/hash: Make kernel_map_linear_page() generic
  book3s64/hash: Disable debug_pagealloc if it requires more memory
  book3s64/hash: Add kfence functionality
  book3s64/radix: Refactoring common kfence related functions
  book3s64/hash: Disable kfence if not early init

 arch/powerpc/include/asm/kfence.h|   2 +
 arch/powerpc/mm/book3s64/hash_utils.c| 364 +--
 arch/powerpc/mm/book3s64/radix_pgtable.c |  12 -
 arch/powerpc/mm/init-common.c|  12 +
 4 files changed, 286 insertions(+), 104 deletions(-)

--
2.45.2



Re: [PATCH 3/3] powerpc: Document details on H_HTM hcall

2024-06-22 Thread IBM
Madhavan Srinivasan  writes:

> Add documentation to 'papr_hcalls.rst' describing the
> input, output and return values of the H_HTM hcall as
> per the internal specification.
>
> Signed-off-by: Madhavan Srinivasan 
> ---
>  Documentation/arch/powerpc/papr_hcalls.rst | 11 +++
>  1 file changed, 11 insertions(+)
>
> diff --git a/Documentation/arch/powerpc/papr_hcalls.rst 
> b/Documentation/arch/powerpc/papr_hcalls.rst
> index 80d2c0aadab5..805e1cb9bab9 100644
> --- a/Documentation/arch/powerpc/papr_hcalls.rst
> +++ b/Documentation/arch/powerpc/papr_hcalls.rst
> @@ -289,6 +289,17 @@ to be issued multiple times in order to be completely 
> serviced. The
>  subsequent hcalls to the hypervisor until the hcall is completely serviced
>  at which point H_SUCCESS or other error is returned by the hypervisor.
>  
> +**H_HTM**
> +
> +| Input: flags, target, operation (op), op-param1, op-param2, op-param3
> +| Out: *dumphtmbufferdata*
> +| Return Value: *H_Success,H_Busy,H_LongBusyOrder,H_Partial,H_Parameter,
> +  H_P2,H_P3,H_P4,H_P5,H_P6,H_State,H_Not_Available,H_Authority*
> +
> +H_HTM supports setup, configuration, control and dumping of Hardware Trace
> +Macro (HTM) function and its data. HTM buffer stores tracing data for 
> functions
> +like core instruction, core LLAT and nest.
> +

Minor nit: Maybe the set of debugfs cmds to collect the trace and some
example trace log? If it is not confidential?

>  References
>  ==
>  .. [1] "Power Architecture Platform Reference"
> -- 
> 2.45.2


Re: [PATCH 2/3] powerpc/pseries: Export hardware trace macro dump via debugfs

2024-06-22 Thread IBM


This is a generic review and I haven't looked into the PAPR spec for
htmdump hcall and it's interface.

Madhavan Srinivasan  writes:

> This patch adds debugfs interface to export Hardware Trace Macro (HTM)
> function data in a LPAR. New hypervisor call "H_HTM" has been
> defined to setup, configure, control and dump the HTM data.
> This patch supports only dumping of HTM data in a LPAR.
> New debugfs folder called "htmdump" has been added under
> /sys/kernel/debug/arch path which contains files need to
> pass required parameters for the H_HTM dump function. New Kconfig
> option called "CONFIG_HTMDUMP" has been in platform/pseries for the same.
>
> With patch series applied and booted, list of files in debugfs path
>
> # pwd
> /sys/kernel/debug/powerpc/htmdump
> # ls
> coreindexonchip  htmtype  nodalchipindex  nodeindex  trace
>
> Signed-off-by: Madhavan Srinivasan 
> ---
>  arch/powerpc/platforms/pseries/Kconfig   |   8 ++
>  arch/powerpc/platforms/pseries/Makefile  |   1 +
>  arch/powerpc/platforms/pseries/htmdump.c | 130 +++
>  3 files changed, 139 insertions(+)
>  create mode 100644 arch/powerpc/platforms/pseries/htmdump.c
>
> diff --git a/arch/powerpc/platforms/pseries/Kconfig 
> b/arch/powerpc/platforms/pseries/Kconfig
> index afc0f6a61337..46c0ea605e33 100644
> --- a/arch/powerpc/platforms/pseries/Kconfig
> +++ b/arch/powerpc/platforms/pseries/Kconfig
> @@ -128,6 +128,14 @@ config CMM
> will be reused for other LPARs. The interface allows firmware to
> balance memory across many LPARs.
>
> +config HTMDUMP
> + tristate "PHYP HTM data dumper"

Not sure if we can make machine_device_initcall() as a tristate?
Did we try compiling it as a module?

It we would like to keep this as a module - then why not use module_init
call and then make it depend upon...

depends on PPC_PSERIES && DEBUG_FS (??)

> + default y

and then since this is mostly a debug trace facility, then we need not enable
it by default right?

> + help
> +   Select this option, if you want to enable the kernel debugfs
> +   interface to dump the Hardware Trace Macro (HTM) function data
> +   in the LPAR.
> +
>  config HV_PERF_CTRS
>   bool "Hypervisor supplied PMU events (24x7 & GPCI)"
>   default y
> diff --git a/arch/powerpc/platforms/pseries/Makefile 
> b/arch/powerpc/platforms/pseries/Makefile
> index 7bf506f6b8c8..3f3e3492e436 100644
> --- a/arch/powerpc/platforms/pseries/Makefile
> +++ b/arch/powerpc/platforms/pseries/Makefile
> @@ -19,6 +19,7 @@ obj-$(CONFIG_HVC_CONSOLE)   += hvconsole.o
>  obj-$(CONFIG_HVCS)   += hvcserver.o
>  obj-$(CONFIG_HCALL_STATS)+= hvCall_inst.o
>  obj-$(CONFIG_CMM)+= cmm.o
> +obj-$(CONFIG_HTMDUMP)+= htmdump.o
>  obj-$(CONFIG_IO_EVENT_IRQ)   += io_event_irq.o
>  obj-$(CONFIG_LPARCFG)+= lparcfg.o
>  obj-$(CONFIG_IBMVIO) += vio.o
> diff --git a/arch/powerpc/platforms/pseries/htmdump.c 
> b/arch/powerpc/platforms/pseries/htmdump.c
> new file mode 100644
> index ..540cdb7e069c
> --- /dev/null
> +++ b/arch/powerpc/platforms/pseries/htmdump.c
> @@ -0,0 +1,130 @@
> +// SPDX-License-Identifier: GPL-2.0-or-later
> +/*
> + * Copyright (C) IBM Corporation, 2024
> + */
> +
> +#define pr_fmt(fmt) "htmdump: " fmt
> +
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 
> +#include 

Do we need all of the above?
e.g. slab, memory_hotplug etc are not needed IMO.

Maybe only?

#include 
#include 
#include 
#include 

#include 
#include 

(module.h depending upon if we make it module_init())


> +
> +/* This enables us to keep track of the memory removed from each node. */
> +struct htmdump_entry {
> + void *buf;
> + struct dentry *dir;
> + char name[16];
> +};
> +
> +static u32 nodeindex = 0;
> +static u32 nodalchipindex = 0;
> +static u32 coreindexonchip = 0;
> +static u32 htmtype = 0;
> +
> +#define BUFFER_SIZE PAGE_SIZE
> +
> +static ssize_t htmdump_read(struct file *filp, char __user *ubuf,
> +  size_t count, loff_t *ppos)
> +{
> + struct htmdump_entry *ent = filp->private_data;
> + unsigned long page, read_size, available;
> + loff_t offset;
> + long rc;
> +
> + page = ALIGN_DOWN(*ppos, BUFFER_SIZE);
> + offset = (*ppos) % BUFFER_SIZE;
> +
> + rc = htm_get_dump_hardware(nodeindex, nodalchipindex, coreindexonchip,
> +htmtype, virt_to_phys(ent->buf), 
> B

Re: [PATCH 2/2] radix/kfence: support late __kfence_pool allocation

2024-05-01 Thread IBM
Hari Bathini  writes:

> With commit b33f778bba5ef ("kfence: alloc kfence_pool after system
> startup"), KFENCE pool can be allocated after system startup via the
> page allocator. This can lead to problems as all memory is not mapped
> at page granularity anymore with CONFIG_KFENCE. Address this by direct
> mapping all memory at PMD level and split the mapping for PMD pages
> that overlap with __kfence_pool to page level granularity if and when
> __kfence_pool is allocated after system startup.
>
> Signed-off-by: Hari Bathini 
> ---
>  arch/powerpc/include/asm/book3s/64/radix.h |  2 +
>  arch/powerpc/include/asm/kfence.h  | 14 +-
>  arch/powerpc/mm/book3s64/radix_pgtable.c   | 50 +-
>  3 files changed, 64 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/book3s/64/radix.h 
> b/arch/powerpc/include/asm/book3s/64/radix.h
> index 8f55ff74bb68..0423ddbcf73c 100644
> --- a/arch/powerpc/include/asm/book3s/64/radix.h
> +++ b/arch/powerpc/include/asm/book3s/64/radix.h
> @@ -340,6 +340,8 @@ extern void radix__vmemmap_remove_mapping(unsigned long 
> start,
>  extern int radix__map_kernel_page(unsigned long ea, unsigned long pa,
>pgprot_t flags, unsigned int psz);
>  
> +extern bool radix_kfence_init_pool(void);
> +
>  static inline unsigned long radix__get_tree_size(void)
>  {
>   unsigned long rts_field;
> diff --git a/arch/powerpc/include/asm/kfence.h 
> b/arch/powerpc/include/asm/kfence.h
> index 18ec2b06ba1e..c5d2fb2f9ecb 100644
> --- a/arch/powerpc/include/asm/kfence.h
> +++ b/arch/powerpc/include/asm/kfence.h
> @@ -18,12 +18,24 @@
>  
>  #ifdef CONFIG_KFENCE
>  extern bool kfence_early_init;
> -#endif
> +
> +static inline bool kfence_alloc_pool_late(void)
> +{
> + return !kfence_early_init;
> +}

Minor nit, but do we need kfence_alloc_pool_late()?
The function name looks confusing. Can we not just use
!kfence_early_init? If not then maybe bool kfence_late_init?

>  
>  static inline bool arch_kfence_init_pool(void)
>  {
> +#ifdef CONFIG_PPC_BOOK3S_64
> + if (radix_enabled())
> + return radix_kfence_init_pool();

Can we directly check...
if (radix_enabled() && !kfence_early_init)
... instead of embedding the check inside radix_kfence_late_init_pool()

> +#endif
> +
>   return true;
>  }
> +#else
> +static inline bool kfence_alloc_pool_late(void) { return false; }
> +#endif
>  
>  #ifdef CONFIG_PPC64
>  static inline bool kfence_protect_page(unsigned long addr, bool protect)
> diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
> b/arch/powerpc/mm/book3s64/radix_pgtable.c
> index fccbf92f279b..f4374e3e31e1 100644
> --- a/arch/powerpc/mm/book3s64/radix_pgtable.c
> +++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
> @@ -253,6 +253,53 @@ void radix__mark_initmem_nx(void)
>  }
>  #endif /* CONFIG_STRICT_KERNEL_RWX */
>  
> +#ifdef CONFIG_KFENCE
> +static inline int radix_split_pmd_page(pmd_t *pmd, unsigned long addr)
> +{
> + pte_t *pte = pte_alloc_one_kernel(&init_mm);
> + unsigned long pfn = PFN_DOWN(__pa(addr));

Minor nit. Since addr will always be page aligned, so maybe PHYS_PFN() is better
suited. Although it does not matter.

> + int i;
> +
> + if (!pte)
> + return -ENOMEM;
> +
> + for (i = 0; i < PTRS_PER_PTE; i++) {
> + __set_pte_at(&init_mm, addr, pte + i, pfn_pte(pfn + i, 
> PAGE_KERNEL), 0);
> + asm volatile("ptesync": : :"memory");
> + }

Maybe a comment above the loop on why __set_pte_at() is ok for late
kfence init? and why not pte_update()? [1]

[1]: https://lore.kernel.org/linuxppc-dev/87y318wp9r@linux.ibm.com/


> + pmd_populate_kernel(&init_mm, pmd, pte);
> +
> + flush_tlb_kernel_range(addr, addr + PMD_SIZE);
> + return 0;
> +}
> +
> +bool radix_kfence_init_pool(void)
> +{
> + unsigned int page_psize, pmd_psize;
> + unsigned long addr;
> + pmd_t *pmd;
> +
> + if (!kfence_alloc_pool_late())
> + return true;
> +
> + page_psize = shift_to_mmu_psize(PAGE_SHIFT);
> + pmd_psize = shift_to_mmu_psize(PMD_SHIFT);
> + for (addr = (unsigned long)__kfence_pool; is_kfence_address((void 
> *)addr);
> +  addr += PAGE_SIZE) {
> + pmd = pmd_off_k(addr);
> +
> + if (pmd_leaf(*pmd)) {
> + if (radix_split_pmd_page(pmd, addr & PMD_MASK))
> + return false;
> + update_page_count(pmd_psize, -1);
> + update_page_count(page_psize, PTRS_PER_PTE);
> + }
> + }
> +
> + return true;
> +}
> +#endif
> +
>  static inline void __meminit
>  print_mapping(unsigned long start, unsigned long end, unsigned long size, 
> bool exec)
>  {
> @@ -391,7 +438,8 @@ static void __init radix_init_pgtable(void)
>   continue;
>   }
>  
> - WARN_ON(create_physical_mapping(start, end, -1, PAGE_KERNEL, 
> ~0UL));
> + WARN_ON

Re: [PATCH 1/2] radix/kfence: map __kfence_pool at page granularity

2024-04-30 Thread IBM
Hari Bathini  writes:

> When KFENCE is enabled, total system memory is mapped at page level
> granularity. But in radix MMU mode, ~3GB additional memory is needed
> to map 100GB of system memory at page level granularity when compared
> to using 2MB direct mapping. This is not desired considering KFENCE is
> designed to be enabled in production kernels [1]. Also, mapping memory
> allocated for KFENCE pool at page granularity seems sufficient enough
> to enable KFENCE support. So, allocate __kfence_pool during bootup and
> map it at page granularity instead of mapping all system memory at
> page granularity.
>
> Without patch:
> # cat /proc/meminfo
> MemTotal:   101201920 kB
>
> With patch:
> # cat /proc/meminfo
> MemTotal:   104483904 kB
>
> All kfence_test.c testcases passed with this patch.
>
> [1] https://lore.kernel.org/all/20201103175841.3495947-2-el...@google.com/
>
> Signed-off-by: Hari Bathini 
> ---
>  arch/powerpc/include/asm/kfence.h|  5 
>  arch/powerpc/mm/book3s64/radix_pgtable.c | 34 ++--
>  arch/powerpc/mm/init_64.c| 14 ++

New at this. But the patch looked interesting, hence my review comments.

>  3 files changed, 45 insertions(+), 8 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/kfence.h 
> b/arch/powerpc/include/asm/kfence.h
> index 424ceef82ae6..18ec2b06ba1e 100644
> --- a/arch/powerpc/include/asm/kfence.h
> +++ b/arch/powerpc/include/asm/kfence.h
> @@ -8,6 +8,7 @@
>  #ifndef __ASM_POWERPC_KFENCE_H
>  #define __ASM_POWERPC_KFENCE_H
>  
> +#include 
>  #include 
>  #include 
>  
> @@ -15,6 +16,10 @@
>  #define ARCH_FUNC_PREFIX "."
>  #endif
>  
> +#ifdef CONFIG_KFENCE
> +extern bool kfence_early_init;
> +#endif
> +
>  static inline bool arch_kfence_init_pool(void)
>  {
>   return true;

Shouldn't we return false for !kfence_early_init?
Because otherwise, this patch may break the late init case which your
next patch is fixing, and maybe git bisect will break?


> diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
> b/arch/powerpc/mm/book3s64/radix_pgtable.c
> index 15e88f1439ec..fccbf92f279b 100644
> --- a/arch/powerpc/mm/book3s64/radix_pgtable.c
> +++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
> @@ -31,6 +31,7 @@
>  #include 
>  #include 
>  #include 
> +#include 
>  
>  #include 
>  
> @@ -291,9 +292,8 @@ static unsigned long next_boundary(unsigned long addr, 
> unsigned long end)
>   return end;
>  }
>  
> -static int __meminit create_physical_mapping(unsigned long start,
> -  unsigned long end,
> -  int nid, pgprot_t _prot)
> +static int __meminit create_physical_mapping(unsigned long start, unsigned 
> long end, int nid,
> +  pgprot_t _prot, unsigned long 
> mapping_sz_limit)

lines over 80 chars.

>  {
>   unsigned long vaddr, addr, mapping_size = 0;
>   bool prev_exec, exec = false;
> @@ -301,7 +301,10 @@ static int __meminit create_physical_mapping(unsigned 
> long start,
>   int psize;
>   unsigned long max_mapping_size = memory_block_size;
>  
> - if (debug_pagealloc_enabled_or_kfence())
> + if (mapping_sz_limit < max_mapping_size)
> + max_mapping_size = mapping_sz_limit;
> +
> + if (debug_pagealloc_enabled())
>   max_mapping_size = PAGE_SIZE;
>  
>   start = ALIGN(start, PAGE_SIZE);
> @@ -358,6 +361,7 @@ static int __meminit create_physical_mapping(unsigned 
> long start,
>  
>  static void __init radix_init_pgtable(void)
>  {
> + phys_addr_t kfence_pool __maybe_unused;
>   unsigned long rts_field;
>   phys_addr_t start, end;
>   u64 i;
> @@ -365,6 +369,13 @@ static void __init radix_init_pgtable(void)
>   /* We don't support slb for radix */
>   slb_set_size(0);
>  
> +#ifdef CONFIG_KFENCE
> + if (kfence_early_init) {
> + kfence_pool = memblock_phys_alloc(KFENCE_POOL_SIZE, PAGE_SIZE);

What if memblock_phys_alloc() failed? error handling?
> + memblock_mark_nomap(kfence_pool, KFENCE_POOL_SIZE);
> + }
> +#endif
> +

Instead of #ifdef CONFIG_KFENCE in the function,
maybe we can define radix_kfence_alloc_pool()? Then we won't need
__maybe_unused too.

>   /*
>* Create the linear mapping
>*/
> @@ -380,10 +391,18 @@ static void __init radix_init_pgtable(void)
>   continue;
>   }
>  
> - WARN_ON(create_physical_mapping(start, end,
> - -1, PAGE_KERNEL));
> + WARN_ON(create_physical_mapping(start, end, -1, PAGE_KERNEL, 
> ~0UL));
>   }
>  
> +#ifdef CONFIG_KFENCE
> + if (kfence_early_init) {
> + create_physical_mapping(kfence_pool, kfence_pool + 
> KFENCE_POOL_SIZE, -1,
> + PAGE_KERNEL, PAGE_SIZE);

Even this can return an error. Maybe WARN_ON_ONCE()? or disabling kfence
for an error?


[PATCH] powerpc/ptdump: Fix walk_vmemmap to also print first vmemmap entry

2024-04-17 Thread Ritesh Harjani (IBM)
walk_vmemmap() was skipping the first vmemmap entry pointed by
vmemmap_list pointer itself. This patch fixes that.

With this we should see the vmemmap entry at 0xc00c for hash
which wasn't getting printed on doing

"cat /sys/kernel/debug/kernel_hash_pagetable"

Signed-off-by: Ritesh Harjani (IBM) 
---
 arch/powerpc/mm/ptdump/hashpagetable.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/ptdump/hashpagetable.c 
b/arch/powerpc/mm/ptdump/hashpagetable.c
index 9a601587836b..a6baa6166d94 100644
--- a/arch/powerpc/mm/ptdump/hashpagetable.c
+++ b/arch/powerpc/mm/ptdump/hashpagetable.c
@@ -491,7 +491,7 @@ static void walk_vmemmap(struct pg_state *st)
 * Traverse the vmemmaped memory and dump pages that are in the hash
 * pagetable.
 */
-   while (ptr->list) {
+   while (ptr) {
hpte_find(st, ptr->virt_addr, mmu_vmemmap_psize);
ptr = ptr->list;
}
--
2.44.0



[PATCH 3/3] powerpc/mm: Update the memory limit based on direct mapping restrictions

2024-04-03 Thread Aneesh Kumar K.V (IBM)
memory limit value specified by the user are further updated such that
the value is 16MB aligned. This is because hash translation mode use
16MB as direct mapping page size. Make sure we update the global
variable 'memory_limit' with the 16MB aligned value such that all kernel
components will see the new aligned value of the memory limit.

Signed-off-by: Aneesh Kumar K.V (IBM) 
---
 arch/powerpc/kernel/prom.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index 7451bedad1f4..b8f764453eaa 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -779,7 +779,6 @@ static inline void save_fscr_to_task(void) {}
 
 void __init early_init_devtree(void *params)
 {
-   phys_addr_t limit;
 
DBG(" -> early_init_devtree(%px)\n", params);
 
@@ -850,8 +849,8 @@ void __init early_init_devtree(void *params)
memory_limit = 0;
 
/* Align down to 16 MB which is large page size with hash page 
translation */
-   limit = ALIGN_DOWN(memory_limit ?: memblock_phys_mem_size(), SZ_16M);
-   memblock_enforce_memory_limit(limit);
+   memory_limit = ALIGN_DOWN(memory_limit ?: memblock_phys_mem_size(), 
SZ_16M);
+   memblock_enforce_memory_limit(memory_limit);
 
 #if defined(CONFIG_PPC_BOOK3S_64) && defined(CONFIG_PPC_4K_PAGES)
if (!early_radix_enabled())
-- 
2.44.0



[PATCH 2/3] powerpc/fadump: Don't update the user-specified memory limit

2024-04-03 Thread Aneesh Kumar K.V (IBM)
If the user specifies the memory limit, the kernel should honor it such
that all allocation and reservations are made within the memory limit
specified. fadump was breaking that rule. Remove the code which updates
the memory limit such that fadump reservations are done within the
limit specified.

Cc: Mahesh Salgaonkar  
Signed-off-by: Aneesh Kumar K.V (IBM) 
---
 arch/powerpc/kernel/fadump.c | 16 
 1 file changed, 16 deletions(-)

diff --git a/arch/powerpc/kernel/fadump.c b/arch/powerpc/kernel/fadump.c
index d14eda1e8589..4e768d93c6d4 100644
--- a/arch/powerpc/kernel/fadump.c
+++ b/arch/powerpc/kernel/fadump.c
@@ -573,22 +573,6 @@ int __init fadump_reserve_mem(void)
}
}
 
-   /*
-* Calculate the memory boundary.
-* If memory_limit is less than actual memory boundary then reserve
-* the memory for fadump beyond the memory_limit and adjust the
-* memory_limit accordingly, so that the running kernel can run with
-* specified memory_limit.
-*/
-   if (memory_limit && memory_limit < memblock_end_of_DRAM()) {
-   size = get_fadump_area_size();
-   if ((memory_limit + size) < memblock_end_of_DRAM())
-   memory_limit += size;
-   else
-   memory_limit = memblock_end_of_DRAM();
-   printk(KERN_INFO "Adjusted memory_limit for firmware-assisted"
-   " dump, now %#016llx\n", memory_limit);
-   }
if (memory_limit)
mem_boundary = memory_limit;
else
-- 
2.44.0



[PATCH 1/3] powerpc/mm: Align memory_limit value specified using mem= kernel parameter

2024-04-03 Thread Aneesh Kumar K.V (IBM)
The value specified for the memory limit is used to set a restriction on
memory usage. It is important to ensure that this restriction is within
the linear map kernel address space range. The hash page table
translation uses a 16MB page size to map the kernel linear map address
space. htab_bolt_mapping() function aligns down the size of the range
while mapping kernel linear address space. Since the memblock limit is
enforced very early during boot, before we can detect the type of memory
translation (radix vs hash), we align the memory limit value specified
as a kernel parameter to 16MB. This alignment value will work for both
hash and radix translations.

Signed-off-by: Aneesh Kumar K.V (IBM) 
---
 arch/powerpc/kernel/prom.c  | 7 +--
 arch/powerpc/kernel/prom_init.c | 4 ++--
 2 files changed, 7 insertions(+), 4 deletions(-)

diff --git a/arch/powerpc/kernel/prom.c b/arch/powerpc/kernel/prom.c
index cd8d8883de90..7451bedad1f4 100644
--- a/arch/powerpc/kernel/prom.c
+++ b/arch/powerpc/kernel/prom.c
@@ -846,8 +846,11 @@ void __init early_init_devtree(void *params)
reserve_crashkernel();
early_reserve_mem();
 
-   /* Ensure that total memory size is page-aligned. */
-   limit = ALIGN(memory_limit ?: memblock_phys_mem_size(), PAGE_SIZE);
+   if (memory_limit > memblock_phys_mem_size())
+   memory_limit = 0;
+
+   /* Align down to 16 MB which is large page size with hash page 
translation */
+   limit = ALIGN_DOWN(memory_limit ?: memblock_phys_mem_size(), SZ_16M);
memblock_enforce_memory_limit(limit);
 
 #if defined(CONFIG_PPC_BOOK3S_64) && defined(CONFIG_PPC_4K_PAGES)
diff --git a/arch/powerpc/kernel/prom_init.c b/arch/powerpc/kernel/prom_init.c
index 0ef358285337..fbb68fc28ed3 100644
--- a/arch/powerpc/kernel/prom_init.c
+++ b/arch/powerpc/kernel/prom_init.c
@@ -817,8 +817,8 @@ static void __init early_cmdline_parse(void)
opt += 4;
prom_memory_limit = prom_memparse(opt, (const char **)&opt);
 #ifdef CONFIG_PPC64
-   /* Align to 16 MB == size of ppc64 large page */
-   prom_memory_limit = ALIGN(prom_memory_limit, 0x100);
+   /* Align down to 16 MB which is large page size with hash page 
translation */
+   prom_memory_limit = ALIGN_DOWN(prom_memory_limit, SZ_16M);
 #endif
}
 

base-commit: 3e92c1e6cd876754b64d1998ec0a01800ed954a6
-- 
2.44.0



[PATCH] mm/debug_vm_pgtable: Fix BUG_ON with pud advanced test

2024-01-28 Thread Aneesh Kumar K.V (IBM)
Architectures like powerpc add debug checks to ensure we find only devmap
PUD pte entries. These debug checks are only done with CONFIG_DEBUG_VM.
This patch marks the ptes used for PUD advanced test devmap pte entries
so that we don't hit on debug checks on architecture like ppc64 as
below.

WARNING: CPU: 2 PID: 1 at arch/powerpc/mm/book3s64/radix_pgtable.c:1382 
radix__pud_hugepage_update+0x38/0x138

NIP [c00a7004] radix__pud_hugepage_update+0x38/0x138
LR [c00a77a8] radix__pudp_huge_get_and_clear+0x28/0x60
Call Trace:
[c4a2f950] [c4a2f9a0] 0xc4a2f9a0 (unreliable)
[c4a2f980] [000d34c1] 0xd34c1
[c4a2f9a0] [c206ba98] pud_advanced_tests+0x118/0x334
[c4a2fa40] [c206db34] debug_vm_pgtable+0xcbc/0x1c48
[c4a2fc10] [c000fd28] do_one_initcall+0x60/0x388

Also

 kernel BUG at arch/powerpc/mm/book3s64/pgtable.c:202!
 

 NIP [c0096510] pudp_huge_get_and_clear_full+0x98/0x174
 LR [c206bb34] pud_advanced_tests+0x1b4/0x334
 Call Trace:
 [c4a2f950] [000d34c1] 0xd34c1 (unreliable)
 [c4a2f9a0] [c206bb34] pud_advanced_tests+0x1b4/0x334
 [c4a2fa40] [c206db34] debug_vm_pgtable+0xcbc/0x1c48
 [c4a2fc10] [c000fd28] do_one_initcall+0x60/0x388

Fixes: 27af67f35631 ("powerpc/book3s64/mm: enable transparent pud hugepage")
Signed-off-by: Aneesh Kumar K.V (IBM) 
---
 mm/debug_vm_pgtable.c | 8 
 1 file changed, 8 insertions(+)

diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index 5662e29fe253..65c19025da3d 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -362,6 +362,12 @@ static void __init pud_advanced_tests(struct 
pgtable_debug_args *args)
vaddr &= HPAGE_PUD_MASK;
 
pud = pfn_pud(args->pud_pfn, args->page_prot);
+   /*
+* Some architectures have debug checks to make sure
+* huge pud mapping are only found with devmap entries
+* For now test with only devmap entries.
+*/
+   pud = pud_mkdevmap(pud);
set_pud_at(args->mm, vaddr, args->pudp, pud);
flush_dcache_page(page);
pudp_set_wrprotect(args->mm, vaddr, args->pudp);
@@ -374,6 +380,7 @@ static void __init pud_advanced_tests(struct 
pgtable_debug_args *args)
WARN_ON(!pud_none(pud));
 #endif /* __PAGETABLE_PMD_FOLDED */
pud = pfn_pud(args->pud_pfn, args->page_prot);
+   pud = pud_mkdevmap(pud);
pud = pud_wrprotect(pud);
pud = pud_mkclean(pud);
set_pud_at(args->mm, vaddr, args->pudp, pud);
@@ -391,6 +398,7 @@ static void __init pud_advanced_tests(struct 
pgtable_debug_args *args)
 #endif /* __PAGETABLE_PMD_FOLDED */
 
pud = pfn_pud(args->pud_pfn, args->page_prot);
+   pud = pud_mkdevmap(pud);
pud = pud_mkyoung(pud);
set_pud_at(args->mm, vaddr, args->pudp, pud);
flush_dcache_page(page);
-- 
2.43.0



Re: [PATCH] MAINTAINERS: powerpc: Add Aneesh & Naveen

2023-12-13 Thread IBM
Michael Ellerman  writes:

> Aneesh and Naveen are helping out with some aspects of upstream
> maintenance, add them as reviewers.
>

Acked-by: Aneesh Kumar K.V (IBM) 

> Signed-off-by: Michael Ellerman 
> ---
>  MAINTAINERS | 2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/MAINTAINERS b/MAINTAINERS
> index ea790149af79..562d048863ee 100644
> --- a/MAINTAINERS
> +++ b/MAINTAINERS
> @@ -12240,6 +12240,8 @@ LINUX FOR POWERPC (32-BIT AND 64-BIT)
>  M:   Michael Ellerman 
>  R:   Nicholas Piggin 
>  R:   Christophe Leroy 
> +R:   Aneesh Kumar K.V 
> +R:   Naveen N. Rao 
>  L:   linuxppc-dev@lists.ozlabs.org
>  S:   Supported
>  W:   https://github.com/linuxppc/wiki/wiki
> -- 
> 2.43.0


Re: [PATCH 09/12] KVM: PPC: Book3S HV nestedv2: Do not call H_COPY_TOFROM_GUEST

2023-12-08 Thread IBM
Vaibhav Jain  writes:

> From: Jordan Niethe 
>
> H_COPY_TOFROM_GUEST is part of the nestedv1 API and so should not be
> called by a nestedv2 host. Do not attempt to call it.
>

May be we should use
firmware_has_feature(FW_FEATURE_H_COPY_TOFROM_GUEST))?

the nestedv2 can end up using the above hcall if it is supported by the
hypervisor right? In its absence we will have to translate the guest ea
using xlate and then use kvm_guest_read to read location using the guest
real address right? That xlate will also involves multiple kvm_guest_read.


> Signed-off-by: Jordan Niethe 
> ---
>  arch/powerpc/kvm/book3s_64_mmu_radix.c | 3 +++
>  1 file changed, 3 insertions(+)
>
> diff --git a/arch/powerpc/kvm/book3s_64_mmu_radix.c 
> b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> index 916af6c153a5..4a1abb9f7c05 100644
> --- a/arch/powerpc/kvm/book3s_64_mmu_radix.c
> +++ b/arch/powerpc/kvm/book3s_64_mmu_radix.c
> @@ -40,6 +40,9 @@ unsigned long __kvmhv_copy_tofrom_guest_radix(int lpid, int 
> pid,
>   unsigned long quadrant, ret = n;
>   bool is_load = !!to;
>  
> + if (kvmhv_is_nestedv2())
> + return H_UNSUPPORTED;
> +
>   /* Can't access quadrants 1 or 2 in non-HV mode, call the HV to do it */
>   if (kvmhv_on_pseries())
>   return plpar_hcall_norets(H_COPY_TOFROM_GUEST, lpid, pid, eaddr,
> -- 
> 2.42.0


Re: [PATCH 01/12] KVM: PPC: Book3S HV nestedv2: Invalidate RPT before deleting a guest

2023-12-07 Thread IBM
Vaibhav Jain  writes:

> From: Jordan Niethe 
>
> An L0 must invalidate the L2's RPT during H_GUEST_DELETE if this has not
> already been done. This is a slow operation that means H_GUEST_DELETE
> must return H_BUSY multiple times before completing. Invalidating the
> tables before deleting the guest so there is less work for the L0 to do.
>
> Signed-off-by: Jordan Niethe 
> ---
>  arch/powerpc/include/asm/kvm_book3s.h | 1 +
>  arch/powerpc/kvm/book3s_hv.c  | 6 --
>  arch/powerpc/kvm/book3s_hv_nested.c   | 2 +-
>  3 files changed, 6 insertions(+), 3 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/kvm_book3s.h 
> b/arch/powerpc/include/asm/kvm_book3s.h
> index 4f527d09c92b..a37736ed3728 100644
> --- a/arch/powerpc/include/asm/kvm_book3s.h
> +++ b/arch/powerpc/include/asm/kvm_book3s.h
> @@ -302,6 +302,7 @@ void kvmhv_nested_exit(void);
>  void kvmhv_vm_nested_init(struct kvm *kvm);
>  long kvmhv_set_partition_table(struct kvm_vcpu *vcpu);
>  long kvmhv_copy_tofrom_guest_nested(struct kvm_vcpu *vcpu);
> +void kvmhv_flush_lpid(u64 lpid);
>  void kvmhv_set_ptbl_entry(u64 lpid, u64 dw0, u64 dw1);
>  void kvmhv_release_all_nested(struct kvm *kvm);
>  long kvmhv_enter_nested_guest(struct kvm_vcpu *vcpu);
> diff --git a/arch/powerpc/kvm/book3s_hv.c b/arch/powerpc/kvm/book3s_hv.c
> index 1ed6ec140701..5543e8490cd9 100644
> --- a/arch/powerpc/kvm/book3s_hv.c
> +++ b/arch/powerpc/kvm/book3s_hv.c
> @@ -5691,10 +5691,12 @@ static void kvmppc_core_destroy_vm_hv(struct kvm *kvm)
>   kvmhv_set_ptbl_entry(kvm->arch.lpid, 0, 0);
>   }
>  
> - if (kvmhv_is_nestedv2())
> + if (kvmhv_is_nestedv2()) {
> + kvmhv_flush_lpid(kvm->arch.lpid);
>   plpar_guest_delete(0, kvm->arch.lpid);
>

I am not sure I follow the optimization here. I would expect the
hypervisor to kill all the translation caches as part of guest_delete.
What is the benefit of doing a lpid flush outside the guest delete?

> - else
> + } else {
>   kvmppc_free_lpid(kvm->arch.lpid);
> + }
>  
>   kvmppc_free_pimap(kvm);
>  }
> diff --git a/arch/powerpc/kvm/book3s_hv_nested.c 
> b/arch/powerpc/kvm/book3s_hv_nested.c
> index 3b658b8696bc..5c375ec1a3c6 100644
> --- a/arch/powerpc/kvm/book3s_hv_nested.c
> +++ b/arch/powerpc/kvm/book3s_hv_nested.c
> @@ -503,7 +503,7 @@ void kvmhv_nested_exit(void)
>   }
>  }
>  
> -static void kvmhv_flush_lpid(u64 lpid)
> +void kvmhv_flush_lpid(u64 lpid)
>  {
>   long rc;
>  
> -- 
> 2.42.0


Re: [PATCH v3] powerpc/pseries/vas: Use usleep_range() to support HCALL delay

2023-12-04 Thread IBM
Haren Myneni  writes:

> VAS allocate, modify and deallocate HCALLs returns
> H_LONG_BUSY_ORDER_1_MSEC or H_LONG_BUSY_ORDER_10_MSEC for busy
> delay and expects OS to reissue HCALL after that delay. But using
> msleep() will often sleep at least 20 msecs even though the
> hypervisor suggests OS reissue these HCALLs after 1 or 10msecs.
> The open and close VAS window functions hold mutex and then issue
> these HCALLs. So these operations can take longer than the
> necessary when multiple threads issue open or close window APIs
> simultaneously.
>
> So instead of msleep(), use usleep_range() to ensure sleep with
> the expected value before issuing HCALL again.
>

Can you summarize if there an user observable impact for the current
code? We have other code paths using msleep(get_longbusy_msec()). Should
we audit those usages?


>
> Signed-off-by: Haren Myneni 
> Suggested-by: Nathan Lynch 
>
> ---
> v1 -> v2:
> - Use usleep_range instead of using RTAS sleep routine as
>   suggested by Nathan
> v2 -> v3:
> - Sleep 10MSecs even for HCALL delay > 10MSecs and the other
>   commit / comemnt changes as suggested by Nathan and Ellerman.
> ---
>  arch/powerpc/platforms/pseries/vas.c | 25 -
>  1 file changed, 24 insertions(+), 1 deletion(-)
>
> diff --git a/arch/powerpc/platforms/pseries/vas.c 
> b/arch/powerpc/platforms/pseries/vas.c
> index 71d52a670d95..5cf81c564d4b 100644
> --- a/arch/powerpc/platforms/pseries/vas.c
> +++ b/arch/powerpc/platforms/pseries/vas.c
> @@ -38,7 +38,30 @@ static long hcall_return_busy_check(long rc)
>  {
>   /* Check if we are stalled for some time */
>   if (H_IS_LONG_BUSY(rc)) {
> - msleep(get_longbusy_msecs(rc));
> + unsigned int ms;
> + /*
> +  * Allocate, Modify and Deallocate HCALLs returns
> +  * H_LONG_BUSY_ORDER_1_MSEC or H_LONG_BUSY_ORDER_10_MSEC
> +  * for the long delay. So the sleep time should always
> +  * be either 1 or 10msecs, but in case if the HCALL
> +  * returns the long delay > 10 msecs, clamp the sleep
> +  * time to 10msecs.
> +  */
> + ms = clamp(get_longbusy_msecs(rc), 1, 10);
> +
> + /*
> +  * msleep() will often sleep at least 20 msecs even
> +  * though the hypervisor suggests that the OS reissue
> +  * HCALLs after 1 or 10msecs. Also the delay hint from
> +  * the HCALL is just a suggestion. So OK to pause for
> +  * less time than the hinted delay. Use usleep_range()
> +  * to ensure we don't sleep much longer than actually
> +  * needed.
> +  *
> +  * See Documentation/timers/timers-howto.rst for
> +  * explanation of the range used here.
> +  */
> + usleep_range(ms * 100, ms * 1000);
>   rc = H_BUSY;
>   } else if (rc == H_BUSY) {
>   cond_resched();
> -- 
> 2.26.3


Re: [PATCH v2] powerpc/book3s/hash: Drop _PAGE_PRIVILEGED from PAGE_NONE

2023-12-04 Thread IBM
Michael Ellerman  writes:

> "Aneesh Kumar K.V"  writes:
>> There used to be a dependency on _PAGE_PRIVILEGED with pte_savedwrite.
>> But that got dropped by
>> commit 6a56ccbcf6c6 ("mm/autonuma: use can_change_(pte|pmd)_writable() to 
>> replace savedwrite")
>>
>> With the change in this patch numa fault pte (pte_protnone()) gets mapped as 
>> regular user pte
>> with RWX cleared (no-access) whereas earlier it used to be mapped 
>> _PAGE_PRIVILEGED.
>>
>> Hash fault handling code did get some WARN_ON added because those
>> functions are not expected to get called with _PAGE_READ cleared.
>> commit 18061c17c8ec ("powerpc/mm: Update PROTFAULT handling in the page 
>> fault path")
>> explains the details.
>  
> You say "did get" which makes me think you're talking about the past.
> But I think you're referring to the WARN_ON you are adding in this patch?

That is correct. Will update this as "Hash fault handing code gets some
WARN_ON added in this patch ..." ?
>

>
>> Also revert commit 1abce0580b89 ("powerpc/64s: Fix __pte_needs_flush() false 
>> positive warning")
>
> That could be done separately as a follow-up couldn't it? Would reduce
> the diff size.
>

Will split that to a separate patch.


>> Signed-off-by: Aneesh Kumar K.V 
>> ---
>>  arch/powerpc/include/asm/book3s/64/pgtable.h  | 9 +++--
>>  arch/powerpc/include/asm/book3s/64/tlbflush.h | 9 ++---
>>  arch/powerpc/mm/book3s64/hash_utils.c | 7 +++
>>  3 files changed, 12 insertions(+), 13 deletions(-)
>>
>> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
>> b/arch/powerpc/include/asm/book3s/64/pgtable.h
>> index cb77eddca54b..2cc58ac74080 100644
>> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
>> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
>> @@ -17,12 +17,6 @@
>>  #define _PAGE_EXEC  0x1 /* execute permission */
>>  #define _PAGE_WRITE 0x2 /* write access allowed */
>>  #define _PAGE_READ  0x4 /* read access allowed */
>> -#define _PAGE_NA_PAGE_PRIVILEGED
>  
>> -#define _PAGE_NAX   _PAGE_EXEC
>> -#define _PAGE_RO_PAGE_READ
>> -#define _PAGE_ROX   (_PAGE_READ | _PAGE_EXEC)
>> -#define _PAGE_RW(_PAGE_READ | _PAGE_WRITE)
>> -#define _PAGE_RWX   (_PAGE_READ | _PAGE_WRITE | _PAGE_EXEC)
>  
> Those are unrelated I think?
>

If we don't require _PAGE_NA we can fallback to generic version.


>>  #define _PAGE_PRIVILEGED0x8 /* kernel access only */
>>  #define _PAGE_SAO   0x00010 /* Strong access order */
>>  #define _PAGE_NON_IDEMPOTENT0x00020 /* non idempotent memory */
>> @@ -529,6 +523,9 @@ static inline bool pte_user(pte_t pte)
>>  }
>>  
>>  #define pte_access_permitted pte_access_permitted
>> +/*
>> + * execute-only mappings return false
>> + */
>
> That would fit better in the existing comment block inside the function
> I think. Normally this location would be a function description comment.
>

Will move.

>>  static inline bool pte_access_permitted(pte_t pte, bool write)
>>  {
>>  /*
>   ie. here
>
> cheers

Thanks
-aneesh


Re: [PATCH v4 06/13] powerpc/rtas: Serialize firmware activation sequences

2023-11-28 Thread IBM
Nathan Lynch  writes:

> "Aneesh Kumar K.V (IBM)"  writes:
>> Nathan Lynch via B4 Relay 
>> writes:
>>
>>>
>>> Use the function lock API to prevent interleaving call sequences of
>>> the ibm,activate-firmware RTAS function, which typically requires
>>> multiple calls to complete the update. While the spec does not
>>> specifically prohibit interleaved sequences, there's almost certainly
>>> no advantage to allowing them.
>>>
>>
>> Can we document what is the equivalent thing the userspace does?
>
> I'm not sure what we would document.
>
> As best I can tell, the activate_firmware command in powerpc-utils does
> not make any effort to protect its use of the ibm,activate-firmware RTAS
> function. The command is not intended to be run manually and I guess
> it's relying on the platform's management console to serialize its
> invocations.
>
> drmgr (also from powerpc-utils) has some dead code for LPM that calls
> ibm,activate-firmware; it should probably be removed. The command uses a
> lock file to serialize all of its executions.
>
> Something that could happen with interleaved ibm,activate-firmware
> sequences is something like this:
>
> 1. Process A initiates an ibm,activate-firmware sequence and receives a
>    "retry" status (-2/990x).
> 2. Process B calls ibm,activate-firmware and receives the "done" status
>(0), concluding the sequence A began.
> 3. Process A, unaware of B, calls ibm,activate-firmware again,
>inadvertently beginning a new sequence.
>

So this patch won't protect us against a parallel userspace invocation.
We can add static bool call_in_progress to track the ongoing
ibm,activate-firmware call from userspace? My only concern is we are
adding locks to protect against parallel calls in the kernel, but at the
same time, we ignore any userspace call regarding the same. We should at
least document this if this is not important to be fixed.

-aneesh


Re: [PATCH RFC 06/12] mm/gup: Drop folio_fast_pin_allowed() in hugepd processing

2023-11-23 Thread IBM
Peter Xu  writes:

> On Thu, Nov 23, 2023 at 06:22:33PM +, Christophe Leroy wrote:
>> > For fast-gup I think the hugepd code is in use, however for walk_page_*
>> > apis hugepd code shouldn't be reached iiuc as we have the hugetlb specific
>> > handling (walk_hugetlb_range()), so anything within walk_pgd_range() to hit
>> > a hugepd can be dead code to me (but note that this "dead code" is good
>> > stuff to me, if one would like to merge hugetlb instead into generic mm).
>> 
>> Not sure what you mean here. What do you mean by "dead code" ?
>> A hugepage directory can be plugged at any page level, from PGD to PMD.
>> So the following bit in walk_pgd_range() is valid and not dead:
>> 
>>  if (is_hugepd(__hugepd(pgd_val(*pgd
>>  err = walk_hugepd_range((hugepd_t *)pgd, addr, next, 
>> walk, PGDIR_SHIFT);
>
> IMHO it boils down to the question on whether hugepd is only used in
> hugetlbfs.  I think I already mentioned that above, but I can be more
> explicit; what I see is that from higher stack in __walk_page_range():
>
>   if (is_vm_hugetlb_page(vma)) {
>   if (ops->hugetlb_entry)
>   err = walk_hugetlb_range(start, end, walk);
>   } else
>   err = walk_pgd_range(start, end, walk);
>
> It means to me as long as the vma is hugetlb, it'll not trigger any code in
> walk_pgd_range(), but only walk_hugetlb_range().  Do you perhaps mean
> hugepd is used outside hugetlbfs?
>

walk_pgd_range also get called from walk_page_range_novma().
IIRC commit e17eae2b839937817d771e2f5d2b30e5e2b81bb7 added the hugepd
details to pagewalk code to handle ptdump.

There is also a desire to use hugepd format in vmap mappings. 
https://lore.kernel.org/linuxppc-dev/cover.1620795204.git.christophe.le...@csgroup.eu

-aneesh


Re: [PATCH] powerpc: add crtsavres.o to always-y instead of extra-y

2023-11-21 Thread IBM
"Nicholas Piggin"  writes:

> On Tue Nov 21, 2023 at 9:23 AM AEST, Masahiro Yamada wrote:
>> crtsavres.o is linked to modules. However, as explained in commit
>> d0e628cd817f ("kbuild: doc: clarify the difference between extra-y
>> and always-y"), 'make modules' does not build extra-y.
>>
>> For example, the following command fails:
>>
>>   $ make ARCH=powerpc LLVM=1 KBUILD_MODPOST_WARN=1 mrproper ps3_defconfig 
>> modules
>> [snip]
>> LD [M]  arch/powerpc/platforms/cell/spufs/spufs.ko
>>   ld.lld: error: cannot open arch/powerpc/lib/crtsavres.o: No such file or 
>> directory
>>   make[3]: *** [scripts/Makefile.modfinal:56: 
>> arch/powerpc/platforms/cell/spufs/spufs.ko] Error 1
>>   make[2]: *** [Makefile:1844: modules] Error 2
>>   make[1]: *** [/home/masahiro/workspace/linux-kbuild/Makefile:350: 
>> __build_one_by_one] Error 2
>>   make: *** [Makefile:234: __sub-make] Error 2
>>
>
> Thanks. Is this the correct Fixes tag?
>
> Fixes: d0e628cd817f ("powerpc/64: Do not link crtsavres.o in vmlinux")
>

I am finding a different commit ID:

commit baa25b571a168aff5a13bfdc973f1229e2b12b63
Author: Nicholas Piggin 
Date:   Fri May 12 01:56:49 2017 +1000

powerpc/64: Do not link crtsavres.o in vmlinux
 
The 64-bit linker creates save/restore functions on demand with final
links, so vmlinux does not require crtsavres.o.
 

-aneesh


Re: [PATCH v4 07/13] powerpc/rtas: Warn if per-function lock isn't held

2023-11-20 Thread IBM
Nathan Lynch via B4 Relay 
writes:

> From: Nathan Lynch 
>
> If the function descriptor has a populated lock member, then callers
> are required to hold it across calls. Now that the firmware activation
> sequence is appropriately guarded, we can warn when the requirement
> isn't satisfied.
>
> __do_enter_rtas_trace() gets reorganized a bit as a result of
> performing the function descriptor lookup unconditionally now.
>

Reviewed-by: Aneesh Kumar K.V (IBM) 

> Signed-off-by: Nathan Lynch 
> ---
>  arch/powerpc/kernel/rtas.c | 21 +
>  1 file changed, 9 insertions(+), 12 deletions(-)
>
> diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
> index e38ba05ad613..deb6289fcf9c 100644
> --- a/arch/powerpc/kernel/rtas.c
> +++ b/arch/powerpc/kernel/rtas.c
> @@ -685,28 +685,25 @@ static void __do_enter_rtas(struct rtas_args *args)
>  
>  static void __do_enter_rtas_trace(struct rtas_args *args)
>  {
> - const char *name = NULL;
> + struct rtas_function *func = 
> rtas_token_to_function(be32_to_cpu(args->token));
>  
> - if (args == &rtas_args)
> - lockdep_assert_held(&rtas_lock);
>   /*
> -  * If the tracepoints that consume the function name aren't
> -  * active, avoid the lookup.
> +  * If there is a per-function lock, it must be held by the
> +  * caller.
>*/
> - if ((trace_rtas_input_enabled() || trace_rtas_output_enabled())) {
> - const s32 token = be32_to_cpu(args->token);
> - const struct rtas_function *func = 
> rtas_token_to_function(token);
> + if (func->lock)
> + WARN_ON(!mutex_is_locked(func->lock));
>  
> - name = func->name;
> - }
> + if (args == &rtas_args)
> + lockdep_assert_held(&rtas_lock);
>  
> - trace_rtas_input(args, name);
> + trace_rtas_input(args, func->name);
>   trace_rtas_ll_entry(args);
>  
>   __do_enter_rtas(args);
>  
>   trace_rtas_ll_exit(args);
> - trace_rtas_output(args, name);
> + trace_rtas_output(args, func->name);
>  }
>  
>  static void do_enter_rtas(struct rtas_args *args)
>
> -- 
> 2.41.0


Re: [PATCH v4 06/13] powerpc/rtas: Serialize firmware activation sequences

2023-11-20 Thread IBM
Nathan Lynch via B4 Relay 
writes:

> From: Nathan Lynch 
>
> Use the function lock API to prevent interleaving call sequences of
> the ibm,activate-firmware RTAS function, which typically requires
> multiple calls to complete the update. While the spec does not
> specifically prohibit interleaved sequences, there's almost certainly
> no advantage to allowing them.
>

Can we document what is the equivalent thing the userspace does? 

Reviewed-by: Aneesh Kumar K.V (IBM) 

> Signed-off-by: Nathan Lynch 
> ---
>  arch/powerpc/kernel/rtas.c | 4 
>  1 file changed, 4 insertions(+)
>
> diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
> index 52f2242d0c28..e38ba05ad613 100644
> --- a/arch/powerpc/kernel/rtas.c
> +++ b/arch/powerpc/kernel/rtas.c
> @@ -1753,10 +1753,14 @@ void rtas_activate_firmware(void)
>   return;
>   }
>  
> + rtas_function_lock(RTAS_FN_IBM_ACTIVATE_FIRMWARE);
> +
>   do {
>   fwrc = rtas_call(token, 0, 1, NULL);
>   } while (rtas_busy_delay(fwrc));
>  
> + rtas_function_unlock(RTAS_FN_IBM_ACTIVATE_FIRMWARE);
> +
>   if (fwrc)
>   pr_err("ibm,activate-firmware failed (%i)\n", fwrc);
>  }
>
> -- 
> 2.41.0


Re: [PATCH v4 05/13] powerpc/rtas: Facilitate high-level call sequences

2023-11-20 Thread IBM
Nathan Lynch via B4 Relay 
writes:

> From: Nathan Lynch 
>
> On RTAS platforms there is a general restriction that the OS must not
> enter RTAS on more than one CPU at a time. This low-level
> serialization requirement is satisfied by holding a spin
> lock (rtas_lock) across most RTAS function invocations.
>
> However, some pseries RTAS functions require multiple successive calls
> to complete a logical operation. Beginning a new call sequence for such a
> function may disrupt any other sequences of that function already in
> progress. Safe and reliable use of these functions effectively
> requires higher-level serialization beyond what is already done at the
> level of RTAS entry and exit.
>
> Where a sequence-based RTAS function is invoked only through
> sys_rtas(), with no in-kernel users, there is no issue as far as the
> kernel is concerned. User space is responsible for appropriately
> serializing its call sequences. (Whether user space code actually
> takes measures to prevent sequence interleaving is another matter.)
> Examples of such functions currently include ibm,platform-dump and
> ibm,get-vpd.
>
> But where a sequence-based RTAS function has both user space and
> in-kernel uesrs, there is a hazard. Even if the in-kernel call sites
> of such a function serialize their sequences correctly, a user of
> sys_rtas() can invoke the same function at any time, potentially
> disrupting a sequence in progress.
>
> So in order to prevent disruption of kernel-based RTAS call sequences,
> they must serialize not only with themselves but also with sys_rtas()
> users, somehow. Preferably without adding global locks or adding more
> function-specific hacks to sys_rtas(). This is a prerequisite for
> adding an in-kernel call sequence of ibm,get-vpd, which is in a change
> to follow.
>
> Note that it has never been feasible for the kernel to prevent
> sys_rtas()-based sequences from being disrupted because control
> returns to user space on every call. sys_rtas()-based users of these
> functions have always been, and continue to be, responsible for
> coordinating their call sequences with other users, even those which
> may invoke the RTAS functions through less direct means than
> sys_rtas(). This is an unavoidable consequence of exposing
> sequence-based RTAS functions through sys_rtas().
>
> * Add new rtas_function_lock() and rtas_function_unlock() APIs for use
>   with sequence-based RTAS functions.
>
> * Add an optional per-function mutex to struct rtas_function. When this
>   member is set, kernel-internal callers of the RTAS function are
>   required to guard their call sequences with rtas_function_lock() and
>   rtas_function_unlock(). This requirement will be enforced in a later
>   change, after all affected call sites are updated.
>
> * Populate the lock members of function table entries where
>   serialization of call sequences is known to be necessary, along with
>   justifying commentary.
>
> * In sys_rtas(), acquire the per-function mutex when it is present.
>
> There should be no perceivable change introduced here except that
> concurrent callers of the same RTAS function via sys_rtas() may block
> on a mutex instead of spinning on rtas_lock. Changes to follow will
> add rtas_function_lock()/unlock() pairs to kernel-based call
> sequences.
>

Can you add an example of the last part. I did look at to find 06 to
find the details 

rtas_function_lock(RTAS_FN_IBM_ACTIVATE_FIRMWARE);

    do {
fwrc = rtas_call(token, 0, 1, NULL);
} while (rtas_busy_delay(fwrc));

rtas_function_unlock(RTAS_FN_IBM_ACTIVATE_FIRMWARE);

Reviewed-by: Aneesh Kumar K.V (IBM) 

>
> Signed-off-by: Nathan Lynch 
> ---
>  arch/powerpc/include/asm/rtas.h |   2 +
>  arch/powerpc/kernel/rtas.c  | 101 
> +++-
>  2 files changed, 101 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
> index b73010583a8d..9a20caba6858 100644
> --- a/arch/powerpc/include/asm/rtas.h
> +++ b/arch/powerpc/include/asm/rtas.h
> @@ -421,6 +421,8 @@ static inline bool rtas_function_implemented(const 
> rtas_fn_handle_t handle)
>  {
>   return rtas_function_token(handle) != RTAS_UNKNOWN_SERVICE;
>  }
> +void rtas_function_lock(rtas_fn_handle_t handle);
> +void rtas_function_unlock(rtas_fn_handle_t handle);
>  extern int rtas_token(const char *service);
>  extern int rtas_service_present(const char *service);
>  extern int rtas_call(int token, int, int, int *, ...);
> diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
> index 1fc0b3fffdd1..52f2242d0c28 100644
> --- a/arch/powerpc/kernel/rtas.c
>

Re: [PATCH v4 04/13] powerpc/rtas: Factor out function descriptor lookup

2023-11-20 Thread IBM
Nathan Lynch via B4 Relay 
writes:

> From: Nathan Lynch 
>
> Move the function descriptor table lookup out of rtas_function_token()
> into a separate routine for use in new code to follow. No functional
> change.
>

Reviewed-by: Aneesh Kumar K.V (IBM) 

> Signed-off-by: Nathan Lynch 
> ---
>  arch/powerpc/kernel/rtas.c | 31 +++
>  1 file changed, 19 insertions(+), 12 deletions(-)
>
> diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
> index f0051881348a..1fc0b3fffdd1 100644
> --- a/arch/powerpc/kernel/rtas.c
> +++ b/arch/powerpc/kernel/rtas.c
> @@ -469,29 +469,36 @@ static struct rtas_function rtas_function_table[] 
> __ro_after_init = {
>  static DEFINE_RAW_SPINLOCK(rtas_lock);
>  static struct rtas_args rtas_args;
>  
> -/**
> - * rtas_function_token() - RTAS function token lookup.
> - * @handle: Function handle, e.g. RTAS_FN_EVENT_SCAN.
> - *
> - * Context: Any context.
> - * Return: the token value for the function if implemented by this platform,
> - * otherwise RTAS_UNKNOWN_SERVICE.
> - */
> -s32 rtas_function_token(const rtas_fn_handle_t handle)
> +static struct rtas_function *rtas_function_lookup(const rtas_fn_handle_t 
> handle)
>  {
>   const size_t index = handle.index;
>   const bool out_of_bounds = index >= ARRAY_SIZE(rtas_function_table);
>  
>   if (WARN_ONCE(out_of_bounds, "invalid function index %zu", index))
> - return RTAS_UNKNOWN_SERVICE;
> + return NULL;
>   /*
>* Various drivers attempt token lookups on non-RTAS
>* platforms.
>*/
>   if (!rtas.dev)
> - return RTAS_UNKNOWN_SERVICE;
> + return NULL;
> +
> + return &rtas_function_table[index];
> +}
> +
> +/**
> + * rtas_function_token() - RTAS function token lookup.
> + * @handle: Function handle, e.g. RTAS_FN_EVENT_SCAN.
> + *
> + * Context: Any context.
> + * Return: the token value for the function if implemented by this platform,
> + * otherwise RTAS_UNKNOWN_SERVICE.
> + */
> +s32 rtas_function_token(const rtas_fn_handle_t handle)
> +{
> + const struct rtas_function *func = rtas_function_lookup(handle);
>  
> - return rtas_function_table[index].token;
> + return func ? func->token : RTAS_UNKNOWN_SERVICE;
>  }
>  EXPORT_SYMBOL_GPL(rtas_function_token);
>  
>
> -- 
> 2.41.0


Re: [PATCH v4 03/13] powerpc/rtas: Add function return status constants

2023-11-20 Thread IBM
Nathan Lynch via B4 Relay 
writes:

> From: Nathan Lynch 
>
> Not all of the generic RTAS function statuses specified in PAPR have
> symbolic constants and descriptions in rtas.h. Fix this, providing a
> little more background, slightly updating the existing wording, and
> improving the formatting.
>

Reviewed-by: Aneesh Kumar K.V (IBM) 

> Signed-off-by: Nathan Lynch 
> ---
>  arch/powerpc/include/asm/rtas.h | 25 +++--
>  1 file changed, 19 insertions(+), 6 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/rtas.h b/arch/powerpc/include/asm/rtas.h
> index c697c3c74694..b73010583a8d 100644
> --- a/arch/powerpc/include/asm/rtas.h
> +++ b/arch/powerpc/include/asm/rtas.h
> @@ -201,12 +201,25 @@ typedef struct {
>  /* Memory set aside for sys_rtas to use with calls that need a work area. */
>  #define RTAS_USER_REGION_SIZE (64 * 1024)
>  
> -/* RTAS return status codes */
> -#define RTAS_HARDWARE_ERROR  -1/* Hardware Error */
> -#define RTAS_BUSY-2/* RTAS Busy */
> -#define RTAS_INVALID_PARAMETER   -3/* Invalid 
> indicator/domain/sensor etc. */
> -#define RTAS_EXTENDED_DELAY_MIN  9900
> -#define RTAS_EXTENDED_DELAY_MAX  9905
> +/*
> + * Common RTAS function return values, derived from the table "RTAS
> + * Status Word Values" in PAPR+ 7.2.8: "Return Codes". If a function
> + * can return a value in this table then generally it has the meaning
> + * listed here. More extended commentary in the documentation for
> + * rtas_call().
> + *
> + * RTAS functions may use negative and positive numbers not in this
> + * set for function-specific error and success conditions,
> + * respectively.
> + */
> +#define RTAS_SUCCESS 0 /* Success. */
> +#define RTAS_HARDWARE_ERROR -1 /* Hardware or other unspecified 
> error. */
> +#define RTAS_BUSY   -2 /* Retry immediately. */
> +#define RTAS_INVALID_PARAMETER  -3 /* Invalid 
> indicator/domain/sensor etc. */
> +#define RTAS_UNEXPECTED_STATE_CHANGE-7 /* Seems limited to EEH and slot 
> reset. */
> +#define RTAS_EXTENDED_DELAY_MIN   9900 /* Retry after delaying for ~1ms. 
> */
> +#define RTAS_EXTENDED_DELAY_MAX   9905 /* Retry after delaying for 
> ~100s. */
> +#define RTAS_ML_ISOLATION_ERROR  -9000 /* Multi-level isolation error. */
>  
>  /* statuses specific to ibm,suspend-me */
>  #define RTAS_SUSPEND_ABORTED 9000 /* Suspension aborted */
>
> -- 
> 2.41.0


Re: [PATCH v4 02/13] powerpc/rtas: Fall back to linear search on failed token->function lookup

2023-11-20 Thread IBM
Nathan Lynch via B4 Relay 
writes:

> From: Nathan Lynch 
>
> Enabling any of the powerpc:rtas_* tracepoints at boot is likely to
> result in an oops on RTAS platforms. For example, booting a QEMU
> pseries model with 'trace_event=powerpc:rtas_input' in the command
> line leads to:
>
>   BUG: Kernel NULL pointer dereference on read at 0x0008
>   Oops: Kernel access of bad area, sig: 7 [#1]
>   NIP [c004231c] do_enter_rtas+0x1bc/0x460
>   LR [c004231c] do_enter_rtas+0x1bc/0x460
>   Call Trace:
> do_enter_rtas+0x1bc/0x460 (unreliable)
> rtas_call+0x22c/0x4a0
> rtas_get_boot_time+0x80/0x14c
> read_persistent_clock64+0x124/0x150
> read_persistent_wall_and_boot_offset+0x28/0x58
> timekeeping_init+0x70/0x348
> start_kernel+0xa0c/0xc1c
> start_here_common+0x1c/0x20
>
> (This is preceded by a warning for the failed lookup in
> rtas_token_to_function().)
>
> This happens when __do_enter_rtas_trace() attempts a token to function
> descriptor lookup before the xarray containing the mappings has been
> set up.
>
> Fall back to linear scan of the table if rtas_token_to_function_xarray
> is empty.
>

Reviewed-by: Aneesh Kumar K.V (IBM) 

> Signed-off-by: Nathan Lynch 
> Fixes: 24098f580e2b ("powerpc/rtas: add tracepoints around RTAS entry")
> ---
>  arch/powerpc/kernel/rtas.c | 18 ++
>  1 file changed, 14 insertions(+), 4 deletions(-)
>
> diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
> index 1ad1869e2e96..f0051881348a 100644
> --- a/arch/powerpc/kernel/rtas.c
> +++ b/arch/powerpc/kernel/rtas.c
> @@ -557,11 +557,21 @@ static const struct rtas_function 
> *rtas_token_to_function(s32 token)
>   return NULL;
>  
>   func = xa_load(&rtas_token_to_function_xarray, token);
> + if (func)
> + return func;
> + /*
> +  * Fall back to linear scan in case the reverse mapping hasn't
> +  * been initialized yet.
> +  */
> + if (xa_empty(&rtas_token_to_function_xarray)) {
> + for_each_rtas_function(func) {
> + if (func->token == token)
> + return func;
> + }
> + }
>  
> - if (WARN_ONCE(!func, "unexpected failed lookup for token %d", token))
> - return NULL;
> -
> - return func;
> + WARN_ONCE(true, "unexpected failed lookup for token %d", token);
> + return NULL;
>  }
>  
>  /* This is here deliberately so it's only used in this file */
>
> -- 
> 2.41.0


Re: [PATCH v4 01/13] powerpc/rtas: Add for_each_rtas_function() iterator

2023-11-20 Thread IBM
Nathan Lynch via B4 Relay 
writes:

> From: Nathan Lynch 
>
> Add a convenience macro for iterating over every element of the
> internal function table and convert the one site that can use it. An
> additional user of the macro is anticipated in changes to follow.
>

Reviewed-by: Aneesh Kumar K.V (IBM) 

> Signed-off-by: Nathan Lynch 
> ---
>  arch/powerpc/kernel/rtas.c | 9 +++--
>  1 file changed, 7 insertions(+), 2 deletions(-)
>
> diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
> index eddc031c4b95..1ad1869e2e96 100644
> --- a/arch/powerpc/kernel/rtas.c
> +++ b/arch/powerpc/kernel/rtas.c
> @@ -454,6 +454,11 @@ static struct rtas_function rtas_function_table[] 
> __ro_after_init = {
>   },
>  };
>  
> +#define for_each_rtas_function(funcp)   \
> + for (funcp = &rtas_function_table[0];   \
> +  funcp < &rtas_function_table[ARRAY_SIZE(rtas_function_table)]; \
> +  ++funcp)
> +
>  /*
>   * Nearly all RTAS calls need to be serialized. All uses of the
>   * default rtas_args block must hold rtas_lock.
> @@ -525,10 +530,10 @@ static DEFINE_XARRAY(rtas_token_to_function_xarray);
>  
>  static int __init rtas_token_to_function_xarray_init(void)
>  {
> + const struct rtas_function *func;
>   int err = 0;
>  
> - for (size_t i = 0; i < ARRAY_SIZE(rtas_function_table); ++i) {
> - const struct rtas_function *func = &rtas_function_table[i];
> + for_each_rtas_function(func) {
>   const s32 token = func->token;
>  
>   if (token == RTAS_UNKNOWN_SERVICE)
>
> -- 
> 2.41.0


Re: [PATCH v2 14/16] powerpc/book3s64/vmemmap: Switch radix to use a different vmemmap handling function

2023-06-27 Thread IBM
"Aneesh Kumar K.V"  writes:

> This is in preparation to update radix to implement vmemmap optimization
> for devdax. Below are the rules w.r.t radix vmemmap mapping
>
> 1. First try to map things using PMD (2M)
> 2. With altmap if altmap cross-boundary check returns true, fall back to
>PAGE_SIZE
> 3. If we can't allocate PMD_SIZE backing memory for vmemmap, fallback to
>PAGE_SIZE
>
> On removing vmemmap mapping, check if every subsection that is using the
> vmemmap area is invalid. If found to be invalid, that implies we can safely
> free the vmemmap area. We don't use the PAGE_UNUSED pattern used by x86
> because with 64K page size, we need to do the above check even at the
> PAGE_SIZE granularity.
>
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  arch/powerpc/include/asm/book3s/64/radix.h |   2 +
>  arch/powerpc/include/asm/pgtable.h |   3 +
>  arch/powerpc/mm/book3s64/radix_pgtable.c   | 319 +++--
>  arch/powerpc/mm/init_64.c  |  26 +-
>  4 files changed, 319 insertions(+), 31 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/book3s/64/radix.h 
> b/arch/powerpc/include/asm/book3s/64/radix.h
> index 8cdff5a05011..87d4c1e62491 100644
> --- a/arch/powerpc/include/asm/book3s/64/radix.h
> +++ b/arch/powerpc/include/asm/book3s/64/radix.h
> @@ -332,6 +332,8 @@ extern int __meminit 
> radix__vmemmap_create_mapping(unsigned long start,
>unsigned long phys);
>  int __meminit radix__vmemmap_populate(unsigned long start, unsigned long end,
> int node, struct vmem_altmap *altmap);
> +void __ref radix__vmemmap_free(unsigned long start, unsigned long end,
> +struct vmem_altmap *altmap);
>  extern void radix__vmemmap_remove_mapping(unsigned long start,
>   unsigned long page_size);
>  
> diff --git a/arch/powerpc/include/asm/pgtable.h 
> b/arch/powerpc/include/asm/pgtable.h
> index 9972626ddaf6..6d4cd2ebae6e 100644
> --- a/arch/powerpc/include/asm/pgtable.h
> +++ b/arch/powerpc/include/asm/pgtable.h
> @@ -168,6 +168,9 @@ static inline bool is_ioremap_addr(const void *x)
>  
>  struct seq_file;
>  void arch_report_meminfo(struct seq_file *m);
> +int __meminit vmemmap_populated(unsigned long vmemmap_addr, int 
> vmemmap_map_size);
> +bool altmap_cross_boundary(struct vmem_altmap *altmap, unsigned long start,
> +unsigned long page_size);
>  #endif /* CONFIG_PPC64 */
>  
>  #endif /* __ASSEMBLY__ */
> diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
> b/arch/powerpc/mm/book3s64/radix_pgtable.c
> index d7e2dd3d4add..ef886fab643d 100644
> --- a/arch/powerpc/mm/book3s64/radix_pgtable.c
> +++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
> @@ -742,8 +742,57 @@ static void free_pud_table(pud_t *pud_start, p4d_t *p4d)
>   p4d_clear(p4d);
>  }
>  
> +static bool __meminit vmemmap_pmd_is_unused(unsigned long addr, unsigned 
> long end)
> +{
> + unsigned long start = ALIGN_DOWN(addr, PMD_SIZE);
> +
> + return !vmemmap_populated(start, PMD_SIZE);
> +}
> +
> +static bool __meminit vmemmap_page_is_unused(unsigned long addr, unsigned 
> long end)
> +{
> + unsigned long start = ALIGN_DOWN(addr, PAGE_SIZE);
> +
> + return !vmemmap_populated(start, PAGE_SIZE);
> +
> +}
> +
> +static void __meminit free_vmemmap_pages(struct page *page,
> +  struct vmem_altmap *altmap,
> +  int order)
> +{
> + unsigned int nr_pages = 1 << order;
> +
> + if (altmap) {
> + unsigned long alt_start, alt_end;
> + unsigned long base_pfn = page_to_pfn(page);
> +
> + /*
> +  * with 1G vmemmap mmaping we can have things setup
> +  * such that even though atlmap is specified we never
> +  * used altmap.
> +  */
> + alt_start = altmap->base_pfn;
> + alt_end = altmap->base_pfn + altmap->reserve +
> + altmap->free + altmap->alloc + altmap->align;
> +
> + if (base_pfn >= alt_start && base_pfn < alt_end) {
> + vmem_altmap_free(altmap, nr_pages);
> + return;
> + }
> + }
> +
> + if (PageReserved(page)) {
> + /* allocated from memblock */
> + while (nr_pages--)
> + free_reserved_page(page++);
> + } else
> + free_pages((unsigned long)page_address(page), order);
> +}
> +
>  static void remove_pte_table(pte_t *pte_start, unsigned long addr,
> -  unsigned long end, bool direct)
> +  unsigned long end, bool direct,
> +  struct vmem_altmap *altmap)
>  {
>   unsigned long next, pages = 0;
>   pte_t *pte;
> @@ -757,24 +806,23 @@ static void remove_pte_table(pte_t *pte_start, unsigned 
> long addr,
>   if (!pte_present(*pte))
>

Re: [PATCH v2 13/16] powerpc/book3s64/mm: Enable transparent pud hugepage

2023-06-27 Thread IBM
"Aneesh Kumar K.V"  writes:

These are just some minor nits in case you are going to send another
revision.

> This is enabled only with radix translation and 1G hugepage size. This will
> be used with devdax device memory with a namespace alignment of 1G.
>
> Anon transparent hugepage is not supported even though we do have helpers
> checking pud_trans_huge(). We should never find that return true. The only
> expected pte bit combination is _PAGE_PTE | _PAGE_DEVMAP.
>
> Some of the helpers are never expected to get called on hash translation
> and hence is marked to call BUG() in such a case.
>
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  arch/powerpc/include/asm/book3s/64/pgtable.h  | 156 --
>  arch/powerpc/include/asm/book3s/64/radix.h|  37 +
>  .../include/asm/book3s/64/tlbflush-radix.h|   2 +
>  arch/powerpc/include/asm/book3s/64/tlbflush.h |   8 +
>  arch/powerpc/mm/book3s64/pgtable.c|  78 +
>  arch/powerpc/mm/book3s64/radix_pgtable.c  |  28 
>  arch/powerpc/mm/book3s64/radix_tlb.c  |   7 +
>  arch/powerpc/platforms/Kconfig.cputype|   1 +
>  include/trace/events/thp.h|  17 ++
>  9 files changed, 323 insertions(+), 11 deletions(-)
>
> diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
> b/arch/powerpc/include/asm/book3s/64/pgtable.h
> index 4acc9690f599..9a05de007956 100644
> --- a/arch/powerpc/include/asm/book3s/64/pgtable.h
> +++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
> @@ -921,8 +921,29 @@ static inline pud_t pte_pud(pte_t pte)
>  {
>   return __pud_raw(pte_raw(pte));
>  }
> +
> +static inline pte_t *pudp_ptep(pud_t *pud)
> +{
> + return (pte_t *)pud;
> +}
> +
> +#define pud_pfn(pud) pte_pfn(pud_pte(pud))
> +#define pud_dirty(pud)   pte_dirty(pud_pte(pud))
> +#define pud_young(pud)   pte_young(pud_pte(pud))
> +#define pud_mkold(pud)   pte_pud(pte_mkold(pud_pte(pud)))
> +#define pud_wrprotect(pud)   pte_pud(pte_wrprotect(pud_pte(pud)))
> +#define pud_mkdirty(pud) pte_pud(pte_mkdirty(pud_pte(pud)))
> +#define pud_mkclean(pud) pte_pud(pte_mkclean(pud_pte(pud)))
> +#define pud_mkyoung(pud) pte_pud(pte_mkyoung(pud_pte(pud)))
> +#define pud_mkwrite(pud) pte_pud(pte_mkwrite(pud_pte(pud)))
>  #define pud_write(pud)   pte_write(pud_pte(pud))
>  
> +#ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
> +#define pud_soft_dirty(pmd)pte_soft_dirty(pud_pte(pud))
> +#define pud_mksoft_dirty(pmd)  pte_pud(pte_mksoft_dirty(pud_pte(pud)))
> +#define pud_clear_soft_dirty(pmd) pte_pud(pte_clear_soft_dirty(pud_pte(pud)))
> +#endif /* CONFIG_HAVE_ARCH_SOFT_DIRTY */
> +
>  static inline int pud_bad(pud_t pud)
>  {
>   if (radix_enabled())
> @@ -1115,15 +1136,24 @@ static inline bool pmd_access_permitted(pmd_t pmd, 
> bool write)
>  
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
>  extern pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot);
> +extern pud_t pfn_pud(unsigned long pfn, pgprot_t pgprot);
>  extern pmd_t mk_pmd(struct page *page, pgprot_t pgprot);
>  extern pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot);
>  extern void set_pmd_at(struct mm_struct *mm, unsigned long addr,
>  pmd_t *pmdp, pmd_t pmd);
> +extern void set_pud_at(struct mm_struct *mm, unsigned long addr,
> +pud_t *pudp, pud_t pud);
> +
>  static inline void update_mmu_cache_pmd(struct vm_area_struct *vma,
>   unsigned long addr, pmd_t *pmd)
>  {
>  }
>  
> +static inline void update_mmu_cache_pud(struct vm_area_struct *vma,
> + unsigned long addr, pud_t *pud)
> +{
> +}
> +
>  extern int hash__has_transparent_hugepage(void);
>  static inline int has_transparent_hugepage(void)
>  {
> @@ -1133,6 +1163,14 @@ static inline int has_transparent_hugepage(void)
>  }
>  #define has_transparent_hugepage has_transparent_hugepage
>  
> +static inline int has_transparent_pud_hugepage(void)
> +{
> + if (radix_enabled())
> + return radix__has_transparent_pud_hugepage();
> + return 0;
> +}
> +#define has_transparent_pud_hugepage has_transparent_pud_hugepage
> +
>  static inline unsigned long
>  pmd_hugepage_update(struct mm_struct *mm, unsigned long addr, pmd_t *pmdp,
>   unsigned long clr, unsigned long set)
> @@ -1142,6 +1180,16 @@ pmd_hugepage_update(struct mm_struct *mm, unsigned 
> long addr, pmd_t *pmdp,
>   return hash__pmd_hugepage_update(mm, addr, pmdp, clr, set);
>  }
>  
> +static inline unsigned long
> +pud_hugepage_update(struct mm_struct *mm, unsigned long addr, pud_t *pudp,
> + unsigned long clr, unsigned long set)
> +{
> + if (radix_enabled())
> + return radix__pud_hugepage_update(mm, addr, pudp, clr, set);
> + BUG();
> + return pud_val(*pudp);
> +}
> +
>  /*
>   * returns true for pmd migration entries, THP, devmap, hugetlb
>   * But compile time dependent on THP config
> @@ -1151,6 +1199,11 @@ st

Re: [PATCH v2 12/16] mm/vmemmap optimization: Split hugetlb and devdax vmemmap optimization

2023-06-27 Thread IBM
"Aneesh Kumar K.V"  writes:

> Arm disabled hugetlb vmemmap optimization [1] because hugetlb vmemmap
> optimization includes an update of both the permissions (writeable to
> read-only) and the output address (pfn) of the vmemmap ptes. That is not
> supported without unmapping of pte(marking it invalid) by some
> architectures.
>
> With DAX vmemmap optimization we don't require such pte updates and
> architectures can enable DAX vmemmap optimization while having hugetlb
> vmemmap optimization disabled. Hence split DAX optimization support into a
> different config.
>
> loongarch and riscv don't have devdax support. So the DAX config is not
> enabled for them. With this change, arm64 should be able to select DAX
> optimization
>
> [1] commit 060a2c92d1b6 ("arm64: mm: hugetlb: Disable 
> HUGETLB_PAGE_OPTIMIZE_VMEMMAP")
>
> Signed-off-by: Aneesh Kumar K.V 
> ---
>  arch/loongarch/Kconfig | 2 +-
>  arch/riscv/Kconfig | 2 +-
>  arch/x86/Kconfig   | 3 ++-
>  fs/Kconfig | 2 +-
>  include/linux/mm.h | 2 +-
>  mm/Kconfig | 5 -
>  6 files changed, 10 insertions(+), 6 deletions(-)

what about s390?

git grep "ARCH_WANT_OPTIMIZE_VMEMMAP" .
arch/s390/Kconfig:  select ARCH_WANT_OPTIMIZE_VMEMMAP

> diff --git a/mm/Kconfig b/mm/Kconfig
> index 7672a22647b4..7b388c10baab 100644
> --- a/mm/Kconfig
> +++ b/mm/Kconfig
> @@ -461,7 +461,10 @@ config SPARSEMEM_VMEMMAP
>  # Select this config option from the architecture Kconfig, if it is preferred
>  # to enable the feature of HugeTLB/dev_dax vmemmap optimization.
>  #
> -config ARCH_WANT_OPTIMIZE_VMEMMAP
> +config ARCH_WANT_OPTIMIZE_DAX_VMEMMAP
> + bool
> +
> +config ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP
>   bool
>  
>  config HAVE_MEMBLOCK_PHYS_MAP
> -- 
> 2.40.1

-ritesh


Re: [PATCH 2/4] fs: define a firmware security filesystem named fwsecurityfs

2022-11-19 Thread Ritesh Harjani (IBM)
Hello Nayna, 

On 22/11/09 03:10PM, Nayna wrote:
> 
> On 11/9/22 08:46, Greg Kroah-Hartman wrote:
> > On Sun, Nov 06, 2022 at 04:07:42PM -0500, Nayna Jain wrote:
> > > securityfs is meant for Linux security subsystems to expose policies/logs
> > > or any other information. However, there are various firmware security
> > > features which expose their variables for user management via the kernel.
> > > There is currently no single place to expose these variables. Different
> > > platforms use sysfs/platform specific filesystem(efivarfs)/securityfs
> > > interface as they find it appropriate. Thus, there is a gap in kernel
> > > interfaces to expose variables for security features.
> > > 
> > > Define a firmware security filesystem (fwsecurityfs) to be used by
> > > security features enabled by the firmware. These variables are platform
> > > specific. This filesystem provides platforms a way to implement their
> > >   own underlying semantics by defining own inode and file operations.
> > > 
> > > Similar to securityfs, the firmware security filesystem is recommended
> > > to be exposed on a well known mount point /sys/firmware/security.
> > > Platforms can define their own directory or file structure under this 
> > > path.
> > > 
> > > Example:
> > > 
> > > # mount -t fwsecurityfs fwsecurityfs /sys/firmware/security
> > Why not juset use securityfs in /sys/security/firmware/ instead?  Then
> > you don't have to create a new filesystem and convince userspace to
> > mount it in a specific location?

I am also curious to know on why not use securityfs, given the similarity
between the two. :)
More specifics on that below...

> 
> From man 5 sysfs page:
> 
> /sys/firmware: This subdirectory contains interfaces for viewing and
> manipulating firmware-specific objects and attributes.
> 
> /sys/kernel: This subdirectory contains various files and subdirectories
> that provide information about the running kernel.
> 
> The security variables which are being exposed via fwsecurityfs are managed
> by firmware, stored in firmware managed space and also often consumed by
> firmware for enabling various security features.

That's ok. As I see it users of securityfs can define their own fileops
(like how you are doing in fwsecurityfs).
See securityfs_create_file() & securityfs_create_symlink(), can accept the fops
& iops. Except maybe securityfs_create_dir(), that could be since there might
not be a usecase for it. But do you also need it in your case is the question to
ask.

> 
> From git commit b67dbf9d4c1987c370fd18fdc4cf9d8aaea604c2, the purpose of
> securityfs(/sys/kernel/security) is to provide a common place for all kernel
> LSMs. The idea of

Which was then seperated out by commit,
da31894ed7b654e2 ("securityfs: do not depend on CONFIG_SECURITY").

securityfs now has a seperate CONFIG_SECURITYFS config option. In fact I was 
even
thinking of why shouldn't we move security/inode.c into fs/securityfs/inode.c .
fs/* is a common place for all filesystems. Users of securityfs can call it's 
exported kernel APIs to create files/dirs/symlinks.

If we move security/inode.c to fs/security/inode.c, then...
...below call within securityfs_init() should be moved into some lsm sepecific 
file.

#ifdef CONFIG_SECURITY
static struct dentry *lsm_dentry;
static ssize_t lsm_read(struct file *filp, char __user *buf, size_t count,
loff_t *ppos)
{
return simple_read_from_buffer(buf, count, ppos, lsm_names,
strlen(lsm_names));
}

static const struct file_operations lsm_ops = {
.read = lsm_read,
.llseek = generic_file_llseek,
};
#endif

securityfs_init()

#ifdef CONFIG_SECURITY
lsm_dentry = securityfs_create_file("lsm", 0444, NULL, NULL,
&lsm_ops);
#endif

So why not move it? Maybe others, can comment more on whether it's a good idea 
to move security/inode.c into fs/security/inode.c? 
This should then help others identify securityfs filesystem in fs/security/ 
for everyone to notice and utilize for their use?

> fwsecurityfs(/sys/firmware/security) is to similarly provide a common place
> for all firmware security objects.
> 
> /sys/firmware already exists. The patch now defines a new /security
> directory in it for firmware security features. Using /sys/kernel/security
> would mean scattering firmware objects in multiple places and confusing the
> purpose of /sys/kernel and /sys/firmware.

We can also think of it this way that, all security related exports should 
happen via /sys/kernel/security/. Then /sys/kernel/security/firmware/ becomes 
the security related firmware exports.

If you see find /sys -iname firmware, I am sure you will find other firmware
specifics directories related to other specific subsystems
(e.g. 
root@qemu:/home/qemu# find /sys -iname firmware
/sys/devices/ndbus0/nmem0/firmware
/sys/devices/ndbus0/firmware
/sys/firmware
)

But it could be, I am not an expert here, although I was thinking a

[PATCH] powerpc: declare unmodified attribute_group usages const

2022-02-28 Thread IBM IMAP
>From ec1a16a15a86c6224cc0129ab3c2ae9f69f2c7c5 Mon Sep 17 00:00:00 2001
From: Rohan McLure 
Date: Mon, 28 Feb 2022 10:19:19 +1100
Subject: [PATCH] powerpc: declare unmodified attribute_group usages
const
To: linuxppc-dev@lists.ozlabs.org

Inspired by (bd75b4ef4977: Constify static attribute_group structs),
accepted by linux-next, reported:
https://patchwork.ozlabs.org/project/linuxppc-dev/patch/20220210202805.7750-4-rikard.falkeb...@gmail.com/

Nearly all singletons of type struct attribute_group are never
modified, and so
are candidates for being const. Declare them as const.

Signed-off-by: Rohan McLure 
---
 arch/powerpc/perf/generic-compat-pmu.c  | 4 ++--
 arch/powerpc/perf/hv-24x7.c | 6 +++---
 arch/powerpc/perf/hv-gpci.c | 8 
 arch/powerpc/perf/imc-pmu.c | 6 +++---
 arch/powerpc/perf/isa207-common.c   | 2 +-
 arch/powerpc/perf/power10-pmu.c | 6 +++---
 arch/powerpc/perf/power7-pmu.c  | 4 ++--
 arch/powerpc/perf/power8-pmu.c  | 4 ++--
 arch/powerpc/perf/power9-pmu.c  | 6 +++---
 arch/powerpc/platforms/cell/cbe_thermal.c   | 4 ++--
 arch/powerpc/platforms/powernv/opal-core.c  | 2 +-
 arch/powerpc/platforms/powernv/opal-dump.c  | 2 +-
 arch/powerpc/platforms/powernv/opal-flash.c | 2 +-
 arch/powerpc/platforms/pseries/papr_scm.c   | 2 +-
 arch/powerpc/platforms/pseries/power.c  | 2 +-
 15 files changed, 30 insertions(+), 30 deletions(-)

diff --git a/arch/powerpc/perf/generic-compat-pmu.c
b/arch/powerpc/perf/generic-compat-pmu.c
index b6e25f75109d..f3db88aee4dd 100644
--- a/arch/powerpc/perf/generic-compat-pmu.c
+++ b/arch/powerpc/perf/generic-compat-pmu.c
@@ -130,7 +130,7 @@ static struct attribute
*generic_compat_events_attr[] = {
NULL
 };
 
-static struct attribute_group generic_compat_pmu_events_group = {
+static const struct attribute_group generic_compat_pmu_events_group =
{
.name = "events",
.attrs = generic_compat_events_attr,
 };
@@ -146,7 +146,7 @@ static struct attribute
*generic_compat_pmu_format_attr[] = {
NULL,
 };
 
-static struct attribute_group generic_compat_pmu_format_group = {
+static const struct attribute_group generic_compat_pmu_format_group =
{
.name = "format",
.attrs = generic_compat_pmu_format_attr,
 };
diff --git a/arch/powerpc/perf/hv-24x7.c b/arch/powerpc/perf/hv-24x7.c
index 1e8aa934e37e..12c1777187fc 100644
--- a/arch/powerpc/perf/hv-24x7.c
+++ b/arch/powerpc/perf/hv-24x7.c
@@ -204,7 +204,7 @@ static struct attribute *format_attrs[] = {
NULL,
 };
 
-static struct attribute_group format_group = {
+static const struct attribute_group format_group = {
.name = "format",
.attrs = format_attrs,
 };
@@ -1148,7 +1148,7 @@ static struct attribute *cpumask_attrs[] = {
NULL,
 };
 
-static struct attribute_group cpumask_attr_group = {
+static const struct attribute_group cpumask_attr_group = {
.attrs = cpumask_attrs,
 };
 
@@ -1162,7 +1162,7 @@ static struct attribute *if_attrs[] = {
NULL,
 };
 
-static struct attribute_group if_group = {
+static const struct attribute_group if_group = {
.name = "interface",
.bin_attrs = if_bin_attrs,
.attrs = if_attrs,
diff --git a/arch/powerpc/perf/hv-gpci.c b/arch/powerpc/perf/hv-gpci.c
index c756228a081f..5eb60ed5b5e8 100644
--- a/arch/powerpc/perf/hv-gpci.c
+++ b/arch/powerpc/perf/hv-gpci.c
@@ -65,12 +65,12 @@ static struct attribute *format_attrs[] = {
NULL,
 };
 
-static struct attribute_group format_group = {
+static const struct attribute_group format_group = {
.name = "format",
.attrs = format_attrs,
 };
 
-static struct attribute_group event_group = {
+static const struct attribute_group event_group = {
.name  = "events",
.attrs = hv_gpci_event_attrs,
 };
@@ -126,11 +126,11 @@ static struct attribute *cpumask_attrs[] = {
NULL,
 };
 
-static struct attribute_group cpumask_attr_group = {
+static const struct attribute_group cpumask_attr_group = {
.attrs = cpumask_attrs,
 };
 
-static struct attribute_group interface_group = {
+static const struct attribute_group interface_group = {
.name = "interface",
.attrs = interface_attrs,
 };
diff --git a/arch/powerpc/perf/imc-pmu.c b/arch/powerpc/perf/imc-pmu.c
index e106909ff9c3..70981a321036 100644
--- a/arch/powerpc/perf/imc-pmu.c
+++ b/arch/powerpc/perf/imc-pmu.c
@@ -71,7 +71,7 @@ static struct attribute *imc_format_attrs[] = {
NULL,
 };
 
-static struct attribute_group imc_format_group = {
+static const struct attribute_group imc_format_group = {
.name = "format",
.attrs = imc_format_attrs,
 };
@@ -90,7 +90,7 @@ static struct attribute *trace_imc_format_attrs[] = {
NULL,
 };
 
-static struct attribute_group trace_imc_format_group = {
+static const struct attribute_group trace_imc_format_group = {
 .name = "format",
 .attrs = trace_imc_format_attrs,
 };
@@ -125