date:20230605

Re: [PATCH] sound: Switch i2c drivers back to use .probe()

2023-06-05 Thread Takashi Iwai

On Thu, 25 May 2023 22:36:40 +0200,
Uwe Kleine-König wrote:
> 
> After commit b8a1a4cd5a98 ("i2c: Provide a temporary .probe_new()
> call-back type"), all drivers being converted to .probe_new() and then
> 03c835f498b5 ("i2c: Switch .probe() to not take an id parameter") convert
> back to (the new) .probe() to be able to eventually drop .probe_new() from
> struct i2c_driver.
> 
> Signed-off-by: Uwe Kleine-König 

Applied now to for-next branch.  Thanks.


Takashi

Re: [RFC 1/1] sched/fair: Consider asymmetric scheduler groups in load balancer

2023-06-05 Thread Tobias Huschle


On 2023-05-16 15:36, Vincent Guittot wrote:
On Mon, 15 May 2023 at 13:46, Tobias Huschle  
wrote:


The current load balancer implementation implies that scheduler 
groups,

within the same domain, all host the same number of CPUs. This is
reflected in the condition, that a scheduler group, which is load
balancing and classified as having spare capacity, should pull work
from the busiest group, if the local group runs less processes than
the busiest one. This implies that these two groups should run the
same number of processes, which is problematic if the groups are not
of the same size.

The assumption that scheduler groups within the same scheduler domain
host the same number of CPUs appears to be true for non-s390
architectures. Nevertheless, s390 can have scheduler groups of unequal
size.

This introduces a performance degredation in the following scenario:

Consider a system with 8 CPUs, 6 CPUs are located on one CPU socket,
the remaining 2 are located on another socket:

Socket   -1--2-
CPU  1 2 3 4 5 67 8

Placing some workload ( x = one task ) yields the following
scenarios:

The first 5 tasks are distributed evenly across the two groups.

Socket   -1--2-
CPU  1 2 3 4 5 67 8
 x x x  x x

Adding a 6th task yields the following distribution:

Socket   -1--2-
CPU  1 2 3 4 5 67 8
SMT1 x x x  x x
SMT2x


Your description is a bit confusing for me. What you name CPU above
should be named Core, doesn' it ?

Could you share with us your scheduler topology ?



You are correct, it should say core instead of CPU.

One actual configuration from one of my test machines (lscpu -e):

CPU NODE DRAWER BOOK SOCKET CORE L1d:L1i:L2 ONLINE CONFIGURED 
POLARIZATION ADDRESS
  00  00  00 0:0:0 yes yeshorizontal 
  0
  10  00  00 1:1:0 yes yeshorizontal 
  1
  20  00  01 2:2:0 yes yeshorizontal 
  2
  30  00  01 3:3:0 yes yeshorizontal 
  3
  40  00  02 4:4:0 yes yeshorizontal 
  4
  50  00  02 5:5:0 yes yeshorizontal 
  5
  60  00  03 6:6:0 yes yeshorizontal 
  6
  70  00  03 7:7:0 yes yeshorizontal 
  7
  80  00  04 8:8:0 yes yeshorizontal 
  8
  90  00  04 9:9:0 yes yeshorizontal 
  9
 100  00  05 10:10:0   yes yeshorizontal 
  10
 110  00  05 11:11:0   yes yeshorizontal 
  11
 120  00  16 12:12:0   yes yeshorizontal 
  12
 130  00  16 13:13:0   yes yeshorizontal 
  13
 140  00  17 14:14:0   yes yeshorizontal 
  14
 150  00  17 15:15:0   yes yeshorizontal 
  15


So, 6 cores / 12 CPUs in one group 2 cores / 4 CPUs in the other.

If I run stress-ng with 8 cpu stressors on the original code I get a 
distribution

like this:

00 01 02 03 04 05 06 07 08 09 10 11  || 12 13 14 15
x x x x  x  x  x  x

Which means that the two cores in the smaller group are running into SMT 
while two
cores in the larger group are still idle. This is caused by the 
prefer_sibling path

which really wants to see both groups run the same number of tasks.



The task is added to the 2nd scheduler group, as the scheduler has the
assumption that scheduler groups are of the same size, so they should
also host the same number of tasks. This makes CPU 7 run into SMT
thread, which comes with a performance penalty. This means, that in
the window of 6-8 tasks, load balancing is done suboptimally, because
SMT is used although there is no reason to do so as fully idle CPUs
are still available.

Taking the weight of the scheduler groups into account, ensures that
a load balancing CPU within a smaller group will not try to pull tasks
from a bigger group while the bigger group still has idle CPUs
available.

Signed-off-by: Tobias Huschle 
---
 kernel/sched/fair.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 48b6f0ca13ac..b1307d7e4065 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -10426,7 +10426,8 @@ static struct sched_group 
*find_busiest_group(struct lb_env *env)

 * group's child domain.
 */
if (sds.prefer_sibling && local->group_type == group_has_spare 
&&

-   busiest->sum_nr_running > local->sum_nr_running + 1)
+   busiest->sum_nr_running * local->group_weight >
+   local->sum_nr_running * busiest->group_weight 
+ 1)


This is the prefer_sibling path. Could it be that you should disable
prefer_siling between your sockets for s

Re: [PATCH 12/13] x86/jitalloc: prepare to allocate exectuatble memory as ROX

2023-06-05 Thread Mike Rapoport

On Sun, Jun 04, 2023 at 10:52:44PM -0400, Steven Rostedt wrote:
> On Thu, 1 Jun 2023 16:54:36 -0700
> Nadav Amit  wrote:
> 
> > > The way text_poke() is used here, it is creating a new writable alias
> > > and flushing it for *each* write to the module (like for each write of
> > > an individual relocation, etc). I was just thinking it might warrant
> > > some batching or something.  

> > I am not advocating to do so, but if you want to have many efficient
> > writes, perhaps you can just disable CR0.WP. Just saying that if you
> > are about to write all over the memory, text_poke() does not provide
> > too much security for the poking thread.

Heh, this is definitely and easier hack to implement :)

> Batching does exist, which is what the text_poke_queue() thing does.

For module loading text_poke_queue() will still be much slower than a bunch
of memset()s for no good reason because we don't need all the complexity of
text_poke_bp_batch() for module initialization because we are sure we are
not patching live code.

What we'd need here is a new batching mode that will create a writable
alias mapping at the beginning of apply_relocate_*() and module_finalize(),
then it will use memcpy() to that writable alias and will tear the mapping
down in the end.

Another option is to teach alternatives to update a writable copy rather
than do in place changes like Song suggested. My feeling is that it will be
more intrusive change though.

> -- Steve
> 

-- 
Sincerely yours,
Mike.

[PATCH] powerpc/signal32: Force inlining of __unsafe_save_user_regs() and save_tm_user_regs_unsafe()

2023-06-05 Thread Christophe Leroy

Looking at generated code for handle_signal32() shows calls to a
function called __unsafe_save_user_regs.constprop.0 while user access
is open.

And that __unsafe_save_user_regs.constprop.0 function has two nops at
the begining, allowing it to be traced, which is unexpected during
user access open window.

The solution could be to mark __unsafe_save_user_regs() no trace, but
to be on the safe side the most efficient is to flag it __always_inline
as already done for function __unsafe_restore_general_regs(). The
function is relatively small and only called twice, so the size
increase will remain in the noise.

Do the same with save_tm_user_regs_unsafe() as it may suffer the
same issue.

Fixes: ef75e7318294 ("powerpc/signal32: Transform save_user_regs() and 
save_tm_user_regs() in 'unsafe' version")
Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/signal_32.c | 15 +--
 1 file changed, 9 insertions(+), 6 deletions(-)

diff --git a/arch/powerpc/kernel/signal_32.c b/arch/powerpc/kernel/signal_32.c
index c114c7f25645..7a718ed32b27 100644
--- a/arch/powerpc/kernel/signal_32.c
+++ b/arch/powerpc/kernel/signal_32.c
@@ -264,8 +264,9 @@ static void prepare_save_user_regs(int ctx_has_vsx_region)
 #endif
 }
 
-static int __unsafe_save_user_regs(struct pt_regs *regs, struct mcontext 
__user *frame,
-  struct mcontext __user *tm_frame, int 
ctx_has_vsx_region)
+static __always_inline int
+__unsafe_save_user_regs(struct pt_regs *regs, struct mcontext __user *frame,
+   struct mcontext __user *tm_frame, int 
ctx_has_vsx_region)
 {
unsigned long msr = regs->msr;
 
@@ -364,8 +365,9 @@ static void prepare_save_tm_user_regs(void)
current->thread.ckvrsave = mfspr(SPRN_VRSAVE);
 }
 
-static int save_tm_user_regs_unsafe(struct pt_regs *regs, struct mcontext 
__user *frame,
-   struct mcontext __user *tm_frame, unsigned 
long msr)
+static __always_inline int
+save_tm_user_regs_unsafe(struct pt_regs *regs, struct mcontext __user *frame,
+struct mcontext __user *tm_frame, unsigned long msr)
 {
/* Save both sets of general registers */
unsafe_save_general_regs(¤t->thread.ckpt_regs, frame, failed);
@@ -444,8 +446,9 @@ static int save_tm_user_regs_unsafe(struct pt_regs *regs, 
struct mcontext __user
 #else
 static void prepare_save_tm_user_regs(void) { }
 
-static int save_tm_user_regs_unsafe(struct pt_regs *regs, struct mcontext 
__user *frame,
-   struct mcontext __user *tm_frame, unsigned 
long msr)
+static __always_inline int
+save_tm_user_regs_unsafe(struct pt_regs *regs, struct mcontext __user *frame,
+struct mcontext __user *tm_frame, unsigned long msr)
 {
return 0;
 }
-- 
2.40.1

[PATCH] powerpc/interrupt: Don't read MSR from interrupt_exit_kernel_prepare()

2023-06-05 Thread Christophe Leroy

A disassembly of interrupt_exit_kernel_prepare() shows a useless read
of MSR register. This is shown by r9 being re-used immediately without
doing anything with the value read.

  c000e0e0:   60 00 00 00 nop
  c000e0e4:   7d 3a c2 a6 mfmd_ap r9
  c000e0e8:   7d 20 00 a6 mfmsr   r9
  c000e0ec:   7c 51 13 a6 mtspr   81,r2
  c000e0f0:   81 3f 00 84 lwz r9,132(r31)
  c000e0f4:   71 29 80 00 andi.   r9,r9,32768

This is due to the use of local_irq_save(). The flags read by
local_irq_save() are never used, use local_irq_disable() instead.

Fixes: 13799748b957 ("powerpc/64: use interrupt restart table to speed up 
return from interrupt")
Signed-off-by: Christophe Leroy 
---
 arch/powerpc/kernel/interrupt.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/powerpc/kernel/interrupt.c b/arch/powerpc/kernel/interrupt.c
index e34c72285b4e..f3fc5fe919d9 100644
--- a/arch/powerpc/kernel/interrupt.c
+++ b/arch/powerpc/kernel/interrupt.c
@@ -368,7 +368,6 @@ void preempt_schedule_irq(void);
 
 notrace unsigned long interrupt_exit_kernel_prepare(struct pt_regs *regs)
 {
-   unsigned long flags;
unsigned long ret = 0;
unsigned long kuap;
bool stack_store = read_thread_flags() & _TIF_EMULATE_STACK_STORE;
@@ -392,7 +391,7 @@ notrace unsigned long interrupt_exit_kernel_prepare(struct 
pt_regs *regs)
 
kuap = kuap_get_and_assert_locked();
 
-   local_irq_save(flags);
+   local_irq_disable();
 
if (!arch_irq_disabled_regs(regs)) {
/* Returning to a kernel context with local irqs enabled. */
-- 
2.40.1

Re: [PATCH 00/13] mm: jit/text allocator

2023-06-05 Thread Mike Rapoport

On Fri, Jun 02, 2023 at 10:35:09AM +0100, Mark Rutland wrote:
> On Thu, Jun 01, 2023 at 02:14:56PM -0400, Kent Overstreet wrote:
> > On Thu, Jun 01, 2023 at 05:12:03PM +0100, Mark Rutland wrote:
> > > For a while I have wanted to give kprobes its own allocator so that it 
> > > can work
> > > even with CONFIG_MODULES=n, and so that it doesn't have to waste VA space 
> > > in
> > > the modules area.
> > > 
> > > Given that, I think these should have their own allocator functions that 
> > > can be
> > > provided independently, even if those happen to use common infrastructure.
> > 
> > How much memory can kprobes conceivably use? I think we also want to try
> > to push back on combinatorial new allocators, if we can.
> 
> That depends on who's using it, and how (e.g. via BPF).
> 
> To be clear, I'm not necessarily asking for entirely different allocators, but
> I do thinkg that we want wrappers that can at least pass distinct start+end
> parameters to a common allocator, and for arm64's modules code I'd expect that
> we'd keep the range falblack logic out of the common allcoator, and just call
> it twice.
> 
> > > > Several architectures override module_alloc() because of various
> > > > constraints where the executable memory can be located and this causes
> > > > additional obstacles for improvements of code allocation.
> > > > 
> > > > This set splits code allocation from modules by introducing
> > > > jit_text_alloc(), jit_data_alloc() and jit_free() APIs, replaces call
> > > > sites of module_alloc() and module_memfree() with the new APIs and
> > > > implements core text and related allocation in a central place.
> > > > 
> > > > Instead of architecture specific overrides for module_alloc(), the
> > > > architectures that require non-default behaviour for text allocation 
> > > > must
> > > > fill jit_alloc_params structure and implement jit_alloc_arch_params() 
> > > > that
> > > > returns a pointer to that structure. If an architecture does not 
> > > > implement
> > > > jit_alloc_arch_params(), the defaults compatible with the current
> > > > modules::module_alloc() are used.
> > > 
> > > As above, I suspect that each of the callsites should probably be using 
> > > common
> > > infrastructure, but I don't think that a single jit_alloc_arch_params() 
> > > makes
> > > sense, since the parameters for each case may need to be distinct.
> > 
> > I don't see how that follows. The whole point of function parameters is
> > that they may be different :)
> 
> What I mean is that jit_alloc_arch_params() tries to aggregate common
> parameters, but they aren't actually common (e.g. the actual start+end range
> for allocation).

jit_alloc_arch_params() tries to aggregate architecture constraints and
requirements for allocations of executable memory and this exactly what
the first 6 patches of this set do.

A while ago Thomas suggested to use a structure that parametrizes
architecture constraints by the memory type used in modules [1] and Song
implemented the infrastructure for it and x86 part [2].

I liked the idea of defining parameters in a single structure, but I
thought that approaching the problem from the arch side rather than from
modules perspective will be better starting point, hence these patches.

I don't see a fundamental reason why a single structure cannot describe
what is needed for different code allocation cases, be it modules, kprobes
or bpf. There is of course an assumption that the core allocations will be
the same for all the users, and it seems to me that something like 

* allocate physical memory if allocator caches are empty
* map it in vmalloc or modules address space
* return memory from the allocator cache to the caller

will work for all usecases.

We might need separate caches for different cases on different
architectures, and a way to specify what cache should be used in the
allocator API, but that does not contradict a single structure for arch
specific parameters, but only makes it more elaborate, e.g. something like

enum jit_type {
JIT_MODULES_TEXT,
JIT_MODULES_DATA,
JIT_KPROBES,
JIT_FTRACE,
JIT_BPF,
JIT_TYPE_MAX,
};

struct jit_alloc_params {
struct jit_rangeranges[JIT_TYPE_MAX];
/* ... */
};

> > Can you give more detail on what parameters you need? If the only extra
> > parameter is just "does this allocation need to live close to kernel
> > text", that's not that big of a deal.
> 
> My thinking was that we at least need the start + end for each caller. That
> might be it, tbh.

Do you mean that modules will have something like

jit_text_alloc(size, MODULES_START, MODULES_END);

and kprobes will have

jit_text_alloc(size, KPROBES_START, KPROBES_END);
?

It sill can be achieved with a single jit_alloc_arch_params(), just by
adding enum jit_type parameter to jit_text_alloc().

[1] https://lore.kernel.org/linux-mm/87v8mndy3y.ffs@tglx/ 
[2] https://lore.kernel.org/all/20230526051529.338710

Re: [PATCH 00/13] mm: jit/text allocator

2023-06-05 Thread Mark Rutland

On Mon, Jun 05, 2023 at 12:20:40PM +0300, Mike Rapoport wrote:
> On Fri, Jun 02, 2023 at 10:35:09AM +0100, Mark Rutland wrote:
> > On Thu, Jun 01, 2023 at 02:14:56PM -0400, Kent Overstreet wrote:
> > > On Thu, Jun 01, 2023 at 05:12:03PM +0100, Mark Rutland wrote:
> > > > For a while I have wanted to give kprobes its own allocator so that it 
> > > > can work
> > > > even with CONFIG_MODULES=n, and so that it doesn't have to waste VA 
> > > > space in
> > > > the modules area.
> > > > 
> > > > Given that, I think these should have their own allocator functions 
> > > > that can be
> > > > provided independently, even if those happen to use common 
> > > > infrastructure.
> > > 
> > > How much memory can kprobes conceivably use? I think we also want to try
> > > to push back on combinatorial new allocators, if we can.
> > 
> > That depends on who's using it, and how (e.g. via BPF).
> > 
> > To be clear, I'm not necessarily asking for entirely different allocators, 
> > but
> > I do thinkg that we want wrappers that can at least pass distinct start+end
> > parameters to a common allocator, and for arm64's modules code I'd expect 
> > that
> > we'd keep the range falblack logic out of the common allcoator, and just 
> > call
> > it twice.
> > 
> > > > > Several architectures override module_alloc() because of various
> > > > > constraints where the executable memory can be located and this causes
> > > > > additional obstacles for improvements of code allocation.
> > > > > 
> > > > > This set splits code allocation from modules by introducing
> > > > > jit_text_alloc(), jit_data_alloc() and jit_free() APIs, replaces call
> > > > > sites of module_alloc() and module_memfree() with the new APIs and
> > > > > implements core text and related allocation in a central place.
> > > > > 
> > > > > Instead of architecture specific overrides for module_alloc(), the
> > > > > architectures that require non-default behaviour for text allocation 
> > > > > must
> > > > > fill jit_alloc_params structure and implement jit_alloc_arch_params() 
> > > > > that
> > > > > returns a pointer to that structure. If an architecture does not 
> > > > > implement
> > > > > jit_alloc_arch_params(), the defaults compatible with the current
> > > > > modules::module_alloc() are used.
> > > > 
> > > > As above, I suspect that each of the callsites should probably be using 
> > > > common
> > > > infrastructure, but I don't think that a single jit_alloc_arch_params() 
> > > > makes
> > > > sense, since the parameters for each case may need to be distinct.
> > > 
> > > I don't see how that follows. The whole point of function parameters is
> > > that they may be different :)
> > 
> > What I mean is that jit_alloc_arch_params() tries to aggregate common
> > parameters, but they aren't actually common (e.g. the actual start+end range
> > for allocation).
> 
> jit_alloc_arch_params() tries to aggregate architecture constraints and
> requirements for allocations of executable memory and this exactly what
> the first 6 patches of this set do.
> 
> A while ago Thomas suggested to use a structure that parametrizes
> architecture constraints by the memory type used in modules [1] and Song
> implemented the infrastructure for it and x86 part [2].
> 
> I liked the idea of defining parameters in a single structure, but I
> thought that approaching the problem from the arch side rather than from
> modules perspective will be better starting point, hence these patches.
> 
> I don't see a fundamental reason why a single structure cannot describe
> what is needed for different code allocation cases, be it modules, kprobes
> or bpf. There is of course an assumption that the core allocations will be
> the same for all the users, and it seems to me that something like 
> 
> * allocate physical memory if allocator caches are empty
> * map it in vmalloc or modules address space
> * return memory from the allocator cache to the caller
> 
> will work for all usecases.
> 
> We might need separate caches for different cases on different
> architectures, and a way to specify what cache should be used in the
> allocator API, but that does not contradict a single structure for arch
> specific parameters, but only makes it more elaborate, e.g. something like
> 
> enum jit_type {
>   JIT_MODULES_TEXT,
>   JIT_MODULES_DATA,
>   JIT_KPROBES,
>   JIT_FTRACE,
>   JIT_BPF,
>   JIT_TYPE_MAX,
> };
> 
> struct jit_alloc_params {
>   struct jit_rangeranges[JIT_TYPE_MAX];
>   /* ... */
> };
> 
> > > Can you give more detail on what parameters you need? If the only extra
> > > parameter is just "does this allocation need to live close to kernel
> > > text", that's not that big of a deal.
> > 
> > My thinking was that we at least need the start + end for each caller. That
> > might be it, tbh.
> 
> Do you mean that modules will have something like
> 
>   jit_text_alloc(size, MODULES_START, MODULES_END);
> 
> and kprobes will have
> 
>

[PATCH 1/4] powerpc/kuap: Avoid unnecessary reads of MD_AP

2023-06-05 Thread Christophe Leroy

A disassembly of interrupt_exit_kernel_prepare() shows a useless read
of MD_AP register. This is shown by r9 being re-used immediately without
doing anything with the value read.

  c000e0e0:   60 00 00 00 nop
  c000e0e4: ===>  7d 3a c2 a6 mfmd_ap r9<
  c000e0e8:   7d 20 00 a6 mfmsr   r9
  c000e0ec:   7c 51 13 a6 mtspr   81,r2
  c000e0f0:   81 3f 00 84 lwz r9,132(r31)
  c000e0f4:   71 29 80 00 andi.   r9,r9,32768

kuap_get_and_assert_locked() is paired with kuap_kernel_restore()
and are only used in interrupt_exit_kernel_prepare(). The value
returned by kuap_get_and_assert_locked() is only used by
kuap_kernel_restore().

On 8xx, kuap_kernel_restore() doesn't use the value read by
kuap_get_and_assert_locked() so modify kuap_get_and_assert_locked()
to not perform the read of MD_AP and return 0 instead.

The same applies on BOOKE.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/nohash/32/kup-8xx.h | 8 ++--
 arch/powerpc/include/asm/nohash/kup-booke.h  | 6 ++
 2 files changed, 4 insertions(+), 10 deletions(-)

diff --git a/arch/powerpc/include/asm/nohash/32/kup-8xx.h 
b/arch/powerpc/include/asm/nohash/32/kup-8xx.h
index c44d97751723..8579210f2a6a 100644
--- a/arch/powerpc/include/asm/nohash/32/kup-8xx.h
+++ b/arch/powerpc/include/asm/nohash/32/kup-8xx.h
@@ -41,14 +41,10 @@ static inline void __kuap_kernel_restore(struct pt_regs 
*regs, unsigned long kua
 
 static inline unsigned long __kuap_get_and_assert_locked(void)
 {
-   unsigned long kuap;
-
-   kuap = mfspr(SPRN_MD_AP);
-
if (IS_ENABLED(CONFIG_PPC_KUAP_DEBUG))
-   WARN_ON_ONCE(kuap >> 16 != MD_APG_KUAP >> 16);
+   WARN_ON_ONCE(mfspr(SPRN_MD_AP) >> 16 != MD_APG_KUAP >> 16);
 
-   return kuap;
+   return 0;
 }
 
 static inline void __allow_user_access(void __user *to, const void __user 
*from,
diff --git a/arch/powerpc/include/asm/nohash/kup-booke.h 
b/arch/powerpc/include/asm/nohash/kup-booke.h
index 49bb41ed0816..823c5a3a96d8 100644
--- a/arch/powerpc/include/asm/nohash/kup-booke.h
+++ b/arch/powerpc/include/asm/nohash/kup-booke.h
@@ -58,12 +58,10 @@ static inline void __kuap_kernel_restore(struct pt_regs 
*regs, unsigned long kua
 
 static inline unsigned long __kuap_get_and_assert_locked(void)
 {
-   unsigned long kuap = mfspr(SPRN_PID);
-
if (IS_ENABLED(CONFIG_PPC_KUAP_DEBUG))
-   WARN_ON_ONCE(kuap);
+   WARN_ON_ONCE(mfspr(SPRN_PID));
 
-   return kuap;
+   return 0;
 }
 
 static inline void __allow_user_access(void __user *to, const void __user 
*from,
-- 
2.40.1

[PATCH 2/4] powerpc/kuap: Avoid useless jump_label on empty function

2023-06-05 Thread Christophe Leroy

Disassembly of interrupt_enter_prepare() shows a pointless nop before the mftb

  c000abf0 :
  c000abf0:   81 23 00 84 lwz r9,132(r3)
  c000abf4:   71 29 40 00 andi.   r9,r9,16384
  c000abf8:   41 82 00 28 beq-c000ac20 

  c000abfc: ===>  60 00 00 00 nop   <
  c000ac00:   7d 0c 42 e6 mftbr8
  c000ac04:   80 e2 00 08 lwz r7,8(r2)
  c000ac08:   81 22 00 28 lwz r9,40(r2)
  c000ac0c:   91 02 00 24 stw r8,36(r2)
  c000ac10:   7d 29 38 50 subfr9,r9,r7
  c000ac14:   7d 29 42 14 add r9,r9,r8
  c000ac18:   91 22 00 08 stw r9,8(r2)
  c000ac1c:   4e 80 00 20 blr
  c000ac20:   60 00 00 00 nop
  c000ac24:   7d 5a c2 a6 mfmd_ap r10
  c000ac28:   3d 20 de 00 lis r9,-8704
  c000ac2c:   91 43 00 b0 stw r10,176(r3)
  c000ac30:   7d 3a c3 a6 mtspr   794,r9
  c000ac34:   4e 80 00 20 blr

That comes from the call to kuap_loc(), allthough __kuap_lock() is an empty
function on the 8xx.

To avoid that, only perform kuap_is_disabled() check when there is something
to do with __kuap_lock().

Do the same with __kuap_save_and_lock() and __kuap_get_and_assert_locked().

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/book3s/32/kup.h |  6 ++---
 arch/powerpc/include/asm/book3s/64/kup.h | 10 +---
 arch/powerpc/include/asm/kup.h   | 25 ++--
 arch/powerpc/include/asm/nohash/32/kup-8xx.h | 11 -
 arch/powerpc/include/asm/nohash/kup-booke.h  |  8 +--
 5 files changed, 26 insertions(+), 34 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/32/kup.h 
b/arch/powerpc/include/asm/book3s/32/kup.h
index 678f9c9d89b6..466a19cfb4df 100644
--- a/arch/powerpc/include/asm/book3s/32/kup.h
+++ b/arch/powerpc/include/asm/book3s/32/kup.h
@@ -77,10 +77,6 @@ static inline void kuap_unlock(unsigned long addr, bool ool)
kuap_unlock_all_ool();
 }
 
-static inline void __kuap_lock(void)
-{
-}
-
 static inline void __kuap_save_and_lock(struct pt_regs *regs)
 {
unsigned long kuap = current->thread.kuap;
@@ -92,6 +88,7 @@ static inline void __kuap_save_and_lock(struct pt_regs *regs)
current->thread.kuap = KUAP_NONE;
kuap_lock_addr(kuap, false);
 }
+#define __kuap_save_and_lock __kuap_save_and_lock
 
 static inline void kuap_user_restore(struct pt_regs *regs)
 {
@@ -120,6 +117,7 @@ static inline unsigned long 
__kuap_get_and_assert_locked(void)
 
return kuap;
 }
+#define __kuap_get_and_assert_locked __kuap_get_and_assert_locked
 
 static __always_inline void __allow_user_access(void __user *to, const void 
__user *from,
u32 size, unsigned long dir)
diff --git a/arch/powerpc/include/asm/book3s/64/kup.h 
b/arch/powerpc/include/asm/book3s/64/kup.h
index 54cf46808157..1b0215ff3710 100644
--- a/arch/powerpc/include/asm/book3s/64/kup.h
+++ b/arch/powerpc/include/asm/book3s/64/kup.h
@@ -297,15 +297,7 @@ static inline unsigned long 
__kuap_get_and_assert_locked(void)
WARN_ON_ONCE(amr != AMR_KUAP_BLOCKED);
return amr;
 }
-
-/* Do nothing, book3s/64 does that in ASM */
-static inline void __kuap_lock(void)
-{
-}
-
-static inline void __kuap_save_and_lock(struct pt_regs *regs)
-{
-}
+#define __kuap_get_and_assert_locked __kuap_get_and_assert_locked
 
 /*
  * We support individually allowing read or write, but we don't support nesting
diff --git a/arch/powerpc/include/asm/kup.h b/arch/powerpc/include/asm/kup.h
index d751ddd08110..0a4e07175612 100644
--- a/arch/powerpc/include/asm/kup.h
+++ b/arch/powerpc/include/asm/kup.h
@@ -52,16 +52,9 @@ __bad_kuap_fault(struct pt_regs *regs, unsigned long 
address, bool is_write)
return false;
 }
 
-static inline void __kuap_lock(void) { }
-static inline void __kuap_save_and_lock(struct pt_regs *regs) { }
 static inline void kuap_user_restore(struct pt_regs *regs) { }
 static inline void __kuap_kernel_restore(struct pt_regs *regs, unsigned long 
amr) { }
 
-static inline unsigned long __kuap_get_and_assert_locked(void)
-{
-   return 0;
-}
-
 /*
  * book3s/64/kup-radix.h defines these functions for the !KUAP case to flush
  * the L1D cache after user accesses. Only include the empty stubs for other
@@ -87,27 +80,32 @@ bad_kuap_fault(struct pt_regs *regs, unsigned long address, 
bool is_write)
 
 static __always_inline void kuap_assert_locked(void)
 {
+#if defined(CONFIG_PPC_KUAP_DEBUG) && defined(__kuap_get_and_assert_locked)
if (kuap_is_disabled())
return;
 
-   if (IS_ENABLED(CONFIG_PPC_KUAP_DEBUG))
-   __kuap_get_and_assert_locked();
+   __kuap_get_and_assert_locked();
+#endif
 }
 
 static __always_inline void kuap_lock(void)
 {
+#ifdef __kuap_lock
if (kuap_is_disabled())
return;
 
__kuap_lock();
+#endif
 }
 
 static __always_inline void kuap_save_and_lock(struct pt_regs *regs)

[PATCH 3/4] powerpc/kuap: Refactor static branch for disabling kuap

2023-06-05 Thread Christophe Leroy

All but book3s/64 use a static branch key for disabling kuap.
book3s/64 uses a memory feature.

Refactor all targets except book3s/64.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/book3s/32/kup.h |  7 ---
 arch/powerpc/include/asm/book3s/64/kup.h |  1 +
 arch/powerpc/include/asm/kup.h   | 15 +++
 arch/powerpc/include/asm/nohash/32/kup-8xx.h |  7 ---
 arch/powerpc/include/asm/nohash/kup-booke.h  |  7 ---
 arch/powerpc/mm/book3s32/kuap.c  |  3 ---
 arch/powerpc/mm/init-common.c|  3 +++
 arch/powerpc/mm/nohash/kup.c |  3 ---
 8 files changed, 19 insertions(+), 27 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/32/kup.h 
b/arch/powerpc/include/asm/book3s/32/kup.h
index 466a19cfb4df..8da9997a67ba 100644
--- a/arch/powerpc/include/asm/book3s/32/kup.h
+++ b/arch/powerpc/include/asm/book3s/32/kup.h
@@ -11,8 +11,6 @@
 
 #include 
 
-extern struct static_key_false disable_kuap_key;
-
 static __always_inline bool kuep_is_disabled(void)
 {
return !IS_ENABLED(CONFIG_PPC_KUEP);
@@ -25,11 +23,6 @@ static __always_inline bool kuep_is_disabled(void)
 #define KUAP_NONE  (~0UL)
 #define KUAP_ALL   (~1UL)
 
-static __always_inline bool kuap_is_disabled(void)
-{
-   return static_branch_unlikely(&disable_kuap_key);
-}
-
 static inline void kuap_lock_one(unsigned long addr)
 {
mtsr(mfsr(addr) | SR_KS, addr);
diff --git a/arch/powerpc/include/asm/book3s/64/kup.h 
b/arch/powerpc/include/asm/book3s/64/kup.h
index 1b0215ff3710..f8b8e93c488c 100644
--- a/arch/powerpc/include/asm/book3s/64/kup.h
+++ b/arch/powerpc/include/asm/book3s/64/kup.h
@@ -233,6 +233,7 @@ static __always_inline bool kuap_is_disabled(void)
 {
return !mmu_has_feature(MMU_FTR_BOOK3S_KUAP);
 }
+#define kuap_is_disabled kuap_is_disabled
 
 static inline void kuap_user_restore(struct pt_regs *regs)
 {
diff --git a/arch/powerpc/include/asm/kup.h b/arch/powerpc/include/asm/kup.h
index 0a4e07175612..74b7f4cee2ed 100644
--- a/arch/powerpc/include/asm/kup.h
+++ b/arch/powerpc/include/asm/kup.h
@@ -6,6 +6,12 @@
 #define KUAP_WRITE 2
 #define KUAP_READ_WRITE(KUAP_READ | KUAP_WRITE)
 
+#ifndef __ASSEMBLY__
+#include 
+
+static __always_inline bool kuap_is_disabled(void);
+#endif
+
 #ifdef CONFIG_PPC_BOOK3S_64
 #include 
 #endif
@@ -41,6 +47,15 @@ void setup_kuep(bool disabled);
 
 #ifdef CONFIG_PPC_KUAP
 void setup_kuap(bool disabled);
+
+#ifndef kuap_is_disabled
+extern struct static_key_false disable_kuap_key;
+
+static __always_inline bool kuap_is_disabled(void)
+{
+   return static_branch_unlikely(&disable_kuap_key);
+}
+#endif
 #else
 static inline void setup_kuap(bool disabled) { }
 
diff --git a/arch/powerpc/include/asm/nohash/32/kup-8xx.h 
b/arch/powerpc/include/asm/nohash/32/kup-8xx.h
index a372cd822887..1d53f38c5cd5 100644
--- a/arch/powerpc/include/asm/nohash/32/kup-8xx.h
+++ b/arch/powerpc/include/asm/nohash/32/kup-8xx.h
@@ -13,13 +13,6 @@
 
 #include 
 
-extern struct static_key_false disable_kuap_key;
-
-static __always_inline bool kuap_is_disabled(void)
-{
-   return static_branch_unlikely(&disable_kuap_key);
-}
-
 static inline void __kuap_save_and_lock(struct pt_regs *regs)
 {
regs->kuap = mfspr(SPRN_MD_AP);
diff --git a/arch/powerpc/include/asm/nohash/kup-booke.h 
b/arch/powerpc/include/asm/nohash/kup-booke.h
index 71182cbe20c3..07759ae9117b 100644
--- a/arch/powerpc/include/asm/nohash/kup-booke.h
+++ b/arch/powerpc/include/asm/nohash/kup-booke.h
@@ -18,13 +18,6 @@
 
 #include 
 
-extern struct static_key_false disable_kuap_key;
-
-static __always_inline bool kuap_is_disabled(void)
-{
-   return static_branch_unlikely(&disable_kuap_key);
-}
-
 static inline void __kuap_lock(void)
 {
mtspr(SPRN_PID, 0);
diff --git a/arch/powerpc/mm/book3s32/kuap.c b/arch/powerpc/mm/book3s32/kuap.c
index 28676cabb005..c5484729b595 100644
--- a/arch/powerpc/mm/book3s32/kuap.c
+++ b/arch/powerpc/mm/book3s32/kuap.c
@@ -3,9 +3,6 @@
 #include 
 #include 
 
-struct static_key_false disable_kuap_key;
-EXPORT_SYMBOL(disable_kuap_key);
-
 void kuap_lock_all_ool(void)
 {
kuap_lock_all();
diff --git a/arch/powerpc/mm/init-common.c b/arch/powerpc/mm/init-common.c
index 119ef491f797..74e140b1efef 100644
--- a/arch/powerpc/mm/init-common.c
+++ b/arch/powerpc/mm/init-common.c
@@ -32,6 +32,9 @@ EXPORT_SYMBOL_GPL(kernstart_virt_addr);
 bool disable_kuep = !IS_ENABLED(CONFIG_PPC_KUEP);
 bool disable_kuap = !IS_ENABLED(CONFIG_PPC_KUAP);
 
+struct static_key_false disable_kuap_key;
+EXPORT_SYMBOL(disable_kuap_key);
+
 static int __init parse_nosmep(char *p)
 {
if (!IS_ENABLED(CONFIG_PPC_BOOK3S_64))
diff --git a/arch/powerpc/mm/nohash/kup.c b/arch/powerpc/mm/nohash/kup.c
index 552becf90e97..4e22adfa2aa8 100644
--- a/arch/powerpc/mm/nohash/kup.c
+++ b/arch/powerpc/mm/nohash/kup.c
@@ -13,9 +13,6 @@
 #include 
 
 #ifdef CONFIG_PPC_KUAP
-struct static_key_false disable_kuap_key;
-EXPORT_SYMBOL(

[PATCH 4/4] powerpc/kuap: Make disabling KUAP at boottime optional

2023-06-05 Thread Christophe Leroy

It is possible to disable KUAP at boottime with 'nosmap' parameter.

That is implemented with jump_label hence adds a 'nop' in front
of each open/close of userspace access.

>From a security point of view it makes sence to disallow disabling
KUAP. And on processors like the 8xx where 'nop' is not seamless,
it saves a few cycles.

So add a CONFIG item to make it optionnal.

Signed-off-by: Christophe Leroy 
---
 arch/powerpc/include/asm/kup.h |  2 +-
 arch/powerpc/mm/init-common.c  |  3 +++
 arch/powerpc/platforms/Kconfig.cputype | 10 ++
 3 files changed, 14 insertions(+), 1 deletion(-)

diff --git a/arch/powerpc/include/asm/kup.h b/arch/powerpc/include/asm/kup.h
index 74b7f4cee2ed..f3280169aeec 100644
--- a/arch/powerpc/include/asm/kup.h
+++ b/arch/powerpc/include/asm/kup.h
@@ -53,7 +53,7 @@ extern struct static_key_false disable_kuap_key;
 
 static __always_inline bool kuap_is_disabled(void)
 {
-   return static_branch_unlikely(&disable_kuap_key);
+   return IS_ENABLED(CONFIG_PPC_KUAP_BOOTTIME) && 
static_branch_unlikely(&disable_kuap_key);
 }
 #endif
 #else
diff --git a/arch/powerpc/mm/init-common.c b/arch/powerpc/mm/init-common.c
index 74e140b1efef..994ee58f0092 100644
--- a/arch/powerpc/mm/init-common.c
+++ b/arch/powerpc/mm/init-common.c
@@ -48,6 +48,9 @@ early_param("nosmep", parse_nosmep);
 
 static int __init parse_nosmap(char *p)
 {
+   if (!IS_ENABLED(CONFIG_PPC_KUAP_BOOTTIME))
+   return 0;
+
disable_kuap = true;
pr_warn("Disabling Kernel Userspace Access Protection\n");
return 0;
diff --git a/arch/powerpc/platforms/Kconfig.cputype 
b/arch/powerpc/platforms/Kconfig.cputype
index 45fd975ef521..f75c2d5cd182 100644
--- a/arch/powerpc/platforms/Kconfig.cputype
+++ b/arch/powerpc/platforms/Kconfig.cputype
@@ -502,6 +502,16 @@ config PPC_KUAP
 
  If you're unsure, say Y.
 
+config PPC_KUAP_BOOTTIME
+   bool "Allow disabling Kernel Userspace Access Protection at boottime"
+   depends on PPC_KUAP
+   default y
+   help
+ Allow the user to disable Kernel Userspace Access Protection (KUAP)
+ at boot time using 'nosmap' kernel parameter.
+
+ If you're unsure, say Y.
+
 config PPC_KUAP_DEBUG
bool "Extra debugging for Kernel Userspace Access Protection"
depends on PPC_KUAP
-- 
2.40.1

[PATCH v3 1/2] powerpc/legacy_serial: Warn about 8250 devices operated without active FSL workarounds

2023-06-05 Thread Uwe Kleine-König

If the 8250 driver is built as a module (or built-in without console
support) the Freescale specific workaround were silently not activated.
Add a warning in this case.

Currently CONFIG_SERIAL_8250_FSL=y implies that the function
fsl8250_handle_irq() is built-in and can be used. However with the
changes of the next commit CONFIG_SERIAL_8250_FSL might be enabled also
when the 8250 driver is a module and so more care is needed when
fsl8250_handle_irq() is to be used. The code added here is able to
handle the new situation already.

Signed-off-by: Uwe Kleine-König 
---
 arch/powerpc/kernel/legacy_serial.c | 14 +-
 1 file changed, 9 insertions(+), 5 deletions(-)

diff --git a/arch/powerpc/kernel/legacy_serial.c 
b/arch/powerpc/kernel/legacy_serial.c
index c9ad12461d44..fdbd85aafeb1 100644
--- a/arch/powerpc/kernel/legacy_serial.c
+++ b/arch/powerpc/kernel/legacy_serial.c
@@ -508,12 +508,16 @@ static void __init fixup_port_irq(int index,
 
port->irq = virq;
 
-#ifdef CONFIG_SERIAL_8250_FSL
-   if (of_device_is_compatible(np, "fsl,ns16550")) {
-   port->handle_irq = fsl8250_handle_irq;
-   port->has_sysrq = IS_ENABLED(CONFIG_SERIAL_8250_CONSOLE);
+   if (IS_ENABLED(CONFIG_SERIAL_8250) &&
+   of_device_is_compatible(np, "fsl,ns16550")) {
+   if (IS_REACHABLE(CONFIG_SERIAL_8250)) {
+   port->handle_irq = fsl8250_handle_irq;
+   port->has_sysrq = 
IS_ENABLED(CONFIG_SERIAL_8250_CONSOLE);
+   } else {
+   pr_warn_once("Not activating Freescale specific 
workaround for device %pOFP\n",
+np);
+   }
}
-#endif
 }
 
 static void __init fixup_port_pio(int index,
-- 
2.39.2

[PATCH v3 0/2] serial: 8250: Apply FSL workarounds also without SERIAL_8250_CONSOLE

2023-06-05 Thread Uwe Kleine-König

Hello,

this is v3 of the series that now also copes for
arch/powerpc/kernel/legacy_serial.c using fsl8250_handle_irq().

For kernel configurations that already before were correctly using the
Freescale workarounds, this is the case with this series applied, too.
So in all cases the situation doesn't get worse. The upside is that even
with the 8250 driver compiled as a module (or built-in but without
console support) the workarounds are now applied for all devices but for
the ones instantiated in arch/powerpc/kernel/legacy_serial.c. (And even
for these there might not be a problem as they might benefit from
enabling the workarounds in drivers/tty/serial/8250/8250_of.c. Not sure
though.)

Patch #1 is new here. Patch #2 only changed lightly: I restored
alphabetic order in drivers/tty/serial/8250/Kconfig.

As patch #1 is needed for patch #2 to not introduce a build failure,
both patches should be taken together. I suggest to add them to Greg's
serial tree, but the changes pending there should not conflict with this
series such that taking them both via powerpc works, too.

Best regards
Uwe

Uwe Kleine-König (2):
  powerpc/legacy_serial: Warn about 8250 devices operated without active
FSL workarounds
  serial: 8250: Apply FSL workarounds also without SERIAL_8250_CONSOLE

 arch/powerpc/kernel/legacy_serial.c | 14 +-
 drivers/tty/serial/8250/Kconfig |  2 +-
 drivers/tty/serial/8250/Makefile|  2 +-
 3 files changed, 11 insertions(+), 7 deletions(-)


base-commit: ac9a78681b921877518763ba0e89202254349d1b
-- 
2.39.2

[PATCH v3 2/2] serial: 8250: Apply FSL workarounds also without SERIAL_8250_CONSOLE

2023-06-05 Thread Uwe Kleine-König

The need to handle the FSL variant of 8250 in a special way is also
present without console support. So soften the dependency for
SERIAL_8250_FSL accordingly. Note that with the 8250 driver compiled as
a module, some devices still might not make use of the needed
workarounds. That affects the ports instantiated in
arch/powerpc/kernel/legacy_serial.c.

This issue was identified by Dominik Andreas Schorpp.

To cope for CONFIG_SERIAL_8250=m + CONFIG_SERIAL_8250_FSL=y, 8250_fsl.o
must be put in the same compilation unit as 8250_port.o because the
latter defines some functions needed in the former and so 8250_fsl.o
must not be built-in if 8250_port.o is available in a module.

Acked-by: Ilpo Järvinen 
Link: 
https://lore.kernel.org/r/20230531083230.2702181-1-u.kleine-koe...@pengutronix.de
Signed-off-by: Uwe Kleine-König 
---
 drivers/tty/serial/8250/Kconfig  | 2 +-
 drivers/tty/serial/8250/Makefile | 2 +-
 2 files changed, 2 insertions(+), 2 deletions(-)

diff --git a/drivers/tty/serial/8250/Kconfig b/drivers/tty/serial/8250/Kconfig
index 5313aa31930f..10c09b19c871 100644
--- a/drivers/tty/serial/8250/Kconfig
+++ b/drivers/tty/serial/8250/Kconfig
@@ -378,7 +378,7 @@ config SERIAL_8250_BCM2835AUX
 
 config SERIAL_8250_FSL
bool "Freescale 16550 UART support" if COMPILE_TEST && !(PPC || ARM || 
ARM64)
-   depends on SERIAL_8250_CONSOLE
+   depends on SERIAL_8250
default PPC || ARM || ARM64
help
  Selecting this option enables a workaround for a break-detection
diff --git a/drivers/tty/serial/8250/Makefile b/drivers/tty/serial/8250/Makefile
index 4fc2fc1f41b6..8824ba5295b6 100644
--- a/drivers/tty/serial/8250/Makefile
+++ b/drivers/tty/serial/8250/Makefile
@@ -12,6 +12,7 @@ obj-$(CONFIG_SERIAL_8250) += 8250.o 8250_base.o
 8250_base-$(CONFIG_SERIAL_8250_DMA)+= 8250_dma.o
 8250_base-$(CONFIG_SERIAL_8250_DWLIB)  += 8250_dwlib.o
 8250_base-$(CONFIG_SERIAL_8250_FINTEK) += 8250_fintek.o
+8250_base-$(CONFIG_SERIAL_8250_FSL)+= 8250_fsl.o
 8250_base-$(CONFIG_SERIAL_8250_PCILIB) += 8250_pcilib.o
 obj-$(CONFIG_SERIAL_8250_PARISC)   += 8250_parisc.o
 obj-$(CONFIG_SERIAL_8250_PCI)  += 8250_pci.o
@@ -28,7 +29,6 @@ obj-$(CONFIG_SERIAL_8250_BOCA)+= 8250_boca.o
 obj-$(CONFIG_SERIAL_8250_EXAR_ST16C554)+= 8250_exar_st16c554.o
 obj-$(CONFIG_SERIAL_8250_HUB6) += 8250_hub6.o
 obj-$(CONFIG_SERIAL_8250_PCI1) += 8250_pci1.o
-obj-$(CONFIG_SERIAL_8250_FSL)  += 8250_fsl.o
 obj-$(CONFIG_SERIAL_8250_MEN_MCB)  += 8250_men_mcb.o
 obj-$(CONFIG_SERIAL_8250_DFL)  += 8250_dfl.o
 obj-$(CONFIG_SERIAL_8250_DW)   += 8250_dw.o
-- 
2.39.2

Re: [PATCH v3 2/2] serial: 8250: Apply FSL workarounds also without SERIAL_8250_CONSOLE

2023-06-05 Thread Ilpo Järvinen

On Mon, 5 Jun 2023, Uwe Kleine-König wrote:

> The need to handle the FSL variant of 8250 in a special way is also
> present without console support. So soften the dependency for
> SERIAL_8250_FSL accordingly. Note that with the 8250 driver compiled as
> a module, some devices still might not make use of the needed
> workarounds. That affects the ports instantiated in
> arch/powerpc/kernel/legacy_serial.c.
> 
> This issue was identified by Dominik Andreas Schorpp.
> 
> To cope for CONFIG_SERIAL_8250=m + CONFIG_SERIAL_8250_FSL=y, 8250_fsl.o
> must be put in the same compilation unit as 8250_port.o because the
> latter defines some functions needed in the former and so 8250_fsl.o
> must not be built-in if 8250_port.o is available in a module.
> 
> Acked-by: Ilpo Järvinen 
> Link: 
> https://lore.kernel.org/r/20230531083230.2702181-1-u.kleine-koe...@pengutronix.de
> Signed-off-by: Uwe Kleine-König 
> ---
>  drivers/tty/serial/8250/Kconfig  | 2 +-
>  drivers/tty/serial/8250/Makefile | 2 +-
>  2 files changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/drivers/tty/serial/8250/Kconfig b/drivers/tty/serial/8250/Kconfig
> index 5313aa31930f..10c09b19c871 100644
> --- a/drivers/tty/serial/8250/Kconfig
> +++ b/drivers/tty/serial/8250/Kconfig
> @@ -378,7 +378,7 @@ config SERIAL_8250_BCM2835AUX
>  
>  config SERIAL_8250_FSL
>   bool "Freescale 16550 UART support" if COMPILE_TEST && !(PPC || ARM || 
> ARM64)
> - depends on SERIAL_8250_CONSOLE
> + depends on SERIAL_8250

Just one additional thought: After the adding the arch side 
workaround/hack, SERIAL_8250_FSL could become a tristate?

(1/2 might need a small change to take into account that it can now be a 
module).

>   default PPC || ARM || ARM64
>   help
> Selecting this option enables a workaround for a break-detection
> diff --git a/drivers/tty/serial/8250/Makefile 
> b/drivers/tty/serial/8250/Makefile
> index 4fc2fc1f41b6..8824ba5295b6 100644
> --- a/drivers/tty/serial/8250/Makefile
> +++ b/drivers/tty/serial/8250/Makefile
> @@ -12,6 +12,7 @@ obj-$(CONFIG_SERIAL_8250)   += 8250.o 8250_base.o
>  8250_base-$(CONFIG_SERIAL_8250_DMA)  += 8250_dma.o
>  8250_base-$(CONFIG_SERIAL_8250_DWLIB)+= 8250_dwlib.o
>  8250_base-$(CONFIG_SERIAL_8250_FINTEK)   += 8250_fintek.o
> +8250_base-$(CONFIG_SERIAL_8250_FSL)  += 8250_fsl.o
>  8250_base-$(CONFIG_SERIAL_8250_PCILIB)   += 8250_pcilib.o
>  obj-$(CONFIG_SERIAL_8250_PARISC) += 8250_parisc.o
>  obj-$(CONFIG_SERIAL_8250_PCI)+= 8250_pci.o
> @@ -28,7 +29,6 @@ obj-$(CONFIG_SERIAL_8250_BOCA)  += 8250_boca.o
>  obj-$(CONFIG_SERIAL_8250_EXAR_ST16C554)  += 8250_exar_st16c554.o
>  obj-$(CONFIG_SERIAL_8250_HUB6)   += 8250_hub6.o
>  obj-$(CONFIG_SERIAL_8250_PCI1)   += 8250_pci1.o
> -obj-$(CONFIG_SERIAL_8250_FSL)+= 8250_fsl.o
>  obj-$(CONFIG_SERIAL_8250_MEN_MCB)+= 8250_men_mcb.o
>  obj-$(CONFIG_SERIAL_8250_DFL)+= 8250_dfl.o
>  obj-$(CONFIG_SERIAL_8250_DW) += 8250_dw.o

-- 
 i.

Re: [PATCH v3 2/2] serial: 8250: Apply FSL workarounds also without SERIAL_8250_CONSOLE

2023-06-05 Thread Uwe Kleine-König

On Mon, Jun 05, 2023 at 04:22:55PM +0300, Ilpo Järvinen wrote:
> On Mon, 5 Jun 2023, Uwe Kleine-König wrote:
> 
> > The need to handle the FSL variant of 8250 in a special way is also
> > present without console support. So soften the dependency for
> > SERIAL_8250_FSL accordingly. Note that with the 8250 driver compiled as
> > a module, some devices still might not make use of the needed
> > workarounds. That affects the ports instantiated in
> > arch/powerpc/kernel/legacy_serial.c.
> > 
> > This issue was identified by Dominik Andreas Schorpp.
> > 
> > To cope for CONFIG_SERIAL_8250=m + CONFIG_SERIAL_8250_FSL=y, 8250_fsl.o
> > must be put in the same compilation unit as 8250_port.o because the
> > latter defines some functions needed in the former and so 8250_fsl.o
> > must not be built-in if 8250_port.o is available in a module.
> > 
> > Acked-by: Ilpo Järvinen 
> > Link: 
> > https://lore.kernel.org/r/20230531083230.2702181-1-u.kleine-koe...@pengutronix.de
> > Signed-off-by: Uwe Kleine-König 
> > ---
> >  drivers/tty/serial/8250/Kconfig  | 2 +-
> >  drivers/tty/serial/8250/Makefile | 2 +-
> >  2 files changed, 2 insertions(+), 2 deletions(-)
> > 
> > diff --git a/drivers/tty/serial/8250/Kconfig 
> > b/drivers/tty/serial/8250/Kconfig
> > index 5313aa31930f..10c09b19c871 100644
> > --- a/drivers/tty/serial/8250/Kconfig
> > +++ b/drivers/tty/serial/8250/Kconfig
> > @@ -378,7 +378,7 @@ config SERIAL_8250_BCM2835AUX
> >  
> >  config SERIAL_8250_FSL
> > bool "Freescale 16550 UART support" if COMPILE_TEST && !(PPC || ARM || 
> > ARM64)
> > -   depends on SERIAL_8250_CONSOLE
> > +   depends on SERIAL_8250
> 
> Just one additional thought: After the adding the arch side 
> workaround/hack, SERIAL_8250_FSL could become a tristate?

I see no benefit for a module separate from 8250_base.ko. There are
dependencies in both directions between 8250_port.o and 8250_fsl.o[1].
So in my book a bool SERIAL_8250_FSL that modifies 8250_base.ko (with
SERIAL_8250=m) is fine.

Best regards
Uwe

[1] 8250_port.o uses fsl8250_handle_irq() from 8250_fsl.o, and
8250_fsl.o uses serial8250_modem_status from 8250_port.o.

-- 
Pengutronix e.K.   | Uwe Kleine-König|
Industrial Linux Solutions | https://www.pengutronix.de/ |


signature.asc
Description: PGP signature

Re: [PATCH v3 2/2] serial: 8250: Apply FSL workarounds also without SERIAL_8250_CONSOLE

2023-06-05 Thread Ilpo Järvinen

On Mon, 5 Jun 2023, Uwe Kleine-König wrote:

> On Mon, Jun 05, 2023 at 04:22:55PM +0300, Ilpo Järvinen wrote:
> > On Mon, 5 Jun 2023, Uwe Kleine-König wrote:
> > 
> > > The need to handle the FSL variant of 8250 in a special way is also
> > > present without console support. So soften the dependency for
> > > SERIAL_8250_FSL accordingly. Note that with the 8250 driver compiled as
> > > a module, some devices still might not make use of the needed
> > > workarounds. That affects the ports instantiated in
> > > arch/powerpc/kernel/legacy_serial.c.
> > > 
> > > This issue was identified by Dominik Andreas Schorpp.
> > > 
> > > To cope for CONFIG_SERIAL_8250=m + CONFIG_SERIAL_8250_FSL=y, 8250_fsl.o
> > > must be put in the same compilation unit as 8250_port.o because the
> > > latter defines some functions needed in the former and so 8250_fsl.o
> > > must not be built-in if 8250_port.o is available in a module.
> > > 
> > > Acked-by: Ilpo Järvinen 
> > > Link: 
> > > https://lore.kernel.org/r/20230531083230.2702181-1-u.kleine-koe...@pengutronix.de
> > > Signed-off-by: Uwe Kleine-König 
> > > ---
> > >  drivers/tty/serial/8250/Kconfig  | 2 +-
> > >  drivers/tty/serial/8250/Makefile | 2 +-
> > >  2 files changed, 2 insertions(+), 2 deletions(-)
> > > 
> > > diff --git a/drivers/tty/serial/8250/Kconfig 
> > > b/drivers/tty/serial/8250/Kconfig
> > > index 5313aa31930f..10c09b19c871 100644
> > > --- a/drivers/tty/serial/8250/Kconfig
> > > +++ b/drivers/tty/serial/8250/Kconfig
> > > @@ -378,7 +378,7 @@ config SERIAL_8250_BCM2835AUX
> > >  
> > >  config SERIAL_8250_FSL
> > >   bool "Freescale 16550 UART support" if COMPILE_TEST && !(PPC || ARM || 
> > > ARM64)
> > > - depends on SERIAL_8250_CONSOLE
> > > + depends on SERIAL_8250
> > 
> > Just one additional thought: After the adding the arch side 
> > workaround/hack, SERIAL_8250_FSL could become a tristate?
> 
> I see no benefit for a module separate from 8250_base.ko. There are
> dependencies in both directions between 8250_port.o and 8250_fsl.o[1].
> So in my book a bool SERIAL_8250_FSL that modifies 8250_base.ko (with
> SERIAL_8250=m) is fine.
> 
> Best regards
> Uwe
> 
> [1] 8250_port.o uses fsl8250_handle_irq() from 8250_fsl.o

Is that after some fix which isn't in tty-next? I see only these:

$ git grep -l fsl8250_handle_irq
arch/powerpc/kernel/legacy_serial.c
drivers/tty/serial/8250/8250_fsl.c
drivers/tty/serial/8250/8250_of.c
include/linux/serial_8250.h

No users of fsl8250_handle_irq in 8250_port.c.

-- 
 i.

>, and 8250_fsl.o uses serial8250_modem_status from 8250_port.o.

Re: [PATCH v3 2/2] serial: 8250: Apply FSL workarounds also without SERIAL_8250_CONSOLE

2023-06-05 Thread Andy Shevchenko

On Mon, Jun 05, 2023 at 04:44:08PM +0300, Ilpo Järvinen wrote:
> On Mon, 5 Jun 2023, Uwe Kleine-König wrote:
> > On Mon, Jun 05, 2023 at 04:22:55PM +0300, Ilpo Järvinen wrote:
> > > On Mon, 5 Jun 2023, Uwe Kleine-König wrote:
> > > > The need to handle the FSL variant of 8250 in a special way is also
> > > > present without console support. So soften the dependency for
> > > > SERIAL_8250_FSL accordingly. Note that with the 8250 driver compiled as
> > > > a module, some devices still might not make use of the needed
> > > > workarounds. That affects the ports instantiated in
> > > > arch/powerpc/kernel/legacy_serial.c.
> > > > 
> > > > This issue was identified by Dominik Andreas Schorpp.
> > > > 
> > > > To cope for CONFIG_SERIAL_8250=m + CONFIG_SERIAL_8250_FSL=y, 8250_fsl.o
> > > > must be put in the same compilation unit as 8250_port.o because the
> > > > latter defines some functions needed in the former and so 8250_fsl.o
> > > > must not be built-in if 8250_port.o is available in a module.
> > > > 
> > > > Acked-by: Ilpo Järvinen 
> > > > Link: 
> > > > https://lore.kernel.org/r/20230531083230.2702181-1-u.kleine-koe...@pengutronix.de
> > > > Signed-off-by: Uwe Kleine-König 
> > > > ---
> > > >  drivers/tty/serial/8250/Kconfig  | 2 +-
> > > >  drivers/tty/serial/8250/Makefile | 2 +-
> > > >  2 files changed, 2 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/drivers/tty/serial/8250/Kconfig 
> > > > b/drivers/tty/serial/8250/Kconfig
> > > > index 5313aa31930f..10c09b19c871 100644
> > > > --- a/drivers/tty/serial/8250/Kconfig
> > > > +++ b/drivers/tty/serial/8250/Kconfig
> > > > @@ -378,7 +378,7 @@ config SERIAL_8250_BCM2835AUX
> > > >  
> > > >  config SERIAL_8250_FSL
> > > > bool "Freescale 16550 UART support" if COMPILE_TEST && !(PPC || 
> > > > ARM || ARM64)
> > > > -   depends on SERIAL_8250_CONSOLE
> > > > +   depends on SERIAL_8250
> > > 
> > > Just one additional thought: After the adding the arch side 
> > > workaround/hack, SERIAL_8250_FSL could become a tristate?
> > 
> > I see no benefit for a module separate from 8250_base.ko. There are
> > dependencies in both directions between 8250_port.o and 8250_fsl.o[1].
> > So in my book a bool SERIAL_8250_FSL that modifies 8250_base.ko (with
> > SERIAL_8250=m) is fine.

> > [1] 8250_port.o uses fsl8250_handle_irq() from 8250_fsl.o
> 
> Is that after some fix which isn't in tty-next? I see only these:
> 
> $ git grep -l fsl8250_handle_irq
> arch/powerpc/kernel/legacy_serial.c
> drivers/tty/serial/8250/8250_fsl.c
> drivers/tty/serial/8250/8250_of.c
> include/linux/serial_8250.h
> 
> No users of fsl8250_handle_irq in 8250_port.c.

> >, and 8250_fsl.o uses serial8250_modem_status from 8250_port.o.

I don't like 8250_base to be fattened by some stuff that has no
generic meaning. Can we avoid putting every quirk there?

-- 
With Best Regards,
Andy Shevchenko

Re: [PATCH v8 0/7] Add pci_dev_for_each_resource() helper and update users

2023-06-05 Thread Andy Shevchenko

On Wed, May 31, 2023 at 04:30:28PM -0500, Bjorn Helgaas wrote:
> On Wed, May 31, 2023 at 08:48:35PM +0200, Jonas Gorski wrote:

...

> > Looking at the code I understand where coverity is coming from:
> > 
> > #define __pci_dev_for_each_res0(dev, res, ...) \
> >for (unsigned int __b = 0;  \
> > res = pci_resource_n(dev, __b), __b < PCI_NUM_RESOURCES;   \
> > __b++)
> > 
> >  res will be assigned before __b is checked for being less than
> > PCI_NUM_RESOURCES, making it point to behind the array at the end of
> > the last loop iteration.
> > 
> > Rewriting the test expression as
> > 
> > __b < PCI_NUM_RESOURCES && (res = pci_resource_n(dev, __b));
> > 
> > should avoid the (coverity) warning by making use of lazy evaluation.
> > 
> > It probably makes the code slightly less performant as res will now be
> > checked for being not NULL (which will always be true), but I doubt it
> > will be significant (or in any hot paths).
> 
> Thanks a lot for looking into this!  I think you're right, and I think
> the rewritten expression is more logical as well.  Do you want to post
> a patch for it?

Gimme some time, I was on a long leave and now it's a pile to handle.

-- 
With Best Regards,
Andy Shevchenko

Re: [PATCH v3 2/2] serial: 8250: Apply FSL workarounds also without SERIAL_8250_CONSOLE

2023-06-05 Thread Uwe Kleine-König

Hello Ilpo,

On Mon, Jun 05, 2023 at 04:44:08PM +0300, Ilpo Järvinen wrote:
> On Mon, 5 Jun 2023, Uwe Kleine-König wrote:
> > On Mon, Jun 05, 2023 at 04:22:55PM +0300, Ilpo Järvinen wrote:
> > > On Mon, 5 Jun 2023, Uwe Kleine-König wrote:
> > > 
> > > > The need to handle the FSL variant of 8250 in a special way is also
> > > > present without console support. So soften the dependency for
> > > > SERIAL_8250_FSL accordingly. Note that with the 8250 driver compiled as
> > > > a module, some devices still might not make use of the needed
> > > > workarounds. That affects the ports instantiated in
> > > > arch/powerpc/kernel/legacy_serial.c.
> > > > 
> > > > This issue was identified by Dominik Andreas Schorpp.
> > > > 
> > > > To cope for CONFIG_SERIAL_8250=m + CONFIG_SERIAL_8250_FSL=y, 8250_fsl.o
> > > > must be put in the same compilation unit as 8250_port.o because the
> > > > latter defines some functions needed in the former and so 8250_fsl.o
> > > > must not be built-in if 8250_port.o is available in a module.
> > > > 
> > > > Acked-by: Ilpo Järvinen 
> > > > Link: 
> > > > https://lore.kernel.org/r/20230531083230.2702181-1-u.kleine-koe...@pengutronix.de
> > > > Signed-off-by: Uwe Kleine-König 
> > > > ---
> > > >  drivers/tty/serial/8250/Kconfig  | 2 +-
> > > >  drivers/tty/serial/8250/Makefile | 2 +-
> > > >  2 files changed, 2 insertions(+), 2 deletions(-)
> > > > 
> > > > diff --git a/drivers/tty/serial/8250/Kconfig 
> > > > b/drivers/tty/serial/8250/Kconfig
> > > > index 5313aa31930f..10c09b19c871 100644
> > > > --- a/drivers/tty/serial/8250/Kconfig
> > > > +++ b/drivers/tty/serial/8250/Kconfig
> > > > @@ -378,7 +378,7 @@ config SERIAL_8250_BCM2835AUX
> > > >  
> > > >  config SERIAL_8250_FSL
> > > > bool "Freescale 16550 UART support" if COMPILE_TEST && !(PPC || 
> > > > ARM || ARM64)
> > > > -   depends on SERIAL_8250_CONSOLE
> > > > +   depends on SERIAL_8250
> > > 
> > > Just one additional thought: After the adding the arch side 
> > > workaround/hack, SERIAL_8250_FSL could become a tristate?
> > 
> > I see no benefit for a module separate from 8250_base.ko. There are
> > dependencies in both directions between 8250_port.o and 8250_fsl.o[1].
> > So in my book a bool SERIAL_8250_FSL that modifies 8250_base.ko (with
> > SERIAL_8250=m) is fine.
> > 
> > Best regards
> > Uwe
> > 
> > [1] 8250_port.o uses fsl8250_handle_irq() from 8250_fsl.o
> 
> Is that after some fix which isn't in tty-next? I see only these:
> 
> $ git grep -l fsl8250_handle_irq
> arch/powerpc/kernel/legacy_serial.c
> drivers/tty/serial/8250/8250_fsl.c
> drivers/tty/serial/8250/8250_of.c
> include/linux/serial_8250.h
> 
> No users of fsl8250_handle_irq in 8250_port.c.

Ah right, I was too quick:

8250_of.o uses fsl8250_handle_irq() from 8250_fsl.o
8250_fsl.o uses serial8250_modem_status() from 8250_port.o (which is in 
8250_base.o)


However linking 8250_fsl.o in 8250_of.o isn't a good solution either as
8250_fsl.o should also be available with CONFIG_SERIAL_OF_PLATFORM
disabled to provide the ACPI driver. And as 8250_of.o already depends on
8250_port.o (e.g. via serial8250_em485_config()) adding 8250_fsl.o
together with 8250_port.o into 8250_base.ko is fine and doesn't add new
dependencies.

Best regards
Uwe

-- 
Pengutronix e.K.   | Uwe Kleine-König|
Industrial Linux Solutions | https://www.pengutronix.de/ |


signature.asc
Description: PGP signature

Re: [PATCH v3 2/2] serial: 8250: Apply FSL workarounds also without SERIAL_8250_CONSOLE

2023-06-05 Thread Ilpo Järvinen

On Mon, 5 Jun 2023, Uwe Kleine-König wrote:

> Hello Ilpo,
> 
> On Mon, Jun 05, 2023 at 04:44:08PM +0300, Ilpo Järvinen wrote:
> > On Mon, 5 Jun 2023, Uwe Kleine-König wrote:
> > > On Mon, Jun 05, 2023 at 04:22:55PM +0300, Ilpo Järvinen wrote:
> > > > On Mon, 5 Jun 2023, Uwe Kleine-König wrote:
> > > > 
> > > > > The need to handle the FSL variant of 8250 in a special way is also
> > > > > present without console support. So soften the dependency for
> > > > > SERIAL_8250_FSL accordingly. Note that with the 8250 driver compiled 
> > > > > as
> > > > > a module, some devices still might not make use of the needed
> > > > > workarounds. That affects the ports instantiated in
> > > > > arch/powerpc/kernel/legacy_serial.c.
> > > > > 
> > > > > This issue was identified by Dominik Andreas Schorpp.
> > > > > 
> > > > > To cope for CONFIG_SERIAL_8250=m + CONFIG_SERIAL_8250_FSL=y, 
> > > > > 8250_fsl.o
> > > > > must be put in the same compilation unit as 8250_port.o because the
> > > > > latter defines some functions needed in the former and so 8250_fsl.o
> > > > > must not be built-in if 8250_port.o is available in a module.
> > > > > 
> > > > > Acked-by: Ilpo Järvinen 
> > > > > Link: 
> > > > > https://lore.kernel.org/r/20230531083230.2702181-1-u.kleine-koe...@pengutronix.de
> > > > > Signed-off-by: Uwe Kleine-König 
> > > > > ---
> > > > >  drivers/tty/serial/8250/Kconfig  | 2 +-
> > > > >  drivers/tty/serial/8250/Makefile | 2 +-
> > > > >  2 files changed, 2 insertions(+), 2 deletions(-)
> > > > > 
> > > > > diff --git a/drivers/tty/serial/8250/Kconfig 
> > > > > b/drivers/tty/serial/8250/Kconfig
> > > > > index 5313aa31930f..10c09b19c871 100644
> > > > > --- a/drivers/tty/serial/8250/Kconfig
> > > > > +++ b/drivers/tty/serial/8250/Kconfig
> > > > > @@ -378,7 +378,7 @@ config SERIAL_8250_BCM2835AUX
> > > > >  
> > > > >  config SERIAL_8250_FSL
> > > > >   bool "Freescale 16550 UART support" if COMPILE_TEST && !(PPC || 
> > > > > ARM || ARM64)
> > > > > - depends on SERIAL_8250_CONSOLE
> > > > > + depends on SERIAL_8250
> > > > 
> > > > Just one additional thought: After the adding the arch side 
> > > > workaround/hack, SERIAL_8250_FSL could become a tristate?
> > > 
> > > I see no benefit for a module separate from 8250_base.ko. There are
> > > dependencies in both directions between 8250_port.o and 8250_fsl.o[1].
> > > So in my book a bool SERIAL_8250_FSL that modifies 8250_base.ko (with
> > > SERIAL_8250=m) is fine.
> > > 
> > > Best regards
> > > Uwe
> > > 
> > > [1] 8250_port.o uses fsl8250_handle_irq() from 8250_fsl.o
> > 
> > Is that after some fix which isn't in tty-next? I see only these:
> > 
> > $ git grep -l fsl8250_handle_irq
> > arch/powerpc/kernel/legacy_serial.c
> > drivers/tty/serial/8250/8250_fsl.c
> > drivers/tty/serial/8250/8250_of.c
> > include/linux/serial_8250.h
> > 
> > No users of fsl8250_handle_irq in 8250_port.c.
> 
> Ah right, I was too quick:
> 
>   8250_of.o uses fsl8250_handle_irq() from 8250_fsl.o
>   8250_fsl.o uses serial8250_modem_status() from 8250_port.o (which is in 
> 8250_base.o)
> 
> 
> However linking 8250_fsl.o in 8250_of.o isn't a good solution either as
> 8250_fsl.o should also be available with CONFIG_SERIAL_OF_PLATFORM
> disabled to provide the ACPI driver. And as 8250_of.o already depends on
> 8250_port.o (e.g. via serial8250_em485_config()) adding 8250_fsl.o
> together with 8250_port.o into 8250_base.ko is fine and doesn't add new
> dependencies.

So we have dependencies one-way only:

8250_port

/\|\
\
8250_fsl \
 /
/\  /

8250_of

There's no loop here, both they be indepedent modules and configured 
independently (with a correct IS_*() in 8250_of.c).

-- 
 i.

Re: [PATCH 12/13] x86/jitalloc: prepare to allocate exectuatble memory as ROX

2023-06-05 Thread Edgecombe, Rick P

On Mon, 2023-06-05 at 11:11 +0300, Mike Rapoport wrote:
> On Sun, Jun 04, 2023 at 10:52:44PM -0400, Steven Rostedt wrote:
> > On Thu, 1 Jun 2023 16:54:36 -0700
> > Nadav Amit  wrote:
> > 
> > > > The way text_poke() is used here, it is creating a new writable
> > > > alias
> > > > and flushing it for *each* write to the module (like for each
> > > > write of
> > > > an individual relocation, etc). I was just thinking it might
> > > > warrant
> > > > some batching or something.  
> 
> > > I am not advocating to do so, but if you want to have many
> > > efficient
> > > writes, perhaps you can just disable CR0.WP. Just saying that if
> > > you
> > > are about to write all over the memory, text_poke() does not
> > > provide
> > > too much security for the poking thread.
> 
> Heh, this is definitely and easier hack to implement :)

I don't know the details, but previously there was some strong dislike
of CR0.WP toggling. And now there is also the problem of CET. Setting
CR0.WP=0 will #GP if CR4.CET is 1 (as it currently is for kernel IBT).
I guess you might get away with toggling them both in some controlled
situation, but it might be a lot easier to hack up then to be made
fully acceptable. It does sound much more efficient though.

> 
> > Batching does exist, which is what the text_poke_queue() thing
> > does.
> 
> For module loading text_poke_queue() will still be much slower than a
> bunch
> of memset()s for no good reason because we don't need all the
> complexity of
> text_poke_bp_batch() for module initialization because we are sure we
> are
> not patching live code.
> 
> What we'd need here is a new batching mode that will create a
> writable
> alias mapping at the beginning of apply_relocate_*() and
> module_finalize(),
> then it will use memcpy() to that writable alias and will tear the
> mapping
> down in the end.

It's probably only a tiny bit faster than keeping a separate writable
allocation and text_poking it in at the end.

> 
> Another option is to teach alternatives to update a writable copy
> rather
> than do in place changes like Song suggested. My feeling is that it
> will be
> more intrusive change though.

You mean keeping a separate RW allocation and then text_poking() the
whole thing in when you are done? That is what I was trying to say at
the beginning of this thread. The other benefit is you don't make the
intermediate loading states of the module, executable.

I tried this technique previously [0], and I thought it was not too
bad. In most of the callers it looks similar to what you have in
do_text_poke(). Sometimes less, sometimes more. It might need
enlightening of some of the stuff currently using text_poke() during
module loading, like jump labels. So that bit is more intrusive, yea.
But it sounds so much cleaner and well controlled. Did you have a
particular trouble spot in mind?

[0]
https://lore.kernel.org/lkml/20201120202426.18009-5-rick.p.edgeco...@intel.com/

[PATCH] powerpc/iommu: Only build sPAPR access functions on pSeries

2023-06-05 Thread Timothy Pearson

 and PowerNV

A build failure with CONFIG_HAVE_PCI=y set without PSERIES or POWERNV
set was caught by the random configuration checker.  Guard the sPAPR
specific IOMMU functions on CONFIG_PPC_PSERIES || CONFIG_PPC_POWERNV.

Signed-off-by: Timothy Pearson 
---
 arch/powerpc/kernel/iommu.c | 4 
 1 file changed, 4 insertions(+)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index 67f0b01e6ff5..c52449ae6936 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -1090,6 +1090,7 @@ void iommu_tce_kill(struct iommu_table *tbl,
 }
 EXPORT_SYMBOL_GPL(iommu_tce_kill);
 
+#if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_PPC_POWERNV)
 static int iommu_take_ownership(struct iommu_table *tbl)
 {
unsigned long flags, i, sz = (tbl->it_size + 7) >> 3;
@@ -1140,6 +1141,7 @@ static void iommu_release_ownership(struct iommu_table 
*tbl)
spin_unlock(&tbl->pools[i].lock);
spin_unlock_irqrestore(&tbl->large_pool.lock, flags);
 }
+#endif
 
 int iommu_add_device(struct iommu_table_group *table_group, struct device *dev)
 {
@@ -1171,6 +1173,7 @@ int iommu_add_device(struct iommu_table_group 
*table_group, struct device *dev)
 }
 EXPORT_SYMBOL_GPL(iommu_add_device);
 
+#if defined(CONFIG_PPC_PSERIES) || defined(CONFIG_PPC_POWERNV)
 /*
  * A simple iommu_table_group_ops which only allows reusing the existing
  * iommu_table. This handles VFIO for POWER7 or the nested KVM.
@@ -1398,5 +1401,6 @@ static int __init 
spapr_tce_setup_phb_iommus_initcall(void)
return 0;
 }
 postcore_initcall_sync(spapr_tce_setup_phb_iommus_initcall);
+#endif
 
 #endif /* CONFIG_IOMMU_API */
-- 
2.30.2

Re: [PATCH 12/13] x86/jitalloc: prepare to allocate exectuatble memory as ROX

2023-06-05 Thread Mike Rapoport

On Mon, Jun 05, 2023 at 04:10:21PM +, Edgecombe, Rick P wrote:
> On Mon, 2023-06-05 at 11:11 +0300, Mike Rapoport wrote:
> > On Sun, Jun 04, 2023 at 10:52:44PM -0400, Steven Rostedt wrote:
> > > On Thu, 1 Jun 2023 16:54:36 -0700
> > > Nadav Amit  wrote:
> > > 
> > > > > The way text_poke() is used here, it is creating a new writable
> > > > > alias
> > > > > and flushing it for *each* write to the module (like for each
> > > > > write of
> > > > > an individual relocation, etc). I was just thinking it might
> > > > > warrant
> > > > > some batching or something.  
> > 
> > > > I am not advocating to do so, but if you want to have many
> > > > efficient
> > > > writes, perhaps you can just disable CR0.WP. Just saying that if
> > > > you
> > > > are about to write all over the memory, text_poke() does not
> > > > provide
> > > > too much security for the poking thread.
> > 
> > Heh, this is definitely and easier hack to implement :)
> 
> I don't know the details, but previously there was some strong dislike
> of CR0.WP toggling. And now there is also the problem of CET. Setting
> CR0.WP=0 will #GP if CR4.CET is 1 (as it currently is for kernel IBT).
> I guess you might get away with toggling them both in some controlled
> situation, but it might be a lot easier to hack up then to be made
> fully acceptable. It does sound much more efficient though.
 
I don't think we'd really want that, especially looking at 

WARN_ONCE(bits_missing, "CR0 WP bit went missing!?\n");

at native_write_cr0().
 
> > > Batching does exist, which is what the text_poke_queue() thing
> > > does.
> > 
> > For module loading text_poke_queue() will still be much slower than a
> > bunch
> > of memset()s for no good reason because we don't need all the
> > complexity of
> > text_poke_bp_batch() for module initialization because we are sure we
> > are
> > not patching live code.
> > 
> > What we'd need here is a new batching mode that will create a
> > writable
> > alias mapping at the beginning of apply_relocate_*() and
> > module_finalize(),
> > then it will use memcpy() to that writable alias and will tear the
> > mapping
> > down in the end.
> 
> It's probably only a tiny bit faster than keeping a separate writable
> allocation and text_poking it in at the end.

Right, but it still will be faster than text_poking every relocation.
 
> > Another option is to teach alternatives to update a writable copy
> > rather
> > than do in place changes like Song suggested. My feeling is that it
> > will be
> > more intrusive change though.
> 
> You mean keeping a separate RW allocation and then text_poking() the
> whole thing in when you are done? That is what I was trying to say at
> the beginning of this thread. The other benefit is you don't make the
> intermediate loading states of the module, executable.
> 
> I tried this technique previously [0], and I thought it was not too
> bad. In most of the callers it looks similar to what you have in
> do_text_poke(). Sometimes less, sometimes more. It might need
> enlightening of some of the stuff currently using text_poke() during
> module loading, like jump labels. So that bit is more intrusive, yea.
> But it sounds so much cleaner and well controlled. Did you have a
> particular trouble spot in mind?

Nothing in particular, except the intrusive part. Except the changes in
modules.c we'd need to teach alternatives to deal with a writable copy.
 
> [0]
> https://lore.kernel.org/lkml/20201120202426.18009-5-rick.p.edgeco...@intel.com/

-- 
Sincerely yours,
Mike.

Re: [PATCH 12/13] x86/jitalloc: prepare to allocate exectuatble memory as ROX

2023-06-05 Thread Edgecombe, Rick P

On Mon, 2023-06-05 at 23:42 +0300, Mike Rapoport wrote:
> > I tried this technique previously [0], and I thought it was not too
> > bad. In most of the callers it looks similar to what you have in
> > do_text_poke(). Sometimes less, sometimes more. It might need
> > enlightening of some of the stuff currently using text_poke()
> > during
> > module loading, like jump labels. So that bit is more intrusive,
> > yea.
> > But it sounds so much cleaner and well controlled. Did you have a
> > particular trouble spot in mind?
> 
> Nothing in particular, except the intrusive part. Except the changes
> in
> modules.c we'd need to teach alternatives to deal with a writable
> copy.

I didn't think alternatives piece looked too bad on the caller side (if
that's what you meant):
https://lore.kernel.org/lkml/20201120202426.18009-7-rick.p.edgeco...@intel.com/

The ugly part was in the (poorly named) module_adjust_writable_addr():

+static inline void *module_adjust_writable_addr(void *addr)
+{
+   unsigned long laddr = (unsigned long)addr;
+   struct module *mod;
+
+   mutex_lock(&module_mutex);
+   mod = __module_address(laddr);
+   if (!mod) {
+   mutex_unlock(&module_mutex);
+   return addr;
+   }
+   mutex_unlock(&module_mutex);
+   /* The module shouldn't be going away if someone is trying to
write to it */
+
+   return (void *)perm_writable_addr(module_get_allocation(mod,
laddr), laddr);
+}
+

It took module_mutex and looked up the module in order to find the
writable buffer from just the executable address. Basically all the
loading code external to modules had to go through that interface. But
now I'm wondering what I was thinking, it seems this could just be an
RCU read lock. That doesn't seem to bad...

Re: [PATCH 12/13] x86/jitalloc: prepare to allocate exectuatble memory as ROX

2023-06-05 Thread Nadav Amit




> On Jun 5, 2023, at 9:10 AM, Edgecombe, Rick P  
> wrote:
> 
> On Mon, 2023-06-05 at 11:11 +0300, Mike Rapoport wrote:
>> On Sun, Jun 04, 2023 at 10:52:44PM -0400, Steven Rostedt wrote:
>>> On Thu, 1 Jun 2023 16:54:36 -0700
>>> Nadav Amit  wrote:
>>> 
> The way text_poke() is used here, it is creating a new writable
> alias
> and flushing it for *each* write to the module (like for each
> write of
> an individual relocation, etc). I was just thinking it might
> warrant
> some batching or something.  
>> 
 I am not advocating to do so, but if you want to have many
 efficient
 writes, perhaps you can just disable CR0.WP. Just saying that if
 you
 are about to write all over the memory, text_poke() does not
 provide
 too much security for the poking thread.
>> 
>> Heh, this is definitely and easier hack to implement :)
> 
> I don't know the details, but previously there was some strong dislike
> of CR0.WP toggling. And now there is also the problem of CET. Setting
> CR0.WP=0 will #GP if CR4.CET is 1 (as it currently is for kernel IBT).
> I guess you might get away with toggling them both in some controlled
> situation, but it might be a lot easier to hack up then to be made
> fully acceptable. It does sound much more efficient though.

Thanks for highlighting this issue. I understand the limitations of
CR0.WP. There is also always the concerns that without CET or other
control flow integrity mechanism, someone would abuse (using ROP/JOP)
functions that clear CR0.WP…

Re: [PATCH 00/13] mm: jit/text allocator

2023-06-05 Thread Kent Overstreet

On Mon, Jun 05, 2023 at 12:20:40PM +0300, Mike Rapoport wrote:
> On Fri, Jun 02, 2023 at 10:35:09AM +0100, Mark Rutland wrote:
> > On Thu, Jun 01, 2023 at 02:14:56PM -0400, Kent Overstreet wrote:
> > > On Thu, Jun 01, 2023 at 05:12:03PM +0100, Mark Rutland wrote:
> > > > For a while I have wanted to give kprobes its own allocator so that it 
> > > > can work
> > > > even with CONFIG_MODULES=n, and so that it doesn't have to waste VA 
> > > > space in
> > > > the modules area.
> > > > 
> > > > Given that, I think these should have their own allocator functions 
> > > > that can be
> > > > provided independently, even if those happen to use common 
> > > > infrastructure.
> > > 
> > > How much memory can kprobes conceivably use? I think we also want to try
> > > to push back on combinatorial new allocators, if we can.
> > 
> > That depends on who's using it, and how (e.g. via BPF).
> > 
> > To be clear, I'm not necessarily asking for entirely different allocators, 
> > but
> > I do thinkg that we want wrappers that can at least pass distinct start+end
> > parameters to a common allocator, and for arm64's modules code I'd expect 
> > that
> > we'd keep the range falblack logic out of the common allcoator, and just 
> > call
> > it twice.
> > 
> > > > > Several architectures override module_alloc() because of various
> > > > > constraints where the executable memory can be located and this causes
> > > > > additional obstacles for improvements of code allocation.
> > > > > 
> > > > > This set splits code allocation from modules by introducing
> > > > > jit_text_alloc(), jit_data_alloc() and jit_free() APIs, replaces call
> > > > > sites of module_alloc() and module_memfree() with the new APIs and
> > > > > implements core text and related allocation in a central place.
> > > > > 
> > > > > Instead of architecture specific overrides for module_alloc(), the
> > > > > architectures that require non-default behaviour for text allocation 
> > > > > must
> > > > > fill jit_alloc_params structure and implement jit_alloc_arch_params() 
> > > > > that
> > > > > returns a pointer to that structure. If an architecture does not 
> > > > > implement
> > > > > jit_alloc_arch_params(), the defaults compatible with the current
> > > > > modules::module_alloc() are used.
> > > > 
> > > > As above, I suspect that each of the callsites should probably be using 
> > > > common
> > > > infrastructure, but I don't think that a single jit_alloc_arch_params() 
> > > > makes
> > > > sense, since the parameters for each case may need to be distinct.
> > > 
> > > I don't see how that follows. The whole point of function parameters is
> > > that they may be different :)
> > 
> > What I mean is that jit_alloc_arch_params() tries to aggregate common
> > parameters, but they aren't actually common (e.g. the actual start+end range
> > for allocation).
> 
> jit_alloc_arch_params() tries to aggregate architecture constraints and
> requirements for allocations of executable memory and this exactly what
> the first 6 patches of this set do.
> 
> A while ago Thomas suggested to use a structure that parametrizes
> architecture constraints by the memory type used in modules [1] and Song
> implemented the infrastructure for it and x86 part [2].
> 
> I liked the idea of defining parameters in a single structure, but I
> thought that approaching the problem from the arch side rather than from
> modules perspective will be better starting point, hence these patches.
> 
> I don't see a fundamental reason why a single structure cannot describe
> what is needed for different code allocation cases, be it modules, kprobes
> or bpf. There is of course an assumption that the core allocations will be
> the same for all the users, and it seems to me that something like 
> 
> * allocate physical memory if allocator caches are empty
> * map it in vmalloc or modules address space
> * return memory from the allocator cache to the caller
> 
> will work for all usecases.
> 
> We might need separate caches for different cases on different
> architectures, and a way to specify what cache should be used in the
> allocator API, but that does not contradict a single structure for arch
> specific parameters, but only makes it more elaborate, e.g. something like
> 
> enum jit_type {
>   JIT_MODULES_TEXT,
>   JIT_MODULES_DATA,
>   JIT_KPROBES,
>   JIT_FTRACE,
>   JIT_BPF,
>   JIT_TYPE_MAX,
> };

Why would we actually need different enums for modules_text, kprobes,
ftrace and bpf? Why can't we treat all text allocations the same?

The reason we can't do that currently is because modules need to go in a
128Mb region on some archs, and without sub page allocation
bpf/kprobes/etc. burn a full page for each allocation. But we're doing
sub page allocation - right?

That leaves module data - which really needs to be split out into rw,
ro, ro_after_init - but I'm not sure we'd even want the same API for
those, they need fairly different page permissions handl

Re: [PATCH v4 RESEND 0/3] sed-opal: keyrings, discovery, revert, key store

2023-06-05 Thread Jens Axboe

On 6/1/23 4:37 PM, gjo...@linux.vnet.ibm.com wrote:
> From: Greg Joyce 
> 
> This patchset has gone through numerous rounds of review and
> all comments/suggetions have been addressed. I believe that
> this patchset is ready for inclusion.
> 
> TCG SED Opal is a specification from The Trusted Computing Group
> that allows self encrypting storage devices (SED) to be locked at
> power on and require an authentication key to unlock the drive.
> 
> The current SED Opal implementation in the block driver
> requires that authentication keys be provided in an ioctl
> so that they can be presented to the underlying SED
> capable drive. Currently, the key is typically entered by
> a user with an application like sedutil or sedcli. While
> this process works, it does not lend itself to automation
> like unlock by a udev rule.
> 
> The SED block driver has been extended so it can alternatively
> obtain a key from a sed-opal kernel keyring. The SED ioctls
> will indicate the source of the key, either directly in the
> ioctl data or from the keyring.
> 
> Two new SED ioctls have also been added. These are:
>   1) IOC_OPAL_REVERT_LSP to revert LSP state
>   2) IOC_OPAL_DISCOVERY to discover drive capabilities/state
> 
> change log v4:
> - rebase to 6.3-rc7
>   - replaced "255" magic number with U8_MAX

None of this applies for for-6.5/block, and I'm a little puzzled
as to why you'd rebase to an old kernel rather than a 6.4-rc at
least?

Please resend one that is current.

-- 
Jens Axboe

Re: [PATCH 00/89] i2c: Convert to platform remove callback returning void

2023-06-05 Thread Wolfram Sang

On Thu, Jun 01, 2023 at 03:54:50PM +0200, Wolfram Sang wrote:
> 
> > I wonder how this series will go in. My expectation was that Wolfram
> > picks up the whole series via his tree?!
> 
> Will do. I am currently super-busy, though.

Whole series applied to for-next. I squashed all the commits into one.
These are mostly simple changes which we won't revert anyhow, but fix
incrementally if we ever find an issue.

signature.asc
Description: PGP signature

Re: [PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page

2023-06-05 Thread Hugh Dickins

On Fri, 2 Jun 2023, Jason Gunthorpe wrote:
> On Mon, May 29, 2023 at 03:02:02PM +0100, Matthew Wilcox wrote:
> > On Sun, May 28, 2023 at 11:20:21PM -0700, Hugh Dickins wrote:
> > > +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> > > +{
> > > + struct page *page;
> > > +
> > > + page = virt_to_page(pgtable);
> > > + call_rcu(&page->rcu_head, pte_free_now);
> > > +}
> > 
> > This can't be safe (on ppc).  IIRC you might have up to 16x4k page
> > tables sharing one 64kB page.  So if you have two page tables from the
> > same page being defer-freed simultaneously, you'll reuse the rcu_head
> > and I cannot imagine things go well from that point.
> > 
> > I have no idea how to solve this problem.
> 
> Maybe power and s390 should allocate a side structure, sort of a
> pre-memdesc thing to store enough extra data?
> 
> If we can get enough bytes then something like this would let a single
> rcu head be shared to manage the free bits.
> 
> struct 64k_page {
> u8 free_pages;
> u8 pending_rcu_free_pages;
> struct rcu_head head;
> }
> 
> free_sub_page(sub_id)
> if (atomic_fetch_or(1 << sub_id, &64k_page->pending_rcu_free_pages))
>  call_rcu(&64k_page->head)
> 
> rcu_func()
>64k_page->free_pages |= atomic_xchg(0, &64k_page->pending_rcu_free_pages)
> 
>if (64k_pages->free_pages == all_ones)
>   free_pgea(64k_page);

Or simply allocate as many rcu_heads as page tables.

I have not thought through your suggestion above, because I'm against
asking s390, or any other architecture, to degrade its page table
implementation by demanding more memory, just for the sake of my patch
series.  In a future memdesc world it might turn out to be reasonable,
but not for this (if I can possibly avoid it).

Below is what I believe to be the correct powerpc patch (built but not
retested).  sparc I thought was going to be an equal problem, but turns
out not: I'll comment on 06/12.  And let's move s390 discussion to 07/12.

[PATCH 05/12] powerpc: add pte_free_defer() for pgtables sharing page

Add powerpc-specific pte_free_defer(), to call pte_free() via call_rcu().
pte_free_defer() will be called inside khugepaged's retract_page_tables()
loop, where allocating extra memory cannot be relied upon.  This precedes
the generic version to avoid build breakage from incompatible pgtable_t.

This is awkward because the struct page contains only one rcu_head, but
that page may be shared between PTE_FRAG_NR pagetables, each wanting to
use the rcu_head at the same time: account concurrent deferrals with a
heightened refcount, only the first making use of the rcu_head, but
re-deferring if more deferrals arrived during its grace period.

Signed-off-by: Hugh Dickins 
---
 arch/powerpc/include/asm/pgalloc.h |  4 +++
 arch/powerpc/mm/pgtable-frag.c | 51 ++
 2 files changed, 55 insertions(+)

diff --git a/arch/powerpc/include/asm/pgalloc.h 
b/arch/powerpc/include/asm/pgalloc.h
index 3360cad78ace..3a971e2a8c73 100644
--- a/arch/powerpc/include/asm/pgalloc.h
+++ b/arch/powerpc/include/asm/pgalloc.h
@@ -45,6 +45,10 @@ static inline void pte_free(struct mm_struct *mm, pgtable_t 
ptepage)
pte_fragment_free((unsigned long *)ptepage, 0);
 }
 
+/* arch use pte_free_defer() implementation in arch/powerpc/mm/pgtable-frag.c 
*/
+#define pte_free_defer pte_free_defer
+void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
+
 /*
  * Functions that deal with pagetables that could be at any level of
  * the table need to be passed an "index_size" so they know how to
diff --git a/arch/powerpc/mm/pgtable-frag.c b/arch/powerpc/mm/pgtable-frag.c
index 20652daa1d7e..e4f58c5fc2ac 100644
--- a/arch/powerpc/mm/pgtable-frag.c
+++ b/arch/powerpc/mm/pgtable-frag.c
@@ -120,3 +120,54 @@ void pte_fragment_free(unsigned long *table, int kernel)
__free_page(page);
}
 }
+
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
+#define PTE_FREE_DEFERRED 0x1 /* beyond any PTE_FRAG_NR */
+
+static void pte_free_now(struct rcu_head *head)
+{
+   struct page *page;
+   int refcount;
+
+   page = container_of(head, struct page, rcu_head);
+   refcount = atomic_sub_return(PTE_FREE_DEFERRED - 1,
+&page->pt_frag_refcount);
+   if (refcount < PTE_FREE_DEFERRED) {
+   pte_fragment_free((unsigned long *)page_address(page), 0);
+   return;
+   }
+   /*
+* One page may be shared between PTE_FRAG_NR pagetables.
+* At least one more call to pte_free_defer() came in while we
+* were already deferring, so the free must be deferred again;
+* but just for one grace period, however many calls came in.
+*/
+   while (refcount >= PTE_FREE_DEFERRED + PTE_FREE_DEFERRED) {
+   refcount = atomic_sub_return(PTE_FREE_DEFERRED,
+&page->pt_frag_refcount);
+   }
+   /* Remove that refcount of 1 left for fragment freeing ab

Re: [PATCHv2 pci-next 2/2] PCI/AER: Rate limit the reporting of the correctable errors

2023-06-05 Thread Grant Grundler

[plain text only this time...]

On Wed, May 17, 2023 at 11:11 PM Grant Grundler  wrote:
>
> On Fri, Apr 7, 2023 at 12:46 PM Bjorn Helgaas  wrote:
> ...
> > But I don't think we need output in a single step; we just need a
> > single instance of ratelimit_state (or one for CPER path and another
> > for native AER path), and that can control all the output for a single
> > error.  E.g., print_hmi_event_info() looks like this:
> >
> >   static void print_hmi_event_info(...)
> >   {
> > static DEFINE_RATELIMIT_STATE(rs, ...);
> >
> > if (__ratelimit(&rs)) {
> >   printk("%s%s Hypervisor Maintenance interrupt ...");
> >   printk("%s Error detail: %s\n", ...);
> >   printk("%s  HMER: %016llx\n", ...);
> > }
> >   }
> >
> > I think it's nice that the struct ratelimit_state is explicit and
> > there's no danger of breaking it when adding another printk later.
>
> Since the output is spread across at least two functions, I think your
> proposal is a better solution.
>
> I'm not happy with the patch series I sent in my previous reply as an
> attachment. It's only marginally better than the original code.

Despite not being happy about it, after a week of vacation I now think
it would be better to include them as is since they solve the
immediate problems and then solve the above two issues in additional
patches. The two changes I have prepared so far correctly fix the
original issues they intended to fix and don't affect the new issues
we've found.

I'll post a V3 of this series tonight after making sure it at least
compiles and "looks right".

cheers,
grant

>
> I need another day or two to see if I can implement your proposal correctly.
>
> cheers,
> grant

Re: [PATCH 06/12] sparc: add pte_free_defer() for pgtables sharing page

2023-06-05 Thread Hugh Dickins

On Sun, 28 May 2023, Hugh Dickins wrote:

> Add sparc-specific pte_free_defer(), to call pte_free() via call_rcu().
> pte_free_defer() will be called inside khugepaged's retract_page_tables()
> loop, where allocating extra memory cannot be relied upon.  This precedes
> the generic version to avoid build breakage from incompatible pgtable_t.

sparc32 supports pagetables sharing a page, but does not support THP;
sparc64 supports THP, but does not support pagetables sharing a page.
So the sparc-specific pte_free_defer() is as simple as the generic one,
except for converting between pte_t *pgtable_t and struct page *.
The patch should be fine as posted (except its title is misleading).

> 
> Signed-off-by: Hugh Dickins 
> ---
>  arch/sparc/include/asm/pgalloc_64.h |  4 
>  arch/sparc/mm/init_64.c | 16 
>  2 files changed, 20 insertions(+)
> 
> diff --git a/arch/sparc/include/asm/pgalloc_64.h 
> b/arch/sparc/include/asm/pgalloc_64.h
> index 7b5561d17ab1..caa7632be4c2 100644
> --- a/arch/sparc/include/asm/pgalloc_64.h
> +++ b/arch/sparc/include/asm/pgalloc_64.h
> @@ -65,6 +65,10 @@ pgtable_t pte_alloc_one(struct mm_struct *mm);
>  void pte_free_kernel(struct mm_struct *mm, pte_t *pte);
>  void pte_free(struct mm_struct *mm, pgtable_t ptepage);
>  
> +/* arch use pte_free_defer() implementation in arch/sparc/mm/init_64.c */
> +#define pte_free_defer pte_free_defer
> +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable);
> +
>  #define pmd_populate_kernel(MM, PMD, PTE)pmd_set(MM, PMD, PTE)
>  #define pmd_populate(MM, PMD, PTE)   pmd_set(MM, PMD, PTE)
>  
> diff --git a/arch/sparc/mm/init_64.c b/arch/sparc/mm/init_64.c
> index 04f9db0c3111..b7c6aa085ef6 100644
> --- a/arch/sparc/mm/init_64.c
> +++ b/arch/sparc/mm/init_64.c
> @@ -2930,6 +2930,22 @@ void pgtable_free(void *table, bool is_page)
>  }
>  
>  #ifdef CONFIG_TRANSPARENT_HUGEPAGE
> +static void pte_free_now(struct rcu_head *head)
> +{
> + struct page *page;
> +
> + page = container_of(head, struct page, rcu_head);
> + __pte_free((pgtable_t)page_to_virt(page));
> +}
> +
> +void pte_free_defer(struct mm_struct *mm, pgtable_t pgtable)
> +{
> + struct page *page;
> +
> + page = virt_to_page(pgtable);
> + call_rcu(&page->rcu_head, pte_free_now);
> +}
> +
>  void update_mmu_cache_pmd(struct vm_area_struct *vma, unsigned long addr,
> pmd_t *pmd)
>  {
> -- 
> 2.35.3
> 
>

[PATCH 1/2] PCI/AER: correctable error message as KERN_INFO

2023-06-05 Thread Grant Grundler

Since correctable errors have been corrected (and counted), the dmesg output
should not be reported as a warning, but rather as "informational".

Otherwise, using a certain well known vendor's PCIe parts in a USB4 docking
station, the dmesg buffer can be spammed with correctable errors, 717 bytes
per instance, potentially many MB per day.

Given the "WARN" priority, these messages have already confused the typical
user that stumbles across them, support staff (triaging feedback reports),
and more than a few linux kernel devs. Changing to INFO will hide these
messages from most audiences.

Signed-off-by: Grant Grundler 
---
 drivers/pci/pcie/aer.c | 20 ++--
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index f6c24ded134c..d7bfc6070ddb 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -692,7 +692,7 @@ static void __aer_print_error(struct pci_dev *dev,
 
if (info->severity == AER_CORRECTABLE) {
strings = aer_correctable_error_string;
-   level = KERN_WARNING;
+   level = KERN_INFO;
} else {
strings = aer_uncorrectable_error_string;
level = KERN_ERR;
@@ -724,7 +724,7 @@ void aer_print_error(struct pci_dev *dev, struct 
aer_err_info *info)
layer = AER_GET_LAYER_ERROR(info->severity, info->status);
agent = AER_GET_AGENT(info->severity, info->status);
 
-   level = (info->severity == AER_CORRECTABLE) ? KERN_WARNING : KERN_ERR;
+   level = (info->severity == AER_CORRECTABLE) ? KERN_INFO : KERN_ERR;
 
pci_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
   aer_error_severity_string[info->severity],
@@ -797,14 +797,22 @@ void cper_print_aer(struct pci_dev *dev, int aer_severity,
info.mask = mask;
info.first_error = PCI_ERR_CAP_FEP(aer->cap_control);
 
-   pci_err(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask);
+   if (aer_severity == AER_CORRECTABLE)
+   pci_info(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, 
mask);
+   else
+   pci_err(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, 
mask);
+
__aer_print_error(dev, &info);
-   pci_err(dev, "aer_layer=%s, aer_agent=%s\n",
-   aer_error_layer[layer], aer_agent_string[agent]);
 
-   if (aer_severity != AER_CORRECTABLE)
+   if (aer_severity == AER_CORRECTABLE) {
+   pci_info(dev, "aer_layer=%s, aer_agent=%s\n",
+   aer_error_layer[layer], aer_agent_string[agent]);
+   } else {
+   pci_err(dev, "aer_layer=%s, aer_agent=%s\n",
+   aer_error_layer[layer], aer_agent_string[agent]);
pci_err(dev, "aer_uncor_severity: 0x%08x\n",
aer->uncor_severity);
+   }
 
if (tlp_header_valid)
__print_tlp_header(dev, &aer->header_log);
-- 
2.41.0.rc0.172.g3f132b7071-goog

[PATCH 2/2] PCI/AER: Rate limit the reporting of the correctable errors

2023-06-05 Thread Grant Grundler

From: Rajat Khandelwal 

There are many instances where correctable errors tend to inundate
the message buffer. We observe such instances during thunderbolt PCIe
tunneling.

It's true that they are mitigated by the hardware and are non-fatal
but we shouldn't be spamming the logs with such correctable errors as it
confuses other kernel developers less familiar with PCI errors, support
staff, and users who happen to look at the logs, hence rate limit them.

A typical example log inside an HP TBT4 dock:
[54912.661142] pcieport :00:07.0: AER: Multiple Corrected error received: 
:2b:00.0
[54912.661194] igc :2b:00.0: PCIe Bus Error: severity=Corrected, type=Data 
Link Layer, (Transmitter ID)
[54912.661203] igc :2b:00.0:   device [8086:5502] error 
status/mask=1100/2000
[54912.661211] igc :2b:00.0:[ 8] Rollover
[54912.661219] igc :2b:00.0:[12] Timeout
[54982.838760] pcieport :00:07.0: AER: Corrected error received: 
:2b:00.0
[54982.838798] igc :2b:00.0: PCIe Bus Error: severity=Corrected, type=Data 
Link Layer, (Transmitter ID)
[54982.838808] igc :2b:00.0:   device [8086:5502] error 
status/mask=1000/2000
[54982.838817] igc :2b:00.0:[12] Timeout

This gets repeated continuously, thus inundating the buffer.

Signed-off-by: Rajat Khandelwal 
Signed-off-by: Grant Grundler 
---
 drivers/pci/pcie/aer.c | 80 +++---
 1 file changed, 51 insertions(+), 29 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index d7bfc6070ddb..830f5a1261c9 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -686,26 +686,36 @@ static void __aer_print_error(struct pci_dev *dev,
  struct aer_err_info *info)
 {
const char **strings;
+   char aer_msg[512];
unsigned long status = info->status & ~info->mask;
-   const char *level, *errmsg;
int i;
 
-   if (info->severity == AER_CORRECTABLE) {
-   strings = aer_correctable_error_string;
-   level = KERN_INFO;
-   } else {
-   strings = aer_uncorrectable_error_string;
-   level = KERN_ERR;
-   }
+   memset(aer_msg, 0, sizeof(*aer_msg));
+   snprintf(aer_msg, sizeof(*aer_msg), "aer_status: 0x%08x, aer_mask: 
0x%08x\n",
+   info->status, info->mask);
+
+   strings = (info->severity == AER_CORRECTABLE) ?
+   aer_correctable_error_string : aer_uncorrectable_error_string;
 
for_each_set_bit(i, &status, 32) {
-   errmsg = strings[i];
+   const char *errmsg = strings[i];
+   char bitmsg[64];
+   memset(bitmsg, 0, sizeof(*bitmsg));
+
if (!errmsg)
errmsg = "Unknown Error Bit";
 
-   pci_printk(level, dev, "   [%2d] %-22s%s\n", i, errmsg,
-   info->first_error == i ? " (First)" : "");
+   snprintf(bitmsg, sizeof(*bitmsg), "   [%2d] %-22s%s\n", i, 
errmsg,
+   info->first_error == i ? " (First)" : "");
+
+   strlcat(aer_msg, bitmsg, sizeof(*aer_msg));
}
+
+   if (info->severity == AER_CORRECTABLE)
+   pci_info_ratelimited(dev, "%s", aer_msg);
+   else
+   pci_err(dev, "%s", aer_msg):
+
pci_dev_aer_stats_incr(dev, info);
 }
 
@@ -713,7 +723,6 @@ void aer_print_error(struct pci_dev *dev, struct 
aer_err_info *info)
 {
int layer, agent;
int id = ((dev->bus->number << 8) | dev->devfn);
-   const char *level;
 
if (!info->status) {
pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, 
(Unregistered Agent ID)\n",
@@ -724,14 +733,19 @@ void aer_print_error(struct pci_dev *dev, struct 
aer_err_info *info)
layer = AER_GET_LAYER_ERROR(info->severity, info->status);
agent = AER_GET_AGENT(info->severity, info->status);
 
-   level = (info->severity == AER_CORRECTABLE) ? KERN_INFO : KERN_ERR;
-
-   pci_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
-  aer_error_severity_string[info->severity],
-  aer_error_layer[layer], aer_agent_string[agent]);
-
-   pci_printk(level, dev, "  device [%04x:%04x] error 
status/mask=%08x/%08x\n",
-  dev->vendor, dev->device, info->status, info->mask);
+   if (info->severity == AER_CORRECTABLE) {
+   pci_info_ratelimited(dev, "PCIe Bus Error: severity=%s, 
type=%s, (%s)\n"
+   "  device [%04x:%04x] error 
status/mask=%08x/%08x\n",
+aer_error_severity_string[info->severity],
+aer_error_layer[layer], 
aer_agent_string[agent],
+dev->vendor, dev->device, info->status, 
info->mask);
+   } else {
+   pci_err(dev, "PCIe Bus Error: severity=%s, type=%s, (%

[PATCHv3 pci-next 1/2] PCI/AER: correctable error message as KERN_INFO

2023-06-05 Thread Grant Grundler

Since correctable errors have been corrected (and counted), the dmesg output
should not be reported as a warning, but rather as "informational".

Otherwise, using a certain well known vendor's PCIe parts in a USB4 docking
station, the dmesg buffer can be spammed with correctable errors, 717 bytes
per instance, potentially many MB per day.

Given the "WARN" priority, these messages have already confused the typical
user that stumbles across them, support staff (triaging feedback reports),
and more than a few linux kernel devs. Changing to INFO will hide these
messages from most audiences.

Signed-off-by: Grant Grundler 
---
 drivers/pci/pcie/aer.c | 20 ++--
 1 file changed, 14 insertions(+), 6 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index f6c24ded134c..d7bfc6070ddb 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -692,7 +692,7 @@ static void __aer_print_error(struct pci_dev *dev,
 
if (info->severity == AER_CORRECTABLE) {
strings = aer_correctable_error_string;
-   level = KERN_WARNING;
+   level = KERN_INFO;
} else {
strings = aer_uncorrectable_error_string;
level = KERN_ERR;
@@ -724,7 +724,7 @@ void aer_print_error(struct pci_dev *dev, struct 
aer_err_info *info)
layer = AER_GET_LAYER_ERROR(info->severity, info->status);
agent = AER_GET_AGENT(info->severity, info->status);
 
-   level = (info->severity == AER_CORRECTABLE) ? KERN_WARNING : KERN_ERR;
+   level = (info->severity == AER_CORRECTABLE) ? KERN_INFO : KERN_ERR;
 
pci_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
   aer_error_severity_string[info->severity],
@@ -797,14 +797,22 @@ void cper_print_aer(struct pci_dev *dev, int aer_severity,
info.mask = mask;
info.first_error = PCI_ERR_CAP_FEP(aer->cap_control);
 
-   pci_err(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, mask);
+   if (aer_severity == AER_CORRECTABLE)
+   pci_info(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, 
mask);
+   else
+   pci_err(dev, "aer_status: 0x%08x, aer_mask: 0x%08x\n", status, 
mask);
+
__aer_print_error(dev, &info);
-   pci_err(dev, "aer_layer=%s, aer_agent=%s\n",
-   aer_error_layer[layer], aer_agent_string[agent]);
 
-   if (aer_severity != AER_CORRECTABLE)
+   if (aer_severity == AER_CORRECTABLE) {
+   pci_info(dev, "aer_layer=%s, aer_agent=%s\n",
+   aer_error_layer[layer], aer_agent_string[agent]);
+   } else {
+   pci_err(dev, "aer_layer=%s, aer_agent=%s\n",
+   aer_error_layer[layer], aer_agent_string[agent]);
pci_err(dev, "aer_uncor_severity: 0x%08x\n",
aer->uncor_severity);
+   }
 
if (tlp_header_valid)
__print_tlp_header(dev, &aer->header_log);
-- 
2.41.0.rc0.172.g3f132b7071-goog

[PATCHv3 pci-next 2/2] PCI/AER: Rate limit the reporting of the correctable errors

2023-06-05 Thread Grant Grundler

From: Rajat Khandelwal 

There are many instances where correctable errors tend to inundate
the message buffer. We observe such instances during thunderbolt PCIe
tunneling.

It's true that they are mitigated by the hardware and are non-fatal
but we shouldn't be spamming the logs with such correctable errors as it
confuses other kernel developers less familiar with PCI errors, support
staff, and users who happen to look at the logs, hence rate limit them.

A typical example log inside an HP TBT4 dock:
[54912.661142] pcieport :00:07.0: AER: Multiple Corrected error received: 
:2b:00.0
[54912.661194] igc :2b:00.0: PCIe Bus Error: severity=Corrected, type=Data 
Link Layer, (Transmitter ID)
[54912.661203] igc :2b:00.0:   device [8086:5502] error 
status/mask=1100/2000
[54912.661211] igc :2b:00.0:[ 8] Rollover
[54912.661219] igc :2b:00.0:[12] Timeout
[54982.838760] pcieport :00:07.0: AER: Corrected error received: 
:2b:00.0
[54982.838798] igc :2b:00.0: PCIe Bus Error: severity=Corrected, type=Data 
Link Layer, (Transmitter ID)
[54982.838808] igc :2b:00.0:   device [8086:5502] error 
status/mask=1000/2000
[54982.838817] igc :2b:00.0:[12] Timeout

This gets repeated continuously, thus inundating the buffer.

Signed-off-by: Rajat Khandelwal 
Signed-off-by: Grant Grundler 
---
 drivers/pci/pcie/aer.c | 80 +++---
 1 file changed, 51 insertions(+), 29 deletions(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index d7bfc6070ddb..830f5a1261c9 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -686,26 +686,36 @@ static void __aer_print_error(struct pci_dev *dev,
  struct aer_err_info *info)
 {
const char **strings;
+   char aer_msg[512];
unsigned long status = info->status & ~info->mask;
-   const char *level, *errmsg;
int i;
 
-   if (info->severity == AER_CORRECTABLE) {
-   strings = aer_correctable_error_string;
-   level = KERN_INFO;
-   } else {
-   strings = aer_uncorrectable_error_string;
-   level = KERN_ERR;
-   }
+   memset(aer_msg, 0, sizeof(*aer_msg));
+   snprintf(aer_msg, sizeof(*aer_msg), "aer_status: 0x%08x, aer_mask: 
0x%08x\n",
+   info->status, info->mask);
+
+   strings = (info->severity == AER_CORRECTABLE) ?
+   aer_correctable_error_string : aer_uncorrectable_error_string;
 
for_each_set_bit(i, &status, 32) {
-   errmsg = strings[i];
+   const char *errmsg = strings[i];
+   char bitmsg[64];
+   memset(bitmsg, 0, sizeof(*bitmsg));
+
if (!errmsg)
errmsg = "Unknown Error Bit";
 
-   pci_printk(level, dev, "   [%2d] %-22s%s\n", i, errmsg,
-   info->first_error == i ? " (First)" : "");
+   snprintf(bitmsg, sizeof(*bitmsg), "   [%2d] %-22s%s\n", i, 
errmsg,
+   info->first_error == i ? " (First)" : "");
+
+   strlcat(aer_msg, bitmsg, sizeof(*aer_msg));
}
+
+   if (info->severity == AER_CORRECTABLE)
+   pci_info_ratelimited(dev, "%s", aer_msg);
+   else
+   pci_err(dev, "%s", aer_msg):
+
pci_dev_aer_stats_incr(dev, info);
 }
 
@@ -713,7 +723,6 @@ void aer_print_error(struct pci_dev *dev, struct 
aer_err_info *info)
 {
int layer, agent;
int id = ((dev->bus->number << 8) | dev->devfn);
-   const char *level;
 
if (!info->status) {
pci_err(dev, "PCIe Bus Error: severity=%s, type=Inaccessible, 
(Unregistered Agent ID)\n",
@@ -724,14 +733,19 @@ void aer_print_error(struct pci_dev *dev, struct 
aer_err_info *info)
layer = AER_GET_LAYER_ERROR(info->severity, info->status);
agent = AER_GET_AGENT(info->severity, info->status);
 
-   level = (info->severity == AER_CORRECTABLE) ? KERN_INFO : KERN_ERR;
-
-   pci_printk(level, dev, "PCIe Bus Error: severity=%s, type=%s, (%s)\n",
-  aer_error_severity_string[info->severity],
-  aer_error_layer[layer], aer_agent_string[agent]);
-
-   pci_printk(level, dev, "  device [%04x:%04x] error 
status/mask=%08x/%08x\n",
-  dev->vendor, dev->device, info->status, info->mask);
+   if (info->severity == AER_CORRECTABLE) {
+   pci_info_ratelimited(dev, "PCIe Bus Error: severity=%s, 
type=%s, (%s)\n"
+   "  device [%04x:%04x] error 
status/mask=%08x/%08x\n",
+aer_error_severity_string[info->severity],
+aer_error_layer[layer], 
aer_agent_string[agent],
+dev->vendor, dev->device, info->status, 
info->mask);
+   } else {
+   pci_err(dev, "PCIe Bus Error: severity=%s, type=%s, (%

Re: [PATCHv2 pci-next 2/2] PCI/AER: Rate limit the reporting of the correctable errors

2023-06-05 Thread Grant Grundler

On Wed, May 17, 2023 at 11:11 PM Grant Grundler 
wrote:

> On Fri, Apr 7, 2023 at 12:46 PM Bjorn Helgaas  wrote:
> ...
> > But I don't think we need output in a single step; we just need a
> > single instance of ratelimit_state (or one for CPER path and another
> > for native AER path), and that can control all the output for a single
> > error.  E.g., print_hmi_event_info() looks like this:
> >
> >   static void print_hmi_event_info(...)
> >   {
> > static DEFINE_RATELIMIT_STATE(rs, ...);
> >
> > if (__ratelimit(&rs)) {
> >   printk("%s%s Hypervisor Maintenance interrupt ...");
> >   printk("%s Error detail: %s\n", ...);
> >   printk("%s  HMER: %016llx\n", ...);
> > }
> >   }
> >
> > I think it's nice that the struct ratelimit_state is explicit and
> > there's no danger of breaking it when adding another printk later.
>
> Since the output is spread across at least two functions, I think your
> proposal is a better solution.
>
> I'm not happy with the patch series I sent in my previous reply as an
> attachment. It's only marginally better than the original code.
>

Despite not being happy about it, after a week of vacation I now think it
would be better to include them as is since they solve the immediate
problems and then solve the above two issues in additional patches. The two
changes I have prepared so far correctly fix the original issues they
intended to fix and don't affect the new issues we've found.

I'll post a V3 of this series tonight after making sure it at least
compiles and "looks right".

cheers,
grant

> I need another day or two to see if I can implement your proposal
> correctly.
>
> cheers,
> grant
>

[PATCH 00/16] Add support for DAX vmemmap optimization for ppc64

2023-06-05 Thread Aneesh Kumar K.V

This patch series implements changes required to support DAX vmemmap
optimization for ppc64. The vmemmap optimization is only enabled with radix MMU
translation and 1GB PUD mapping with 64K page size. The patch series also split
hugetlb vmemmap optimization as a separate Kconfig variable so that
architectures can enable DAX vmemmap optimization without enabling hugetlb
vmemmap optimization. This should enable architectures like arm64 to enable DAX
vmemmap optimization while they can't enable hugetlb vmemmap optimization. More
details of the same are in patch "mm/vmemmap optimization: Split hugetlb and
d

Aneesh Kumar K.V (16):
  powerpc/mm/book3s64: Use pmdp_ptep helper instead of typecasting.
  powerpc/book3s64/mm: mmu_vmemmap_psize is used by radix
  powerpc/book3s64/mm: Fix DirectMap stats in /proc/meminfo
  powerpc/book3s64/mm: Use PAGE_KERNEL instead of opencoding
  powerpc/mm/dax: Fix the condition when checking if altmap vmemap can
cross-boundary
  mm/hugepage pud: Allow arch-specific helper function to check huge
page pud support
  mm: Change pudp_huge_get_and_clear_full take vm_area_struct as arg
  mm/vmemmap: Improve vmemmap_can_optimize and allow architectures to
override
  mm/vmemmap: Allow architectures to override how vmemmap optimization
works
  mm: Add __HAVE_ARCH_PUD_SAME similar to __HAVE_ARCH_P4D_SAME
  mm/huge pud: Use transparent huge pud helpers only with
CONFIG_TRANSPARENT_HUGEPAGE
  mm/vmemmap optimization: Split hugetlb and devdax vmemmap optimization
  powerpc/book3s64/mm: Enable transparent pud hugepage
  powerpc/book3s64/vmemmap: Switch radix to use a different vmemmap
handling function
  powerpc/book3s64/radix: Add support for vmemmap optimization for radix
  powerpc/book3s64/radix: Remove mmu_vmemmap_psize

 Documentation/mm/vmemmap_dedup.rst|   1 +
 Documentation/powerpc/vmemmap_dedup.rst   | 101 
 arch/loongarch/Kconfig|   2 +-
 arch/powerpc/Kconfig  |   1 +
 arch/powerpc/include/asm/book3s/64/pgtable.h  | 156 -
 arch/powerpc/include/asm/book3s/64/radix.h|  47 ++
 .../include/asm/book3s/64/tlbflush-radix.h|   2 +
 arch/powerpc/include/asm/book3s/64/tlbflush.h |   8 +
 arch/powerpc/include/asm/pgtable.h|   3 +
 arch/powerpc/mm/book3s64/pgtable.c|  78 +++
 arch/powerpc/mm/book3s64/radix_pgtable.c  | 551 --
 arch/powerpc/mm/book3s64/radix_tlb.c  |   7 +
 arch/powerpc/mm/init_64.c |  39 +-
 arch/powerpc/platforms/Kconfig.cputype|   1 +
 arch/riscv/Kconfig|   2 +-
 arch/x86/Kconfig  |   3 +-
 drivers/nvdimm/pfn_devs.c |   2 +-
 fs/Kconfig|   2 +-
 include/linux/mm.h|  32 +-
 include/linux/pgtable.h   |  11 +-
 include/trace/events/thp.h|  17 +
 mm/Kconfig|   5 +-
 mm/debug_vm_pgtable.c |   2 +-
 mm/huge_memory.c  |   2 +-
 mm/mm_init.c  |   2 +-
 mm/mremap.c   |   2 +-
 mm/sparse-vmemmap.c   |   3 +
 27 files changed, 1005 insertions(+), 77 deletions(-)
 create mode 100644 Documentation/powerpc/vmemmap_dedup.rst

-- 
2.40.1

[PATCH 01/16] powerpc/mm/book3s64: Use pmdp_ptep helper instead of typecasting.

2023-06-05 Thread Aneesh Kumar K.V

No functional change in this patch.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/book3s64/radix_pgtable.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 2297aa764ecd..5f8c6fbe8a69 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -952,7 +952,7 @@ unsigned long radix__pmd_hugepage_update(struct mm_struct 
*mm, unsigned long add
assert_spin_locked(pmd_lockptr(mm, pmdp));
 #endif
 
-   old = radix__pte_update(mm, addr, (pte_t *)pmdp, clr, set, 1);
+   old = radix__pte_update(mm, addr, pmdp_ptep(pmdp), clr, set, 1);
trace_hugepage_update(addr, old, clr, set);
 
return old;
-- 
2.40.1

[PATCH 02/16] powerpc/book3s64/mm: mmu_vmemmap_psize is used by radix

2023-06-05 Thread Aneesh Kumar K.V

This should not be within CONFIG_PPC_64S_HASHS_MMU. We use mmu_vmemmap_psize
on radix while mapping the vmemmap area.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/book3s64/radix_pgtable.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 5f8c6fbe8a69..570add33c02d 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -594,7 +594,6 @@ void __init radix__early_init_mmu(void)
 {
unsigned long lpcr;
 
-#ifdef CONFIG_PPC_64S_HASH_MMU
 #ifdef CONFIG_PPC_64K_PAGES
/* PAGE_SIZE mappings */
mmu_virtual_psize = MMU_PAGE_64K;
@@ -611,7 +610,6 @@ void __init radix__early_init_mmu(void)
mmu_vmemmap_psize = MMU_PAGE_2M;
} else
mmu_vmemmap_psize = mmu_virtual_psize;
-#endif
 #endif
/*
 * initialize page table size
-- 
2.40.1

[PATCH 03/16] powerpc/book3s64/mm: Fix DirectMap stats in /proc/meminfo

2023-06-05 Thread Aneesh Kumar K.V

On memory unplug reduce DirectMap page count correctly.
root@ubuntu-guest:# grep Direct /proc/meminfo
DirectMap4k:   0 kB
DirectMap64k:   0 kB
DirectMap2M:115343360 kB
DirectMap1G:   0 kB

Before fix:
root@ubuntu-guest:# ndctl disable-namespace all
disabled 1 namespace
root@ubuntu-guest:# grep Direct /proc/meminfo
DirectMap4k:   0 kB
DirectMap64k:   0 kB
DirectMap2M:115343360 kB
DirectMap1G:   0 kB

After fix:
root@ubuntu-guest:# ndctl disable-namespace all
disabled 1 namespace
root@ubuntu-guest:# grep Direct /proc/meminfo
DirectMap4k:   0 kB
DirectMap64k:   0 kB
DirectMap2M:104857600 kB
DirectMap1G:   0 kB

Fixes: a2dc009afa9a ("powerpc/mm/book3s/radix: Add mapping statistics")
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/book3s64/radix_pgtable.c | 34 +++-
 1 file changed, 22 insertions(+), 12 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 570add33c02d..15a099e53cde 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -743,9 +743,9 @@ static void free_pud_table(pud_t *pud_start, p4d_t *p4d)
 }
 
 static void remove_pte_table(pte_t *pte_start, unsigned long addr,
-unsigned long end)
+unsigned long end, bool direct)
 {
-   unsigned long next;
+   unsigned long next, pages = 0;
pte_t *pte;
 
pte = pte_start + pte_index(addr);
@@ -767,13 +767,16 @@ static void remove_pte_table(pte_t *pte_start, unsigned 
long addr,
}
 
pte_clear(&init_mm, addr, pte);
+   pages++;
}
+   if (direct)
+   update_page_count(mmu_virtual_psize, -pages);
 }
 
 static void __meminit remove_pmd_table(pmd_t *pmd_start, unsigned long addr,
-unsigned long end)
+  unsigned long end, bool direct)
 {
-   unsigned long next;
+   unsigned long next, pages = 0;
pte_t *pte_base;
pmd_t *pmd;
 
@@ -791,19 +794,22 @@ static void __meminit remove_pmd_table(pmd_t *pmd_start, 
unsigned long addr,
continue;
}
pte_clear(&init_mm, addr, (pte_t *)pmd);
+   pages++;
continue;
}
 
pte_base = (pte_t *)pmd_page_vaddr(*pmd);
-   remove_pte_table(pte_base, addr, next);
+   remove_pte_table(pte_base, addr, next, direct);
free_pte_table(pte_base, pmd);
}
+   if (direct)
+   update_page_count(MMU_PAGE_2M, -pages);
 }
 
 static void __meminit remove_pud_table(pud_t *pud_start, unsigned long addr,
-unsigned long end)
+  unsigned long end, bool direct)
 {
-   unsigned long next;
+   unsigned long next, pages = 0;
pmd_t *pmd_base;
pud_t *pud;
 
@@ -821,16 +827,20 @@ static void __meminit remove_pud_table(pud_t *pud_start, 
unsigned long addr,
continue;
}
pte_clear(&init_mm, addr, (pte_t *)pud);
+   pages++;
continue;
}
 
pmd_base = pud_pgtable(*pud);
-   remove_pmd_table(pmd_base, addr, next);
+   remove_pmd_table(pmd_base, addr, next, direct);
free_pmd_table(pmd_base, pud);
}
+   if (direct)
+   update_page_count(MMU_PAGE_1G, -pages);
 }
 
-static void __meminit remove_pagetable(unsigned long start, unsigned long end)
+static void __meminit remove_pagetable(unsigned long start, unsigned long end,
+  bool direct)
 {
unsigned long addr, next;
pud_t *pud_base;
@@ -859,7 +869,7 @@ static void __meminit remove_pagetable(unsigned long start, 
unsigned long end)
}
 
pud_base = p4d_pgtable(*p4d);
-   remove_pud_table(pud_base, addr, next);
+   remove_pud_table(pud_base, addr, next, direct);
free_pud_table(pud_base, p4d);
}
 
@@ -882,7 +892,7 @@ int __meminit radix__create_section_mapping(unsigned long 
start,
 
 int __meminit radix__remove_section_mapping(unsigned long start, unsigned long 
end)
 {
-   remove_pagetable(start, end);
+   remove_pagetable(start, end, true);
return 0;
 }
 #endif /* CONFIG_MEMORY_HOTPLUG */
@@ -918,7 +928,7 @@ int __meminit radix__vmemmap_create_mapping(unsigned long 
start,
 #ifdef CONFIG_MEMORY_HOTPLUG
 void __meminit radix__vmemmap_remove_mapping(unsigned long start, unsigned 
long page_size)
 {
-   remove_pagetable(start, start + page_size);
+   remove_pagetable(start, start + page_size, false);
 }
 #endif
 #e

[PATCH 04/16] powerpc/book3s64/mm: Use PAGE_KERNEL instead of opencoding

2023-06-05 Thread Aneesh Kumar K.V

No functional change in this patch.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/book3s64/radix_pgtable.c | 3 +--
 1 file changed, 1 insertion(+), 2 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 15a099e53cde..76f6a1f3b9d8 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -910,7 +910,6 @@ int __meminit radix__vmemmap_create_mapping(unsigned long 
start,
  unsigned long phys)
 {
/* Create a PTE encoding */
-   unsigned long flags = _PAGE_PRESENT | _PAGE_ACCESSED | _PAGE_KERNEL_RW;
int nid = early_pfn_to_nid(phys >> PAGE_SHIFT);
int ret;
 
@@ -919,7 +918,7 @@ int __meminit radix__vmemmap_create_mapping(unsigned long 
start,
return -1;
}
 
-   ret = __map_kernel_page_nid(start, phys, __pgprot(flags), page_size, 
nid);
+   ret = __map_kernel_page_nid(start, phys, PAGE_KERNEL, page_size, nid);
BUG_ON(ret);
 
return 0;
-- 
2.40.1

[PATCH 05/16] powerpc/mm/dax: Fix the condition when checking if altmap vmemap can cross-boundary

2023-06-05 Thread Aneesh Kumar K.V

Without this fix, the last subsection vmemmap can end up in memory even if the
namespace is created with -M mem and has sufficient space in the altmap area.

Fixes: cf387d9644d8 ("libnvdimm/altmap: Track namespace boundaries in altmap")
Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/init_64.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 05b0d584e50b..fe1b83020e0d 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -189,7 +189,7 @@ static bool altmap_cross_boundary(struct vmem_altmap 
*altmap, unsigned long star
unsigned long nr_pfn = page_size / sizeof(struct page);
unsigned long start_pfn = page_to_pfn((struct page *)start);
 
-   if ((start_pfn + nr_pfn) > altmap->end_pfn)
+   if ((start_pfn + nr_pfn - 1) > altmap->end_pfn)
return true;
 
if (start_pfn < altmap->base_pfn)
-- 
2.40.1

[PATCH 06/16] mm/hugepage pud: Allow arch-specific helper function to check huge page pud support

2023-06-05 Thread Aneesh Kumar K.V

Architectures like powerpc would like to enable transparent huge page pud
support only with radix translation. To support that add
has_transparent_pud_hugepage() helper that architectures can override.

Signed-off-by: Aneesh Kumar K.V 
---
 drivers/nvdimm/pfn_devs.c | 2 +-
 include/linux/pgtable.h   | 3 +++
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/nvdimm/pfn_devs.c b/drivers/nvdimm/pfn_devs.c
index af7d9301520c..18ad315581ca 100644
--- a/drivers/nvdimm/pfn_devs.c
+++ b/drivers/nvdimm/pfn_devs.c
@@ -100,7 +100,7 @@ static unsigned long *nd_pfn_supported_alignments(unsigned 
long *alignments)
 
if (has_transparent_hugepage()) {
alignments[1] = HPAGE_PMD_SIZE;
-   if (IS_ENABLED(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD))
+   if (has_transparent_pud_hugepage())
alignments[2] = HPAGE_PUD_SIZE;
}
 
diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index c5a51481bbb9..b3f4dd0240f5 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -1597,6 +1597,9 @@ typedef unsigned int pgtbl_mod_mask;
 #define has_transparent_hugepage() IS_BUILTIN(CONFIG_TRANSPARENT_HUGEPAGE)
 #endif
 
+#ifndef has_transparent_pud_hugepage
+#define has_transparent_pud_hugepage() 
IS_BUILTIN(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
+#endif
 /*
  * On some architectures it depends on the mm if the p4d/pud or pmd
  * layer of the page table hierarchy is folded or not.
-- 
2.40.1

[PATCH 07/16] mm: Change pudp_huge_get_and_clear_full take vm_area_struct as arg

2023-06-05 Thread Aneesh Kumar K.V

We will use this in a later patch to do tlb flush when clearing pud entries on
powerpc. This is similar to
commit 93a98695f2f9 ("mm: change pmdp_huge_get_and_clear_full take 
vm_area_struct as arg")

Signed-off-by: Aneesh Kumar K.V 
---
 include/linux/pgtable.h | 4 ++--
 mm/debug_vm_pgtable.c   | 2 +-
 mm/huge_memory.c| 2 +-
 3 files changed, 4 insertions(+), 4 deletions(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index b3f4dd0240f5..2fe19720075e 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -442,11 +442,11 @@ static inline pmd_t pmdp_huge_get_and_clear_full(struct 
vm_area_struct *vma,
 #endif
 
 #ifndef __HAVE_ARCH_PUDP_HUGE_GET_AND_CLEAR_FULL
-static inline pud_t pudp_huge_get_and_clear_full(struct mm_struct *mm,
+static inline pud_t pudp_huge_get_and_clear_full(struct vm_area_struct *vma,
unsigned long address, pud_t *pudp,
int full)
 {
-   return pudp_huge_get_and_clear(mm, address, pudp);
+   return pudp_huge_get_and_clear(vma->vm_mm, address, pudp);
 }
 #endif
 #endif /* CONFIG_TRANSPARENT_HUGEPAGE */
diff --git a/mm/debug_vm_pgtable.c b/mm/debug_vm_pgtable.c
index c54177aabebd..c2bf25d5e5cd 100644
--- a/mm/debug_vm_pgtable.c
+++ b/mm/debug_vm_pgtable.c
@@ -382,7 +382,7 @@ static void __init pud_advanced_tests(struct 
pgtable_debug_args *args)
WARN_ON(!(pud_write(pud) && pud_dirty(pud)));
 
 #ifndef __PAGETABLE_PMD_FOLDED
-   pudp_huge_get_and_clear_full(args->mm, vaddr, args->pudp, 1);
+   pudp_huge_get_and_clear_full(args->vma, vaddr, args->pudp, 1);
pud = READ_ONCE(*args->pudp);
WARN_ON(!pud_none(pud));
 #endif /* __PAGETABLE_PMD_FOLDED */
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 624671aaa60d..8774b4751a84 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -1980,7 +1980,7 @@ int zap_huge_pud(struct mmu_gather *tlb, struct 
vm_area_struct *vma,
if (!ptl)
return 0;
 
-   pudp_huge_get_and_clear_full(tlb->mm, addr, pud, tlb->fullmm);
+   pudp_huge_get_and_clear_full(vma, addr, pud, tlb->fullmm);
tlb_remove_pud_tlb_entry(tlb, pud, addr);
if (vma_is_special_huge(vma)) {
spin_unlock(ptl);
-- 
2.40.1

[PATCH 08/16] mm/vmemmap: Improve vmemmap_can_optimize and allow architectures to override

2023-06-05 Thread Aneesh Kumar K.V

dax vmemmap optimization requires a minimum of 2 PAGE_SIZE area within vmemmap
such that tail page mapping can point to the second PAGE_SIZE area. Enforce that
in vmemmap_can_optimize() function.

Architectures like powerpc also want to enable vmemmap optimization
conditionally (only with radix MMU translation). Hence allow architecture
override.

Signed-off-by: Aneesh Kumar K.V 
---
 include/linux/mm.h | 30 ++
 mm/mm_init.c   |  2 +-
 2 files changed, 27 insertions(+), 5 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 27ce77080c79..9a45e61cd83f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -31,6 +31,8 @@
 #include 
 #include 
 
+#include 
+
 struct mempolicy;
 struct anon_vma;
 struct anon_vma_chain;
@@ -3550,13 +3552,33 @@ void vmemmap_free(unsigned long start, unsigned long 
end,
struct vmem_altmap *altmap);
 #endif
 
+#define VMEMMAP_RESERVE_NR 2
 #ifdef CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP
-static inline bool vmemmap_can_optimize(struct vmem_altmap *altmap,
-  struct dev_pagemap *pgmap)
+static inline bool __vmemmap_can_optimize(struct vmem_altmap *altmap,
+ struct dev_pagemap *pgmap)
 {
-   return is_power_of_2(sizeof(struct page)) &&
-   pgmap && (pgmap_vmemmap_nr(pgmap) > 1) && !altmap;
+   if (pgmap) {
+   unsigned long nr_pages;
+   unsigned long nr_vmemmap_pages;
+
+   nr_pages = pgmap_vmemmap_nr(pgmap);
+   nr_vmemmap_pages = ((nr_pages * sizeof(struct page)) >> 
PAGE_SHIFT);
+   /*
+* For vmemmap optimization with DAX we need minimum 2 vmemmap
+* pages. See layout diagram in 
Documentation/mm/vmemmap_dedup.rst
+*/
+   return is_power_of_2(sizeof(struct page)) &&
+   (nr_vmemmap_pages > VMEMMAP_RESERVE_NR) && !altmap;
+   }
+   return false;
 }
+/*
+ * If we don't have an architecture override, use the generic rule
+ */
+#ifndef vmemmap_can_optimize
+#define vmemmap_can_optimize __vmemmap_can_optimize
+#endif
+
 #else
 static inline bool vmemmap_can_optimize(struct vmem_altmap *altmap,
   struct dev_pagemap *pgmap)
diff --git a/mm/mm_init.c b/mm/mm_init.c
index 7f7f9c677854..d1676afc94f1 100644
--- a/mm/mm_init.c
+++ b/mm/mm_init.c
@@ -1020,7 +1020,7 @@ static inline unsigned long compound_nr_pages(struct 
vmem_altmap *altmap,
if (!vmemmap_can_optimize(altmap, pgmap))
return pgmap_vmemmap_nr(pgmap);
 
-   return 2 * (PAGE_SIZE / sizeof(struct page));
+   return VMEMMAP_RESERVE_NR * (PAGE_SIZE / sizeof(struct page));
 }
 
 static void __ref memmap_init_compound(struct page *head,
-- 
2.40.1

[PATCH 09/16] mm/vmemmap: Allow architectures to override how vmemmap optimization works

2023-06-05 Thread Aneesh Kumar K.V

Architectures like powerpc will like to use different page table allocators and
mapping mechanisms to implement vmemmap optimization. Similar to
vmemmap_populate allow architectures to implement vmemap_populate_compound_pages

Signed-off-by: Aneesh Kumar K.V 
---
 mm/sparse-vmemmap.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
index 10d73a0dfcec..0b83706c08fd 100644
--- a/mm/sparse-vmemmap.c
+++ b/mm/sparse-vmemmap.c
@@ -141,6 +141,7 @@ void __meminit vmemmap_verify(pte_t *pte, int node,
start, end - 1);
 }
 
+#ifndef vmemmap_populate_compound_pages
 pte_t * __meminit vmemmap_pte_populate(pmd_t *pmd, unsigned long addr, int 
node,
   struct vmem_altmap *altmap,
   struct page *reuse)
@@ -446,6 +447,8 @@ static int __meminit 
vmemmap_populate_compound_pages(unsigned long start_pfn,
return 0;
 }
 
+#endif
+
 struct page * __meminit __populate_section_memmap(unsigned long pfn,
unsigned long nr_pages, int nid, struct vmem_altmap *altmap,
struct dev_pagemap *pgmap)
-- 
2.40.1

[PATCH 10/16] mm: Add __HAVE_ARCH_PUD_SAME similar to __HAVE_ARCH_P4D_SAME

2023-06-05 Thread Aneesh Kumar K.V

This helps architectures to override pmd_same and pud_same
independently.

Signed-off-by: Aneesh Kumar K.V 
---
 include/linux/pgtable.h | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 2fe19720075e..8c5174d1f9db 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -681,7 +681,9 @@ static inline int pmd_same(pmd_t pmd_a, pmd_t pmd_b)
 {
return pmd_val(pmd_a) == pmd_val(pmd_b);
 }
+#endif
 
+#ifndef __HAVE_ARCH_PUD_SAME
 static inline int pud_same(pud_t pud_a, pud_t pud_b)
 {
return pud_val(pud_a) == pud_val(pud_b);
-- 
2.40.1

[PATCH 11/16] mm/huge pud: Use transparent huge pud helpers only with CONFIG_TRANSPARENT_HUGEPAGE

2023-06-05 Thread Aneesh Kumar K.V

pudp_set_wrprotect and move_huge_pud helpers are only used when
CONFIG_TRANSPARENT_HUGEPAGE is enabled. Similar to pmdp_set_wrprotect and
move_huge_pmd_helpers use architecture override only if
CONFIG_TRANSPARENT_HUGEPAGE is set

Signed-off-by: Aneesh Kumar K.V 
---
 include/linux/pgtable.h | 2 ++
 mm/mremap.c | 2 +-
 2 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/include/linux/pgtable.h b/include/linux/pgtable.h
index 8c5174d1f9db..c7f5806dc9d1 100644
--- a/include/linux/pgtable.h
+++ b/include/linux/pgtable.h
@@ -550,6 +550,7 @@ static inline void pmdp_set_wrprotect(struct mm_struct *mm,
 #endif
 #ifndef __HAVE_ARCH_PUDP_SET_WRPROTECT
 #ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+#ifdef CONFIG_TRANSPARENT_HUGEPAGE
 static inline void pudp_set_wrprotect(struct mm_struct *mm,
  unsigned long address, pud_t *pudp)
 {
@@ -563,6 +564,7 @@ static inline void pudp_set_wrprotect(struct mm_struct *mm,
 {
BUILD_BUG();
 }
+#endif /* CONFIG_TRANSPARENT_HUGEPAGE */
 #endif /* CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD */
 #endif
 
diff --git a/mm/mremap.c b/mm/mremap.c
index b11ce6c92099..6373db571e5c 100644
--- a/mm/mremap.c
+++ b/mm/mremap.c
@@ -338,7 +338,7 @@ static inline bool move_normal_pud(struct vm_area_struct 
*vma,
 }
 #endif
 
-#ifdef CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD
+#if defined(CONFIG_TRANSPARENT_HUGEPAGE) && 
defined(CONFIG_HAVE_ARCH_TRANSPARENT_HUGEPAGE_PUD)
 static bool move_huge_pud(struct vm_area_struct *vma, unsigned long old_addr,
  unsigned long new_addr, pud_t *old_pud, pud_t 
*new_pud)
 {
-- 
2.40.1

[PATCH 12/16] mm/vmemmap optimization: Split hugetlb and devdax vmemmap optimization

2023-06-05 Thread Aneesh Kumar K.V

Arm disabled hugetlb vmemmap optimization [1] because hugetlb vmemmap
optimization includes an update of both the permissions (writeable to read-only)
and the output address (pfn) of the vmemmap ptes. That is not supported without
unmapping of pte(marking it invalid) by some architectures.

With DAX vmemmap optimization we don't require such pte updates and
architectures can enable DAX vmemmap optimization while having hugetlb vmemmap
optimization disabled. Hence split DAX optimization support into a different
config.

loongarch and riscv don't have devdax support. So the DAX config is not enabled
for them. With this change, arm64 should be able to select DAX optimization

[1] commit 060a2c92d1b6 ("arm64: mm: hugetlb: Disable 
HUGETLB_PAGE_OPTIMIZE_VMEMMAP")

Signed-off-by: Aneesh Kumar K.V 
---
 arch/loongarch/Kconfig | 2 +-
 arch/riscv/Kconfig | 2 +-
 arch/x86/Kconfig   | 3 ++-
 fs/Kconfig | 2 +-
 include/linux/mm.h | 2 +-
 mm/Kconfig | 5 -
 6 files changed, 10 insertions(+), 6 deletions(-)

diff --git a/arch/loongarch/Kconfig b/arch/loongarch/Kconfig
index d38b066fc931..2060990c4612 100644
--- a/arch/loongarch/Kconfig
+++ b/arch/loongarch/Kconfig
@@ -55,7 +55,7 @@ config LOONGARCH
select ARCH_USE_QUEUED_SPINLOCKS
select ARCH_WANT_DEFAULT_TOPDOWN_MMAP_LAYOUT
select ARCH_WANT_LD_ORPHAN_WARN
-   select ARCH_WANT_OPTIMIZE_VMEMMAP
+   select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP
select ARCH_WANTS_NO_INSTR
select BUILDTIME_TABLE_SORT
select COMMON_CLK
diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index 348c0fa1fc8c..aafbf201f708 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -49,7 +49,7 @@ config RISCV
select ARCH_WANT_GENERAL_HUGETLB if !RISCV_ISA_SVNAPOT
select ARCH_WANT_HUGE_PMD_SHARE if 64BIT
select ARCH_WANT_LD_ORPHAN_WARN if !XIP_KERNEL
-   select ARCH_WANT_OPTIMIZE_VMEMMAP
+   select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP
select ARCH_WANTS_THP_SWAP if HAVE_ARCH_TRANSPARENT_HUGEPAGE
select BINFMT_FLAT_NO_DATA_START_OFFSET if !MMU
select BUILDTIME_TABLE_SORT if MMU
diff --git a/arch/x86/Kconfig b/arch/x86/Kconfig
index 53bab123a8ee..eb383960b6ee 100644
--- a/arch/x86/Kconfig
+++ b/arch/x86/Kconfig
@@ -127,7 +127,8 @@ config X86
select ARCH_WANT_GENERAL_HUGETLB
select ARCH_WANT_HUGE_PMD_SHARE
select ARCH_WANT_LD_ORPHAN_WARN
-   select ARCH_WANT_OPTIMIZE_VMEMMAP   if X86_64
+   select ARCH_WANT_OPTIMIZE_DAX_VMEMMAP   if X86_64
+   select ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP   if X86_64
select ARCH_WANTS_THP_SWAP  if X86_64
select ARCH_HAS_PARANOID_L1D_FLUSH
select BUILDTIME_TABLE_SORT
diff --git a/fs/Kconfig b/fs/Kconfig
index 18d034ec7953..9c104c130a6e 100644
--- a/fs/Kconfig
+++ b/fs/Kconfig
@@ -252,7 +252,7 @@ config HUGETLB_PAGE
 
 config HUGETLB_PAGE_OPTIMIZE_VMEMMAP
def_bool HUGETLB_PAGE
-   depends on ARCH_WANT_OPTIMIZE_VMEMMAP
+   depends on ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP
depends on SPARSEMEM_VMEMMAP
 
 config HUGETLB_PAGE_OPTIMIZE_VMEMMAP_DEFAULT_ON
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9a45e61cd83f..6e56ae09f0c1 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3553,7 +3553,7 @@ void vmemmap_free(unsigned long start, unsigned long end,
 #endif
 
 #define VMEMMAP_RESERVE_NR 2
-#ifdef CONFIG_ARCH_WANT_OPTIMIZE_VMEMMAP
+#ifdef CONFIG_ARCH_WANT_OPTIMIZE_DAX_VMEMMAP
 static inline bool __vmemmap_can_optimize(struct vmem_altmap *altmap,
  struct dev_pagemap *pgmap)
 {
diff --git a/mm/Kconfig b/mm/Kconfig
index 7672a22647b4..7b388c10baab 100644
--- a/mm/Kconfig
+++ b/mm/Kconfig
@@ -461,7 +461,10 @@ config SPARSEMEM_VMEMMAP
 # Select this config option from the architecture Kconfig, if it is preferred
 # to enable the feature of HugeTLB/dev_dax vmemmap optimization.
 #
-config ARCH_WANT_OPTIMIZE_VMEMMAP
+config ARCH_WANT_OPTIMIZE_DAX_VMEMMAP
+   bool
+
+config ARCH_WANT_OPTIMIZE_HUGETLB_VMEMMAP
bool
 
 config HAVE_MEMBLOCK_PHYS_MAP
-- 
2.40.1

[PATCH 13/16] powerpc/book3s64/mm: Enable transparent pud hugepage

2023-06-05 Thread Aneesh Kumar K.V

This is enabled only with radix translation and 1G hugepage size. This will be
used with devdax device memory with a namespace alignment of 1G.

Anon transparent hugepage is not supported even though we do have helpers
checking pud_trans_huge(). We should never find that return true. The only
expected pte bit combination is _PAGE_PTE | _PAGE_DEVMAP.

Some of the helpers are never expected to get called on hash translation and
hence is marked to call BUG() in such a case.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/pgtable.h  | 156 --
 arch/powerpc/include/asm/book3s/64/radix.h|  37 +
 .../include/asm/book3s/64/tlbflush-radix.h|   2 +
 arch/powerpc/include/asm/book3s/64/tlbflush.h |   8 +
 arch/powerpc/mm/book3s64/pgtable.c|  78 +
 arch/powerpc/mm/book3s64/radix_pgtable.c  |  28 
 arch/powerpc/mm/book3s64/radix_tlb.c  |   7 +
 arch/powerpc/platforms/Kconfig.cputype|   1 +
 include/trace/events/thp.h|  17 ++
 9 files changed, 323 insertions(+), 11 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/pgtable.h 
b/arch/powerpc/include/asm/book3s/64/pgtable.h
index 4acc9690f599..9a05de007956 100644
--- a/arch/powerpc/include/asm/book3s/64/pgtable.h
+++ b/arch/powerpc/include/asm/book3s/64/pgtable.h
@@ -921,8 +921,29 @@ static inline pud_t pte_pud(pte_t pte)
 {
return __pud_raw(pte_raw(pte));
 }
+
+static inline pte_t *pudp_ptep(pud_t *pud)
+{
+   return (pte_t *)pud;
+}
+
+#define pud_pfn(pud)   pte_pfn(pud_pte(pud))
+#define pud_dirty(pud) pte_dirty(pud_pte(pud))
+#define pud_young(pud) pte_young(pud_pte(pud))
+#define pud_mkold(pud) pte_pud(pte_mkold(pud_pte(pud)))
+#define pud_wrprotect(pud) pte_pud(pte_wrprotect(pud_pte(pud)))
+#define pud_mkdirty(pud)   pte_pud(pte_mkdirty(pud_pte(pud)))
+#define pud_mkclean(pud)   pte_pud(pte_mkclean(pud_pte(pud)))
+#define pud_mkyoung(pud)   pte_pud(pte_mkyoung(pud_pte(pud)))
+#define pud_mkwrite(pud)   pte_pud(pte_mkwrite(pud_pte(pud)))
 #define pud_write(pud) pte_write(pud_pte(pud))
 
+#ifdef CONFIG_HAVE_ARCH_SOFT_DIRTY
+#define pud_soft_dirty(pmd)pte_soft_dirty(pud_pte(pud))
+#define pud_mksoft_dirty(pmd)  pte_pud(pte_mksoft_dirty(pud_pte(pud)))
+#define pud_clear_soft_dirty(pmd) pte_pud(pte_clear_soft_dirty(pud_pte(pud)))
+#endif /* CONFIG_HAVE_ARCH_SOFT_DIRTY */
+
 static inline int pud_bad(pud_t pud)
 {
if (radix_enabled())
@@ -1115,15 +1136,24 @@ static inline bool pmd_access_permitted(pmd_t pmd, bool 
write)
 
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 extern pmd_t pfn_pmd(unsigned long pfn, pgprot_t pgprot);
+extern pud_t pfn_pud(unsigned long pfn, pgprot_t pgprot);
 extern pmd_t mk_pmd(struct page *page, pgprot_t pgprot);
 extern pmd_t pmd_modify(pmd_t pmd, pgprot_t newprot);
 extern void set_pmd_at(struct mm_struct *mm, unsigned long addr,
   pmd_t *pmdp, pmd_t pmd);
+extern void set_pud_at(struct mm_struct *mm, unsigned long addr,
+  pud_t *pudp, pud_t pud);
+
 static inline void update_mmu_cache_pmd(struct vm_area_struct *vma,
unsigned long addr, pmd_t *pmd)
 {
 }
 
+static inline void update_mmu_cache_pud(struct vm_area_struct *vma,
+   unsigned long addr, pud_t *pud)
+{
+}
+
 extern int hash__has_transparent_hugepage(void);
 static inline int has_transparent_hugepage(void)
 {
@@ -1133,6 +1163,14 @@ static inline int has_transparent_hugepage(void)
 }
 #define has_transparent_hugepage has_transparent_hugepage
 
+static inline int has_transparent_pud_hugepage(void)
+{
+   if (radix_enabled())
+   return radix__has_transparent_pud_hugepage();
+   return 0;
+}
+#define has_transparent_pud_hugepage has_transparent_pud_hugepage
+
 static inline unsigned long
 pmd_hugepage_update(struct mm_struct *mm, unsigned long addr, pmd_t *pmdp,
unsigned long clr, unsigned long set)
@@ -1142,6 +1180,16 @@ pmd_hugepage_update(struct mm_struct *mm, unsigned long 
addr, pmd_t *pmdp,
return hash__pmd_hugepage_update(mm, addr, pmdp, clr, set);
 }
 
+static inline unsigned long
+pud_hugepage_update(struct mm_struct *mm, unsigned long addr, pud_t *pudp,
+   unsigned long clr, unsigned long set)
+{
+   if (radix_enabled())
+   return radix__pud_hugepage_update(mm, addr, pudp, clr, set);
+   BUG();
+   return pud_val(*pudp);
+}
+
 /*
  * returns true for pmd migration entries, THP, devmap, hugetlb
  * But compile time dependent on THP config
@@ -1151,6 +1199,11 @@ static inline int pmd_large(pmd_t pmd)
return !!(pmd_raw(pmd) & cpu_to_be64(_PAGE_PTE));
 }
 
+static inline int pud_large(pud_t pud)
+{
+   return !!(pud_raw(pud) & cpu_to_be64(_PAGE_PTE));
+}
+
 /*
  * For radix we should always find H_PAGE_HASHPTE zero. Hence
  * the below will work for radix too
@@ -1166

[PATCH 14/16] powerpc/book3s64/vmemmap: Switch radix to use a different vmemmap handling function

2023-06-05 Thread Aneesh Kumar K.V

This is in preparation to update radix to implement vmemmap optimization for
devdax. Below are the rules w.r.t radix vmemmap mapping

1. First try to map things using PMD (2M)
2. With altmap if altmap cross-boundary check returns true, fall back to 
PAGE_SIZE
3. IF we can't allocate PMD_SIZE backing memory for vmemmap, fallback to 
PAGE_SIZE

On removing vmemmap mapping, check if every subsection that is using the vmemmap
area is invalid. If found to be invalid, that implies we can safely free the
vmemmap area. We don't use the PAGE_UNUSED pattern used by x86 because with 64K
page size, we need to do the above check even at the PAGE_SIZE granularity.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/include/asm/book3s/64/radix.h |   2 +
 arch/powerpc/include/asm/pgtable.h |   3 +
 arch/powerpc/mm/book3s64/radix_pgtable.c   | 293 +++--
 arch/powerpc/mm/init_64.c  |  26 +-
 4 files changed, 293 insertions(+), 31 deletions(-)

diff --git a/arch/powerpc/include/asm/book3s/64/radix.h 
b/arch/powerpc/include/asm/book3s/64/radix.h
index 8cdff5a05011..87d4c1e62491 100644
--- a/arch/powerpc/include/asm/book3s/64/radix.h
+++ b/arch/powerpc/include/asm/book3s/64/radix.h
@@ -332,6 +332,8 @@ extern int __meminit radix__vmemmap_create_mapping(unsigned 
long start,
 unsigned long phys);
 int __meminit radix__vmemmap_populate(unsigned long start, unsigned long end,
  int node, struct vmem_altmap *altmap);
+void __ref radix__vmemmap_free(unsigned long start, unsigned long end,
+  struct vmem_altmap *altmap);
 extern void radix__vmemmap_remove_mapping(unsigned long start,
unsigned long page_size);
 
diff --git a/arch/powerpc/include/asm/pgtable.h 
b/arch/powerpc/include/asm/pgtable.h
index 9972626ddaf6..6d4cd2ebae6e 100644
--- a/arch/powerpc/include/asm/pgtable.h
+++ b/arch/powerpc/include/asm/pgtable.h
@@ -168,6 +168,9 @@ static inline bool is_ioremap_addr(const void *x)
 
 struct seq_file;
 void arch_report_meminfo(struct seq_file *m);
+int __meminit vmemmap_populated(unsigned long vmemmap_addr, int 
vmemmap_map_size);
+bool altmap_cross_boundary(struct vmem_altmap *altmap, unsigned long start,
+  unsigned long page_size);
 #endif /* CONFIG_PPC64 */
 
 #endif /* __ASSEMBLY__ */
diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index d7e2dd3d4add..65de8630abcb 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -742,8 +742,57 @@ static void free_pud_table(pud_t *pud_start, p4d_t *p4d)
p4d_clear(p4d);
 }
 
+static bool __meminit vmemmap_pmd_is_unused(unsigned long addr, unsigned long 
end)
+{
+   unsigned long start = ALIGN_DOWN(addr, PMD_SIZE);
+
+   return vmemmap_populated(start, PMD_SIZE);
+}
+
+static bool __meminit vmemmap_page_is_unused(unsigned long addr, unsigned long 
end)
+{
+   unsigned long start = ALIGN_DOWN(addr, PAGE_SIZE);
+
+   return vmemmap_populated(start, PAGE_SIZE);
+
+}
+
+static void __meminit free_vmemmap_pages(struct page *page,
+struct vmem_altmap *altmap,
+int order)
+{
+   unsigned int nr_pages = 1 << order;
+
+   if (altmap) {
+   unsigned long alt_start, alt_end;
+   unsigned long base_pfn = page_to_pfn(page);
+
+   /*
+* with 1G vmemmap mmaping we can have things setup
+* such that even though atlmap is specified we never
+* used altmap.
+*/
+   alt_start = altmap->base_pfn;
+   alt_end = altmap->base_pfn + altmap->reserve +
+   altmap->free + altmap->alloc + altmap->align;
+
+   if (base_pfn >= alt_start && base_pfn < alt_end) {
+   vmem_altmap_free(altmap, nr_pages);
+   return;
+   }
+   }
+
+   if (PageReserved(page)) {
+   /* allocated from memblock */
+   while (nr_pages--)
+   free_reserved_page(page++);
+   } else
+   free_pages((unsigned long)page_address(page), order);
+}
+
 static void remove_pte_table(pte_t *pte_start, unsigned long addr,
-unsigned long end, bool direct)
+unsigned long end, bool direct,
+struct vmem_altmap *altmap)
 {
unsigned long next, pages = 0;
pte_t *pte;
@@ -757,24 +806,23 @@ static void remove_pte_table(pte_t *pte_start, unsigned 
long addr,
if (!pte_present(*pte))
continue;
 
-   if (!PAGE_ALIGNED(addr) || !PAGE_ALIGNED(next)) {
-   /*
-* The vmemmap_free() and remove_section_ma

[PATCH 15/16] powerpc/book3s64/radix: Add support for vmemmap optimization for radix

2023-06-05 Thread Aneesh Kumar K.V

With 2M PMD-level mapping, we require 32 struct pages and a single vmemmap page
can contain 1024 struct pages (PAGE_SIZE/sizeof(struct page)). Hence with 64K
page size, we don't use vmemmap deduplication for PMD-level mapping.

Signed-off-by: Aneesh Kumar K.V 
---
 Documentation/mm/vmemmap_dedup.rst |   1 +
 Documentation/powerpc/vmemmap_dedup.rst| 101 ++
 arch/powerpc/Kconfig   |   1 +
 arch/powerpc/include/asm/book3s/64/radix.h |   8 +
 arch/powerpc/mm/book3s64/radix_pgtable.c   | 203 +
 5 files changed, 314 insertions(+)
 create mode 100644 Documentation/powerpc/vmemmap_dedup.rst

diff --git a/Documentation/mm/vmemmap_dedup.rst 
b/Documentation/mm/vmemmap_dedup.rst
index a4b12ff906c4..c573e08b5043 100644
--- a/Documentation/mm/vmemmap_dedup.rst
+++ b/Documentation/mm/vmemmap_dedup.rst
@@ -210,6 +210,7 @@ the device (altmap).
 
 The following page sizes are supported in DAX: PAGE_SIZE (4K on x86_64),
 PMD_SIZE (2M on x86_64) and PUD_SIZE (1G on x86_64).
+For powerpc equivalent details see Documentation/powerpc/vmemmap_dedup.rst
 
 The differences with HugeTLB are relatively minor.
 
diff --git a/Documentation/powerpc/vmemmap_dedup.rst 
b/Documentation/powerpc/vmemmap_dedup.rst
new file mode 100644
index ..dc4db59fdf87
--- /dev/null
+++ b/Documentation/powerpc/vmemmap_dedup.rst
@@ -0,0 +1,101 @@
+.. SPDX-License-Identifier: GPL-2.0
+
+==
+Device DAX
+==
+
+The device-dax interface uses the tail deduplication technique explained in
+Documentation/mm/vmemmap_dedup.rst
+
+On powerpc, vmemmap deduplication is only used with radix MMU translation. Also
+with a 64K page size, only the devdax namespace with 1G alignment uses vmemmap
+deduplication.
+
+With 2M PMD level mapping, we require 32 struct pages and a single 64K vmemmap
+page can contain 1024 struct pages (64K/sizeof(struct page)). Hence there is no
+vmemmap deduplication possible.
+
+With 1G PUD level mapping, we require 16384 struct pages and a single 64K
+vmemmap page can contain 1024 struct pages (64K/sizeof(struct page)). Hence we
+require 16 64K pages in vmemmap to map the struct page for 1G PUD level 
mapping.
+
+Here's how things look like on device-dax after the sections are populated::
+ +---+ ---virt_to_page---> +---+   mapping to   +---+
+ |   | | 0 | -> | 0 |
+ |   | +---++---+
+ |   | | 1 | -> | 1 |
+ |   | +---++---+
+ |   | | 2 | ^ ^ ^ ^ ^ ^
+ |   | +---+   | | | | |
+ |   | | 3 | --+ | | | |
+ |   | +---+ | | | |
+ |   | | 4 | + | | |
+ |PUD| +---+   | | |
+ |   level   | | . | --+ | |
+ |  mapping  | +---+ | |
+ |   | | . | + |
+ |   | +---+   |
+ |   | | 15| --+
+ |   | +---+
+ |   |
+ |   |
+ |   |
+ +---+
+
+
+With 4K page size, 2M PMD level mapping requires 512 struct pages and a single
+4K vmemmap page contains 64 struct pages(4K/sizeof(struct page)). Hence we
+require 8 4K pages in vmemmap to map the struct page for 2M pmd level mapping.
+
+Here's how things look like on device-dax after the sections are populated::
+
+ +---+ ---virt_to_page---> +---+   mapping to   +---+
+ |   | | 0 | -> | 0 |
+ |   | +---++---+
+ |   | | 1 | -> | 1 |
+ |   | +---++---+
+ |   | | 2 | ^ ^ ^ ^ ^ ^
+ |   | +---+   | | | | |
+ |   | | 3 | --+ | | | |
+ |   | +---+ | | | |
+ |   | | 4 | + | | |
+ |PMD| +---+   | | |
+ |   level   | | 5 | --+ | |
+ |  mapping  | +---+ | |
+ |   |

[PATCH 16/16] powerpc/book3s64/radix: Remove mmu_vmemmap_psize

2023-06-05 Thread Aneesh Kumar K.V

This is not used by radix anymore.

Signed-off-by: Aneesh Kumar K.V 
---
 arch/powerpc/mm/book3s64/radix_pgtable.c | 10 --
 arch/powerpc/mm/init_64.c| 21 ++---
 2 files changed, 14 insertions(+), 17 deletions(-)

diff --git a/arch/powerpc/mm/book3s64/radix_pgtable.c 
b/arch/powerpc/mm/book3s64/radix_pgtable.c
index 82d36df52a8d..b59219751599 100644
--- a/arch/powerpc/mm/book3s64/radix_pgtable.c
+++ b/arch/powerpc/mm/book3s64/radix_pgtable.c
@@ -601,16 +601,6 @@ void __init radix__early_init_mmu(void)
mmu_virtual_psize = MMU_PAGE_4K;
 #endif
 
-#ifdef CONFIG_SPARSEMEM_VMEMMAP
-   /* vmemmap mapping */
-   if (mmu_psize_defs[MMU_PAGE_2M].shift) {
-   /*
-* map vmemmap using 2M if available
-*/
-   mmu_vmemmap_psize = MMU_PAGE_2M;
-   } else
-   mmu_vmemmap_psize = mmu_virtual_psize;
-#endif
/*
 * initialize page table size
 */
diff --git a/arch/powerpc/mm/init_64.c b/arch/powerpc/mm/init_64.c
index 5701faca39ef..6db7a063ba63 100644
--- a/arch/powerpc/mm/init_64.c
+++ b/arch/powerpc/mm/init_64.c
@@ -198,17 +198,12 @@ bool altmap_cross_boundary(struct vmem_altmap *altmap, 
unsigned long start,
return false;
 }
 
-int __meminit vmemmap_populate(unsigned long start, unsigned long end, int 
node,
-   struct vmem_altmap *altmap)
+int __meminit __vmemmap_populate(unsigned long start, unsigned long end, int 
node,
+struct vmem_altmap *altmap)
 {
bool altmap_alloc;
unsigned long page_size = 1 << mmu_psize_defs[mmu_vmemmap_psize].shift;
 
-#ifdef CONFIG_PPC_BOOK3S_64
-   if (radix_enabled())
-   return radix__vmemmap_populate(start, end, node, altmap);
-#endif
-
/* Align to the page size of the linear mapping. */
start = ALIGN_DOWN(start, page_size);
 
@@ -277,6 +272,18 @@ int __meminit vmemmap_populate(unsigned long start, 
unsigned long end, int node,
return 0;
 }
 
+int __meminit vmemmap_populate(unsigned long start, unsigned long end, int 
node,
+  struct vmem_altmap *altmap)
+{
+
+#ifdef CONFIG_PPC_BOOK3S_64
+   if (radix_enabled())
+   return radix__vmemmap_populate(start, end, node, altmap);
+#endif
+
+   return __vmemmap_populate(start, end, node, altmap);
+}
+
 #ifdef CONFIG_MEMORY_HOTPLUG
 static unsigned long vmemmap_list_free(unsigned long start)
 {
-- 
2.40.1

Re: [PATCH 07/12] s390: add pte_free_defer(), with use of mmdrop_async()

2023-06-05 Thread Hugh Dickins

On Sun, 28 May 2023, Hugh Dickins wrote:

> Add s390-specific pte_free_defer(), to call pte_free() via call_rcu().
> pte_free_defer() will be called inside khugepaged's retract_page_tables()
> loop, where allocating extra memory cannot be relied upon.  This precedes
> the generic version to avoid build breakage from incompatible pgtable_t.
> 
> This version is more complicated than others: because page_table_free()
> needs to know which fragment is being freed, and which mm to link it to.
> 
> page_table_free()'s fragment handling is clever, but I could too easily
> break it: what's done here in pte_free_defer() and pte_free_now() might
> be better integrated with page_table_free()'s cleverness, but not by me!
> 
> By the time that page_table_free() gets called via RCU, it's conceivable
> that mm would already have been freed: so mmgrab() in pte_free_defer()
> and mmdrop() in pte_free_now().  No, that is not a good context to call
> mmdrop() from, so make mmdrop_async() public and use that.

But Matthew Wilcox quickly pointed out that sharing one page->rcu_head
between multiple page tables is tricky: something I knew but had lost
sight of.  So the powerpc and s390 patches were broken: powerpc fairly
easily fixed, but s390 more painful.

In https://lore.kernel.org/linux-s390/20230601155751.7c949ca4@thinkpad-T15/
On Thu, 1 Jun 2023 15:57:51 +0200
Gerald Schaefer  wrote:
> 
> Yes, we have 2 pagetables in one 4K page, which could result in same
> rcu_head reuse. It might be possible to use the cleverness from our
> page_table_free() function, e.g. to only do the call_rcu() once, for
> the case where both 2K pagetable fragments become unused, similar to
> how we decide when to actually call __free_page().
> 
> However, it might be much worse, and page->rcu_head from a pagetable
> page cannot be used at all for s390, because we also use page->lru
> to keep our list of free 2K pagetable fragments. I always get confused
> by struct page unions, so not completely sure, but it seems to me that
> page->rcu_head would overlay with page->lru, right?

Sigh, yes, page->rcu_head overlays page->lru.  But (please correct me if
I'm wrong) I think that s390 could use exactly the same technique for
its list of free 2K pagetable fragments as it uses for its list of THP
"deposited" pagetable fragments, over in arch/s390/mm/pgtable.c: use
the first two longs of the page table itself for threading the list.

And while it could use third and fourth longs instead, I don't see any
need for that: a deposited pagetable has been allocated, so would not
be on the list of free fragments.

Below is one of the grossest patches I've ever posted: gross because
it's a rushed attempt to see whether that is viable, while it would take
me longer to understand all the s390 cleverness there (even though the
PP AA commentary above page_table_alloc() is excellent).

I'm hoping the use of page->lru in arch/s390/mm/gmap.c is disjoint.
And cmma_init_nodat()? Ah, that's __init so I guess disjoint.

Gerald, s390 folk: would it be possible for you to give this
a try, suggest corrections and improvements, and then I can make it
a separate patch of the series; and work on avoiding concurrent use
of the rcu_head by pagetable fragment buddies (ideally fit in with
the scheme already there, maybe DD bits to go along with the PP AA).

Why am I even asking you to move away from page->lru: why don't I
thread s390's pte_free_defer() pagetables like THP's deposit does?
I cannot, because the deferred pagetables have to remain accessible
as valid pagetables, until the RCU grace period has elapsed - unless
all the list pointers would appear as pte_none(), which I doubt.

(That may limit our possibilities with the deposited pagetables in
future: I can imagine them too wanting to remain accessible as valid
pagetables.  But that's not needed by this series, and s390 only uses
deposit/withdraw for anon THP; and some are hoping that we might be
able to move away from deposit/withdraw altogther - though powerpc's
special use will make that more difficult.)

Thanks!
Hugh

--- 6.4-rc5/arch/s390/mm/pgalloc.c
+++ linux/arch/s390/mm/pgalloc.c
@@ -232,6 +232,7 @@ void page_table_free_pgste(struct page *
  */
 unsigned long *page_table_alloc(struct mm_struct *mm)
 {
+   struct list_head *listed;
unsigned long *table;
struct page *page;
unsigned int mask, bit;
@@ -241,8 +242,8 @@ unsigned long *page_table_alloc(struct m
table = NULL;
spin_lock_bh(&mm->context.lock);
if (!list_empty(&mm->context.pgtable_list)) {
-   page = list_first_entry(&mm->context.pgtable_list,
-   struct page, lru);
+   listed = mm->context.pgtable_list.next;
+   page = virt_to_page(listed);
mask = atomic_read(&page->_refcount) >> 24;
/*
 * The pending removal bits

Re: [PATCH 09/12] mm/khugepaged: retract_page_tables() without mmap or vma lock

2023-06-05 Thread Hugh Dickins

On Wed, 31 May 2023, Jann Horn wrote:
> On Mon, May 29, 2023 at 8:25 AM Hugh Dickins  wrote:
> > +static void retract_page_tables(struct address_space *mapping, pgoff_t 
> > pgoff)
...
> > +* Note that vma->anon_vma check is racy: it can be set 
> > after
> > +* the check, but page locks (with XA_RETRY_ENTRYs in holes)
> > +* prevented establishing new ptes of the page. So we are 
> > safe
> > +* to remove page table below, without even checking it's 
> > empty.
> 
> This "we are safe to remove page table below, without even checking
> it's empty" assumes that the only way to create new anonymous PTEs is
> to use existing file PTEs, right? What about private shmem VMAs that
> are registered with userfaultfd as VM_UFFD_MISSING? I think for those,
> the UFFDIO_COPY ioctl lets you directly insert anonymous PTEs without
> looking at the mapping and its pages (except for checking that the
> insertion point is before end-of-file), protected only by mmap_lock
> (shared) and pte_offset_map_lock().

Right, from your comments and Peter's, thank you both, I can see that
userfaultfd breaks the usual assumptions here: so I'm putting an
if (unlikely(vma->anon_vma || userfaultfd_wp(vma)))
check in once we've got the ptlock; with a comment above it to point
the blame at uffd, though I gave up on describing all the detail.
And deleted this earlier "we are safe" paragraph.

You did suggest, in another mail, that perhaps there should be a scan
checking all pte_none() when we get the ptlock.  I wasn't keen on yet
another debug scan for bugs and didn't add that, thinking I was going
to add a patch on the end to do so in page_table_check_pte_clear_range().

But when I came to write that patch, found that I'd been misled by its
name: it's about checking or adjusting some accounting, not really a
suitable place to check for pte_none() at all; so just scrapped it.

...
> > -   collapse_and_free_pmd(mm, vma, addr, pmd);
> 
> The old code called collapse_and_free_pmd(), which involves MMU
> notifier invocation...

...
> > +   pml = pmd_lock(mm, pmd);
> > +   ptl = pte_lockptr(mm, pmd);
> > +   if (ptl != pml)
> > +   spin_lock_nested(ptl, SINGLE_DEPTH_NESTING);
> > +   pgt_pmd = pmdp_collapse_flush(vma, addr, pmd);
> 
> ... while the new code only does pmdp_collapse_flush(), which clears
> the pmd entry and does a TLB flush, but AFAICS doesn't use MMU
> notifiers. My understanding is that that's problematic - maybe (?) it
> is sort of okay with regards to classic MMU notifier users like KVM,
> but it's probably wrong for IOMMUv2 users, where an IOMMU directly
> consumes the normal page tables?

Right, I intentionally left out the MMU notifier invocation, knowing
that we have already done an MMU notifier invocation when unmapping
any PTEs which were mapped: it was necessary for collapse_and_free_pmd()
in the collapse_pte_mapped_thp() case, but there was no notifier in this
case for many years, and I was glad to be rid of it.

However, I now see that you were adding it intentionally even for this
case in your f268f6cf875f; and from later comments in this thread, it
looks like there is still uncertainty about whether it is needed here,
but safer to assume that it is needed: I'll add it back.

> 
> (FWIW, last I looked, there also seemed to be some other issues with
> MMU notifier usage wrt IOMMUv2, see the thread
> .)

Re: [PATCH 00/12] mm: free retracted page table by RCU

2023-06-05 Thread Hugh Dickins

On Fri, 2 Jun 2023, Jann Horn wrote:
> On Fri, Jun 2, 2023 at 6:37 AM Hugh Dickins  wrote:
> 
> > The most obvious vital thing (in the split ptlock case) is that it
> > remains a struct page with a usable ptl spinlock embedded in it.
> >
> > The question becomes more urgent when/if extending to replacing the
> > pagetable pmd by huge pmd in one go, without any mmap_lock: powerpc
> > wants to deposit the page table for later use even in the shmem/file
> > case (and all arches in the anon case): I did work out the details once
> > before, but I'm not sure whether I would still agree with myself; and was
> > glad to leave replacement out of this series, to revisit some time later.
> >
> > >
> > > So in particular, in handle_pte_fault() we can reach the "if
> > > (unlikely(!pte_same(*vmf->pte, entry)))" with vmf->pte pointing to a
> > > detached zeroed page table, but we're okay with that because in that
> > > case we know that !pte_none(vmf->orig_pte)&&pte_none(*vmf->pte) ,
> > > which implies !pte_same(*vmf->pte, entry) , which means we'll bail
> > > out?
> >
> > There is no current (even at end of series) circumstance in which we
> > could be pointing to a detached page table there; but yes, I want to
> > allow for that, and yes I agree with your analysis.
> 
> Hmm, what am I missing here?

I spent quite a while trying to reconstruct what I had been thinking,
what meaning of "detached" or "there" I had in mind when I asserted so
confidently "There is no current (even at end of series) circumstance
in which we could be pointing to a detached page table there".

But had to give up and get on with more useful work.
Of course you are right, and that is what this series is about.

Hugh

> 
> static vm_fault_t handle_pte_fault(struct vm_fault *vmf)
> {
>   pte_t entry;
> 
>   if (unlikely(pmd_none(*vmf->pmd))) {
> [not executed]
>   } else {
> /*
>  * A regular pmd is established and it can't morph into a huge
>  * pmd by anon khugepaged, since that takes mmap_lock in write
>  * mode; but shmem or file collapse to THP could still morph
>  * it into a huge pmd: just retry later if so.
>  */
> vmf->pte = pte_offset_map_nolock(vmf->vma->vm_mm, vmf->pmd,
>  vmf->address, &vmf->ptl);
> if (unlikely(!vmf->pte))
>   [not executed]
> // this reads a present readonly PTE
> vmf->orig_pte = ptep_get_lockless(vmf->pte);
> vmf->flags |= FAULT_FLAG_ORIG_PTE_VALID;
> 
> if (pte_none(vmf->orig_pte)) {
>   [not executed]
> }
>   }
> 
>   [at this point, a concurrent THP collapse operation detaches the page table]
>   // vmf->pte now points into a detached page table
> 
>   if (!vmf->pte)
> [not executed]
> 
>   if (!pte_present(vmf->orig_pte))
> [not executed]
> 
>   if (pte_protnone(vmf->orig_pte) && vma_is_accessible(vmf->vma))
> [not executed]
> 
>   spin_lock(vmf->ptl);
>   entry = vmf->orig_pte;
>   // vmf->pte still points into a detached page table
>   if (unlikely(!pte_same(*vmf->pte, entry))) {
> update_mmu_tlb(vmf->vma, vmf->address, vmf->pte);
> goto unlock;
>   }
>   [...]
> }

[PATCH v2 0/4] powerpc/boot: build flags refactoring

2023-06-05 Thread Nicholas Piggin

This is a rebase of the series here. It's no longer a fix because
the clang build issue got a minimal fix. A couple were merged, and
I pulled the rest of the boot specific ones into this series.

https://lore.kernel.org/linuxppc-dev/20230426055848.402993-1-npig...@gmail.com/

Thanks,
Nick

Nicholas Piggin (4):
  powerpc/boot: Separate target flags from BOOTCFLAGS
  powerpc/boot: Separate CPP flags from BOOTCFLAGS
  powerpc/boot: Separate BOOTCFLAGS from BOOTASFLAGS
  powerpc/boot: Clean up Makefile after cflags and asflags separation

 arch/powerpc/boot/Makefile | 66 --
 1 file changed, 34 insertions(+), 32 deletions(-)

-- 
2.40.1

[PATCH v2 1/4] powerpc/boot: Separate target flags from BOOTCFLAGS

2023-06-05 Thread Nicholas Piggin

Add BOOTTARGETFLAGS variable with target / ABI options common to
CFLAGS and AFLAGS.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/boot/Makefile | 41 --
 1 file changed, 22 insertions(+), 19 deletions(-)

diff --git a/arch/powerpc/boot/Makefile b/arch/powerpc/boot/Makefile
index 771b79423bbc..ae80f7f1774e 100644
--- a/arch/powerpc/boot/Makefile
+++ b/arch/powerpc/boot/Makefile
@@ -39,33 +39,44 @@ BOOTCFLAGS:= -Wall -Wundef -Wstrict-prototypes 
-Wno-trigraphs \
 $(LINUXINCLUDE)
 
 ifdef CONFIG_PPC64_BOOT_WRAPPER
-BOOTCFLAGS += -m64
+BOOTTARGETFLAGS+= -m64
+BOOTTARGETFLAGS+= -mabi=elfv2
 ifdef CONFIG_PPC64_ELF_ABI_V2
-BOOTCFLAGS += $(call cc-option,-mabi=elfv2)
+BOOTTARGETFLAGS+= $(call cc-option,-mabi=elfv2)
 endif
 else
-BOOTCFLAGS += -m32
+BOOTTARGETFLAGS:= -m32
 endif
 
 ifdef CONFIG_TARGET_CPU_BOOL
-BOOTCFLAGS += -mcpu=$(CONFIG_TARGET_CPU)
+BOOTTARGETFLAGS+= -mcpu=$(CONFIG_TARGET_CPU)
 else ifdef CONFIG_PPC64_BOOT_WRAPPER
 ifdef CONFIG_CPU_LITTLE_ENDIAN
-BOOTCFLAGS += -mcpu=powerpc64le
+BOOTTARGETFLAGS+= -mcpu=powerpc64le
 else
-BOOTCFLAGS += -mcpu=powerpc64
+BOOTTARGETFLAGS+= -mcpu=powerpc64
 endif
 endif
 
+$(obj)/4xx.o: BOOTTARGETFLAGS += -mcpu=405
+$(obj)/ebony.o: BOOTTARGETFLAGS += -mcpu=440
+$(obj)/cuboot-hotfoot.o: BOOTTARGETFLAGS += -mcpu=405
+$(obj)/cuboot-taishan.o: BOOTTARGETFLAGS += -mcpu=440
+$(obj)/cuboot-katmai.o: BOOTTARGETFLAGS += -mcpu=440
+$(obj)/cuboot-acadia.o: BOOTTARGETFLAGS += -mcpu=405
+$(obj)/treeboot-iss4xx.o: BOOTTARGETFLAGS += -mcpu=405
+$(obj)/treeboot-currituck.o: BOOTTARGETFLAGS += -mcpu=405
+$(obj)/treeboot-akebono.o: BOOTTARGETFLAGS += -mcpu=405
+
 BOOTCFLAGS += -isystem $(shell $(BOOTCC) -print-file-name=include)
 
 ifdef CONFIG_CPU_BIG_ENDIAN
-BOOTCFLAGS += -mbig-endian
+BOOTTARGETFLAGS+= -mbig-endian
 else
-BOOTCFLAGS += -mlittle-endian
+BOOTTARGETFLAGS+= -mlittle-endian
 endif
 
-BOOTAFLAGS := -D__ASSEMBLY__ $(BOOTCFLAGS) -nostdinc
+BOOTAFLAGS := -D__ASSEMBLY__ $(BOOTTARGETFLAGS) $(BOOTCFLAGS) -nostdinc
 
 BOOTARFLAGS:= -crD
 
@@ -73,6 +84,8 @@ BOOTCFLAGS+= $(call cc-option,-mno-prefixed) \
   $(call cc-option,-mno-pcrel) \
   $(call cc-option,-mno-mma)
 
+BOOTCFLAGS += $(BOOTTARGETFLAGS)
+
 ifdef CONFIG_CC_IS_CLANG
 BOOTCFLAGS += $(CLANG_FLAGS)
 BOOTAFLAGS += $(CLANG_FLAGS)
@@ -91,16 +104,6 @@ BOOTCFLAGS  += -I$(objtree)/$(obj) -I$(srctree)/$(obj)
 
 DTC_FLAGS  ?= -p 1024
 
-$(obj)/4xx.o: BOOTCFLAGS += -mcpu=405
-$(obj)/ebony.o: BOOTCFLAGS += -mcpu=440
-$(obj)/cuboot-hotfoot.o: BOOTCFLAGS += -mcpu=405
-$(obj)/cuboot-taishan.o: BOOTCFLAGS += -mcpu=440
-$(obj)/cuboot-katmai.o: BOOTCFLAGS += -mcpu=440
-$(obj)/cuboot-acadia.o: BOOTCFLAGS += -mcpu=405
-$(obj)/treeboot-iss4xx.o: BOOTCFLAGS += -mcpu=405
-$(obj)/treeboot-currituck.o: BOOTCFLAGS += -mcpu=405
-$(obj)/treeboot-akebono.o: BOOTCFLAGS += -mcpu=405
-
 # The pre-boot decompressors pull in a lot of kernel headers and other source
 # files. This creates a bit of a dependency headache since we need to copy
 # these files into the build dir, fix up any includes and ensure that dependent
-- 
2.40.1

[PATCH v2 2/4] powerpc/boot: Separate CPP flags from BOOTCFLAGS

2023-06-05 Thread Nicholas Piggin

Add BOOTCPPFLAGS variable for the CPP options required by C and AS.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/boot/Makefile | 14 +++---
 1 file changed, 7 insertions(+), 7 deletions(-)

diff --git a/arch/powerpc/boot/Makefile b/arch/powerpc/boot/Makefile
index ae80f7f1774e..9445ec442512 100644
--- a/arch/powerpc/boot/Makefile
+++ b/arch/powerpc/boot/Makefile
@@ -32,11 +32,13 @@ else
 BOOTAR := $(AR)
 endif
 
+BOOTCPPFLAGS   := -nostdinc $(LINUXINCLUDE)
+BOOTCPPFLAGS   += -isystem $(shell $(BOOTCC) -print-file-name=include)
+
 BOOTCFLAGS:= -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs \
 -fno-strict-aliasing -O2 -msoft-float -mno-altivec -mno-vsx \
 $(call cc-option,-mno-spe) $(call cc-option,-mspe=no) \
--pipe -fomit-frame-pointer -fno-builtin -fPIC -nostdinc \
-$(LINUXINCLUDE)
+-pipe -fomit-frame-pointer -fno-builtin -fPIC
 
 ifdef CONFIG_PPC64_BOOT_WRAPPER
 BOOTTARGETFLAGS+= -m64
@@ -68,15 +70,13 @@ $(obj)/treeboot-iss4xx.o: BOOTTARGETFLAGS += -mcpu=405
 $(obj)/treeboot-currituck.o: BOOTTARGETFLAGS += -mcpu=405
 $(obj)/treeboot-akebono.o: BOOTTARGETFLAGS += -mcpu=405
 
-BOOTCFLAGS += -isystem $(shell $(BOOTCC) -print-file-name=include)
-
 ifdef CONFIG_CPU_BIG_ENDIAN
 BOOTTARGETFLAGS+= -mbig-endian
 else
 BOOTTARGETFLAGS+= -mlittle-endian
 endif
 
-BOOTAFLAGS := -D__ASSEMBLY__ $(BOOTTARGETFLAGS) $(BOOTCFLAGS) -nostdinc
+BOOTAFLAGS := -D__ASSEMBLY__ $(BOOTTARGETFLAGS) $(BOOTCFLAGS)
 
 BOOTARFLAGS:= -crD
 
@@ -227,10 +227,10 @@ clean-files := $(zlib-) $(zlibheader-) 
$(zliblinuxheader-) \
empty.c zImage.coff.lds zImage.ps3.lds zImage.lds
 
 quiet_cmd_bootcc = BOOTCC  $@
-  cmd_bootcc = $(BOOTCC) -Wp,-MD,$(depfile) $(BOOTCFLAGS) -c -o $@ $<
+  cmd_bootcc = $(BOOTCC) -Wp,-MD,$(depfile) $(BOOTCPPFLAGS) $(BOOTCFLAGS) 
-c -o $@ $<
 
 quiet_cmd_bootas = BOOTAS  $@
-  cmd_bootas = $(BOOTCC) -Wp,-MD,$(depfile) $(BOOTAFLAGS) -c -o $@ $<
+  cmd_bootas = $(BOOTCC) -Wp,-MD,$(depfile) $(BOOTCPPFLAGS) $(BOOTAFLAGS) 
-c -o $@ $<
 
 quiet_cmd_bootar = BOOTAR  $@
   cmd_bootar = $(BOOTAR) $(BOOTARFLAGS) $@. $(real-prereqs); mv 
$@. $@
-- 
2.40.1

[PATCH v2 3/4] powerpc/boot: Separate BOOTCFLAGS from BOOTASFLAGS

2023-06-05 Thread Nicholas Piggin

BOOTCFLAGS no longer contains anything that BOOTASFLAGS needs (except
-pipe). Separate them to avoid fragility with cross-contamination of
flags which has caused several build problems.

Suggested-by: Linus Torvalds 
Link: 
https://lore.kernel.org/lkml/CAHk-=whywudjdeobn1hrwyskqkvzyiq5rbsw5rjjexgnbsn...@mail.gmail.com/
Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/boot/Makefile | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/arch/powerpc/boot/Makefile b/arch/powerpc/boot/Makefile
index 9445ec442512..9cdc0858b256 100644
--- a/arch/powerpc/boot/Makefile
+++ b/arch/powerpc/boot/Makefile
@@ -76,7 +76,7 @@ else
 BOOTTARGETFLAGS+= -mlittle-endian
 endif
 
-BOOTAFLAGS := -D__ASSEMBLY__ $(BOOTTARGETFLAGS) $(BOOTCFLAGS)
+BOOTAFLAGS := -D__ASSEMBLY__ $(BOOTTARGETFLAGS) -pipe
 
 BOOTARFLAGS:= -crD
 
-- 
2.40.1

[PATCH v2 4/4] powerpc/boot: Clean up Makefile after cflags and asflags separation

2023-06-05 Thread Nicholas Piggin

Tidy pass over boot Makefile. Move variables together where possible.

Signed-off-by: Nicholas Piggin 
---
 arch/powerpc/boot/Makefile | 27 +--
 1 file changed, 13 insertions(+), 14 deletions(-)

diff --git a/arch/powerpc/boot/Makefile b/arch/powerpc/boot/Makefile
index 9cdc0858b256..6f9ef031bfc7 100644
--- a/arch/powerpc/boot/Makefile
+++ b/arch/powerpc/boot/Makefile
@@ -32,14 +32,6 @@ else
 BOOTAR := $(AR)
 endif
 
-BOOTCPPFLAGS   := -nostdinc $(LINUXINCLUDE)
-BOOTCPPFLAGS   += -isystem $(shell $(BOOTCC) -print-file-name=include)
-
-BOOTCFLAGS:= -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs \
--fno-strict-aliasing -O2 -msoft-float -mno-altivec -mno-vsx \
-$(call cc-option,-mno-spe) $(call cc-option,-mspe=no) \
--pipe -fomit-frame-pointer -fno-builtin -fPIC
-
 ifdef CONFIG_PPC64_BOOT_WRAPPER
 BOOTTARGETFLAGS+= -m64
 BOOTTARGETFLAGS+= -mabi=elfv2
@@ -76,15 +68,22 @@ else
 BOOTTARGETFLAGS+= -mlittle-endian
 endif
 
-BOOTAFLAGS := -D__ASSEMBLY__ $(BOOTTARGETFLAGS) -pipe
-
-BOOTARFLAGS:= -crD
+BOOTCPPFLAGS   := -nostdinc $(LINUXINCLUDE)
+BOOTCPPFLAGS   += -isystem $(shell $(BOOTCC) -print-file-name=include)
 
-BOOTCFLAGS += $(call cc-option,-mno-prefixed) \
+BOOTCFLAGS := -pipe $(BOOTTARGETFLAGS) \
+  -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs \
+  -fno-strict-aliasing -O2 \
+  -msoft-float -mno-altivec -mno-vsx \
+  $(call cc-option,-mno-prefixed) \
   $(call cc-option,-mno-pcrel) \
-  $(call cc-option,-mno-mma)
+  $(call cc-option,-mno-mma) \
+  $(call cc-option,-mno-spe) $(call cc-option,-mspe=no) \
+  -fomit-frame-pointer -fno-builtin -fPIC
 
-BOOTCFLAGS += $(BOOTTARGETFLAGS)
+BOOTAFLAGS := -pipe $(BOOTTARGETFLAGS) -D__ASSEMBLY__
+
+BOOTARFLAGS:= -crD
 
 ifdef CONFIG_CC_IS_CLANG
 BOOTCFLAGS += $(CLANG_FLAGS)
-- 
2.40.1

[PATCH v2] powerpc/build: Remove -pipe from compilation flags

2023-06-05 Thread Nicholas Piggin

x86 removed -pipe in commit 437e88ab8f9e2 ("x86/build: Remove -pipe from
KBUILD_CFLAGS") and the newer arm64 and riscv seem to have never used it,
so that seems to be the way the world's going.

Compile performance building defconfig on a POWER10 PowerNV system
was in the noise after 10 builds each. No point in adding options unless
they help something, so remove it.

Signed-off-by: Nicholas Piggin 
---
This applies after the boot flags series, but easy enough to rebase if
necessary.

Thanks,
Nick

 arch/powerpc/Makefile  | 2 +-
 arch/powerpc/boot/Makefile | 4 ++--
 2 files changed, 3 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/Makefile b/arch/powerpc/Makefile
index dca73f673d70..76fc7cc26780 100644
--- a/arch/powerpc/Makefile
+++ b/arch/powerpc/Makefile
@@ -166,7 +166,7 @@ asinstr := $(call as-instr,lis 
9$(comma)foo@high,-DHAVE_AS_ATHIGH=1)
 KBUILD_CPPFLAGS+= -I $(srctree)/arch/$(ARCH) $(asinstr)
 KBUILD_AFLAGS  += $(AFLAGS-y)
 KBUILD_CFLAGS  += $(call cc-option,-msoft-float)
-KBUILD_CFLAGS  += -pipe $(CFLAGS-y)
+KBUILD_CFLAGS  += $(CFLAGS-y)
 CPP= $(CC) -E $(KBUILD_CFLAGS)
 
 CHECKFLAGS += -m$(BITS) -D__powerpc__ -D__powerpc$(BITS)__
diff --git a/arch/powerpc/boot/Makefile b/arch/powerpc/boot/Makefile
index 6f9ef031bfc7..bf8976563e02 100644
--- a/arch/powerpc/boot/Makefile
+++ b/arch/powerpc/boot/Makefile
@@ -71,7 +71,7 @@ endif
 BOOTCPPFLAGS   := -nostdinc $(LINUXINCLUDE)
 BOOTCPPFLAGS   += -isystem $(shell $(BOOTCC) -print-file-name=include)
 
-BOOTCFLAGS := -pipe $(BOOTTARGETFLAGS) \
+BOOTCFLAGS := $(BOOTTARGETFLAGS) \
   -Wall -Wundef -Wstrict-prototypes -Wno-trigraphs \
   -fno-strict-aliasing -O2 \
   -msoft-float -mno-altivec -mno-vsx \
@@ -81,7 +81,7 @@ BOOTCFLAGS:= -pipe $(BOOTTARGETFLAGS) \
   $(call cc-option,-mno-spe) $(call cc-option,-mspe=no) \
   -fomit-frame-pointer -fno-builtin -fPIC
 
-BOOTAFLAGS := -pipe $(BOOTTARGETFLAGS) -D__ASSEMBLY__
+BOOTAFLAGS := $(BOOTTARGETFLAGS) -D__ASSEMBLY__
 
 BOOTARFLAGS:= -crD
 
-- 
2.40.1

63 matches

Mail list logo