date:20230322

Re: [PATCH v5 1/7] trace: Add trace_ipi_send_cpumask()

2023-03-22 Thread Peter Zijlstra

On Tue, Mar 07, 2023 at 02:35:52PM +, Valentin Schneider wrote:
> trace_ipi_raise() is unsuitable for generically tracing IPI sources due to
> its "reason" argument being an uninformative string (on arm64 all you get
> is "Function call interrupts" for SMP calls).
> 
> Add a variant of it that exports a target cpumask, a callsite and a callback.
> 
> Signed-off-by: Valentin Schneider 
> Reviewed-by: Steven Rostedt (Google) 
> ---
>  include/trace/events/ipi.h | 22 ++
>  1 file changed, 22 insertions(+)
> 
> diff --git a/include/trace/events/ipi.h b/include/trace/events/ipi.h
> index 0be71dad6ec03..b1125dc27682c 100644
> --- a/include/trace/events/ipi.h
> +++ b/include/trace/events/ipi.h
> @@ -35,6 +35,28 @@ TRACE_EVENT(ipi_raise,
>   TP_printk("target_mask=%s (%s)", __get_bitmask(target_cpus), 
> __entry->reason)
>  );
>  
> +TRACE_EVENT(ipi_send_cpumask,
> +
> + TP_PROTO(const struct cpumask *cpumask, unsigned long callsite, void 
> *callback),
> +
> + TP_ARGS(cpumask, callsite, callback),
> +
> + TP_STRUCT__entry(
> + __cpumask(cpumask)
> + __field(void *, callsite)
> + __field(void *, callback)
> + ),
> +
> + TP_fast_assign(
> + __assign_cpumask(cpumask, cpumask_bits(cpumask));
> + __entry->callsite = (void *)callsite;
> + __entry->callback = callback;
> + ),
> +
> + TP_printk("cpumask=%s callsite=%pS callback=%pS",
> +   __get_cpumask(cpumask), __entry->callsite, __entry->callback)
> +);

Would it make sense to add a variant like: ipi_send_cpu() that records a
single cpu instead of a cpumask. A lot of sites seems to do:
cpumask_of(cpu) for that first argument, and it seems to me it is quite
daft to have to memcpy a full multi-word cpumask in those cases.

Remember, nr_possible_cpus > 64 is quite common these days.

Re: [PATCH v5 7/7] sched, smp: Trace smp callback causing an IPI

2023-03-22 Thread Peter Zijlstra

On Tue, Mar 07, 2023 at 02:35:58PM +, Valentin Schneider wrote:

> @@ -477,6 +490,25 @@ static __always_inline void csd_unlock(struct 
> __call_single_data *csd)
>   smp_store_release(&csd->node.u_flags, 0);
>  }
>  
> +static __always_inline void
> +raw_smp_call_single_queue(int cpu, struct llist_node *node, smp_call_func_t 
> func)
> +{
> + /*
> +  * The list addition should be visible to the target CPU when it pops
> +  * the head of the list to pull the entry off it in the IPI handler
> +  * because of normal cache coherency rules implied by the underlying
> +  * llist ops.
> +  *
> +  * If IPIs can go out of order to the cache coherency protocol
> +  * in an architecture, sufficient synchronisation should be added
> +  * to arch code to make it appear to obey cache coherency WRT
> +  * locking and barrier primitives. Generic code isn't really
> +  * equipped to do the right thing...
> +  */
> + if (llist_add(node, &per_cpu(call_single_queue, cpu)))
> + send_call_function_single_ipi(cpu, func);
> +}
> +
>  static DEFINE_PER_CPU_SHARED_ALIGNED(call_single_data_t, csd_data);
>  
>  void __smp_call_single_queue(int cpu, struct llist_node *node)
> @@ -493,21 +525,25 @@ void __smp_call_single_queue(int cpu, struct llist_node 
> *node)
>   }
>   }
>  #endif
>   /*
> +  * We have to check the type of the CSD before queueing it, because
> +  * once queued it can have its flags cleared by
> +  *   flush_smp_call_function_queue()
> +  * even if we haven't sent the smp_call IPI yet (e.g. the stopper
> +  * executes migration_cpu_stop() on the remote CPU).
>*/
> + if (trace_ipi_send_cpumask_enabled()) {
> + call_single_data_t *csd;
> + smp_call_func_t func;
> +
> + csd = container_of(node, call_single_data_t, node.llist);
> + func = CSD_TYPE(csd) == CSD_TYPE_TTWU ?
> + sched_ttwu_pending : csd->func;
> +
> + raw_smp_call_single_queue(cpu, node, func);
> + } else {
> + raw_smp_call_single_queue(cpu, node, NULL);
> + }
>  }

Hurmph... so we only really consume @func when we IPI. Would it not be
more useful to trace this thing for *every* csd enqeued?

Re: [PATCH v5 1/7] trace: Add trace_ipi_send_cpumask()

2023-03-22 Thread Peter Zijlstra

On Wed, Mar 22, 2023 at 10:39:55AM +0100, Peter Zijlstra wrote:
> On Tue, Mar 07, 2023 at 02:35:52PM +, Valentin Schneider wrote:
> > trace_ipi_raise() is unsuitable for generically tracing IPI sources due to
> > its "reason" argument being an uninformative string (on arm64 all you get
> > is "Function call interrupts" for SMP calls).
> > 
> > Add a variant of it that exports a target cpumask, a callsite and a 
> > callback.
> > 
> > Signed-off-by: Valentin Schneider 
> > Reviewed-by: Steven Rostedt (Google) 
> > ---
> >  include/trace/events/ipi.h | 22 ++
> >  1 file changed, 22 insertions(+)
> > 
> > diff --git a/include/trace/events/ipi.h b/include/trace/events/ipi.h
> > index 0be71dad6ec03..b1125dc27682c 100644
> > --- a/include/trace/events/ipi.h
> > +++ b/include/trace/events/ipi.h
> > @@ -35,6 +35,28 @@ TRACE_EVENT(ipi_raise,
> > TP_printk("target_mask=%s (%s)", __get_bitmask(target_cpus), 
> > __entry->reason)
> >  );
> >  
> > +TRACE_EVENT(ipi_send_cpumask,
> > +
> > +   TP_PROTO(const struct cpumask *cpumask, unsigned long callsite, void 
> > *callback),
> > +
> > +   TP_ARGS(cpumask, callsite, callback),
> > +
> > +   TP_STRUCT__entry(
> > +   __cpumask(cpumask)
> > +   __field(void *, callsite)
> > +   __field(void *, callback)
> > +   ),
> > +
> > +   TP_fast_assign(
> > +   __assign_cpumask(cpumask, cpumask_bits(cpumask));
> > +   __entry->callsite = (void *)callsite;
> > +   __entry->callback = callback;
> > +   ),
> > +
> > +   TP_printk("cpumask=%s callsite=%pS callback=%pS",
> > + __get_cpumask(cpumask), __entry->callsite, __entry->callback)
> > +);
> 
> Would it make sense to add a variant like: ipi_send_cpu() that records a
> single cpu instead of a cpumask. A lot of sites seems to do:
> cpumask_of(cpu) for that first argument, and it seems to me it is quite
> daft to have to memcpy a full multi-word cpumask in those cases.
> 
> Remember, nr_possible_cpus > 64 is quite common these days.

Something we litte bit like so...

---
Subject: trace: Add trace_ipi_send_cpu()
From: Peter Zijlstra 
Date: Wed Mar 22 11:28:36 CET 2023

Because copying cpumasks around when targeting a single CPU is a bit
daft...

Signed-off-by: Peter Zijlstra (Intel) 
---
 include/linux/smp.h|6 +++---
 include/trace/events/ipi.h |   22 ++
 kernel/irq_work.c  |6 ++
 kernel/smp.c   |4 ++--
 4 files changed, 29 insertions(+), 9 deletions(-)

--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -130,9 +130,9 @@ extern void arch_smp_send_reschedule(int
  * scheduler_ipi() is inline so can't be passed as callback reason, but the
  * callsite IP should be sufficient for root-causing IPIs sent from here.
  */
-#define smp_send_reschedule(cpu) ({  \
-   trace_ipi_send_cpumask(cpumask_of(cpu), _RET_IP_, NULL);  \
-   arch_smp_send_reschedule(cpu);\
+#define smp_send_reschedule(cpu) ({  \
+   trace_ipi_send_cpu(cpu, _RET_IP_, NULL);  \
+   arch_smp_send_reschedule(cpu);\
 })
 
 /*
--- a/include/trace/events/ipi.h
+++ b/include/trace/events/ipi.h
@@ -35,6 +35,28 @@ TRACE_EVENT(ipi_raise,
TP_printk("target_mask=%s (%s)", __get_bitmask(target_cpus), 
__entry->reason)
 );
 
+TRACE_EVENT(ipi_send_cpu,
+
+   TP_PROTO(const unsigned int cpu, unsigned long callsite, void 
*callback),
+
+   TP_ARGS(cpu, callsite, callback),
+
+   TP_STRUCT__entry(
+   __field(unsigned int, cpu)
+   __field(void *, callsite)
+   __field(void *, callback)
+   ),
+
+   TP_fast_assign(
+   __entry->cpu = cpu;
+   __entry->callsite = (void *)callsite;
+   __entry->callback = callback;
+   ),
+
+   TP_printk("cpu=%s callsite=%pS callback=%pS",
+ __entry->cpu, __entry->callsite, __entry->callback)
+);
+
 TRACE_EVENT(ipi_send_cpumask,
 
TP_PROTO(const struct cpumask *cpumask, unsigned long callsite, void 
*callback),
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -78,10 +78,8 @@ void __weak arch_irq_work_raise(void)
 
 static __always_inline void irq_work_raise(struct irq_work *work)
 {
-   if (trace_ipi_send_cpumask_enabled() && arch_irq_work_has_interrupt())
-   trace_ipi_send_cpumask(cpumask_of(smp_processor_id()),
-  _RET_IP_,
-  work->func);
+   if (trace_ipi_send_cpu_enabled() && arch_irq_work_has_interrupt())
+   trace_ipi_send_cpu(smp_processor_id(), _RET_IP_, work->func);
 
arch_irq_work_raise();
 }
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -109,7 +109,7 @@ static __always_inline void
 send_call_function_single_ipi(int cpu, smp_call_func_t func)
 {
if (call_function_single_prep_ipi(cpu)) {
-   trace_ipi_send_cpumask(cpumask_of(cpu), _RET_IP_, f

Re: [PATCH v5 1/7] trace: Add trace_ipi_send_cpumask()

2023-03-22 Thread Valentin Schneider

On 22/03/23 11:30, Peter Zijlstra wrote:
> On Wed, Mar 22, 2023 at 10:39:55AM +0100, Peter Zijlstra wrote:
>> On Tue, Mar 07, 2023 at 02:35:52PM +, Valentin Schneider wrote:
>> > +TRACE_EVENT(ipi_send_cpumask,
>> > +
>> > +  TP_PROTO(const struct cpumask *cpumask, unsigned long callsite, void 
>> > *callback),
>> > +
>> > +  TP_ARGS(cpumask, callsite, callback),
>> > +
>> > +  TP_STRUCT__entry(
>> > +  __cpumask(cpumask)
>> > +  __field(void *, callsite)
>> > +  __field(void *, callback)
>> > +  ),
>> > +
>> > +  TP_fast_assign(
>> > +  __assign_cpumask(cpumask, cpumask_bits(cpumask));
>> > +  __entry->callsite = (void *)callsite;
>> > +  __entry->callback = callback;
>> > +  ),
>> > +
>> > +  TP_printk("cpumask=%s callsite=%pS callback=%pS",
>> > +__get_cpumask(cpumask), __entry->callsite, __entry->callback)
>> > +);
>>
>> Would it make sense to add a variant like: ipi_send_cpu() that records a
>> single cpu instead of a cpumask. A lot of sites seems to do:
>> cpumask_of(cpu) for that first argument, and it seems to me it is quite
>> daft to have to memcpy a full multi-word cpumask in those cases.
>>
>> Remember, nr_possible_cpus > 64 is quite common these days.
>
> Something we litte bit like so...
>

I was wondering whether we could stick with a single trace event, but let
ftrace be aware of weight=1 vs weight>1 cpumasks.

For weight>1, it would memcpy() as usual, for weight=1, it could write a
pointer to a cpu_bit_bitmap[] equivalent embedded in the trace itself.

Unfortunately, Ftrace bitmasks are represented as a u32 made of two 16 bit
values: [offset in event record, size], so there isn't a straightforward
way to point to a "reusable" cpumask. AFAICT the only alternative would be
to do that via a different trace event, but then we should just go with a
plain old uint - i.e. do what you're doing here, so:

Tested-and-reviewed-by: Valentin Schneider 

(with the tiny typo fix below)

> @@ -35,6 +35,28 @@ TRACE_EVENT(ipi_raise,
>   TP_printk("target_mask=%s (%s)", __get_bitmask(target_cpus), 
> __entry->reason)
>  );
>
> +TRACE_EVENT(ipi_send_cpu,
> +
> + TP_PROTO(const unsigned int cpu, unsigned long callsite, void 
> *callback),
> +
> + TP_ARGS(cpu, callsite, callback),
> +
> + TP_STRUCT__entry(
> + __field(unsigned int, cpu)
> + __field(void *, callsite)
> + __field(void *, callback)
> + ),
> +
> + TP_fast_assign(
> + __entry->cpu = cpu;
> + __entry->callsite = (void *)callsite;
> + __entry->callback = callback;
> + ),
> +
> + TP_printk("cpu=%s callsite=%pS callback=%pS",
^
  s/s/u/

> +   __entry->cpu, __entry->callsite, __entry->callback)
> +);
> +

Re: [PATCH v3 2/2] arch/powerpc/kvm: kvmppc_hv_entry: remove r4 argument

2023-03-22 Thread Michael Ellerman

Kautuk Consul  writes:
> kvmppc_hv_entry is called from only 2 locations within
> book3s_hv_rmhandlers.S. Both of those locations set r4
> as HSTATE_KVM_VCPU(r13) before calling kvmppc_hv_entry.
> So, shift the r4 load instruction to kvmppc_hv_entry and
> thus modify the calling convention of this function.
>
> Signed-off-by: Kautuk Consul 
> ---
>  arch/powerpc/kvm/book3s_hv_rmhandlers.S | 11 ++-
>  1 file changed, 6 insertions(+), 5 deletions(-)
>
> diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
> b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> index b81ba4ee0521..b61f0b2c677b 100644
> --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> @@ -85,7 +85,7 @@ _GLOBAL_TOC(kvmppc_hv_entry_trampoline)
>   RFI_TO_KERNEL
>  
>  kvmppc_call_hv_entry:
> - ld  r4, HSTATE_KVM_VCPU(r13)
> + /* Enter guest. */
>   bl  kvmppc_hv_entry
>  
>   /* Back from guest - restore host state and return to caller */
> @@ -352,9 +352,7 @@ kvm_secondary_got_guest:
>   mtspr   SPRN_LDBAR, r0
>   isync
>  63:
> - /* Order load of vcpu after load of vcore */
> - lwsync
> - ld  r4, HSTATE_KVM_VCPU(r13)
> + /* Enter guest. */
>   bl  kvmppc_hv_entry
>  
>   /* Back from the guest, go back to nap */
> @@ -506,7 +504,6 @@ SYM_INNER_LABEL(kvmppc_hv_entry, SYM_L_LOCAL)
>  
>   /* Required state:
>*
> -  * R4 = vcpu pointer (or NULL)
>* MSR = ~IR|DR
>* R13 = PACA
>* R1 = host R1
> @@ -524,6 +521,10 @@ SYM_INNER_LABEL(kvmppc_hv_entry, SYM_L_LOCAL)
>   li  r6, KVM_GUEST_MODE_HOST_HV
>   stb r6, HSTATE_IN_GUEST(r13)
>  
> + /* Order load of vcpu after load of vcore */

Just copying the comment here doesn't work. It doesn't make sense on its
own here, because the VCORE is loaded (again) a few lines below (536).
So as written this comment seems backward vs the code.

The comment would need to expand to explain that the barrier is for the
case where we came from kvm_secondary_got_guest.

> + lwsync
> + ld  r4, HSTATE_KVM_VCPU(r13)
> +
>  #ifdef CONFIG_KVM_BOOK3S_HV_P8_TIMING
>   /* Store initial timestamp */
>   cmpdi   r4, 0

But as Nick says I don't think it's worth investing effort in small
tweaks to this code. The risk of introducing bugs is too high for such a
small improvement to the code.

Thanks for trying, but I think this asm code is best left more or less
alone unless we find actual bugs in it - or unless we can make
substantial improvements to it, which would be rewriting in C, or at
least converting to a fully call/return style rather than the current
forest of labels.

I will take patch 1 though, as that's an obvious cleanup and poses no
risk (famous last words :D).

cheers

Re: [PATCH v5 7/7] sched, smp: Trace smp callback causing an IPI

2023-03-22 Thread Valentin Schneider

On 22/03/23 10:53, Peter Zijlstra wrote:
> On Tue, Mar 07, 2023 at 02:35:58PM +, Valentin Schneider wrote:
>
>> @@ -477,6 +490,25 @@ static __always_inline void csd_unlock(struct 
>> __call_single_data *csd)
>>  smp_store_release(&csd->node.u_flags, 0);
>>  }
>>
>> +static __always_inline void
>> +raw_smp_call_single_queue(int cpu, struct llist_node *node, smp_call_func_t 
>> func)
>> +{
>> +/*
>> + * The list addition should be visible to the target CPU when it pops
>> + * the head of the list to pull the entry off it in the IPI handler
>> + * because of normal cache coherency rules implied by the underlying
>> + * llist ops.
>> + *
>> + * If IPIs can go out of order to the cache coherency protocol
>> + * in an architecture, sufficient synchronisation should be added
>> + * to arch code to make it appear to obey cache coherency WRT
>> + * locking and barrier primitives. Generic code isn't really
>> + * equipped to do the right thing...
>> + */
>> +if (llist_add(node, &per_cpu(call_single_queue, cpu)))
>> +send_call_function_single_ipi(cpu, func);
>> +}
>> +
>>  static DEFINE_PER_CPU_SHARED_ALIGNED(call_single_data_t, csd_data);
>>
>>  void __smp_call_single_queue(int cpu, struct llist_node *node)
>> @@ -493,21 +525,25 @@ void __smp_call_single_queue(int cpu, struct 
>> llist_node *node)
>>  }
>>  }
>>  #endif
>>  /*
>> + * We have to check the type of the CSD before queueing it, because
>> + * once queued it can have its flags cleared by
>> + *   flush_smp_call_function_queue()
>> + * even if we haven't sent the smp_call IPI yet (e.g. the stopper
>> + * executes migration_cpu_stop() on the remote CPU).
>>   */
>> +if (trace_ipi_send_cpumask_enabled()) {
>> +call_single_data_t *csd;
>> +smp_call_func_t func;
>> +
>> +csd = container_of(node, call_single_data_t, node.llist);
>> +func = CSD_TYPE(csd) == CSD_TYPE_TTWU ?
>> +sched_ttwu_pending : csd->func;
>> +
>> +raw_smp_call_single_queue(cpu, node, func);
>> +} else {
>> +raw_smp_call_single_queue(cpu, node, NULL);
>> +}
>>  }
>
> Hurmph... so we only really consume @func when we IPI. Would it not be
> more useful to trace this thing for *every* csd enqeued?

It's true that any CSD enqueued on that CPU's call_single_queue in the
[first CSD llist_add()'ed, IPI IRQ hits] timeframe is a potential source of
interference.

However, can we be sure that first CSD isn't an indirect cause for the
following ones? say the target CPU exits RCU EQS due to the IPI, there's a
bit of time before it gets to flush_smp_call_function_queue() where some other 
CSD
could be enqueued *because* of that change in state.

I couldn't find a easy example of that, I might be biased as this is where
I'd like to go wrt IPI'ing isolated CPUs in usermode. But regardless, when
correlating an IPI IRQ with its source, we'd always have to look at the
first CSD in that CSD stack.

Re: (subset) [PATCH v2 01/10] powerpc/machdep: Make machine name const

2023-03-22 Thread Michael Ellerman

On Sat, 18 Feb 2023 10:15:44 +0100, Christophe Leroy wrote:
> Machine name in struct machdep_calls should never be modified.
> 
> Mark it const.
> 
> 

Patches 1-7 applied to powerpc/next.

[01/10] powerpc/machdep: Make machine name const

https://git.kernel.org/powerpc/c/0aafbdf35c75cbfec82636d01e6dc7950bc1507c
[02/10] powerpc/machdep: Define 'compatible' property in ppc_md and use it

https://git.kernel.org/powerpc/c/2fc39acfcacf3dc1392d8062f6d7b7d94eb2537c
[03/10] powerpc/platforms: Use 'compatible' property for simple cases

https://git.kernel.org/powerpc/c/1c96fcdef8c7492ecf34ed70102a1ae5253ef9d1
[04/10] powerpc/47x: Split ppc47x machine in two

https://git.kernel.org/powerpc/c/357f82395cd8a0279067805841e5968f4e6dc932
[05/10] powerpc/gamecube|wii : Use machine_device_initcall()

https://git.kernel.org/powerpc/c/f47b17d51997cd47e0e6fb1b90145d516ebe6b3e
[06/10] powerpc/85xx: Fix function naming for p1023_rdb platform

https://git.kernel.org/powerpc/c/5a81c02d0cc5067170e49452d55a4dfd21333257
[07/10] powerpc: Make generic_calibrate_decr() the default

https://git.kernel.org/powerpc/c/0aafbdf35c75cbfec82636d01e6dc7950bc1507c

cheers

Re: [PATCH] powerpc: Fix some kernel-doc warnings

2023-03-22 Thread Michael Ellerman

On Mon, 31 Oct 2022 21:54:52 -0400, Bo Liu wrote:
> The current code provokes some kernel-doc warnings:
>   arch/powerpc/kernel/process.c:1606: warning: This comment starts with 
> '/**', but isn't a kernel-doc comment. Refer 
> Documentation/doc-guide/kernel-doc.rst
> 
> 

Applied to powerpc/next.

[1/1] powerpc: Fix some kernel-doc warnings
  https://git.kernel.org/powerpc/c/be994293544f1c0b032dabfe0832d9c1dfcea14b

cheers

Re: [PATCH 0/2] ppc: simplify sysctl registration

2023-03-22 Thread Michael Ellerman

On Fri, 10 Mar 2023 15:28:48 -0800, Luis Chamberlain wrote:
> We can simplify the way we do sysctl registration both by
> reducing the number of lines and also avoiding calllers which
> could do recursion. The docs are being updated to help reflect
> this better [0].
> 
> [0] 
> https://lore.kernel.org/all/20230310223947.3917711-1-mcg...@kernel.org/T/#u
> 
> [...]

Applied to powerpc/next.

[1/2] ppc: simplify one-level sysctl registration for powersave_nap_ctl_table
  https://git.kernel.org/powerpc/c/bfedee5dc406ddcd70d667be1501659f1b232b7f
[2/2] ppc: simplify one-level sysctl registration for 
nmi_wd_lpm_factor_ctl_table
  https://git.kernel.org/powerpc/c/3a713753d3cb52e4e3039cdb906ef00f0b574219

cheers

Re: [PATCH] selftests/powerpc: Increase timeout for vsx_signal test

2023-03-22 Thread Michael Ellerman

On Wed, 08 Mar 2023 08:36:14 +1100, Michael Neuling wrote:
> On the max config P10 machine (1920 threads and 64TB) this test fails
> with a timeout:
> 
> Sending signals to all threads 10 times...!! killing vmx_signal
> !! child died by signal 15
> failure: vmx_signal
> 
> [...]

Applied to powerpc/next.

[1/1] selftests/powerpc: Increase timeout for vsx_signal test
  https://git.kernel.org/powerpc/c/493648d6795f00b6dcd6295b2b4221871bc1b25b

cheers

Re: [PATCH 0/3] Allow CONFIG_PPC64_BIG_ENDIAN_ELF_ABI_V2 with ld.lld 15+

2023-03-22 Thread Michael Ellerman

On Wed, 15 Feb 2023 11:41:14 -0700, Nathan Chancellor wrote:
> Currently, CONFIG_PPC64_BIG_ENDIAN_ELF_ABI_V2 is not selectable with
> ld.lld because of an explicit dependency on GNU ld, due to lack of
> testing with LLVM.
> 
> Erhard was kind enough to test this option on his hardware with LLVM 15,
> which ran without any issues. This should not be too surprising, as
> ld.lld does not have support for the ELFv1 ABI, only ELFv2, so it should
> have decent support. With this series, big endian kernels can be built
> with LLVM=1.
> 
> [...]

Applied to powerpc/next.

[1/3] powerpc/boot: Only use '-mabi=elfv2' with CONFIG_PPC64_BOOT_WRAPPER
  https://git.kernel.org/powerpc/c/d1c5accacb234c3a9f1609a73b4b2eaa4ef07d1a
[2/3] powerpc: Fix use of '-mabi=elfv2' with clang
  https://git.kernel.org/powerpc/c/7c3bd8362b06cff0a4044a4975adb7d71db2dfba
[3/3] powerpc: Allow CONFIG_PPC64_BIG_ENDIAN_ELF_ABI_V2 with ld.lld 15+
  https://git.kernel.org/powerpc/c/a11334d8327b3fd7987cbfb38e956a44c722d88f

cheers

Re: [PATCH v2 1/4] powerpc/iommu: Add "borrowing" iommu_table_group_ops

2023-03-22 Thread Michael Ellerman

On Mon, 6 Mar 2023 11:30:20 -0600 (CST), Timothy Pearson wrote:
> PPC64 IOMMU API defines iommu_table_group_ops which handles DMA windows
> for PEs: control the ownership, create/set/unset a table the hardware
> for dynamic DMA windows (DDW). VFIO uses the API to implement support
> on POWER.
> 
> So far only PowerNV IODA2 (POWER8 and newer machines) implemented this and 
> other cases (POWER7 or nested KVM) did not and instead reused
> existing iommu_table structs. This means 1) no DDW 2) ownership transfer
> is done directly in the VFIO SPAPR TCE driver.
> 
> [...]

Applied to powerpc/next.

[1/4] powerpc/iommu: Add "borrowing" iommu_table_group_ops
  https://git.kernel.org/powerpc/c/9d67c94335096311c0bc7556ad1022de7385790b

cheers

Re: (subset) [PATCH 1/2] selftests/powerpc/pmu: Fix sample field check in the mmcra_thresh_marked_sample_test

2023-03-22 Thread Michael Ellerman

On Wed, 01 Mar 2023 22:39:17 +0530, Kajol Jain wrote:
> The testcase verifies the setting of different fields in Monitor Mode
> Control Register A (MMCRA). In the current code, EV_CODE_EXTRACT macro
> is used to extract the "sample" field, which then needs to be further
> processed to fetch rand_samp_elig and rand_samp_mode bits. But the
> current code is not passing valid sample field to EV_CODE_EXTRACT
> macro. Patch addresses this by fixing the input for EV_CODE_EXTRACT.
> 
> [...]

Patch 1 applied to powerpc/next.

[1/2] selftests/powerpc/pmu: Fix sample field check in the 
mmcra_thresh_marked_sample_test
  https://git.kernel.org/powerpc/c/8a32341cf04ba05974931b4664683c2c9fb84e56

cheers

Re: [PATCH v2 2/4] powerpc/pci_64: Init pcibios subsys a bit later

2023-03-22 Thread Michael Ellerman

On Mon, 6 Mar 2023 11:30:42 -0600 (CST), Timothy Pearson wrote:
> The following patches are going to add dependency/use of iommu_ops which
> is initialized in subsys_initcall as well.
> 
> This moves pciobios_init() to the next initcall level.
> 
> This should not cause behavioral change.
> 
> [...]

Applied to powerpc/next.

[2/4] powerpc/pci_64: Init pcibios subsys a bit later
  https://git.kernel.org/powerpc/c/76f351096c4516f38b9c901a21797fa958588e3a

cheers

Re: [PATCH v2 3/4] powerpc/iommu: Add iommu_ops to report capabilities and

2023-03-22 Thread Michael Ellerman

On Mon, 6 Mar 2023 11:31:00 -0600 (CST), Timothy Pearson wrote:
>  allow blocking domains
> 
> Up until now PPC64 managed to avoid using iommu_ops. The VFIO driver
> uses a SPAPR TCE sub-driver and all iommu_ops uses were kept in
> the Type1 VFIO driver. Recent development added 2 uses of iommu_ops to
> the generic VFIO which broke POWER:
> - a coherency capability check;
> - blocking IOMMU domain - iommu_group_dma_owner_claimed()/...
> 
> [...]

Applied to powerpc/next.

[3/4] powerpc/iommu: Add iommu_ops to report capabilities and
  https://git.kernel.org/powerpc/c/a940904443e432623579245babe63e2486ff327b

cheers

Re: [PATCH v2 4/4] Add myself to MAINTAINERS for Power VFIO support

2023-03-22 Thread Michael Ellerman

On Mon, 6 Mar 2023 11:31:28 -0600 (CST), Timothy Pearson wrote:
> 


Applied to powerpc/next.

[4/4] Add myself to MAINTAINERS for Power VFIO support
  https://git.kernel.org/powerpc/c/a34d2f0d79ec890b9b1b156a90016b6330173b8a

cheers

Re: [PATCH] powerpc/iommu: Fix notifiers being shared by PCI and VIO buses

2023-03-22 Thread R Nageswara Sastry





On 22/03/23 9:23 am, Russell Currey wrote:

fail_iommu_setup() registers the fail_iommu_bus_notifier struct to both
PCI and VIO buses.  struct notifier_block is a linked list node, so this
causes any notifiers later registered to either bus type to also be
registered to the other since they share the same node.

This causes issues in (at least) the vgaarb code, which registers a
notifier for PCI buses.  pci_notify() ends up being called on a vio
device, converted with to_pci_dev() even though it's not a PCI device,
and finally makes a bad access in vga_arbiter_add_pci_device() as
discovered with KASAN:

  BUG: KASAN: slab-out-of-bounds in vga_arbiter_add_pci_device+0x60/0xe00
  Read of size 4 at addr c00264c26fdc by task swapper/0/1

  Call Trace:
  [c00263607520] [c00010f7023c] dump_stack_lvl+0x1bc/0x2b8 (unreliable)
  [c00263607560] [cf142a64] print_report+0x3f4/0xc60
  [c00263607640] [cf142144] kasan_report+0x244/0x698
  [c00263607740] [cf1460e8] __asan_load4+0xe8/0x250
  [c00263607760] [cff4b850] vga_arbiter_add_pci_device+0x60/0xe00
  [c00263607850] [cff4c678] pci_notify+0x88/0x444
  [c002636078b0] [ce94dfc4] notifier_call_chain+0x104/0x320
  [c00263607950] [ce94f050] blocking_notifier_call_chain+0xa0/0x140
  [c00263607990] [c000100cb3b8] device_add+0xac8/0x1d30
  [c00263607aa0] [c000100ccd98] device_register+0x58/0x80
  [c00263607ad0] [ce84247c] vio_register_device_node+0x9ac/0xce0
  [c00263607ba0] [c000126c95d8] vio_bus_scan_register_devices+0xc4/0x13c
  [c00263607bd0] [c000126c96e4] 
__machine_initcall_pseries_vio_device_init+0x94/0xf0
  [c00263607c00] [ce69467c] do_one_initcall+0x12c/0xaa8
  [c00263607cf0] [c0001268b8a8] kernel_init_freeable+0xa48/0xba8
  [c00263607dd0] [ce695f24] kernel_init+0x64/0x400
  [c00263607e50] [ce68e0e4] ret_from_kernel_thread+0x5c/0x64

Fix this by creating separate notifier_block structs for each bus type.

Fixes: d6b9a81b2a45 ("powerpc: IOMMU fault injection")
Reported-by: Nageswara R Sastry 
Signed-off-by: Russell Currey 



Tested-by: Nageswara R Sastry 


---
  arch/powerpc/kernel/iommu.c | 15 ---
  1 file changed, 12 insertions(+), 3 deletions(-)

diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c
index ee95937bdaf1..6f1117fe3870 100644
--- a/arch/powerpc/kernel/iommu.c
+++ b/arch/powerpc/kernel/iommu.c
@@ -171,17 +171,26 @@ static int fail_iommu_bus_notify(struct notifier_block 
*nb,
return 0;
  }
  
-static struct notifier_block fail_iommu_bus_notifier = {

+/*
+ * PCI and VIO buses need separate notifier_block structs, since they're linked
+ * list nodes.  Sharing a notifier_block would mean that any notifiers later
+ * registered for PCI buses would also get called by VIO buses and vice versa.
+ */
+static struct notifier_block fail_iommu_pci_bus_notifier = {
+   .notifier_call = fail_iommu_bus_notify
+};
+
+static struct notifier_block fail_iommu_vio_bus_notifier = {
.notifier_call = fail_iommu_bus_notify
  };
  
  static int __init fail_iommu_setup(void)

  {
  #ifdef CONFIG_PCI
-   bus_register_notifier(&pci_bus_type, &fail_iommu_bus_notifier);
+   bus_register_notifier(&pci_bus_type, &fail_iommu_pci_bus_notifier);
  #endif
  #ifdef CONFIG_IBMVIO
-   bus_register_notifier(&vio_bus_type, &fail_iommu_bus_notifier);
+   bus_register_notifier(&vio_bus_type, &fail_iommu_vio_bus_notifier);
  #endif
  
  	return 0;


--
Thanks and Regards
R.Nageswara Sastry

Re: [PATCH v8 1/3] riscv: Introduce CONFIG_RELOCATABLE

2023-03-22 Thread Alexandre Ghiti


@linux-kbuild: Does anyone has an idea to solve this?

Thanks!

On 2/22/23 13:29, Alexandre Ghiti wrote:

+cc linux-kbuild, llvm, Nathan, Nick

On 2/15/23 15:36, Alexandre Ghiti wrote:

From: Alexandre Ghiti 

This config allows to compile 64b kernel as PIE and to relocate it at
any virtual address at runtime: this paves the way to KASLR.
Runtime relocation is possible since relocation metadata are embedded 
into

the kernel.

Note that relocating at runtime introduces an overhead even if the
kernel is loaded at the same address it was linked at and that the 
compiler

options are those used in arm64 which uses the same RELA relocation
format.

Signed-off-by: Alexandre Ghiti 
---
  arch/riscv/Kconfig  | 14 +
  arch/riscv/Makefile |  7 +++--
  arch/riscv/kernel/efi-header.S  |  6 ++--
  arch/riscv/kernel/vmlinux.lds.S | 10 --
  arch/riscv/mm/Makefile  |  4 +++
  arch/riscv/mm/init.c    | 54 -
  6 files changed, 87 insertions(+), 8 deletions(-)

diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig
index e2b656043abf..e0ee7ce4b2e3 100644
--- a/arch/riscv/Kconfig
+++ b/arch/riscv/Kconfig
@@ -544,6 +544,20 @@ config COMPAT
      If you want to execute 32-bit userspace applications, say Y.
  +config RELOCATABLE
+    bool "Build a relocatable kernel"
+    depends on MMU && 64BIT && !XIP_KERNEL
+    help
+  This builds a kernel as a Position Independent Executable 
(PIE),
+  which retains all relocation metadata required to relocate 
the
+  kernel binary at runtime to a different virtual address 
than the

+  address it was linked at.
+  Since RISCV uses the RELA relocation format, this requires a
+  relocation pass at runtime even if the kernel is loaded at 
the

+  same address it was linked at.
+
+  If unsure, say N.
+
  endmenu # "Kernel features"
    menu "Boot options"
diff --git a/arch/riscv/Makefile b/arch/riscv/Makefile
index 82153960ac00..97c34136b027 100644
--- a/arch/riscv/Makefile
+++ b/arch/riscv/Makefile
@@ -7,9 +7,12 @@
  #
    OBJCOPYFLAGS    := -O binary
-LDFLAGS_vmlinux :=
+ifeq ($(CONFIG_RELOCATABLE),y)
+    LDFLAGS_vmlinux += -shared -Bsymbolic -z notext -z norelro
+    KBUILD_CFLAGS += -fPIE
+endif
  ifeq ($(CONFIG_DYNAMIC_FTRACE),y)
-    LDFLAGS_vmlinux := --no-relax
+    LDFLAGS_vmlinux += --no-relax
  KBUILD_CPPFLAGS += -DCC_USING_PATCHABLE_FUNCTION_ENTRY
  CC_FLAGS_FTRACE := -fpatchable-function-entry=8
  endif
diff --git a/arch/riscv/kernel/efi-header.S 
b/arch/riscv/kernel/efi-header.S

index 8e733aa48ba6..f7ee09c4f12d 100644
--- a/arch/riscv/kernel/efi-header.S
+++ b/arch/riscv/kernel/efi-header.S
@@ -33,7 +33,7 @@ optional_header:
  .byte    0x02    // MajorLinkerVersion
  .byte    0x14    // MinorLinkerVersion
  .long    __pecoff_text_end - efi_header_end    // SizeOfCode
-    .long    __pecoff_data_virt_size    // 
SizeOfInitializedData
+    .long    __pecoff_data_virt_end - __pecoff_text_end    // 
SizeOfInitializedData

  .long    0    // SizeOfUninitializedData
  .long    __efistub_efi_pe_entry - _start    // 
AddressOfEntryPoint

  .long    efi_header_end - _start    // BaseOfCode
@@ -91,9 +91,9 @@ section_table:
  IMAGE_SCN_MEM_EXECUTE    // Characteristics
    .ascii    ".data\0\0\0"
-    .long    __pecoff_data_virt_size    // VirtualSize
+    .long    __pecoff_data_virt_end - __pecoff_text_end    // 
VirtualSize

  .long    __pecoff_text_end - _start    // VirtualAddress
-    .long    __pecoff_data_raw_size    // SizeOfRawData
+    .long    __pecoff_data_raw_end - __pecoff_text_end    // 
SizeOfRawData

  .long    __pecoff_text_end - _start    // PointerToRawData
    .long    0    // PointerToRelocations
diff --git a/arch/riscv/kernel/vmlinux.lds.S 
b/arch/riscv/kernel/vmlinux.lds.S

index 4e6c88aa4d87..8be2de3be08c 100644
--- a/arch/riscv/kernel/vmlinux.lds.S
+++ b/arch/riscv/kernel/vmlinux.lds.S
@@ -122,9 +122,15 @@ SECTIONS
  *(.sdata*)
  }
  +    .rela.dyn : ALIGN(8) {
+    __rela_dyn_start = .;
+    *(.rela .rela*)
+    __rela_dyn_end = .;
+    }
+



So I realized those relocations would be better in the init section so 
we can get rid of them at some point. So I tried the following:


diff --git a/arch/riscv/kernel/vmlinux.lds.S 
b/arch/riscv/kernel/vmlinux.lds.S

index 7ac215467fd5..6111023a89ef 100644
--- a/arch/riscv/kernel/vmlinux.lds.S
+++ b/arch/riscv/kernel/vmlinux.lds.S
@@ -93,6 +93,12 @@ SECTIONS
    *(.rel.dyn*)
    }

+   .rela.dyn : ALIGN(8) {
+   __rela_dyn_start = .;
+   *(.rela .rela*)
+   __rela_dyn_end = .;
+   }
+
    __init_data_end = .;

    . = ALIGN(8);
@@ -119,12 +125,6 @@ SECTIONS
    *(.sdata*)
    }

-   .rela.dyn

Re: [PATCH v4 0/7] introduce vm_flags modifier functions

2023-03-22 Thread Jason Gunthorpe

On Tue, Mar 14, 2023 at 02:11:44PM -0600, Alex Williamson wrote:
> On Thu, 26 Jan 2023 11:37:45 -0800
> Suren Baghdasaryan  wrote:
> 
> > This patchset was originally published as a part of per-VMA locking [1] and
> > was split after suggestion that it's viable on its own and to facilitate
> > the review process. It is now a preprequisite for the next version of 
> > per-VMA
> > lock patchset, which reuses vm_flags modifier functions to lock the VMA when
> > vm_flags are being updated.
> > 
> > VMA vm_flags modifications are usually done under exclusive mmap_lock
> > protection because this attrubute affects other decisions like VMA merging
> > or splitting and races should be prevented. Introduce vm_flags modifier
> > functions to enforce correct locking.
> > 
> > The patchset applies cleanly over mm-unstable branch of mm tree.
> 
> With this series, vfio-pci developed a bunch of warnings around not
> holding the mmap_lock write semaphore while calling
> io_remap_pfn_range() from our fault handler, vfio_pci_mmap_fault().
> 
> I suspect vdpa has the same issue for their use of remap_pfn_range()
> from their fault handler, JasonW, MST, FYI.

Yeah, IMHO this whole approach has always been a bit sketchy, it was
never intended that remap_pfn would be called from a fault handler,
you are supposed to use a vmf_insert_pfn() type API from fault
handlers.

> The reason for using remap_pfn_range() on fault in vfio-pci is that
> we're mapping device MMIO to userspace, where that MMIO can be disabled
> and we'd rather zap the mapping when that occurs so that we can sigbus
> the user rather than allow the user to trigger potentially fatal bus
> errors on the host.
 
> Peter Xu has suggested offline that a non-lazy approach to reinsert the
> mappings might be more inline with mm expectations relative to touching
> vm_flags during fault.  

Yes, I feel the same - along with completing the address_space
conversion you had started. IIRC that was part of the reason we needed
this design in vfio.

Jason

Re: [PATCH v5 7/7] sched, smp: Trace smp callback causing an IPI

2023-03-22 Thread Peter Zijlstra

On Wed, Mar 22, 2023 at 12:20:28PM +, Valentin Schneider wrote:
> On 22/03/23 10:53, Peter Zijlstra wrote:

> > Hurmph... so we only really consume @func when we IPI. Would it not be
> > more useful to trace this thing for *every* csd enqeued?
> 
> It's true that any CSD enqueued on that CPU's call_single_queue in the
> [first CSD llist_add()'ed, IPI IRQ hits] timeframe is a potential source of
> interference.
> 
> However, can we be sure that first CSD isn't an indirect cause for the
> following ones? say the target CPU exits RCU EQS due to the IPI, there's a
> bit of time before it gets to flush_smp_call_function_queue() where some 
> other CSD
> could be enqueued *because* of that change in state.
> 
> I couldn't find a easy example of that, I might be biased as this is where
> I'd like to go wrt IPI'ing isolated CPUs in usermode. But regardless, when
> correlating an IPI IRQ with its source, we'd always have to look at the
> first CSD in that CSD stack.

So I was thinking something like this:

---
Subject: trace,smp: Trace all smp_function_call*() invocations
From: Peter Zijlstra 
Date: Wed Mar 22 14:58:36 CET 2023

(Ab)use the trace_ipi_send_cpu*() family to trace all
smp_function_call*() invocations, not only those that result in an
actual IPI.

The queued entries log their callback function while the actual IPIs
are traced on generic_smp_call_function_single_interrupt().

Signed-off-by: Peter Zijlstra (Intel) 
---
 kernel/smp.c |   58 ++
 1 file changed, 30 insertions(+), 28 deletions(-)

--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -106,18 +106,20 @@ void __init call_function_init(void)
 }
 
 static __always_inline void
-send_call_function_single_ipi(int cpu, smp_call_func_t func)
+send_call_function_single_ipi(int cpu)
 {
if (call_function_single_prep_ipi(cpu)) {
-   trace_ipi_send_cpu(cpu, _RET_IP_, func);
+   trace_ipi_send_cpu(cpu, _RET_IP_,
+  generic_smp_call_function_single_interrupt);
arch_send_call_function_single_ipi(cpu);
}
 }
 
 static __always_inline void
-send_call_function_ipi_mask(const struct cpumask *mask, smp_call_func_t func)
+send_call_function_ipi_mask(const struct cpumask *mask)
 {
-   trace_ipi_send_cpumask(mask, _RET_IP_, func);
+   trace_ipi_send_cpumask(mask, _RET_IP_,
+  generic_smp_call_function_single_interrupt);
arch_send_call_function_ipi_mask(mask);
 }
 
@@ -318,25 +320,6 @@ static __always_inline void csd_unlock(s
smp_store_release(&csd->node.u_flags, 0);
 }
 
-static __always_inline void
-raw_smp_call_single_queue(int cpu, struct llist_node *node, smp_call_func_t 
func)
-{
-   /*
-* The list addition should be visible to the target CPU when it pops
-* the head of the list to pull the entry off it in the IPI handler
-* because of normal cache coherency rules implied by the underlying
-* llist ops.
-*
-* If IPIs can go out of order to the cache coherency protocol
-* in an architecture, sufficient synchronisation should be added
-* to arch code to make it appear to obey cache coherency WRT
-* locking and barrier primitives. Generic code isn't really
-* equipped to do the right thing...
-*/
-   if (llist_add(node, &per_cpu(call_single_queue, cpu)))
-   send_call_function_single_ipi(cpu, func);
-}
-
 static DEFINE_PER_CPU_SHARED_ALIGNED(call_single_data_t, csd_data);
 
 void __smp_call_single_queue(int cpu, struct llist_node *node)
@@ -356,10 +339,23 @@ void __smp_call_single_queue(int cpu, st
func = CSD_TYPE(csd) == CSD_TYPE_TTWU ?
sched_ttwu_pending : csd->func;
 
-   raw_smp_call_single_queue(cpu, node, func);
-   } else {
-   raw_smp_call_single_queue(cpu, node, NULL);
+   trace_ipi_send_cpu(cpu, _RET_IP_, func);
}
+
+   /*
+* The list addition should be visible to the target CPU when it pops
+* the head of the list to pull the entry off it in the IPI handler
+* because of normal cache coherency rules implied by the underlying
+* llist ops.
+*
+* If IPIs can go out of order to the cache coherency protocol
+* in an architecture, sufficient synchronisation should be added
+* to arch code to make it appear to obey cache coherency WRT
+* locking and barrier primitives. Generic code isn't really
+* equipped to do the right thing...
+*/
+   if (llist_add(node, &per_cpu(call_single_queue, cpu)))
+   send_call_function_single_ipi(cpu);
 }
 
 /*
@@ -798,14 +794,20 @@ static void smp_call_function_many_cond(
}
 
/*
+* Trace each smp_function_call_*() as an IPI, actual IPIs
+* will be traced with 
func==generic_smp_call_fu

Re: [PATCH v5 7/7] sched, smp: Trace smp callback causing an IPI

2023-03-22 Thread Valentin Schneider

On 22/03/23 15:04, Peter Zijlstra wrote:
> On Wed, Mar 22, 2023 at 12:20:28PM +, Valentin Schneider wrote:
>> On 22/03/23 10:53, Peter Zijlstra wrote:
>
>> > Hurmph... so we only really consume @func when we IPI. Would it not be
>> > more useful to trace this thing for *every* csd enqeued?
>>
>> It's true that any CSD enqueued on that CPU's call_single_queue in the
>> [first CSD llist_add()'ed, IPI IRQ hits] timeframe is a potential source of
>> interference.
>>
>> However, can we be sure that first CSD isn't an indirect cause for the
>> following ones? say the target CPU exits RCU EQS due to the IPI, there's a
>> bit of time before it gets to flush_smp_call_function_queue() where some 
>> other CSD
>> could be enqueued *because* of that change in state.
>>
>> I couldn't find a easy example of that, I might be biased as this is where
>> I'd like to go wrt IPI'ing isolated CPUs in usermode. But regardless, when
>> correlating an IPI IRQ with its source, we'd always have to look at the
>> first CSD in that CSD stack.
>
> So I was thinking something like this:
>
> ---
> Subject: trace,smp: Trace all smp_function_call*() invocations
> From: Peter Zijlstra 
> Date: Wed Mar 22 14:58:36 CET 2023
>
> (Ab)use the trace_ipi_send_cpu*() family to trace all
> smp_function_call*() invocations, not only those that result in an
> actual IPI.
>
> The queued entries log their callback function while the actual IPIs
> are traced on generic_smp_call_function_single_interrupt().
>
> Signed-off-by: Peter Zijlstra (Intel) 
> ---
>  kernel/smp.c |   58 
> ++
>  1 file changed, 30 insertions(+), 28 deletions(-)
>
> --- a/kernel/smp.c
> +++ b/kernel/smp.c
> @@ -106,18 +106,20 @@ void __init call_function_init(void)
>  }
>
>  static __always_inline void
> -send_call_function_single_ipi(int cpu, smp_call_func_t func)
> +send_call_function_single_ipi(int cpu)
>  {
>   if (call_function_single_prep_ipi(cpu)) {
> - trace_ipi_send_cpu(cpu, _RET_IP_, func);
> + trace_ipi_send_cpu(cpu, _RET_IP_,
> +generic_smp_call_function_single_interrupt);

Hm, this does get rid of the func being passed down the helpers, but this
means the trace events are now stateful, i.e. I need the first and last
events in a CSD stack to figure out which one actually caused the IPI.

It also requires whoever is looking at the trace to be aware of which IPIs
are attached to a CSD, and which ones aren't. ATM that's only the resched
IPI, but per the cover letter there's more to come (e.g. tick_broadcast()
for arm64/riscv and a few others). For instance:

   hackbench-157   [001]10.894320: ipi_send_cpu: cpu=3 
callsite=check_preempt_curr+0x37 callback=0x0
   hackbench-157   [001]10.895068: ipi_send_cpu: cpu=3 
callsite=try_to_wake_up+0x29e callback=sched_ttwu_pending+0x0
   hackbench-157   [001]10.895068: ipi_send_cpu: cpu=3 
callsite=try_to_wake_up+0x29e 
callback=generic_smp_call_function_single_interrupt+0x0

That first one sent a RESCHEDULE IPI, the second one a CALL_FUNCTION one,
but you really have to know what you're looking at...

Are you worried about the @func being pushed down? Staring at x86 asm is
not good for the soul, but AFAICT this does cause an extra register to be
popped in the prologue because all of the helpers are __always_inline, so
both paths of the static key(s) are in the same stackframe.

I can "improve" this with:

---
diff --git a/kernel/smp.c b/kernel/smp.c
index 5cd680a7e78ef..55f120dae1713 100644
--- a/kernel/smp.c
+++ b/kernel/smp.c
@@ -511,6 +511,26 @@ raw_smp_call_single_queue(int cpu, struct llist_node 
*node, smp_call_func_t func
 
 static DEFINE_PER_CPU_SHARED_ALIGNED(call_single_data_t, csd_data);
 
+static noinline void __smp_call_single_queue_trace(int cpu, struct llist_node 
*node)
+{
+   call_single_data_t *csd;
+   smp_call_func_t func;
+
+
+   /*
+* We have to check the type of the CSD before queueing it, because
+* once queued it can have its flags cleared by
+*   flush_smp_call_function_queue()
+* even if we haven't sent the smp_call IPI yet (e.g. the stopper
+* executes migration_cpu_stop() on the remote CPU).
+*/
+   csd = container_of(node, call_single_data_t, node.llist);
+   func = CSD_TYPE(csd) == CSD_TYPE_TTWU ?
+   sched_ttwu_pending : csd->func;
+
+   raw_smp_call_single_queue(cpu, node, func);
+}
+
 void __smp_call_single_queue(int cpu, struct llist_node *node)
 {
 #ifdef CONFIG_CSD_LOCK_WAIT_DEBUG
@@ -525,25 +545,10 @@ void __smp_call_single_queue(int cpu, struct llist_node 
*node)
}
}
 #endif
-   /*
-* We have to check the type of the CSD before queueing it, because
-* once queued it can have its flags cleared by
-*   flush_smp_call_function_queue()
-* even if we haven't sent the smp_call IPI yet (e.g. th

Re: [PATCH v5 7/7] sched, smp: Trace smp callback causing an IPI

2023-03-22 Thread Peter Zijlstra

On Wed, Mar 22, 2023 at 05:01:13PM +, Valentin Schneider wrote:

> > So I was thinking something like this:

> Hm, this does get rid of the func being passed down the helpers, but this
> means the trace events are now stateful, i.e. I need the first and last
> events in a CSD stack to figure out which one actually caused the IPI.

Isn't much of tracing stateful? I mean, why am I always writing awk
programs to parse trace output?

The one that is directly followed by
generic_smp_call_function_single_interrupt() (horrible name that), is
the one that tripped the IPI.

> It also requires whoever is looking at the trace to be aware of which IPIs
> are attached to a CSD, and which ones aren't. ATM that's only the resched
> IPI, but per the cover letter there's more to come (e.g. tick_broadcast()
> for arm64/riscv and a few others). For instance:
> 
>hackbench-157   [001]10.894320: ipi_send_cpu: cpu=3 
> callsite=check_preempt_curr+0x37 callback=0x0

Arguably we should be setting callback to scheduler_ipi(), except
ofcourse, that's not an actual function...

Maybe we can do "extern inline" for the actual users and provide a dummy
function for the symbol when tracing.

>hackbench-157   [001]10.895068: ipi_send_cpu: cpu=3 
> callsite=try_to_wake_up+0x29e callback=sched_ttwu_pending+0x0
>hackbench-157   [001]10.895068: ipi_send_cpu: cpu=3 
> callsite=try_to_wake_up+0x29e 
> callback=generic_smp_call_function_single_interrupt+0x0
> 
> That first one sent a RESCHEDULE IPI, the second one a CALL_FUNCTION one,
> but you really have to know what you're looking at...

But you have to know that anyway, you can't do tracing and not know wtf
you're doing. Or rather, if you do, I don't give a crap and you can keep
the pieces :-)

Grepping the callback should be pretty quick resolution at to what trips
it, no?

(also, if you *really* can't manage, we can always add yet another
argument that gives a type thingy)

> Are you worried about the @func being pushed down?

Not really, I was finding it odd that only the first csd was being
logged. Either you should log them all (after all, the target CPU will
run them all and you might still wonder where the heck they came from)
or it should log none and always report that hideous long function name
I can't be arsed to type again :-)

> Staring at x86 asm is not good for the soul,

Scarred for life :-) What's worse, due to being exposed to Intel syntax
at a young age, I'm now permantently confused as to the argument order
of x86 asm.

Re: [PATCH 0/2] KVM: PPC: support kvm selftests

2023-03-22 Thread Sean Christopherson

On Thu, Mar 16, 2023, Michael Ellerman wrote:
> Nicholas Piggin  writes:
> > Hi,
> >
> > This series adds initial KVM selftests support for powerpc
> > (64-bit, BookS).
> 
> Awesome.
>  
> > It spans 3 maintainers but it does not really
> > affect arch/powerpc, and it is well contained in selftests
> > code, just touches some makefiles and a tiny bit headers so
> > conflicts should be unlikely and trivial.
> >
> > I guess Paolo is the best point to merge these, if no comments
> > or objections?
> 
> Yeah. If it helps:
> 
> Acked-by: Michael Ellerman  (powerpc)

What is the long term plan for KVM PPC maintenance?  I was under the impression
that KVM PPC was trending toward "bug fixes only", but the addition of selftests
support suggests otherwise.

I ask primarily because routing KVM PPC patches through the PPC tree is going to
be problematic if KVM PPC sees signficiant development.  The current situation 
is
ok because the volume of patches is low and KVM PPC isn't trying to drive 
anything
substantial into common KVM code, but if that changes... 

My other concern is that for selftests specifically, us KVM folks are taking on
more maintenance burden by supporting PPC.  AFAIK, none of the people that focus
on KVM selftests in any meaningful capacity have access to PPC hardware, let 
alone
know enough about the architecture to make intelligent code changes.

Don't get me wrong, I'm very much in favor of more testing, I just don't want 
KVM
to get left holding the bag.

Re: [next-20230317][PPC/MLX5][bisected 4d5ab0a] Boot WARNING: CPU: 0 PID: 9 at net/core/dev.c:1928 call_netdevice_notifiers_info

2023-03-22 Thread Abdul Haleem





On 3/20/23 6:55 PM, Lorenzo Bianconi wrote:

Greeting's

Warning is seen while booting kernels from 6.3.0-rc3-next-20230317 on my
powerpc Power 10 LPAR

Boots fine without warnings when below patch is reverted

commit 4d5ab0ad964df178beba031b89429a601893ff61
Author: Lorenzo Bianconi 
Date:   Thu Mar 9 13:25:31 2023 +0100

 net/mlx5e: take into account device reconfiguration for xdp_features
flag

 Take into account LRO and GRO configuration setting device xdp_features
 flag. Consider channel rq_wq_type enabling rx scatter-gatter support in
 xdp_features flag and disable NETDEV_XDP_ACT_NDO_XMIT_SG since it is not
 supported yet by the driver.
 Moreover always enable NETDEV_XDP_ACT_NDO_XMIT as the ndo_xdp_xmit

4d5ab0ad got introduced in next-20230314

@Lorenzo Could you please look into this


I would say this issue has been already fixed by Jakub here:

https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/net/core/xdp.c?id=769639c1fe8a98129aa97c8ee981639db1e8955c



Thanks Lorenzo,

Verified the patch and it fixes the problem and next-20230321 kernel 
boots fine on my powerpc lpar


Tested-by: Abdul Haleem 
--
Regard's

Abdul Haleem
IBM Linux Technology Center

Re: [PATCH v5 7/7] sched, smp: Trace smp callback causing an IPI

2023-03-22 Thread Valentin Schneider

On 22/03/23 18:22, Peter Zijlstra wrote:
> On Wed, Mar 22, 2023 at 05:01:13PM +, Valentin Schneider wrote:
>
>> > So I was thinking something like this:
>
>> Hm, this does get rid of the func being passed down the helpers, but this
>> means the trace events are now stateful, i.e. I need the first and last
>> events in a CSD stack to figure out which one actually caused the IPI.
>
> Isn't much of tracing stateful? I mean, why am I always writing awk
> programs to parse trace output?
>
> The one that is directly followed by
> generic_smp_call_function_single_interrupt() (horrible name that), is
> the one that tripped the IPI.
>

Right.

>> It also requires whoever is looking at the trace to be aware of which IPIs
>> are attached to a CSD, and which ones aren't. ATM that's only the resched
>> IPI, but per the cover letter there's more to come (e.g. tick_broadcast()
>> for arm64/riscv and a few others). For instance:
>> 
>>hackbench-157   [001]10.894320: ipi_send_cpu: cpu=3 
>> callsite=check_preempt_curr+0x37 callback=0x0
>
> Arguably we should be setting callback to scheduler_ipi(), except
> ofcourse, that's not an actual function...
>
> Maybe we can do "extern inline" for the actual users and provide a dummy
> function for the symbol when tracing.
>

Huh, I wasn't aware that was an option, I'll look into that. I did scribble
down a comment next to smp_send_reschedule(), but having a decodable
function name would be better!

>>hackbench-157   [001]10.895068: ipi_send_cpu: cpu=3 
>> callsite=try_to_wake_up+0x29e callback=sched_ttwu_pending+0x0
>>hackbench-157   [001]10.895068: ipi_send_cpu: cpu=3 
>> callsite=try_to_wake_up+0x29e 
>> callback=generic_smp_call_function_single_interrupt+0x0
>> 
>> That first one sent a RESCHEDULE IPI, the second one a CALL_FUNCTION one,
>> but you really have to know what you're looking at...
>
> But you have to know that anyway, you can't do tracing and not know wtf
> you're doing. Or rather, if you do, I don't give a crap and you can keep
> the pieces :-)
>
> Grepping the callback should be pretty quick resolution at to what trips
> it, no?
>
> (also, if you *really* can't manage, we can always add yet another
> argument that gives a type thingy)
>

Ah, I was a bit unclear here - I don't care too much about the IPI type
being used, but rather being able to figure out on IRQ entry where that IPI
came from - thinking some more about now, I don't think logging *all* CSDs
causes an issue there, as you'd look at the earliest-not-seen-yet event
targeting this CPU anyway.

That'll be made easy once I get to having cpumask filters for ftrace, so
I can just issue something like:

  trace-cmd record -e 'ipi_send_cpu' -f "cpu == 3" -e 'ipi_send_cpumask' -f 
"cpus \in {3}" -T hackbench 

(it's somewhere on the todolist...)

TL;DR: I *think* I've convinced myself logging all of them isn't an issue -
I'm going to play with this on something "smarter" than just hackbench
under QEMU just to drill it in.

Re: [PATCH v8 1/3] riscv: Introduce CONFIG_RELOCATABLE

2023-03-22 Thread Nick Desaulniers

On Fri, Feb 24, 2023 at 7:58 AM Björn Töpel  wrote:
>
> Alexandre Ghiti  writes:
>
> > +cc linux-kbuild, llvm, Nathan, Nick
> >
> > On 2/15/23 15:36, Alexandre Ghiti wrote:
> >> From: Alexandre Ghiti 
> >>
> > I tried a lot of things, but I struggle to understand, does anyone have
> > any idea? FYI, the same problem happens with LLVM.

Off the top of my head, no idea.

(Maybe as a follow up to this series, I wonder if pursuing
ARCH_HAS_RELR for ARCH=riscv is worthwhile?)

>
> Don't ask me *why*, but adding --emit-relocs to your linker flags solves
> "the NULL .rela.dyn" both for GCC and LLVM.
>
> The downside is that you end up with a bunch of .rela cruft in your
> vmlinux.

There was a patch just this week to use $(OBJCOPY) to strip these from
vmlinux (for x86). Looks like x86 uses --emit-relocs for KASLR:
https://lore.kernel.org/lkml/20230320121006.4863-1-petr.pa...@suse.com/
-- 
Thanks,
~Nick Desaulniers

Re: [PATCH v6 1/4] PCI: Introduce pci_dev_for_each_resource()

2023-03-22 Thread Bjorn Helgaas

Hi Andy and Mika,

I really like the improvements here.  They make the code read much
better.

On Mon, Mar 20, 2023 at 03:16:30PM +0200, Andy Shevchenko wrote:
> From: Mika Westerberg 
> ...

>  static void fixup_winbond_82c105(struct pci_dev* dev)
>  {
> - int i;
> + struct resource *r;
>   unsigned int reg;
>  
>   if (!machine_is(pseries))
> @@ -251,14 +251,14 @@ static void fixup_winbond_82c105(struct pci_dev* dev)
>   /* Enable LEGIRQ to use INTC instead of ISA interrupts */
>   pci_write_config_dword(dev, 0x40, reg | (1<<11));
>  
> - for (i = 0; i < DEVICE_COUNT_RESOURCE; ++i) {
> + pci_dev_for_each_resource_p(dev, r) {
>   /* zap the 2nd function of the winbond chip */
> - if (dev->resource[i].flags & IORESOURCE_IO
> - && dev->bus->number == 0 && dev->devfn == 0x81)
> - dev->resource[i].flags &= ~IORESOURCE_IO;
> - if (dev->resource[i].start == 0 && dev->resource[i].end) {
> - dev->resource[i].flags = 0;
> - dev->resource[i].end = 0;
> + if (dev->bus->number == 0 && dev->devfn == 0x81 &&
> + r->flags & IORESOURCE_IO)

This is a nice literal conversion, but it's kind of lame to test
bus->number and devfn *inside* the loop here, since they can't change
inside the loop.

> + r->flags &= ~IORESOURCE_IO;
> + if (r->start == 0 && r->end) {
> + r->flags = 0;
> + r->end = 0;
>   }
>   }

>  #define pci_resource_len(dev,bar) \
>   ((pci_resource_end((dev), (bar)) == 0) ? 0 :\
>   \
> -  (pci_resource_end((dev), (bar)) -  \
> -   pci_resource_start((dev), (bar)) + 1))
> +  resource_size(pci_resource_n((dev), (bar

I like this change, but it's unrelated to pci_dev_for_each_resource()
and unmentioned in the commit log.

> +#define __pci_dev_for_each_resource(dev, res, __i, vartype)  \
> + for (vartype __i = 0;   \
> +  res = pci_resource_n(dev, __i), __i < PCI_NUM_RESOURCES;   \
> +  __i++)
> +
> +#define pci_dev_for_each_resource(dev, res, i)   
> \
> +   __pci_dev_for_each_resource(dev, res, i, )
> +
> +#define pci_dev_for_each_resource_p(dev, res)
> \
> + __pci_dev_for_each_resource(dev, res, __i, unsigned int)

This series converts many cases to drop the iterator variable ("i"),
which is fantastic.

Several of the remaining places need the iterator variable only to
call pci_claim_resource(), which could be converted to take a "struct
resource *" directly without much trouble.

We don't have to do that pci_claim_resource() conversion now, but
since we're converging on the "(dev, res)" style, I think we should
reverse the names so we have something like:

  pci_dev_for_each_resource(dev, res)
  pci_dev_for_each_resource_idx(dev, res, i)

Not sure __pci_dev_for_each_resource() is worthwhile since it only
avoids repeating that single "for" statement, and passing in "vartype"
(sometimes empty to implicitly avoid the declaration) is a little
complicated to read.  I think it'd be easier to read like this:

  #define pci_dev_for_each_resource(dev, res)  \
for (unsigned int __i = 0; \
 res = pci_resource_n(dev, __i), __i < PCI_NUM_RESOURCES;  \
 __i++)

  #define pci_dev_for_each_resource_idx(dev, res, idx) \
for (idx = 0;  \
 res = pci_resource_n(dev, idx), idx < PCI_NUM_RESOURCES;  \
 idx++)

Bjorn

Re: [PATCH v6 2/4] PCI: Split pci_bus_for_each_resource_p() out of pci_bus_for_each_resource()

2023-03-22 Thread Bjorn Helgaas

On Mon, Mar 20, 2023 at 03:16:31PM +0200, Andy Shevchenko wrote:
> ...

> -#define pci_bus_for_each_resource(bus, res, i)   
> \
> - for (i = 0; \
> - (res = pci_bus_resource_n(bus, i)) || i < PCI_BRIDGE_RESOURCE_NUM; \
> -  i++)
> +#define __pci_bus_for_each_resource(bus, res, __i, vartype)  
> \
> + for (vartype __i = 0;   
> \
> +  res = pci_bus_resource_n(bus, __i), __i < PCI_BRIDGE_RESOURCE_NUM; 
> \
> +  __i++)
> +
> +#define pci_bus_for_each_resource(bus, res, i)   
> \
> + __pci_bus_for_each_resource(bus, res, i, )
> +
> +#define pci_bus_for_each_resource_p(bus, res)
> \
> + __pci_bus_for_each_resource(bus, res, __i, unsigned int)

I like these changes a lot, too!

Same comments about _p vs _idx and __pci_bus_for_each_resource(...,
vartype).

Also would prefer 80 char max instead of 81.

Re: [Skiboot] [PATCH V4 1/3] core/device: Add function to return child node using name at substring "@"

2023-03-22 Thread Reza Arbab


Hi Athira,

On Mon, Mar 06, 2023 at 09:09:11AM +0530, Athira Rajeev wrote:

Add a function dt_find_by_name_substr() that returns the child node if
it matches till first occurence at "@" of a given name, otherwise NULL.


Given this summary, I don't understand the following:


+   assert(dt_find_by_name_substr(root, "node@1") == addr1);
+   assert(dt_find_by_name_substr(root, "node0_1@2") == addr2);


Is this behavior required? I don't think it makes sense to call this 
function with a second argument containing '@', so I wouldn't expect it 
to match anything in these cases. The function seems to specifically 
enable it:



+struct dt_node *dt_find_by_name_substr(struct dt_node *root, const char *name)
+{

[snip]

+  node = strdup(name);
+  if (!node)
+  return NULL;
+  node = strtok(node, "@");


Seems like you could get rid of this and just use name as-is.

I was curious about something else; say we have 'node@1' and 'node@2'.  
Is there an expectation of which it should match?


addr1 = dt_new_addr(root, "node", 0x1);
addr2 = dt_new_addr(root, "node", 0x2);
assert(dt_find_by_name_substr(root, "node") == ???);
   ^^^


+/* Find a child node by name and substring */
+struct dt_node *dt_find_by_name_substr(struct dt_node *root, const char *name);


I think this name fit better in previous versions of the patch, but 
since you're specifically looking for '@' now, maybe call it something 
like dt_find_by_name_before_addr?


--
Reza Arbab

Re: [Skiboot] [PATCH V4 3/3] skiboot: Update IMC PMU node names for power10

2023-03-22 Thread Reza Arbab


On Mon, Mar 06, 2023 at 09:09:13AM +0530, Athira Rajeev wrote:

+   } else if (proc_gen == proc_gen_p10) {
+   int val;
+   char al[8], xl[8], otl[8], phb[8];


Are four different variables needed here? I think you could just reuse 
one temporary variable.


char name[8];


+   for (i=0; i<8; i++) {
+   val = ((avl_vec & (0x7ULL << (29 + (3 * i >> (29 + 
(3 * i)));
+   switch (val) {
+   case 0x5: //xlink configured and functional
+
+   snprintf(al, sizeof(al), "alink%1d",(7-i));
+   target = dt_find_by_name_substr(dev, al);
+   if (target)
+   dt_free(target);
+
+   snprintf(otl, sizeof(otl),"otl%1d_0",(7-i));
+   target = dt_find_by_name_substr(dev, otl);
+   if (target)
+   dt_free(target);
+
+   snprintf(otl,sizeof(otl),"otl%1d_1",(7-i));
+   target = dt_find_by_name_substr(dev, otl);
+   if (target)
+   dt_free(target);
+
+   break;
+   case 0x6: //alink configured and functional
+
+   snprintf(xl,sizeof(xl),"xlink%1d",(7-i));
+   target = dt_find_by_name_substr(dev, xl);
+   if (target)
+   dt_free(target);
+
+   snprintf(otl,sizeof(otl),"otl%1d_0",(7-i));
+   target = dt_find_by_name_substr(dev, otl);
+   if (target)
+   dt_free(target);
+
+   snprintf(otl,sizeof(otl),"otl%1d_1",(7-i));
+   target = dt_find_by_name_substr(dev, otl);
+   if (target)
+   dt_free(target);
+   break;
+
+   case 0x7: //CAPI configured and functional
+   snprintf(al,sizeof(al),"alink%1d",(7-i));
+   target = dt_find_by_name_substr(dev, al);
+   if (target)
+   dt_free(target);
+
+   snprintf(xl,sizeof(xl),"xlink%1d",(7-i));
+   target = dt_find_by_name_substr(dev, xl);
+   if (target)
+   dt_free(target);
+   break;
+   default:
+   snprintf(xl,sizeof(xl),"xlink%1d",(7-i));
+   target = dt_find_by_name_substr(dev, xl);
+   if (target)
+   dt_free(target);
+
+   snprintf(al,sizeof(al),"alink%1d",(7-i));
+   target = dt_find_by_name_substr(dev, al);
+   if (target)
+   dt_free(target);
+
+   snprintf(otl,sizeof(otl),"otl%1d_0",(7-i));
+   target = dt_find_by_name_substr(dev, otl);
+   if (target)
+   dt_free(target);
+
+   snprintf(otl,sizeof(otl),"otl%1d_1",(7-i));
+   target = dt_find_by_name_substr(dev, otl);
+   if (target)
+   dt_free(target);
+   break;


As far as I know skiboot follows the kernel coding style. Would you mind 
fixing up the minor style nits checkpatch.pl reports for this patch?


--
Reza Arbab

Re: [PATCH v5 7/7] sched, smp: Trace smp callback causing an IPI

2023-03-22 Thread Peter Zijlstra

On Wed, Mar 22, 2023 at 06:22:28PM +, Valentin Schneider wrote:
> On 22/03/23 18:22, Peter Zijlstra wrote:

> >>hackbench-157   [001]10.894320: ipi_send_cpu: cpu=3 
> >> callsite=check_preempt_curr+0x37 callback=0x0
> >
> > Arguably we should be setting callback to scheduler_ipi(), except
> > ofcourse, that's not an actual function...
> >
> > Maybe we can do "extern inline" for the actual users and provide a dummy
> > function for the symbol when tracing.
> >
> 
> Huh, I wasn't aware that was an option, I'll look into that. I did scribble
> down a comment next to smp_send_reschedule(), but having a decodable
> function name would be better!

So clang-15 builds the below (and generates the expected code), but
gcc-12 vomits nonsense about a non-static inline calling a static inline
or somesuch bollocks :-/

--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1991,7 +1991,7 @@ extern char *__get_task_comm(char *to, s
 })
 
 #ifdef CONFIG_SMP
-static __always_inline void scheduler_ipi(void)
+extern __always_inline void scheduler_ipi(void)
 {
/*
 * Fold TIF_NEED_RESCHED into the preempt_count; anybody setting
--- a/include/linux/smp.h
+++ b/include/linux/smp.h
@@ -130,9 +130,9 @@ extern void arch_smp_send_reschedule(int
  * scheduler_ipi() is inline so can't be passed as callback reason, but the
  * callsite IP should be sufficient for root-causing IPIs sent from here.
  */
-#define smp_send_reschedule(cpu) ({  \
-   trace_ipi_send_cpu(cpu, _RET_IP_, NULL);  \
-   arch_smp_send_reschedule(cpu);\
+#define smp_send_reschedule(cpu) ({\
+   trace_ipi_send_cpu(cpu, _RET_IP_, &scheduler_ipi);  \
+   arch_smp_send_reschedule(cpu);  \
 })
 
 /*
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -3790,6 +3790,15 @@ static int ttwu_runnable(struct task_str
 }
 
 #ifdef CONFIG_SMP
+void scheduler_ipi(void)
+{
+   /*
+* Actual users should end up using the extern inline, this is only
+* here for the symbol.
+*/
+   BUG();
+}
+
 void sched_ttwu_pending(void *arg)
 {
struct llist_node *llist = arg;

Re: [PATCH 4/8] powerpc/rtas: fix miswording in rtas_function kerneldoc

2023-03-22 Thread Andrew Donnellan

On Mon, 2023-03-06 at 15:33 -0600, Nathan Lynch via B4 Relay wrote:
> From: Nathan Lynch 
> 
> The 'filter' member is a pointer, not a bool; fix the wording
> accordingly.
> 
> Signed-off-by: Nathan Lynch 

Reviewed-by: Andrew Donnellan 

> ---
>  arch/powerpc/kernel/rtas.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
> index c73b01d722f6..c29c38b1a55a 100644
> --- a/arch/powerpc/kernel/rtas.c
> +++ b/arch/powerpc/kernel/rtas.c
> @@ -68,7 +68,7 @@ struct rtas_filter {
>   *    functions are believed to have no
> users on
>   *    ppc64le, and we want to keep it that
> way. It does
>   *    not make sense for this to be set when
> @filter
> - *    is false.
> + *    is NULL.
>   */
>  struct rtas_function {
> s32 token;
> 

-- 
Andrew DonnellanOzLabs, ADL Canberra
a...@linux.ibm.com   IBM Australia Limited

[powerpc:next] BUILD SUCCESS 3a713753d3cb52e4e3039cdb906ef00f0b574219

2023-03-22 Thread kernel test robot

tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git 
next
branch HEAD: 3a713753d3cb52e4e3039cdb906ef00f0b574219  powerpc: Simplify sysctl 
registration for nmi_wd_lpm_factor_ctl_table

elapsed time: 732m

configs tested: 98
configs skipped: 5

The following configs have been built successfully.
More configs may be tested in the coming days.

tested configs:
alphaallyesconfig   gcc  
alpha   defconfig   gcc  
arc  allyesconfig   gcc  
arc  buildonly-randconfig-r003-20230322   gcc  
arc  buildonly-randconfig-r004-20230322   gcc  
arc defconfig   gcc  
arc  randconfig-r003-20230322   gcc  
arc  randconfig-r043-20230322   gcc  
arm  allmodconfig   gcc  
arm  allyesconfig   gcc  
arm  buildonly-randconfig-r001-20230322   clang
arm defconfig   gcc  
arm  randconfig-r046-20230322   clang
arm64allyesconfig   gcc  
arm64   defconfig   gcc  
cskydefconfig   gcc  
csky randconfig-r011-20230322   gcc  
csky randconfig-r012-20230322   gcc  
csky randconfig-r016-20230322   gcc  
csky randconfig-r024-20230322   gcc  
csky randconfig-r026-20230322   gcc  
hexagon  randconfig-r014-20230322   clang
hexagon  randconfig-r041-20230322   clang
hexagon  randconfig-r045-20230322   clang
i386 allyesconfig   gcc  
i386  debian-10.3   gcc  
i386defconfig   gcc  
i386  randconfig-a001   gcc  
i386  randconfig-a002   clang
i386  randconfig-a003   gcc  
i386  randconfig-a004   clang
i386  randconfig-a005   gcc  
i386  randconfig-a006   clang
i386  randconfig-a011   clang
i386  randconfig-a012   gcc  
i386  randconfig-a013   clang
i386  randconfig-a014   gcc  
i386  randconfig-a015   clang
i386  randconfig-a016   gcc  
ia64 allmodconfig   gcc  
ia64defconfig   gcc  
loongarchallmodconfig   gcc  
loongarch allnoconfig   gcc  
loongarchbuildonly-randconfig-r002-20230322   gcc  
loongarch   defconfig   gcc  
m68k allmodconfig   gcc  
m68kdefconfig   gcc  
mips allmodconfig   gcc  
mips allyesconfig   gcc  
mips randconfig-r035-20230322   gcc  
mips randconfig-r036-20230322   gcc  
nios2   defconfig   gcc  
openrisc randconfig-r015-20230322   gcc  
openrisc randconfig-r023-20230322   gcc  
openrisc randconfig-r031-20230322   gcc  
parisc  defconfig   gcc  
parisc   randconfig-r034-20230322   gcc  
parisc64defconfig   gcc  
powerpc  allmodconfig   gcc  
powerpc   allnoconfig   gcc  
powerpc  randconfig-r001-20230322   clang
powerpc  randconfig-r013-20230322   gcc  
riscvallmodconfig   gcc  
riscv allnoconfig   gcc  
riscv   defconfig   gcc  
riscvrandconfig-r021-20230322   gcc  
riscvrandconfig-r042-20230322   gcc  
riscv  rv32_defconfig   gcc  
s390 allmodconfig   gcc  
s390 allyesconfig   gcc  
s390defconfig   gcc  
s390 randconfig-r004-20230322   clang
s390 randconfig-r022-20230322   gcc  
s390 randconfig-r044-20230322   gcc  
sh   allmodconfig   gcc  
sparcbuildonly-randconfig-r006-20230322   gcc  
sparc   defconfig   gcc  
sparc64  randconfig-r002-20230322   gcc  
sparc64  randconfig-r032-20230322   gcc  
sparc64  randconfig-r033-20230322   gcc  
um i386_defconfig   gcc  
um   x86_64_defconfig   gcc  
x86_64allnoconfig   gcc  
x86_64   allyesconfig   gcc  
x86_64  defconfig   gcc  
x86_64

Re: [PATCH 1/8] powerpc/rtas: ensure 8-byte alignment for struct rtas_args

2023-03-22 Thread Andrew Donnellan

On Mon, 2023-03-06 at 15:33 -0600, Nathan Lynch via B4 Relay wrote:
> > From: Nathan Lynch 
> > 
> > CHRP and PAPR agree: "In order to make an RTAS call, the operating
> > system must construct an argument call buffer aligned on an eight
> > byte
> > boundary in physically contiguous real memory [...]." (7.2.7
> > Calling
> > Mechanism and Conventions).
> > 
> > struct rtas_args is the type used for this argument call buffer.
> > The
> > unarchitected 'rets' member happens to produce 8-byte alignment for
> > the struct on 64-bit targets in practice. But without an alignment
> > directive the structure will have only 4-byte alignment on 32-bit
> > targets:
> > 
> >   $ nm b/{before,after}/chrp32/vmlinux | grep rtas_args
> >   c096881c b rtas_args
> >   c0968820 b rtas_args
> > 
> > Add an alignment directive to the struct rtas_args declaration so
> > all
> > instances have the alignment required by the specs. rtas-types.h no
> > longer refers to any spinlock types, so drop the spinlock_types.h
> > inclusion while we're here.
> > 
> > Signed-off-by: Nathan Lynch 

Reviewed-by: Andrew Donnellan 

> > ---
> >  arch/powerpc/include/asm/rtas-types.h | 5 +++--
> >  1 file changed, 3 insertions(+), 2 deletions(-)
> > 
> > diff --git a/arch/powerpc/include/asm/rtas-types.h
> > b/arch/powerpc/include/asm/rtas-types.h
> > index f2ad4a96cbc5..861145c8a021 100644
> > --- a/arch/powerpc/include/asm/rtas-types.h
> > +++ b/arch/powerpc/include/asm/rtas-types.h
> > @@ -2,7 +2,8 @@
> >  #ifndef _ASM_POWERPC_RTAS_TYPES_H
> >  #define _ASM_POWERPC_RTAS_TYPES_H
> >  
> > -#include 
> > +#include 
> > +#include 
> >  
> >  typedef __be32 rtas_arg_t;
> >  
> > @@ -12,7 +13,7 @@ struct rtas_args {
> > __be32 nret;
> > rtas_arg_t args[16];
> > rtas_arg_t *rets; /* Pointer to return values in
> > args[].
> > */
> > -};
> > +} __aligned(SZ_8);

Nowhere else in the kernel uses __aligned(SZ_8) over just __aligned(8),
which I suppose would also save an include, but I don't care either
way.

> >  
> >  struct rtas_t {
> > unsigned long entry;/* physical address pointer
> > */
> > 

-- 
Andrew DonnellanOzLabs, ADL Canberra
a...@linux.ibm.com   IBM Australia Limited

Re: [PATCH 2/8] powerpc/rtas: use memmove for potentially overlapping buffer copy

2023-03-22 Thread Andrew Donnellan

On Mon, 2023-03-06 at 15:33 -0600, Nathan Lynch via B4 Relay wrote:
> From: Nathan Lynch 
> 
> Using memcpy() isn't safe when buf is identical to rtas_err_buf,
> which
> can happen during boot before slab is up. Full context which may not
> be obvious from the diff:
> 
> if (altbuf) {
> buf = altbuf;
> } else {
> buf = rtas_err_buf;
> if (slab_is_available())
> buf = kmalloc(RTAS_ERROR_LOG_MAX,
> GFP_ATOMIC);
> }
> if (buf)
> memcpy(buf, rtas_err_buf, RTAS_ERROR_LOG_MAX);
> 
> This was found by inspection and I'm not aware of it causing problems
> in practice. It appears to have been introduced by commit
> 033ef338b6e0 ("powerpc: Merge rtas.c into arch/powerpc/kernel"); the
> old ppc64 version of this code did not have this problem.
> 
> Use memmove() instead.
> 
> Fixes: 033ef338b6e0 ("powerpc: Merge rtas.c into
> arch/powerpc/kernel")
> Signed-off-by: Nathan Lynch 

Reviewed-by: Andrew Donnellan 

-- 
Andrew DonnellanOzLabs, ADL Canberra
a...@linux.ibm.com   IBM Australia Limited

Re: [PATCH 3/8] powerpc/rtas: rtas_call_unlocked() kerneldoc

2023-03-22 Thread Andrew Donnellan

On Mon, 2023-03-06 at 15:33 -0600, Nathan Lynch via B4 Relay wrote:
> From: Nathan Lynch 
> 
> Add documentation for rtas_call_unlocked(), including details on how
> it differs from rtas_call().
> 
> Signed-off-by: Nathan Lynch 

Reviewed-by: Andrew Donnellan 


-- 
Andrew DonnellanOzLabs, ADL Canberra
a...@linux.ibm.com   IBM Australia Limited

Re: [PATCH 5/8] powerpc/rtas: rename va_rtas_call_unlocked() to va_rtas_call()

2023-03-22 Thread Andrew Donnellan

On Mon, 2023-03-06 at 15:33 -0600, Nathan Lynch via B4 Relay wrote:
> From: Nathan Lynch 
> 
> The function name va_rtas_call_unlocked() is confusing: it may be
> called with or without rtas_lock held. Rename it to va_rtas_call().
> 
> Signed-off-by: Nathan Lynch 

Not a huge fan of the name, the va_ suggests that the only difference
between this function and rtas_call() is the varargs handling. Perhaps
something like __rtas_call()?

> ---
>  arch/powerpc/kernel/rtas.c | 9 -
>  1 file changed, 4 insertions(+), 5 deletions(-)
> 
> diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
> index c29c38b1a55a..96a10a0abe3a 100644
> --- a/arch/powerpc/kernel/rtas.c
> +++ b/arch/powerpc/kernel/rtas.c
> @@ -996,9 +996,8 @@ static void __init init_error_log_max(void) {}
>  #endif
>  
>  
> -static void
> -va_rtas_call_unlocked(struct rtas_args *args, int token, int nargs,
> int nret,
> - va_list list)
> +static void va_rtas_call(struct rtas_args *args, int token, int
> nargs, int nret,
> +    va_list list)
>  {
> int i;
>  
> @@ -1038,7 +1037,7 @@ void rtas_call_unlocked(struct rtas_args *args,
> int token, int nargs, int nret,
> va_list list;
>  
> va_start(list, nret);
> -   va_rtas_call_unlocked(args, token, nargs, nret, list);
> +   va_rtas_call(args, token, nargs, nret, list);
> va_end(list);
>  }
>  
> @@ -1138,7 +1137,7 @@ int rtas_call(int token, int nargs, int nret,
> int *outputs, ...)
> args = &rtas_args;
>  
> va_start(list, outputs);
> -   va_rtas_call_unlocked(args, token, nargs, nret, list);
> +   va_rtas_call(args, token, nargs, nret, list);
> va_end(list);
>  
> /* A -1 return code indicates that the last command couldn't
> 

-- 
Andrew DonnellanOzLabs, ADL Canberra
a...@linux.ibm.com   IBM Australia Limited

Re: [PATCH 7/8] powerpc/rtas: warn on unsafe argument to rtas_call_unlocked()

2023-03-22 Thread Andrew Donnellan

On Mon, 2023-03-06 at 15:33 -0600, Nathan Lynch via B4 Relay wrote:
> From: Nathan Lynch 
> 
> Any caller of rtas_call_unlocked() must provide an rtas_args
> parameter
> block distinct from the core rtas_args buffer used by the rtas_call()
> path. It's an unlikely error to make, but the potential consequences
> are grim, and it's trivial to check.
> 
> Signed-off-by: Nathan Lynch 

call_rtas_display_status() seems to do exactly this, or am I missing
something?

> ---
>  arch/powerpc/kernel/rtas.c | 7 +++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c
> index 633c925164e7..47a2aa43d7d4 100644
> --- a/arch/powerpc/kernel/rtas.c
> +++ b/arch/powerpc/kernel/rtas.c
> @@ -1042,6 +1042,13 @@ void rtas_call_unlocked(struct rtas_args
> *args, int token, int nargs, int nret,
>  {
> va_list list;
>  
> +   /*
> +    * Callers must not use rtas_args; otherwise they risk
> +    * corrupting the state of the rtas_call() path, which is
> +    * serialized by rtas_lock.
> +    */
> +   WARN_ON(args == &rtas_args);
> +
> va_start(list, nret);
> va_rtas_call(args, token, nargs, nret, list);
> va_end(list);
> 

-- 
Andrew DonnellanOzLabs, ADL Canberra
a...@linux.ibm.com   IBM Australia Limited

Re: [PATCH v3 2/2] arch/powerpc/kvm: kvmppc_hv_entry: remove r4 argument

2023-03-22 Thread Kautuk Consul

Hi,

On 2023-03-22 23:17:35, Michael Ellerman wrote:
> Kautuk Consul  writes:
> > kvmppc_hv_entry is called from only 2 locations within
> > book3s_hv_rmhandlers.S. Both of those locations set r4
> > as HSTATE_KVM_VCPU(r13) before calling kvmppc_hv_entry.
> > So, shift the r4 load instruction to kvmppc_hv_entry and
> > thus modify the calling convention of this function.
> >
> > Signed-off-by: Kautuk Consul 
> > ---
> >  arch/powerpc/kvm/book3s_hv_rmhandlers.S | 11 ++-
> >  1 file changed, 6 insertions(+), 5 deletions(-)
> >
> > diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S 
> > b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> > index b81ba4ee0521..b61f0b2c677b 100644
> > --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> > +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S
> > @@ -85,7 +85,7 @@ _GLOBAL_TOC(kvmppc_hv_entry_trampoline)
> > RFI_TO_KERNEL
> >  
> >  kvmppc_call_hv_entry:
> > -   ld  r4, HSTATE_KVM_VCPU(r13)
> > +   /* Enter guest. */
> > bl  kvmppc_hv_entry
> >  
> > /* Back from guest - restore host state and return to caller */
> > @@ -352,9 +352,7 @@ kvm_secondary_got_guest:
> > mtspr   SPRN_LDBAR, r0
> > isync
> >  63:
> > -   /* Order load of vcpu after load of vcore */
> > -   lwsync
> > -   ld  r4, HSTATE_KVM_VCPU(r13)
> > +   /* Enter guest. */
> > bl  kvmppc_hv_entry
> >  
> > /* Back from the guest, go back to nap */
> > @@ -506,7 +504,6 @@ SYM_INNER_LABEL(kvmppc_hv_entry, SYM_L_LOCAL)
> >  
> > /* Required state:
> >  *
> > -* R4 = vcpu pointer (or NULL)
> >  * MSR = ~IR|DR
> >  * R13 = PACA
> >  * R1 = host R1
> > @@ -524,6 +521,10 @@ SYM_INNER_LABEL(kvmppc_hv_entry, SYM_L_LOCAL)
> > li  r6, KVM_GUEST_MODE_HOST_HV
> > stb r6, HSTATE_IN_GUEST(r13)
> >  
> > +   /* Order load of vcpu after load of vcore */
> 
> Just copying the comment here doesn't work. It doesn't make sense on its
> own here, because the VCORE is loaded (again) a few lines below (536).
> So as written this comment seems backward vs the code.
> 
> The comment would need to expand to explain that the barrier is for the
> case where we came from kvm_secondary_got_guest.
> 
Agreed.
> > +   lwsync
> > +   ld  r4, HSTATE_KVM_VCPU(r13)
> > +
> >  #ifdef CONFIG_KVM_BOOK3S_HV_P8_TIMING
> > /* Store initial timestamp */
> > cmpdi   r4, 0
> 
> 
> But as Nick says I don't think it's worth investing effort in small
> tweaks to this code. The risk of introducing bugs is too high for such a
> small improvement to the code.
> 
> Thanks for trying, but I think this asm code is best left more or less
> alone unless we find actual bugs in it - or unless we can make
> substantial improvements to it, which would be rewriting in C, or at
> least converting to a fully call/return style rather than the current
> forest of labels.
> 
> I will take patch 1 though, as that's an obvious cleanup and poses no
> risk (famous last words :D).
Thanks for taking patch 1. I completely agree that it would not be wise
to introduce even small alterations to stable legacy code. But sending
patches like this and conversing with this mailing list's reviewers
is giving me a good picture of what you are open to accepting at this
stage.

Thanks again!
> 
> cheers

Re: [PATCH 6/8] powerpc/rtas: lockdep annotations

2023-03-22 Thread Andrew Donnellan

On Mon, 2023-03-06 at 15:33 -0600, Nathan Lynch via B4 Relay wrote:
> From: Nathan Lynch 
> 
> Add lockdep annotations for the following properties that must hold:
> 
> * Any error log retrieval must be atomically coupled with the prior
>   RTAS call, without a window for another RTAS call to occur before
> the
>   error log can be retrieved.
> 
> * All users of the core rtas_args parameter block must hold
> rtas_lock.
> 
> Move the definitions of rtas_lock and rtas_args up in the file so
> that
> __do_enter_rtas_trace() can refer to them.
> 
> Signed-off-by: Nathan Lynch 

I'm no lockdep expert and I haven't checked if every possible case that
can be annotated has been annotated, but these changes make sense as
far as I can tell from my limited inspection.

Reviewed-by: Andrew Donnellan 

-- 
Andrew DonnellanOzLabs, ADL Canberra
a...@linux.ibm.com   IBM Australia Limited

Re: [PATCH 8/8] powerpc/rtas: consume retry statuses in sys_rtas()

2023-03-22 Thread Andrew Donnellan

On Mon, 2023-03-06 at 15:33 -0600, Nathan Lynch via B4 Relay wrote:
> From: Nathan Lynch 
> 
> The kernel can handle retrying RTAS function calls in response to
> -2/990x in the sys_rtas() handler instead of relaying the
> intermediate
> status to user space.
> 
> Justifications:
> 
> * Currently it's nondeterministic and quite variable in practice
>   whether a retry status is returned for any given invocation of
>   sys_rtas(). Therefore user space code cannot be expecting a retry
>   result without already being broken.
> 
> * This tends to significantly reduce the total number of system calls
>   issued by programs such as drmgr which make use of sys_rtas(),
>   improving the experience of tracing and debugging such
>   programs. This is the main motivation for me: I think this change
>   will make it easier for us to characterize current sys_rtas() use
>   cases as we move them to other interfaces over time.
> 
> * It reduces the number of opportunities for user space to leave
>   complex operations, such as those associated with DLPAR, incomplete
>   and diffcult to recover.
> 
> * We can expect performance improvements for existing sys_rtas()
>   users, not only because of overall reduction in the number of
> system
>   calls issued, but also due to the better handling of -2/990x in the
>   kernel. For example, librtas still sleeps for 1ms on -2, which is
>   completely unnecessary.

Would be good to see this fixed on the librtas side.

> 
> Performance differences for PHB add and remove on a small P10 PowerVM
> partition are included below. For add, elapsed time is slightly
> reduced. For remove, there are more significant improvements: the
> number of context switches is reduced by an order of magnitude, and
> elapsed time is reduced by over half.
> 
> (- before, + after):
> 
>   Performance counter stats for 'drmgr -c phb -a -s PHB 23' (5 runs):
> 
> -  1,847.58 msec task-clock   #    0.135
> CPUs utilized   ( +- 14.15% )
> -    10,867  cs   #    9.800
> K/sec   ( +- 14.14% )
> +  1,901.15 msec task-clock   #    0.148
> CPUs utilized   ( +- 14.13% )
> +    10,451  cs   #    9.158
> K/sec   ( +- 14.14% )
> 
> - 13.656557 +- 0.000124 seconds time elapsed  ( +-  0.00% )
> +  12.88080 +- 0.00404 seconds time elapsed  ( +-  0.03% )
> 
>   Performance counter stats for 'drmgr -c phb -r -s PHB 23' (5 runs):
> 
> -  1,473.75 msec task-clock   #    0.092
> CPUs utilized   ( +- 14.15% )
> - 2,652  cs   #    3.000
> K/sec   ( +- 14.16% )
> +  1,444.55 msec task-clock   #    0.221
> CPUs utilized   ( +- 14.14% )
> +   104  cs   #  119.957
> /sec    ( +- 14.63% )
> 
> -  15.99718 +- 0.00801 seconds time elapsed  ( +-  0.05% )
> +   6.54256 +- 0.00830 seconds time elapsed  ( +-  0.13% )
> 
> Move the existing rtas_lock-guarded critical section in sys_rtas()
> into a conventional rtas_busy_delay()-based loop, returning to user
> space only when a final success or failure result is available.
> 
> Signed-off-by: Nathan Lynch 

Should there be some kind of timeout? I'm a bit worried by sleeping in
a syscall for an extended period.

-- 
Andrew DonnellanOzLabs, ADL Canberra
a...@linux.ibm.com   IBM Australia Limited

41 matches

Mail list logo