Re: [PATCH v5 1/7] trace: Add trace_ipi_send_cpumask()
On Tue, Mar 07, 2023 at 02:35:52PM +, Valentin Schneider wrote: > trace_ipi_raise() is unsuitable for generically tracing IPI sources due to > its "reason" argument being an uninformative string (on arm64 all you get > is "Function call interrupts" for SMP calls). > > Add a variant of it that exports a target cpumask, a callsite and a callback. > > Signed-off-by: Valentin Schneider > Reviewed-by: Steven Rostedt (Google) > --- > include/trace/events/ipi.h | 22 ++ > 1 file changed, 22 insertions(+) > > diff --git a/include/trace/events/ipi.h b/include/trace/events/ipi.h > index 0be71dad6ec03..b1125dc27682c 100644 > --- a/include/trace/events/ipi.h > +++ b/include/trace/events/ipi.h > @@ -35,6 +35,28 @@ TRACE_EVENT(ipi_raise, > TP_printk("target_mask=%s (%s)", __get_bitmask(target_cpus), > __entry->reason) > ); > > +TRACE_EVENT(ipi_send_cpumask, > + > + TP_PROTO(const struct cpumask *cpumask, unsigned long callsite, void > *callback), > + > + TP_ARGS(cpumask, callsite, callback), > + > + TP_STRUCT__entry( > + __cpumask(cpumask) > + __field(void *, callsite) > + __field(void *, callback) > + ), > + > + TP_fast_assign( > + __assign_cpumask(cpumask, cpumask_bits(cpumask)); > + __entry->callsite = (void *)callsite; > + __entry->callback = callback; > + ), > + > + TP_printk("cpumask=%s callsite=%pS callback=%pS", > + __get_cpumask(cpumask), __entry->callsite, __entry->callback) > +); Would it make sense to add a variant like: ipi_send_cpu() that records a single cpu instead of a cpumask. A lot of sites seems to do: cpumask_of(cpu) for that first argument, and it seems to me it is quite daft to have to memcpy a full multi-word cpumask in those cases. Remember, nr_possible_cpus > 64 is quite common these days.
Re: [PATCH v5 7/7] sched, smp: Trace smp callback causing an IPI
On Tue, Mar 07, 2023 at 02:35:58PM +, Valentin Schneider wrote: > @@ -477,6 +490,25 @@ static __always_inline void csd_unlock(struct > __call_single_data *csd) > smp_store_release(&csd->node.u_flags, 0); > } > > +static __always_inline void > +raw_smp_call_single_queue(int cpu, struct llist_node *node, smp_call_func_t > func) > +{ > + /* > + * The list addition should be visible to the target CPU when it pops > + * the head of the list to pull the entry off it in the IPI handler > + * because of normal cache coherency rules implied by the underlying > + * llist ops. > + * > + * If IPIs can go out of order to the cache coherency protocol > + * in an architecture, sufficient synchronisation should be added > + * to arch code to make it appear to obey cache coherency WRT > + * locking and barrier primitives. Generic code isn't really > + * equipped to do the right thing... > + */ > + if (llist_add(node, &per_cpu(call_single_queue, cpu))) > + send_call_function_single_ipi(cpu, func); > +} > + > static DEFINE_PER_CPU_SHARED_ALIGNED(call_single_data_t, csd_data); > > void __smp_call_single_queue(int cpu, struct llist_node *node) > @@ -493,21 +525,25 @@ void __smp_call_single_queue(int cpu, struct llist_node > *node) > } > } > #endif > /* > + * We have to check the type of the CSD before queueing it, because > + * once queued it can have its flags cleared by > + * flush_smp_call_function_queue() > + * even if we haven't sent the smp_call IPI yet (e.g. the stopper > + * executes migration_cpu_stop() on the remote CPU). >*/ > + if (trace_ipi_send_cpumask_enabled()) { > + call_single_data_t *csd; > + smp_call_func_t func; > + > + csd = container_of(node, call_single_data_t, node.llist); > + func = CSD_TYPE(csd) == CSD_TYPE_TTWU ? > + sched_ttwu_pending : csd->func; > + > + raw_smp_call_single_queue(cpu, node, func); > + } else { > + raw_smp_call_single_queue(cpu, node, NULL); > + } > } Hurmph... so we only really consume @func when we IPI. Would it not be more useful to trace this thing for *every* csd enqeued?
Re: [PATCH v5 1/7] trace: Add trace_ipi_send_cpumask()
On Wed, Mar 22, 2023 at 10:39:55AM +0100, Peter Zijlstra wrote: > On Tue, Mar 07, 2023 at 02:35:52PM +, Valentin Schneider wrote: > > trace_ipi_raise() is unsuitable for generically tracing IPI sources due to > > its "reason" argument being an uninformative string (on arm64 all you get > > is "Function call interrupts" for SMP calls). > > > > Add a variant of it that exports a target cpumask, a callsite and a > > callback. > > > > Signed-off-by: Valentin Schneider > > Reviewed-by: Steven Rostedt (Google) > > --- > > include/trace/events/ipi.h | 22 ++ > > 1 file changed, 22 insertions(+) > > > > diff --git a/include/trace/events/ipi.h b/include/trace/events/ipi.h > > index 0be71dad6ec03..b1125dc27682c 100644 > > --- a/include/trace/events/ipi.h > > +++ b/include/trace/events/ipi.h > > @@ -35,6 +35,28 @@ TRACE_EVENT(ipi_raise, > > TP_printk("target_mask=%s (%s)", __get_bitmask(target_cpus), > > __entry->reason) > > ); > > > > +TRACE_EVENT(ipi_send_cpumask, > > + > > + TP_PROTO(const struct cpumask *cpumask, unsigned long callsite, void > > *callback), > > + > > + TP_ARGS(cpumask, callsite, callback), > > + > > + TP_STRUCT__entry( > > + __cpumask(cpumask) > > + __field(void *, callsite) > > + __field(void *, callback) > > + ), > > + > > + TP_fast_assign( > > + __assign_cpumask(cpumask, cpumask_bits(cpumask)); > > + __entry->callsite = (void *)callsite; > > + __entry->callback = callback; > > + ), > > + > > + TP_printk("cpumask=%s callsite=%pS callback=%pS", > > + __get_cpumask(cpumask), __entry->callsite, __entry->callback) > > +); > > Would it make sense to add a variant like: ipi_send_cpu() that records a > single cpu instead of a cpumask. A lot of sites seems to do: > cpumask_of(cpu) for that first argument, and it seems to me it is quite > daft to have to memcpy a full multi-word cpumask in those cases. > > Remember, nr_possible_cpus > 64 is quite common these days. Something we litte bit like so... --- Subject: trace: Add trace_ipi_send_cpu() From: Peter Zijlstra Date: Wed Mar 22 11:28:36 CET 2023 Because copying cpumasks around when targeting a single CPU is a bit daft... Signed-off-by: Peter Zijlstra (Intel) --- include/linux/smp.h|6 +++--- include/trace/events/ipi.h | 22 ++ kernel/irq_work.c |6 ++ kernel/smp.c |4 ++-- 4 files changed, 29 insertions(+), 9 deletions(-) --- a/include/linux/smp.h +++ b/include/linux/smp.h @@ -130,9 +130,9 @@ extern void arch_smp_send_reschedule(int * scheduler_ipi() is inline so can't be passed as callback reason, but the * callsite IP should be sufficient for root-causing IPIs sent from here. */ -#define smp_send_reschedule(cpu) ({ \ - trace_ipi_send_cpumask(cpumask_of(cpu), _RET_IP_, NULL); \ - arch_smp_send_reschedule(cpu);\ +#define smp_send_reschedule(cpu) ({ \ + trace_ipi_send_cpu(cpu, _RET_IP_, NULL); \ + arch_smp_send_reschedule(cpu);\ }) /* --- a/include/trace/events/ipi.h +++ b/include/trace/events/ipi.h @@ -35,6 +35,28 @@ TRACE_EVENT(ipi_raise, TP_printk("target_mask=%s (%s)", __get_bitmask(target_cpus), __entry->reason) ); +TRACE_EVENT(ipi_send_cpu, + + TP_PROTO(const unsigned int cpu, unsigned long callsite, void *callback), + + TP_ARGS(cpu, callsite, callback), + + TP_STRUCT__entry( + __field(unsigned int, cpu) + __field(void *, callsite) + __field(void *, callback) + ), + + TP_fast_assign( + __entry->cpu = cpu; + __entry->callsite = (void *)callsite; + __entry->callback = callback; + ), + + TP_printk("cpu=%s callsite=%pS callback=%pS", + __entry->cpu, __entry->callsite, __entry->callback) +); + TRACE_EVENT(ipi_send_cpumask, TP_PROTO(const struct cpumask *cpumask, unsigned long callsite, void *callback), --- a/kernel/irq_work.c +++ b/kernel/irq_work.c @@ -78,10 +78,8 @@ void __weak arch_irq_work_raise(void) static __always_inline void irq_work_raise(struct irq_work *work) { - if (trace_ipi_send_cpumask_enabled() && arch_irq_work_has_interrupt()) - trace_ipi_send_cpumask(cpumask_of(smp_processor_id()), - _RET_IP_, - work->func); + if (trace_ipi_send_cpu_enabled() && arch_irq_work_has_interrupt()) + trace_ipi_send_cpu(smp_processor_id(), _RET_IP_, work->func); arch_irq_work_raise(); } --- a/kernel/smp.c +++ b/kernel/smp.c @@ -109,7 +109,7 @@ static __always_inline void send_call_function_single_ipi(int cpu, smp_call_func_t func) { if (call_function_single_prep_ipi(cpu)) { - trace_ipi_send_cpumask(cpumask_of(cpu), _RET_IP_, f
Re: [PATCH v5 1/7] trace: Add trace_ipi_send_cpumask()
On 22/03/23 11:30, Peter Zijlstra wrote: > On Wed, Mar 22, 2023 at 10:39:55AM +0100, Peter Zijlstra wrote: >> On Tue, Mar 07, 2023 at 02:35:52PM +, Valentin Schneider wrote: >> > +TRACE_EVENT(ipi_send_cpumask, >> > + >> > + TP_PROTO(const struct cpumask *cpumask, unsigned long callsite, void >> > *callback), >> > + >> > + TP_ARGS(cpumask, callsite, callback), >> > + >> > + TP_STRUCT__entry( >> > + __cpumask(cpumask) >> > + __field(void *, callsite) >> > + __field(void *, callback) >> > + ), >> > + >> > + TP_fast_assign( >> > + __assign_cpumask(cpumask, cpumask_bits(cpumask)); >> > + __entry->callsite = (void *)callsite; >> > + __entry->callback = callback; >> > + ), >> > + >> > + TP_printk("cpumask=%s callsite=%pS callback=%pS", >> > +__get_cpumask(cpumask), __entry->callsite, __entry->callback) >> > +); >> >> Would it make sense to add a variant like: ipi_send_cpu() that records a >> single cpu instead of a cpumask. A lot of sites seems to do: >> cpumask_of(cpu) for that first argument, and it seems to me it is quite >> daft to have to memcpy a full multi-word cpumask in those cases. >> >> Remember, nr_possible_cpus > 64 is quite common these days. > > Something we litte bit like so... > I was wondering whether we could stick with a single trace event, but let ftrace be aware of weight=1 vs weight>1 cpumasks. For weight>1, it would memcpy() as usual, for weight=1, it could write a pointer to a cpu_bit_bitmap[] equivalent embedded in the trace itself. Unfortunately, Ftrace bitmasks are represented as a u32 made of two 16 bit values: [offset in event record, size], so there isn't a straightforward way to point to a "reusable" cpumask. AFAICT the only alternative would be to do that via a different trace event, but then we should just go with a plain old uint - i.e. do what you're doing here, so: Tested-and-reviewed-by: Valentin Schneider (with the tiny typo fix below) > @@ -35,6 +35,28 @@ TRACE_EVENT(ipi_raise, > TP_printk("target_mask=%s (%s)", __get_bitmask(target_cpus), > __entry->reason) > ); > > +TRACE_EVENT(ipi_send_cpu, > + > + TP_PROTO(const unsigned int cpu, unsigned long callsite, void > *callback), > + > + TP_ARGS(cpu, callsite, callback), > + > + TP_STRUCT__entry( > + __field(unsigned int, cpu) > + __field(void *, callsite) > + __field(void *, callback) > + ), > + > + TP_fast_assign( > + __entry->cpu = cpu; > + __entry->callsite = (void *)callsite; > + __entry->callback = callback; > + ), > + > + TP_printk("cpu=%s callsite=%pS callback=%pS", ^ s/s/u/ > + __entry->cpu, __entry->callsite, __entry->callback) > +); > +
Re: [PATCH v3 2/2] arch/powerpc/kvm: kvmppc_hv_entry: remove r4 argument
Kautuk Consul writes: > kvmppc_hv_entry is called from only 2 locations within > book3s_hv_rmhandlers.S. Both of those locations set r4 > as HSTATE_KVM_VCPU(r13) before calling kvmppc_hv_entry. > So, shift the r4 load instruction to kvmppc_hv_entry and > thus modify the calling convention of this function. > > Signed-off-by: Kautuk Consul > --- > arch/powerpc/kvm/book3s_hv_rmhandlers.S | 11 ++- > 1 file changed, 6 insertions(+), 5 deletions(-) > > diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S > b/arch/powerpc/kvm/book3s_hv_rmhandlers.S > index b81ba4ee0521..b61f0b2c677b 100644 > --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S > +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S > @@ -85,7 +85,7 @@ _GLOBAL_TOC(kvmppc_hv_entry_trampoline) > RFI_TO_KERNEL > > kvmppc_call_hv_entry: > - ld r4, HSTATE_KVM_VCPU(r13) > + /* Enter guest. */ > bl kvmppc_hv_entry > > /* Back from guest - restore host state and return to caller */ > @@ -352,9 +352,7 @@ kvm_secondary_got_guest: > mtspr SPRN_LDBAR, r0 > isync > 63: > - /* Order load of vcpu after load of vcore */ > - lwsync > - ld r4, HSTATE_KVM_VCPU(r13) > + /* Enter guest. */ > bl kvmppc_hv_entry > > /* Back from the guest, go back to nap */ > @@ -506,7 +504,6 @@ SYM_INNER_LABEL(kvmppc_hv_entry, SYM_L_LOCAL) > > /* Required state: >* > - * R4 = vcpu pointer (or NULL) >* MSR = ~IR|DR >* R13 = PACA >* R1 = host R1 > @@ -524,6 +521,10 @@ SYM_INNER_LABEL(kvmppc_hv_entry, SYM_L_LOCAL) > li r6, KVM_GUEST_MODE_HOST_HV > stb r6, HSTATE_IN_GUEST(r13) > > + /* Order load of vcpu after load of vcore */ Just copying the comment here doesn't work. It doesn't make sense on its own here, because the VCORE is loaded (again) a few lines below (536). So as written this comment seems backward vs the code. The comment would need to expand to explain that the barrier is for the case where we came from kvm_secondary_got_guest. > + lwsync > + ld r4, HSTATE_KVM_VCPU(r13) > + > #ifdef CONFIG_KVM_BOOK3S_HV_P8_TIMING > /* Store initial timestamp */ > cmpdi r4, 0 But as Nick says I don't think it's worth investing effort in small tweaks to this code. The risk of introducing bugs is too high for such a small improvement to the code. Thanks for trying, but I think this asm code is best left more or less alone unless we find actual bugs in it - or unless we can make substantial improvements to it, which would be rewriting in C, or at least converting to a fully call/return style rather than the current forest of labels. I will take patch 1 though, as that's an obvious cleanup and poses no risk (famous last words :D). cheers
Re: [PATCH v5 7/7] sched, smp: Trace smp callback causing an IPI
On 22/03/23 10:53, Peter Zijlstra wrote: > On Tue, Mar 07, 2023 at 02:35:58PM +, Valentin Schneider wrote: > >> @@ -477,6 +490,25 @@ static __always_inline void csd_unlock(struct >> __call_single_data *csd) >> smp_store_release(&csd->node.u_flags, 0); >> } >> >> +static __always_inline void >> +raw_smp_call_single_queue(int cpu, struct llist_node *node, smp_call_func_t >> func) >> +{ >> +/* >> + * The list addition should be visible to the target CPU when it pops >> + * the head of the list to pull the entry off it in the IPI handler >> + * because of normal cache coherency rules implied by the underlying >> + * llist ops. >> + * >> + * If IPIs can go out of order to the cache coherency protocol >> + * in an architecture, sufficient synchronisation should be added >> + * to arch code to make it appear to obey cache coherency WRT >> + * locking and barrier primitives. Generic code isn't really >> + * equipped to do the right thing... >> + */ >> +if (llist_add(node, &per_cpu(call_single_queue, cpu))) >> +send_call_function_single_ipi(cpu, func); >> +} >> + >> static DEFINE_PER_CPU_SHARED_ALIGNED(call_single_data_t, csd_data); >> >> void __smp_call_single_queue(int cpu, struct llist_node *node) >> @@ -493,21 +525,25 @@ void __smp_call_single_queue(int cpu, struct >> llist_node *node) >> } >> } >> #endif >> /* >> + * We have to check the type of the CSD before queueing it, because >> + * once queued it can have its flags cleared by >> + * flush_smp_call_function_queue() >> + * even if we haven't sent the smp_call IPI yet (e.g. the stopper >> + * executes migration_cpu_stop() on the remote CPU). >> */ >> +if (trace_ipi_send_cpumask_enabled()) { >> +call_single_data_t *csd; >> +smp_call_func_t func; >> + >> +csd = container_of(node, call_single_data_t, node.llist); >> +func = CSD_TYPE(csd) == CSD_TYPE_TTWU ? >> +sched_ttwu_pending : csd->func; >> + >> +raw_smp_call_single_queue(cpu, node, func); >> +} else { >> +raw_smp_call_single_queue(cpu, node, NULL); >> +} >> } > > Hurmph... so we only really consume @func when we IPI. Would it not be > more useful to trace this thing for *every* csd enqeued? It's true that any CSD enqueued on that CPU's call_single_queue in the [first CSD llist_add()'ed, IPI IRQ hits] timeframe is a potential source of interference. However, can we be sure that first CSD isn't an indirect cause for the following ones? say the target CPU exits RCU EQS due to the IPI, there's a bit of time before it gets to flush_smp_call_function_queue() where some other CSD could be enqueued *because* of that change in state. I couldn't find a easy example of that, I might be biased as this is where I'd like to go wrt IPI'ing isolated CPUs in usermode. But regardless, when correlating an IPI IRQ with its source, we'd always have to look at the first CSD in that CSD stack.
Re: (subset) [PATCH v2 01/10] powerpc/machdep: Make machine name const
On Sat, 18 Feb 2023 10:15:44 +0100, Christophe Leroy wrote: > Machine name in struct machdep_calls should never be modified. > > Mark it const. > > Patches 1-7 applied to powerpc/next. [01/10] powerpc/machdep: Make machine name const https://git.kernel.org/powerpc/c/0aafbdf35c75cbfec82636d01e6dc7950bc1507c [02/10] powerpc/machdep: Define 'compatible' property in ppc_md and use it https://git.kernel.org/powerpc/c/2fc39acfcacf3dc1392d8062f6d7b7d94eb2537c [03/10] powerpc/platforms: Use 'compatible' property for simple cases https://git.kernel.org/powerpc/c/1c96fcdef8c7492ecf34ed70102a1ae5253ef9d1 [04/10] powerpc/47x: Split ppc47x machine in two https://git.kernel.org/powerpc/c/357f82395cd8a0279067805841e5968f4e6dc932 [05/10] powerpc/gamecube|wii : Use machine_device_initcall() https://git.kernel.org/powerpc/c/f47b17d51997cd47e0e6fb1b90145d516ebe6b3e [06/10] powerpc/85xx: Fix function naming for p1023_rdb platform https://git.kernel.org/powerpc/c/5a81c02d0cc5067170e49452d55a4dfd21333257 [07/10] powerpc: Make generic_calibrate_decr() the default https://git.kernel.org/powerpc/c/0aafbdf35c75cbfec82636d01e6dc7950bc1507c cheers
Re: [PATCH] powerpc: Fix some kernel-doc warnings
On Mon, 31 Oct 2022 21:54:52 -0400, Bo Liu wrote: > The current code provokes some kernel-doc warnings: > arch/powerpc/kernel/process.c:1606: warning: This comment starts with > '/**', but isn't a kernel-doc comment. Refer > Documentation/doc-guide/kernel-doc.rst > > Applied to powerpc/next. [1/1] powerpc: Fix some kernel-doc warnings https://git.kernel.org/powerpc/c/be994293544f1c0b032dabfe0832d9c1dfcea14b cheers
Re: [PATCH 0/2] ppc: simplify sysctl registration
On Fri, 10 Mar 2023 15:28:48 -0800, Luis Chamberlain wrote: > We can simplify the way we do sysctl registration both by > reducing the number of lines and also avoiding calllers which > could do recursion. The docs are being updated to help reflect > this better [0]. > > [0] > https://lore.kernel.org/all/20230310223947.3917711-1-mcg...@kernel.org/T/#u > > [...] Applied to powerpc/next. [1/2] ppc: simplify one-level sysctl registration for powersave_nap_ctl_table https://git.kernel.org/powerpc/c/bfedee5dc406ddcd70d667be1501659f1b232b7f [2/2] ppc: simplify one-level sysctl registration for nmi_wd_lpm_factor_ctl_table https://git.kernel.org/powerpc/c/3a713753d3cb52e4e3039cdb906ef00f0b574219 cheers
Re: [PATCH] selftests/powerpc: Increase timeout for vsx_signal test
On Wed, 08 Mar 2023 08:36:14 +1100, Michael Neuling wrote: > On the max config P10 machine (1920 threads and 64TB) this test fails > with a timeout: > > Sending signals to all threads 10 times...!! killing vmx_signal > !! child died by signal 15 > failure: vmx_signal > > [...] Applied to powerpc/next. [1/1] selftests/powerpc: Increase timeout for vsx_signal test https://git.kernel.org/powerpc/c/493648d6795f00b6dcd6295b2b4221871bc1b25b cheers
Re: [PATCH 0/3] Allow CONFIG_PPC64_BIG_ENDIAN_ELF_ABI_V2 with ld.lld 15+
On Wed, 15 Feb 2023 11:41:14 -0700, Nathan Chancellor wrote: > Currently, CONFIG_PPC64_BIG_ENDIAN_ELF_ABI_V2 is not selectable with > ld.lld because of an explicit dependency on GNU ld, due to lack of > testing with LLVM. > > Erhard was kind enough to test this option on his hardware with LLVM 15, > which ran without any issues. This should not be too surprising, as > ld.lld does not have support for the ELFv1 ABI, only ELFv2, so it should > have decent support. With this series, big endian kernels can be built > with LLVM=1. > > [...] Applied to powerpc/next. [1/3] powerpc/boot: Only use '-mabi=elfv2' with CONFIG_PPC64_BOOT_WRAPPER https://git.kernel.org/powerpc/c/d1c5accacb234c3a9f1609a73b4b2eaa4ef07d1a [2/3] powerpc: Fix use of '-mabi=elfv2' with clang https://git.kernel.org/powerpc/c/7c3bd8362b06cff0a4044a4975adb7d71db2dfba [3/3] powerpc: Allow CONFIG_PPC64_BIG_ENDIAN_ELF_ABI_V2 with ld.lld 15+ https://git.kernel.org/powerpc/c/a11334d8327b3fd7987cbfb38e956a44c722d88f cheers
Re: [PATCH v2 1/4] powerpc/iommu: Add "borrowing" iommu_table_group_ops
On Mon, 6 Mar 2023 11:30:20 -0600 (CST), Timothy Pearson wrote: > PPC64 IOMMU API defines iommu_table_group_ops which handles DMA windows > for PEs: control the ownership, create/set/unset a table the hardware > for dynamic DMA windows (DDW). VFIO uses the API to implement support > on POWER. > > So far only PowerNV IODA2 (POWER8 and newer machines) implemented this and > other cases (POWER7 or nested KVM) did not and instead reused > existing iommu_table structs. This means 1) no DDW 2) ownership transfer > is done directly in the VFIO SPAPR TCE driver. > > [...] Applied to powerpc/next. [1/4] powerpc/iommu: Add "borrowing" iommu_table_group_ops https://git.kernel.org/powerpc/c/9d67c94335096311c0bc7556ad1022de7385790b cheers
Re: (subset) [PATCH 1/2] selftests/powerpc/pmu: Fix sample field check in the mmcra_thresh_marked_sample_test
On Wed, 01 Mar 2023 22:39:17 +0530, Kajol Jain wrote: > The testcase verifies the setting of different fields in Monitor Mode > Control Register A (MMCRA). In the current code, EV_CODE_EXTRACT macro > is used to extract the "sample" field, which then needs to be further > processed to fetch rand_samp_elig and rand_samp_mode bits. But the > current code is not passing valid sample field to EV_CODE_EXTRACT > macro. Patch addresses this by fixing the input for EV_CODE_EXTRACT. > > [...] Patch 1 applied to powerpc/next. [1/2] selftests/powerpc/pmu: Fix sample field check in the mmcra_thresh_marked_sample_test https://git.kernel.org/powerpc/c/8a32341cf04ba05974931b4664683c2c9fb84e56 cheers
Re: [PATCH v2 2/4] powerpc/pci_64: Init pcibios subsys a bit later
On Mon, 6 Mar 2023 11:30:42 -0600 (CST), Timothy Pearson wrote: > The following patches are going to add dependency/use of iommu_ops which > is initialized in subsys_initcall as well. > > This moves pciobios_init() to the next initcall level. > > This should not cause behavioral change. > > [...] Applied to powerpc/next. [2/4] powerpc/pci_64: Init pcibios subsys a bit later https://git.kernel.org/powerpc/c/76f351096c4516f38b9c901a21797fa958588e3a cheers
Re: [PATCH v2 3/4] powerpc/iommu: Add iommu_ops to report capabilities and
On Mon, 6 Mar 2023 11:31:00 -0600 (CST), Timothy Pearson wrote: > allow blocking domains > > Up until now PPC64 managed to avoid using iommu_ops. The VFIO driver > uses a SPAPR TCE sub-driver and all iommu_ops uses were kept in > the Type1 VFIO driver. Recent development added 2 uses of iommu_ops to > the generic VFIO which broke POWER: > - a coherency capability check; > - blocking IOMMU domain - iommu_group_dma_owner_claimed()/... > > [...] Applied to powerpc/next. [3/4] powerpc/iommu: Add iommu_ops to report capabilities and https://git.kernel.org/powerpc/c/a940904443e432623579245babe63e2486ff327b cheers
Re: [PATCH v2 4/4] Add myself to MAINTAINERS for Power VFIO support
On Mon, 6 Mar 2023 11:31:28 -0600 (CST), Timothy Pearson wrote: > Applied to powerpc/next. [4/4] Add myself to MAINTAINERS for Power VFIO support https://git.kernel.org/powerpc/c/a34d2f0d79ec890b9b1b156a90016b6330173b8a cheers
Re: [PATCH] powerpc/iommu: Fix notifiers being shared by PCI and VIO buses
On 22/03/23 9:23 am, Russell Currey wrote: fail_iommu_setup() registers the fail_iommu_bus_notifier struct to both PCI and VIO buses. struct notifier_block is a linked list node, so this causes any notifiers later registered to either bus type to also be registered to the other since they share the same node. This causes issues in (at least) the vgaarb code, which registers a notifier for PCI buses. pci_notify() ends up being called on a vio device, converted with to_pci_dev() even though it's not a PCI device, and finally makes a bad access in vga_arbiter_add_pci_device() as discovered with KASAN: BUG: KASAN: slab-out-of-bounds in vga_arbiter_add_pci_device+0x60/0xe00 Read of size 4 at addr c00264c26fdc by task swapper/0/1 Call Trace: [c00263607520] [c00010f7023c] dump_stack_lvl+0x1bc/0x2b8 (unreliable) [c00263607560] [cf142a64] print_report+0x3f4/0xc60 [c00263607640] [cf142144] kasan_report+0x244/0x698 [c00263607740] [cf1460e8] __asan_load4+0xe8/0x250 [c00263607760] [cff4b850] vga_arbiter_add_pci_device+0x60/0xe00 [c00263607850] [cff4c678] pci_notify+0x88/0x444 [c002636078b0] [ce94dfc4] notifier_call_chain+0x104/0x320 [c00263607950] [ce94f050] blocking_notifier_call_chain+0xa0/0x140 [c00263607990] [c000100cb3b8] device_add+0xac8/0x1d30 [c00263607aa0] [c000100ccd98] device_register+0x58/0x80 [c00263607ad0] [ce84247c] vio_register_device_node+0x9ac/0xce0 [c00263607ba0] [c000126c95d8] vio_bus_scan_register_devices+0xc4/0x13c [c00263607bd0] [c000126c96e4] __machine_initcall_pseries_vio_device_init+0x94/0xf0 [c00263607c00] [ce69467c] do_one_initcall+0x12c/0xaa8 [c00263607cf0] [c0001268b8a8] kernel_init_freeable+0xa48/0xba8 [c00263607dd0] [ce695f24] kernel_init+0x64/0x400 [c00263607e50] [ce68e0e4] ret_from_kernel_thread+0x5c/0x64 Fix this by creating separate notifier_block structs for each bus type. Fixes: d6b9a81b2a45 ("powerpc: IOMMU fault injection") Reported-by: Nageswara R Sastry Signed-off-by: Russell Currey Tested-by: Nageswara R Sastry --- arch/powerpc/kernel/iommu.c | 15 --- 1 file changed, 12 insertions(+), 3 deletions(-) diff --git a/arch/powerpc/kernel/iommu.c b/arch/powerpc/kernel/iommu.c index ee95937bdaf1..6f1117fe3870 100644 --- a/arch/powerpc/kernel/iommu.c +++ b/arch/powerpc/kernel/iommu.c @@ -171,17 +171,26 @@ static int fail_iommu_bus_notify(struct notifier_block *nb, return 0; } -static struct notifier_block fail_iommu_bus_notifier = { +/* + * PCI and VIO buses need separate notifier_block structs, since they're linked + * list nodes. Sharing a notifier_block would mean that any notifiers later + * registered for PCI buses would also get called by VIO buses and vice versa. + */ +static struct notifier_block fail_iommu_pci_bus_notifier = { + .notifier_call = fail_iommu_bus_notify +}; + +static struct notifier_block fail_iommu_vio_bus_notifier = { .notifier_call = fail_iommu_bus_notify }; static int __init fail_iommu_setup(void) { #ifdef CONFIG_PCI - bus_register_notifier(&pci_bus_type, &fail_iommu_bus_notifier); + bus_register_notifier(&pci_bus_type, &fail_iommu_pci_bus_notifier); #endif #ifdef CONFIG_IBMVIO - bus_register_notifier(&vio_bus_type, &fail_iommu_bus_notifier); + bus_register_notifier(&vio_bus_type, &fail_iommu_vio_bus_notifier); #endif return 0; -- Thanks and Regards R.Nageswara Sastry
Re: [PATCH v8 1/3] riscv: Introduce CONFIG_RELOCATABLE
@linux-kbuild: Does anyone has an idea to solve this? Thanks! On 2/22/23 13:29, Alexandre Ghiti wrote: +cc linux-kbuild, llvm, Nathan, Nick On 2/15/23 15:36, Alexandre Ghiti wrote: From: Alexandre Ghiti This config allows to compile 64b kernel as PIE and to relocate it at any virtual address at runtime: this paves the way to KASLR. Runtime relocation is possible since relocation metadata are embedded into the kernel. Note that relocating at runtime introduces an overhead even if the kernel is loaded at the same address it was linked at and that the compiler options are those used in arm64 which uses the same RELA relocation format. Signed-off-by: Alexandre Ghiti --- arch/riscv/Kconfig | 14 + arch/riscv/Makefile | 7 +++-- arch/riscv/kernel/efi-header.S | 6 ++-- arch/riscv/kernel/vmlinux.lds.S | 10 -- arch/riscv/mm/Makefile | 4 +++ arch/riscv/mm/init.c | 54 - 6 files changed, 87 insertions(+), 8 deletions(-) diff --git a/arch/riscv/Kconfig b/arch/riscv/Kconfig index e2b656043abf..e0ee7ce4b2e3 100644 --- a/arch/riscv/Kconfig +++ b/arch/riscv/Kconfig @@ -544,6 +544,20 @@ config COMPAT If you want to execute 32-bit userspace applications, say Y. +config RELOCATABLE + bool "Build a relocatable kernel" + depends on MMU && 64BIT && !XIP_KERNEL + help + This builds a kernel as a Position Independent Executable (PIE), + which retains all relocation metadata required to relocate the + kernel binary at runtime to a different virtual address than the + address it was linked at. + Since RISCV uses the RELA relocation format, this requires a + relocation pass at runtime even if the kernel is loaded at the + same address it was linked at. + + If unsure, say N. + endmenu # "Kernel features" menu "Boot options" diff --git a/arch/riscv/Makefile b/arch/riscv/Makefile index 82153960ac00..97c34136b027 100644 --- a/arch/riscv/Makefile +++ b/arch/riscv/Makefile @@ -7,9 +7,12 @@ # OBJCOPYFLAGS := -O binary -LDFLAGS_vmlinux := +ifeq ($(CONFIG_RELOCATABLE),y) + LDFLAGS_vmlinux += -shared -Bsymbolic -z notext -z norelro + KBUILD_CFLAGS += -fPIE +endif ifeq ($(CONFIG_DYNAMIC_FTRACE),y) - LDFLAGS_vmlinux := --no-relax + LDFLAGS_vmlinux += --no-relax KBUILD_CPPFLAGS += -DCC_USING_PATCHABLE_FUNCTION_ENTRY CC_FLAGS_FTRACE := -fpatchable-function-entry=8 endif diff --git a/arch/riscv/kernel/efi-header.S b/arch/riscv/kernel/efi-header.S index 8e733aa48ba6..f7ee09c4f12d 100644 --- a/arch/riscv/kernel/efi-header.S +++ b/arch/riscv/kernel/efi-header.S @@ -33,7 +33,7 @@ optional_header: .byte 0x02 // MajorLinkerVersion .byte 0x14 // MinorLinkerVersion .long __pecoff_text_end - efi_header_end // SizeOfCode - .long __pecoff_data_virt_size // SizeOfInitializedData + .long __pecoff_data_virt_end - __pecoff_text_end // SizeOfInitializedData .long 0 // SizeOfUninitializedData .long __efistub_efi_pe_entry - _start // AddressOfEntryPoint .long efi_header_end - _start // BaseOfCode @@ -91,9 +91,9 @@ section_table: IMAGE_SCN_MEM_EXECUTE // Characteristics .ascii ".data\0\0\0" - .long __pecoff_data_virt_size // VirtualSize + .long __pecoff_data_virt_end - __pecoff_text_end // VirtualSize .long __pecoff_text_end - _start // VirtualAddress - .long __pecoff_data_raw_size // SizeOfRawData + .long __pecoff_data_raw_end - __pecoff_text_end // SizeOfRawData .long __pecoff_text_end - _start // PointerToRawData .long 0 // PointerToRelocations diff --git a/arch/riscv/kernel/vmlinux.lds.S b/arch/riscv/kernel/vmlinux.lds.S index 4e6c88aa4d87..8be2de3be08c 100644 --- a/arch/riscv/kernel/vmlinux.lds.S +++ b/arch/riscv/kernel/vmlinux.lds.S @@ -122,9 +122,15 @@ SECTIONS *(.sdata*) } + .rela.dyn : ALIGN(8) { + __rela_dyn_start = .; + *(.rela .rela*) + __rela_dyn_end = .; + } + So I realized those relocations would be better in the init section so we can get rid of them at some point. So I tried the following: diff --git a/arch/riscv/kernel/vmlinux.lds.S b/arch/riscv/kernel/vmlinux.lds.S index 7ac215467fd5..6111023a89ef 100644 --- a/arch/riscv/kernel/vmlinux.lds.S +++ b/arch/riscv/kernel/vmlinux.lds.S @@ -93,6 +93,12 @@ SECTIONS *(.rel.dyn*) } + .rela.dyn : ALIGN(8) { + __rela_dyn_start = .; + *(.rela .rela*) + __rela_dyn_end = .; + } + __init_data_end = .; . = ALIGN(8); @@ -119,12 +125,6 @@ SECTIONS *(.sdata*) } - .rela.dyn
Re: [PATCH v4 0/7] introduce vm_flags modifier functions
On Tue, Mar 14, 2023 at 02:11:44PM -0600, Alex Williamson wrote: > On Thu, 26 Jan 2023 11:37:45 -0800 > Suren Baghdasaryan wrote: > > > This patchset was originally published as a part of per-VMA locking [1] and > > was split after suggestion that it's viable on its own and to facilitate > > the review process. It is now a preprequisite for the next version of > > per-VMA > > lock patchset, which reuses vm_flags modifier functions to lock the VMA when > > vm_flags are being updated. > > > > VMA vm_flags modifications are usually done under exclusive mmap_lock > > protection because this attrubute affects other decisions like VMA merging > > or splitting and races should be prevented. Introduce vm_flags modifier > > functions to enforce correct locking. > > > > The patchset applies cleanly over mm-unstable branch of mm tree. > > With this series, vfio-pci developed a bunch of warnings around not > holding the mmap_lock write semaphore while calling > io_remap_pfn_range() from our fault handler, vfio_pci_mmap_fault(). > > I suspect vdpa has the same issue for their use of remap_pfn_range() > from their fault handler, JasonW, MST, FYI. Yeah, IMHO this whole approach has always been a bit sketchy, it was never intended that remap_pfn would be called from a fault handler, you are supposed to use a vmf_insert_pfn() type API from fault handlers. > The reason for using remap_pfn_range() on fault in vfio-pci is that > we're mapping device MMIO to userspace, where that MMIO can be disabled > and we'd rather zap the mapping when that occurs so that we can sigbus > the user rather than allow the user to trigger potentially fatal bus > errors on the host. > Peter Xu has suggested offline that a non-lazy approach to reinsert the > mappings might be more inline with mm expectations relative to touching > vm_flags during fault. Yes, I feel the same - along with completing the address_space conversion you had started. IIRC that was part of the reason we needed this design in vfio. Jason
Re: [PATCH v5 7/7] sched, smp: Trace smp callback causing an IPI
On Wed, Mar 22, 2023 at 12:20:28PM +, Valentin Schneider wrote: > On 22/03/23 10:53, Peter Zijlstra wrote: > > Hurmph... so we only really consume @func when we IPI. Would it not be > > more useful to trace this thing for *every* csd enqeued? > > It's true that any CSD enqueued on that CPU's call_single_queue in the > [first CSD llist_add()'ed, IPI IRQ hits] timeframe is a potential source of > interference. > > However, can we be sure that first CSD isn't an indirect cause for the > following ones? say the target CPU exits RCU EQS due to the IPI, there's a > bit of time before it gets to flush_smp_call_function_queue() where some > other CSD > could be enqueued *because* of that change in state. > > I couldn't find a easy example of that, I might be biased as this is where > I'd like to go wrt IPI'ing isolated CPUs in usermode. But regardless, when > correlating an IPI IRQ with its source, we'd always have to look at the > first CSD in that CSD stack. So I was thinking something like this: --- Subject: trace,smp: Trace all smp_function_call*() invocations From: Peter Zijlstra Date: Wed Mar 22 14:58:36 CET 2023 (Ab)use the trace_ipi_send_cpu*() family to trace all smp_function_call*() invocations, not only those that result in an actual IPI. The queued entries log their callback function while the actual IPIs are traced on generic_smp_call_function_single_interrupt(). Signed-off-by: Peter Zijlstra (Intel) --- kernel/smp.c | 58 ++ 1 file changed, 30 insertions(+), 28 deletions(-) --- a/kernel/smp.c +++ b/kernel/smp.c @@ -106,18 +106,20 @@ void __init call_function_init(void) } static __always_inline void -send_call_function_single_ipi(int cpu, smp_call_func_t func) +send_call_function_single_ipi(int cpu) { if (call_function_single_prep_ipi(cpu)) { - trace_ipi_send_cpu(cpu, _RET_IP_, func); + trace_ipi_send_cpu(cpu, _RET_IP_, + generic_smp_call_function_single_interrupt); arch_send_call_function_single_ipi(cpu); } } static __always_inline void -send_call_function_ipi_mask(const struct cpumask *mask, smp_call_func_t func) +send_call_function_ipi_mask(const struct cpumask *mask) { - trace_ipi_send_cpumask(mask, _RET_IP_, func); + trace_ipi_send_cpumask(mask, _RET_IP_, + generic_smp_call_function_single_interrupt); arch_send_call_function_ipi_mask(mask); } @@ -318,25 +320,6 @@ static __always_inline void csd_unlock(s smp_store_release(&csd->node.u_flags, 0); } -static __always_inline void -raw_smp_call_single_queue(int cpu, struct llist_node *node, smp_call_func_t func) -{ - /* -* The list addition should be visible to the target CPU when it pops -* the head of the list to pull the entry off it in the IPI handler -* because of normal cache coherency rules implied by the underlying -* llist ops. -* -* If IPIs can go out of order to the cache coherency protocol -* in an architecture, sufficient synchronisation should be added -* to arch code to make it appear to obey cache coherency WRT -* locking and barrier primitives. Generic code isn't really -* equipped to do the right thing... -*/ - if (llist_add(node, &per_cpu(call_single_queue, cpu))) - send_call_function_single_ipi(cpu, func); -} - static DEFINE_PER_CPU_SHARED_ALIGNED(call_single_data_t, csd_data); void __smp_call_single_queue(int cpu, struct llist_node *node) @@ -356,10 +339,23 @@ void __smp_call_single_queue(int cpu, st func = CSD_TYPE(csd) == CSD_TYPE_TTWU ? sched_ttwu_pending : csd->func; - raw_smp_call_single_queue(cpu, node, func); - } else { - raw_smp_call_single_queue(cpu, node, NULL); + trace_ipi_send_cpu(cpu, _RET_IP_, func); } + + /* +* The list addition should be visible to the target CPU when it pops +* the head of the list to pull the entry off it in the IPI handler +* because of normal cache coherency rules implied by the underlying +* llist ops. +* +* If IPIs can go out of order to the cache coherency protocol +* in an architecture, sufficient synchronisation should be added +* to arch code to make it appear to obey cache coherency WRT +* locking and barrier primitives. Generic code isn't really +* equipped to do the right thing... +*/ + if (llist_add(node, &per_cpu(call_single_queue, cpu))) + send_call_function_single_ipi(cpu); } /* @@ -798,14 +794,20 @@ static void smp_call_function_many_cond( } /* +* Trace each smp_function_call_*() as an IPI, actual IPIs +* will be traced with func==generic_smp_call_fu
Re: [PATCH v5 7/7] sched, smp: Trace smp callback causing an IPI
On 22/03/23 15:04, Peter Zijlstra wrote: > On Wed, Mar 22, 2023 at 12:20:28PM +, Valentin Schneider wrote: >> On 22/03/23 10:53, Peter Zijlstra wrote: > >> > Hurmph... so we only really consume @func when we IPI. Would it not be >> > more useful to trace this thing for *every* csd enqeued? >> >> It's true that any CSD enqueued on that CPU's call_single_queue in the >> [first CSD llist_add()'ed, IPI IRQ hits] timeframe is a potential source of >> interference. >> >> However, can we be sure that first CSD isn't an indirect cause for the >> following ones? say the target CPU exits RCU EQS due to the IPI, there's a >> bit of time before it gets to flush_smp_call_function_queue() where some >> other CSD >> could be enqueued *because* of that change in state. >> >> I couldn't find a easy example of that, I might be biased as this is where >> I'd like to go wrt IPI'ing isolated CPUs in usermode. But regardless, when >> correlating an IPI IRQ with its source, we'd always have to look at the >> first CSD in that CSD stack. > > So I was thinking something like this: > > --- > Subject: trace,smp: Trace all smp_function_call*() invocations > From: Peter Zijlstra > Date: Wed Mar 22 14:58:36 CET 2023 > > (Ab)use the trace_ipi_send_cpu*() family to trace all > smp_function_call*() invocations, not only those that result in an > actual IPI. > > The queued entries log their callback function while the actual IPIs > are traced on generic_smp_call_function_single_interrupt(). > > Signed-off-by: Peter Zijlstra (Intel) > --- > kernel/smp.c | 58 > ++ > 1 file changed, 30 insertions(+), 28 deletions(-) > > --- a/kernel/smp.c > +++ b/kernel/smp.c > @@ -106,18 +106,20 @@ void __init call_function_init(void) > } > > static __always_inline void > -send_call_function_single_ipi(int cpu, smp_call_func_t func) > +send_call_function_single_ipi(int cpu) > { > if (call_function_single_prep_ipi(cpu)) { > - trace_ipi_send_cpu(cpu, _RET_IP_, func); > + trace_ipi_send_cpu(cpu, _RET_IP_, > +generic_smp_call_function_single_interrupt); Hm, this does get rid of the func being passed down the helpers, but this means the trace events are now stateful, i.e. I need the first and last events in a CSD stack to figure out which one actually caused the IPI. It also requires whoever is looking at the trace to be aware of which IPIs are attached to a CSD, and which ones aren't. ATM that's only the resched IPI, but per the cover letter there's more to come (e.g. tick_broadcast() for arm64/riscv and a few others). For instance: hackbench-157 [001]10.894320: ipi_send_cpu: cpu=3 callsite=check_preempt_curr+0x37 callback=0x0 hackbench-157 [001]10.895068: ipi_send_cpu: cpu=3 callsite=try_to_wake_up+0x29e callback=sched_ttwu_pending+0x0 hackbench-157 [001]10.895068: ipi_send_cpu: cpu=3 callsite=try_to_wake_up+0x29e callback=generic_smp_call_function_single_interrupt+0x0 That first one sent a RESCHEDULE IPI, the second one a CALL_FUNCTION one, but you really have to know what you're looking at... Are you worried about the @func being pushed down? Staring at x86 asm is not good for the soul, but AFAICT this does cause an extra register to be popped in the prologue because all of the helpers are __always_inline, so both paths of the static key(s) are in the same stackframe. I can "improve" this with: --- diff --git a/kernel/smp.c b/kernel/smp.c index 5cd680a7e78ef..55f120dae1713 100644 --- a/kernel/smp.c +++ b/kernel/smp.c @@ -511,6 +511,26 @@ raw_smp_call_single_queue(int cpu, struct llist_node *node, smp_call_func_t func static DEFINE_PER_CPU_SHARED_ALIGNED(call_single_data_t, csd_data); +static noinline void __smp_call_single_queue_trace(int cpu, struct llist_node *node) +{ + call_single_data_t *csd; + smp_call_func_t func; + + + /* +* We have to check the type of the CSD before queueing it, because +* once queued it can have its flags cleared by +* flush_smp_call_function_queue() +* even if we haven't sent the smp_call IPI yet (e.g. the stopper +* executes migration_cpu_stop() on the remote CPU). +*/ + csd = container_of(node, call_single_data_t, node.llist); + func = CSD_TYPE(csd) == CSD_TYPE_TTWU ? + sched_ttwu_pending : csd->func; + + raw_smp_call_single_queue(cpu, node, func); +} + void __smp_call_single_queue(int cpu, struct llist_node *node) { #ifdef CONFIG_CSD_LOCK_WAIT_DEBUG @@ -525,25 +545,10 @@ void __smp_call_single_queue(int cpu, struct llist_node *node) } } #endif - /* -* We have to check the type of the CSD before queueing it, because -* once queued it can have its flags cleared by -* flush_smp_call_function_queue() -* even if we haven't sent the smp_call IPI yet (e.g. th
Re: [PATCH v5 7/7] sched, smp: Trace smp callback causing an IPI
On Wed, Mar 22, 2023 at 05:01:13PM +, Valentin Schneider wrote: > > So I was thinking something like this: > Hm, this does get rid of the func being passed down the helpers, but this > means the trace events are now stateful, i.e. I need the first and last > events in a CSD stack to figure out which one actually caused the IPI. Isn't much of tracing stateful? I mean, why am I always writing awk programs to parse trace output? The one that is directly followed by generic_smp_call_function_single_interrupt() (horrible name that), is the one that tripped the IPI. > It also requires whoever is looking at the trace to be aware of which IPIs > are attached to a CSD, and which ones aren't. ATM that's only the resched > IPI, but per the cover letter there's more to come (e.g. tick_broadcast() > for arm64/riscv and a few others). For instance: > >hackbench-157 [001]10.894320: ipi_send_cpu: cpu=3 > callsite=check_preempt_curr+0x37 callback=0x0 Arguably we should be setting callback to scheduler_ipi(), except ofcourse, that's not an actual function... Maybe we can do "extern inline" for the actual users and provide a dummy function for the symbol when tracing. >hackbench-157 [001]10.895068: ipi_send_cpu: cpu=3 > callsite=try_to_wake_up+0x29e callback=sched_ttwu_pending+0x0 >hackbench-157 [001]10.895068: ipi_send_cpu: cpu=3 > callsite=try_to_wake_up+0x29e > callback=generic_smp_call_function_single_interrupt+0x0 > > That first one sent a RESCHEDULE IPI, the second one a CALL_FUNCTION one, > but you really have to know what you're looking at... But you have to know that anyway, you can't do tracing and not know wtf you're doing. Or rather, if you do, I don't give a crap and you can keep the pieces :-) Grepping the callback should be pretty quick resolution at to what trips it, no? (also, if you *really* can't manage, we can always add yet another argument that gives a type thingy) > Are you worried about the @func being pushed down? Not really, I was finding it odd that only the first csd was being logged. Either you should log them all (after all, the target CPU will run them all and you might still wonder where the heck they came from) or it should log none and always report that hideous long function name I can't be arsed to type again :-) > Staring at x86 asm is not good for the soul, Scarred for life :-) What's worse, due to being exposed to Intel syntax at a young age, I'm now permantently confused as to the argument order of x86 asm.
Re: [PATCH 0/2] KVM: PPC: support kvm selftests
On Thu, Mar 16, 2023, Michael Ellerman wrote: > Nicholas Piggin writes: > > Hi, > > > > This series adds initial KVM selftests support for powerpc > > (64-bit, BookS). > > Awesome. > > > It spans 3 maintainers but it does not really > > affect arch/powerpc, and it is well contained in selftests > > code, just touches some makefiles and a tiny bit headers so > > conflicts should be unlikely and trivial. > > > > I guess Paolo is the best point to merge these, if no comments > > or objections? > > Yeah. If it helps: > > Acked-by: Michael Ellerman (powerpc) What is the long term plan for KVM PPC maintenance? I was under the impression that KVM PPC was trending toward "bug fixes only", but the addition of selftests support suggests otherwise. I ask primarily because routing KVM PPC patches through the PPC tree is going to be problematic if KVM PPC sees signficiant development. The current situation is ok because the volume of patches is low and KVM PPC isn't trying to drive anything substantial into common KVM code, but if that changes... My other concern is that for selftests specifically, us KVM folks are taking on more maintenance burden by supporting PPC. AFAIK, none of the people that focus on KVM selftests in any meaningful capacity have access to PPC hardware, let alone know enough about the architecture to make intelligent code changes. Don't get me wrong, I'm very much in favor of more testing, I just don't want KVM to get left holding the bag.
Re: [next-20230317][PPC/MLX5][bisected 4d5ab0a] Boot WARNING: CPU: 0 PID: 9 at net/core/dev.c:1928 call_netdevice_notifiers_info
On 3/20/23 6:55 PM, Lorenzo Bianconi wrote: Greeting's Warning is seen while booting kernels from 6.3.0-rc3-next-20230317 on my powerpc Power 10 LPAR Boots fine without warnings when below patch is reverted commit 4d5ab0ad964df178beba031b89429a601893ff61 Author: Lorenzo Bianconi Date: Thu Mar 9 13:25:31 2023 +0100 net/mlx5e: take into account device reconfiguration for xdp_features flag Take into account LRO and GRO configuration setting device xdp_features flag. Consider channel rq_wq_type enabling rx scatter-gatter support in xdp_features flag and disable NETDEV_XDP_ACT_NDO_XMIT_SG since it is not supported yet by the driver. Moreover always enable NETDEV_XDP_ACT_NDO_XMIT as the ndo_xdp_xmit 4d5ab0ad got introduced in next-20230314 @Lorenzo Could you please look into this I would say this issue has been already fixed by Jakub here: https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/net/core/xdp.c?id=769639c1fe8a98129aa97c8ee981639db1e8955c Thanks Lorenzo, Verified the patch and it fixes the problem and next-20230321 kernel boots fine on my powerpc lpar Tested-by: Abdul Haleem -- Regard's Abdul Haleem IBM Linux Technology Center
Re: [PATCH v5 7/7] sched, smp: Trace smp callback causing an IPI
On 22/03/23 18:22, Peter Zijlstra wrote: > On Wed, Mar 22, 2023 at 05:01:13PM +, Valentin Schneider wrote: > >> > So I was thinking something like this: > >> Hm, this does get rid of the func being passed down the helpers, but this >> means the trace events are now stateful, i.e. I need the first and last >> events in a CSD stack to figure out which one actually caused the IPI. > > Isn't much of tracing stateful? I mean, why am I always writing awk > programs to parse trace output? > > The one that is directly followed by > generic_smp_call_function_single_interrupt() (horrible name that), is > the one that tripped the IPI. > Right. >> It also requires whoever is looking at the trace to be aware of which IPIs >> are attached to a CSD, and which ones aren't. ATM that's only the resched >> IPI, but per the cover letter there's more to come (e.g. tick_broadcast() >> for arm64/riscv and a few others). For instance: >> >>hackbench-157 [001]10.894320: ipi_send_cpu: cpu=3 >> callsite=check_preempt_curr+0x37 callback=0x0 > > Arguably we should be setting callback to scheduler_ipi(), except > ofcourse, that's not an actual function... > > Maybe we can do "extern inline" for the actual users and provide a dummy > function for the symbol when tracing. > Huh, I wasn't aware that was an option, I'll look into that. I did scribble down a comment next to smp_send_reschedule(), but having a decodable function name would be better! >>hackbench-157 [001]10.895068: ipi_send_cpu: cpu=3 >> callsite=try_to_wake_up+0x29e callback=sched_ttwu_pending+0x0 >>hackbench-157 [001]10.895068: ipi_send_cpu: cpu=3 >> callsite=try_to_wake_up+0x29e >> callback=generic_smp_call_function_single_interrupt+0x0 >> >> That first one sent a RESCHEDULE IPI, the second one a CALL_FUNCTION one, >> but you really have to know what you're looking at... > > But you have to know that anyway, you can't do tracing and not know wtf > you're doing. Or rather, if you do, I don't give a crap and you can keep > the pieces :-) > > Grepping the callback should be pretty quick resolution at to what trips > it, no? > > (also, if you *really* can't manage, we can always add yet another > argument that gives a type thingy) > Ah, I was a bit unclear here - I don't care too much about the IPI type being used, but rather being able to figure out on IRQ entry where that IPI came from - thinking some more about now, I don't think logging *all* CSDs causes an issue there, as you'd look at the earliest-not-seen-yet event targeting this CPU anyway. That'll be made easy once I get to having cpumask filters for ftrace, so I can just issue something like: trace-cmd record -e 'ipi_send_cpu' -f "cpu == 3" -e 'ipi_send_cpumask' -f "cpus \in {3}" -T hackbench (it's somewhere on the todolist...) TL;DR: I *think* I've convinced myself logging all of them isn't an issue - I'm going to play with this on something "smarter" than just hackbench under QEMU just to drill it in.
Re: [PATCH v8 1/3] riscv: Introduce CONFIG_RELOCATABLE
On Fri, Feb 24, 2023 at 7:58 AM Björn Töpel wrote: > > Alexandre Ghiti writes: > > > +cc linux-kbuild, llvm, Nathan, Nick > > > > On 2/15/23 15:36, Alexandre Ghiti wrote: > >> From: Alexandre Ghiti > >> > > I tried a lot of things, but I struggle to understand, does anyone have > > any idea? FYI, the same problem happens with LLVM. Off the top of my head, no idea. (Maybe as a follow up to this series, I wonder if pursuing ARCH_HAS_RELR for ARCH=riscv is worthwhile?) > > Don't ask me *why*, but adding --emit-relocs to your linker flags solves > "the NULL .rela.dyn" both for GCC and LLVM. > > The downside is that you end up with a bunch of .rela cruft in your > vmlinux. There was a patch just this week to use $(OBJCOPY) to strip these from vmlinux (for x86). Looks like x86 uses --emit-relocs for KASLR: https://lore.kernel.org/lkml/20230320121006.4863-1-petr.pa...@suse.com/ -- Thanks, ~Nick Desaulniers
Re: [PATCH v6 1/4] PCI: Introduce pci_dev_for_each_resource()
Hi Andy and Mika, I really like the improvements here. They make the code read much better. On Mon, Mar 20, 2023 at 03:16:30PM +0200, Andy Shevchenko wrote: > From: Mika Westerberg > ... > static void fixup_winbond_82c105(struct pci_dev* dev) > { > - int i; > + struct resource *r; > unsigned int reg; > > if (!machine_is(pseries)) > @@ -251,14 +251,14 @@ static void fixup_winbond_82c105(struct pci_dev* dev) > /* Enable LEGIRQ to use INTC instead of ISA interrupts */ > pci_write_config_dword(dev, 0x40, reg | (1<<11)); > > - for (i = 0; i < DEVICE_COUNT_RESOURCE; ++i) { > + pci_dev_for_each_resource_p(dev, r) { > /* zap the 2nd function of the winbond chip */ > - if (dev->resource[i].flags & IORESOURCE_IO > - && dev->bus->number == 0 && dev->devfn == 0x81) > - dev->resource[i].flags &= ~IORESOURCE_IO; > - if (dev->resource[i].start == 0 && dev->resource[i].end) { > - dev->resource[i].flags = 0; > - dev->resource[i].end = 0; > + if (dev->bus->number == 0 && dev->devfn == 0x81 && > + r->flags & IORESOURCE_IO) This is a nice literal conversion, but it's kind of lame to test bus->number and devfn *inside* the loop here, since they can't change inside the loop. > + r->flags &= ~IORESOURCE_IO; > + if (r->start == 0 && r->end) { > + r->flags = 0; > + r->end = 0; > } > } > #define pci_resource_len(dev,bar) \ > ((pci_resource_end((dev), (bar)) == 0) ? 0 :\ > \ > - (pci_resource_end((dev), (bar)) - \ > - pci_resource_start((dev), (bar)) + 1)) > + resource_size(pci_resource_n((dev), (bar I like this change, but it's unrelated to pci_dev_for_each_resource() and unmentioned in the commit log. > +#define __pci_dev_for_each_resource(dev, res, __i, vartype) \ > + for (vartype __i = 0; \ > + res = pci_resource_n(dev, __i), __i < PCI_NUM_RESOURCES; \ > + __i++) > + > +#define pci_dev_for_each_resource(dev, res, i) > \ > + __pci_dev_for_each_resource(dev, res, i, ) > + > +#define pci_dev_for_each_resource_p(dev, res) > \ > + __pci_dev_for_each_resource(dev, res, __i, unsigned int) This series converts many cases to drop the iterator variable ("i"), which is fantastic. Several of the remaining places need the iterator variable only to call pci_claim_resource(), which could be converted to take a "struct resource *" directly without much trouble. We don't have to do that pci_claim_resource() conversion now, but since we're converging on the "(dev, res)" style, I think we should reverse the names so we have something like: pci_dev_for_each_resource(dev, res) pci_dev_for_each_resource_idx(dev, res, i) Not sure __pci_dev_for_each_resource() is worthwhile since it only avoids repeating that single "for" statement, and passing in "vartype" (sometimes empty to implicitly avoid the declaration) is a little complicated to read. I think it'd be easier to read like this: #define pci_dev_for_each_resource(dev, res) \ for (unsigned int __i = 0; \ res = pci_resource_n(dev, __i), __i < PCI_NUM_RESOURCES; \ __i++) #define pci_dev_for_each_resource_idx(dev, res, idx) \ for (idx = 0; \ res = pci_resource_n(dev, idx), idx < PCI_NUM_RESOURCES; \ idx++) Bjorn
Re: [PATCH v6 2/4] PCI: Split pci_bus_for_each_resource_p() out of pci_bus_for_each_resource()
On Mon, Mar 20, 2023 at 03:16:31PM +0200, Andy Shevchenko wrote: > ... > -#define pci_bus_for_each_resource(bus, res, i) > \ > - for (i = 0; \ > - (res = pci_bus_resource_n(bus, i)) || i < PCI_BRIDGE_RESOURCE_NUM; \ > - i++) > +#define __pci_bus_for_each_resource(bus, res, __i, vartype) > \ > + for (vartype __i = 0; > \ > + res = pci_bus_resource_n(bus, __i), __i < PCI_BRIDGE_RESOURCE_NUM; > \ > + __i++) > + > +#define pci_bus_for_each_resource(bus, res, i) > \ > + __pci_bus_for_each_resource(bus, res, i, ) > + > +#define pci_bus_for_each_resource_p(bus, res) > \ > + __pci_bus_for_each_resource(bus, res, __i, unsigned int) I like these changes a lot, too! Same comments about _p vs _idx and __pci_bus_for_each_resource(..., vartype). Also would prefer 80 char max instead of 81.
Re: [Skiboot] [PATCH V4 1/3] core/device: Add function to return child node using name at substring "@"
Hi Athira, On Mon, Mar 06, 2023 at 09:09:11AM +0530, Athira Rajeev wrote: Add a function dt_find_by_name_substr() that returns the child node if it matches till first occurence at "@" of a given name, otherwise NULL. Given this summary, I don't understand the following: + assert(dt_find_by_name_substr(root, "node@1") == addr1); + assert(dt_find_by_name_substr(root, "node0_1@2") == addr2); Is this behavior required? I don't think it makes sense to call this function with a second argument containing '@', so I wouldn't expect it to match anything in these cases. The function seems to specifically enable it: +struct dt_node *dt_find_by_name_substr(struct dt_node *root, const char *name) +{ [snip] + node = strdup(name); + if (!node) + return NULL; + node = strtok(node, "@"); Seems like you could get rid of this and just use name as-is. I was curious about something else; say we have 'node@1' and 'node@2'. Is there an expectation of which it should match? addr1 = dt_new_addr(root, "node", 0x1); addr2 = dt_new_addr(root, "node", 0x2); assert(dt_find_by_name_substr(root, "node") == ???); ^^^ +/* Find a child node by name and substring */ +struct dt_node *dt_find_by_name_substr(struct dt_node *root, const char *name); I think this name fit better in previous versions of the patch, but since you're specifically looking for '@' now, maybe call it something like dt_find_by_name_before_addr? -- Reza Arbab
Re: [Skiboot] [PATCH V4 3/3] skiboot: Update IMC PMU node names for power10
On Mon, Mar 06, 2023 at 09:09:13AM +0530, Athira Rajeev wrote: + } else if (proc_gen == proc_gen_p10) { + int val; + char al[8], xl[8], otl[8], phb[8]; Are four different variables needed here? I think you could just reuse one temporary variable. char name[8]; + for (i=0; i<8; i++) { + val = ((avl_vec & (0x7ULL << (29 + (3 * i >> (29 + (3 * i))); + switch (val) { + case 0x5: //xlink configured and functional + + snprintf(al, sizeof(al), "alink%1d",(7-i)); + target = dt_find_by_name_substr(dev, al); + if (target) + dt_free(target); + + snprintf(otl, sizeof(otl),"otl%1d_0",(7-i)); + target = dt_find_by_name_substr(dev, otl); + if (target) + dt_free(target); + + snprintf(otl,sizeof(otl),"otl%1d_1",(7-i)); + target = dt_find_by_name_substr(dev, otl); + if (target) + dt_free(target); + + break; + case 0x6: //alink configured and functional + + snprintf(xl,sizeof(xl),"xlink%1d",(7-i)); + target = dt_find_by_name_substr(dev, xl); + if (target) + dt_free(target); + + snprintf(otl,sizeof(otl),"otl%1d_0",(7-i)); + target = dt_find_by_name_substr(dev, otl); + if (target) + dt_free(target); + + snprintf(otl,sizeof(otl),"otl%1d_1",(7-i)); + target = dt_find_by_name_substr(dev, otl); + if (target) + dt_free(target); + break; + + case 0x7: //CAPI configured and functional + snprintf(al,sizeof(al),"alink%1d",(7-i)); + target = dt_find_by_name_substr(dev, al); + if (target) + dt_free(target); + + snprintf(xl,sizeof(xl),"xlink%1d",(7-i)); + target = dt_find_by_name_substr(dev, xl); + if (target) + dt_free(target); + break; + default: + snprintf(xl,sizeof(xl),"xlink%1d",(7-i)); + target = dt_find_by_name_substr(dev, xl); + if (target) + dt_free(target); + + snprintf(al,sizeof(al),"alink%1d",(7-i)); + target = dt_find_by_name_substr(dev, al); + if (target) + dt_free(target); + + snprintf(otl,sizeof(otl),"otl%1d_0",(7-i)); + target = dt_find_by_name_substr(dev, otl); + if (target) + dt_free(target); + + snprintf(otl,sizeof(otl),"otl%1d_1",(7-i)); + target = dt_find_by_name_substr(dev, otl); + if (target) + dt_free(target); + break; As far as I know skiboot follows the kernel coding style. Would you mind fixing up the minor style nits checkpatch.pl reports for this patch? -- Reza Arbab
Re: [PATCH v5 7/7] sched, smp: Trace smp callback causing an IPI
On Wed, Mar 22, 2023 at 06:22:28PM +, Valentin Schneider wrote: > On 22/03/23 18:22, Peter Zijlstra wrote: > >>hackbench-157 [001]10.894320: ipi_send_cpu: cpu=3 > >> callsite=check_preempt_curr+0x37 callback=0x0 > > > > Arguably we should be setting callback to scheduler_ipi(), except > > ofcourse, that's not an actual function... > > > > Maybe we can do "extern inline" for the actual users and provide a dummy > > function for the symbol when tracing. > > > > Huh, I wasn't aware that was an option, I'll look into that. I did scribble > down a comment next to smp_send_reschedule(), but having a decodable > function name would be better! So clang-15 builds the below (and generates the expected code), but gcc-12 vomits nonsense about a non-static inline calling a static inline or somesuch bollocks :-/ --- a/include/linux/sched.h +++ b/include/linux/sched.h @@ -1991,7 +1991,7 @@ extern char *__get_task_comm(char *to, s }) #ifdef CONFIG_SMP -static __always_inline void scheduler_ipi(void) +extern __always_inline void scheduler_ipi(void) { /* * Fold TIF_NEED_RESCHED into the preempt_count; anybody setting --- a/include/linux/smp.h +++ b/include/linux/smp.h @@ -130,9 +130,9 @@ extern void arch_smp_send_reschedule(int * scheduler_ipi() is inline so can't be passed as callback reason, but the * callsite IP should be sufficient for root-causing IPIs sent from here. */ -#define smp_send_reschedule(cpu) ({ \ - trace_ipi_send_cpu(cpu, _RET_IP_, NULL); \ - arch_smp_send_reschedule(cpu);\ +#define smp_send_reschedule(cpu) ({\ + trace_ipi_send_cpu(cpu, _RET_IP_, &scheduler_ipi); \ + arch_smp_send_reschedule(cpu); \ }) /* --- a/kernel/sched/core.c +++ b/kernel/sched/core.c @@ -3790,6 +3790,15 @@ static int ttwu_runnable(struct task_str } #ifdef CONFIG_SMP +void scheduler_ipi(void) +{ + /* +* Actual users should end up using the extern inline, this is only +* here for the symbol. +*/ + BUG(); +} + void sched_ttwu_pending(void *arg) { struct llist_node *llist = arg;
Re: [PATCH 4/8] powerpc/rtas: fix miswording in rtas_function kerneldoc
On Mon, 2023-03-06 at 15:33 -0600, Nathan Lynch via B4 Relay wrote: > From: Nathan Lynch > > The 'filter' member is a pointer, not a bool; fix the wording > accordingly. > > Signed-off-by: Nathan Lynch Reviewed-by: Andrew Donnellan > --- > arch/powerpc/kernel/rtas.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c > index c73b01d722f6..c29c38b1a55a 100644 > --- a/arch/powerpc/kernel/rtas.c > +++ b/arch/powerpc/kernel/rtas.c > @@ -68,7 +68,7 @@ struct rtas_filter { > * functions are believed to have no > users on > * ppc64le, and we want to keep it that > way. It does > * not make sense for this to be set when > @filter > - * is false. > + * is NULL. > */ > struct rtas_function { > s32 token; > -- Andrew DonnellanOzLabs, ADL Canberra a...@linux.ibm.com IBM Australia Limited
[powerpc:next] BUILD SUCCESS 3a713753d3cb52e4e3039cdb906ef00f0b574219
tree/branch: https://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux.git next branch HEAD: 3a713753d3cb52e4e3039cdb906ef00f0b574219 powerpc: Simplify sysctl registration for nmi_wd_lpm_factor_ctl_table elapsed time: 732m configs tested: 98 configs skipped: 5 The following configs have been built successfully. More configs may be tested in the coming days. tested configs: alphaallyesconfig gcc alpha defconfig gcc arc allyesconfig gcc arc buildonly-randconfig-r003-20230322 gcc arc buildonly-randconfig-r004-20230322 gcc arc defconfig gcc arc randconfig-r003-20230322 gcc arc randconfig-r043-20230322 gcc arm allmodconfig gcc arm allyesconfig gcc arm buildonly-randconfig-r001-20230322 clang arm defconfig gcc arm randconfig-r046-20230322 clang arm64allyesconfig gcc arm64 defconfig gcc cskydefconfig gcc csky randconfig-r011-20230322 gcc csky randconfig-r012-20230322 gcc csky randconfig-r016-20230322 gcc csky randconfig-r024-20230322 gcc csky randconfig-r026-20230322 gcc hexagon randconfig-r014-20230322 clang hexagon randconfig-r041-20230322 clang hexagon randconfig-r045-20230322 clang i386 allyesconfig gcc i386 debian-10.3 gcc i386defconfig gcc i386 randconfig-a001 gcc i386 randconfig-a002 clang i386 randconfig-a003 gcc i386 randconfig-a004 clang i386 randconfig-a005 gcc i386 randconfig-a006 clang i386 randconfig-a011 clang i386 randconfig-a012 gcc i386 randconfig-a013 clang i386 randconfig-a014 gcc i386 randconfig-a015 clang i386 randconfig-a016 gcc ia64 allmodconfig gcc ia64defconfig gcc loongarchallmodconfig gcc loongarch allnoconfig gcc loongarchbuildonly-randconfig-r002-20230322 gcc loongarch defconfig gcc m68k allmodconfig gcc m68kdefconfig gcc mips allmodconfig gcc mips allyesconfig gcc mips randconfig-r035-20230322 gcc mips randconfig-r036-20230322 gcc nios2 defconfig gcc openrisc randconfig-r015-20230322 gcc openrisc randconfig-r023-20230322 gcc openrisc randconfig-r031-20230322 gcc parisc defconfig gcc parisc randconfig-r034-20230322 gcc parisc64defconfig gcc powerpc allmodconfig gcc powerpc allnoconfig gcc powerpc randconfig-r001-20230322 clang powerpc randconfig-r013-20230322 gcc riscvallmodconfig gcc riscv allnoconfig gcc riscv defconfig gcc riscvrandconfig-r021-20230322 gcc riscvrandconfig-r042-20230322 gcc riscv rv32_defconfig gcc s390 allmodconfig gcc s390 allyesconfig gcc s390defconfig gcc s390 randconfig-r004-20230322 clang s390 randconfig-r022-20230322 gcc s390 randconfig-r044-20230322 gcc sh allmodconfig gcc sparcbuildonly-randconfig-r006-20230322 gcc sparc defconfig gcc sparc64 randconfig-r002-20230322 gcc sparc64 randconfig-r032-20230322 gcc sparc64 randconfig-r033-20230322 gcc um i386_defconfig gcc um x86_64_defconfig gcc x86_64allnoconfig gcc x86_64 allyesconfig gcc x86_64 defconfig gcc x86_64
Re: [PATCH 1/8] powerpc/rtas: ensure 8-byte alignment for struct rtas_args
On Mon, 2023-03-06 at 15:33 -0600, Nathan Lynch via B4 Relay wrote: > > From: Nathan Lynch > > > > CHRP and PAPR agree: "In order to make an RTAS call, the operating > > system must construct an argument call buffer aligned on an eight > > byte > > boundary in physically contiguous real memory [...]." (7.2.7 > > Calling > > Mechanism and Conventions). > > > > struct rtas_args is the type used for this argument call buffer. > > The > > unarchitected 'rets' member happens to produce 8-byte alignment for > > the struct on 64-bit targets in practice. But without an alignment > > directive the structure will have only 4-byte alignment on 32-bit > > targets: > > > > $ nm b/{before,after}/chrp32/vmlinux | grep rtas_args > > c096881c b rtas_args > > c0968820 b rtas_args > > > > Add an alignment directive to the struct rtas_args declaration so > > all > > instances have the alignment required by the specs. rtas-types.h no > > longer refers to any spinlock types, so drop the spinlock_types.h > > inclusion while we're here. > > > > Signed-off-by: Nathan Lynch Reviewed-by: Andrew Donnellan > > --- > > arch/powerpc/include/asm/rtas-types.h | 5 +++-- > > 1 file changed, 3 insertions(+), 2 deletions(-) > > > > diff --git a/arch/powerpc/include/asm/rtas-types.h > > b/arch/powerpc/include/asm/rtas-types.h > > index f2ad4a96cbc5..861145c8a021 100644 > > --- a/arch/powerpc/include/asm/rtas-types.h > > +++ b/arch/powerpc/include/asm/rtas-types.h > > @@ -2,7 +2,8 @@ > > #ifndef _ASM_POWERPC_RTAS_TYPES_H > > #define _ASM_POWERPC_RTAS_TYPES_H > > > > -#include > > +#include > > +#include > > > > typedef __be32 rtas_arg_t; > > > > @@ -12,7 +13,7 @@ struct rtas_args { > > __be32 nret; > > rtas_arg_t args[16]; > > rtas_arg_t *rets; /* Pointer to return values in > > args[]. > > */ > > -}; > > +} __aligned(SZ_8); Nowhere else in the kernel uses __aligned(SZ_8) over just __aligned(8), which I suppose would also save an include, but I don't care either way. > > > > struct rtas_t { > > unsigned long entry;/* physical address pointer > > */ > > -- Andrew DonnellanOzLabs, ADL Canberra a...@linux.ibm.com IBM Australia Limited
Re: [PATCH 2/8] powerpc/rtas: use memmove for potentially overlapping buffer copy
On Mon, 2023-03-06 at 15:33 -0600, Nathan Lynch via B4 Relay wrote: > From: Nathan Lynch > > Using memcpy() isn't safe when buf is identical to rtas_err_buf, > which > can happen during boot before slab is up. Full context which may not > be obvious from the diff: > > if (altbuf) { > buf = altbuf; > } else { > buf = rtas_err_buf; > if (slab_is_available()) > buf = kmalloc(RTAS_ERROR_LOG_MAX, > GFP_ATOMIC); > } > if (buf) > memcpy(buf, rtas_err_buf, RTAS_ERROR_LOG_MAX); > > This was found by inspection and I'm not aware of it causing problems > in practice. It appears to have been introduced by commit > 033ef338b6e0 ("powerpc: Merge rtas.c into arch/powerpc/kernel"); the > old ppc64 version of this code did not have this problem. > > Use memmove() instead. > > Fixes: 033ef338b6e0 ("powerpc: Merge rtas.c into > arch/powerpc/kernel") > Signed-off-by: Nathan Lynch Reviewed-by: Andrew Donnellan -- Andrew DonnellanOzLabs, ADL Canberra a...@linux.ibm.com IBM Australia Limited
Re: [PATCH 3/8] powerpc/rtas: rtas_call_unlocked() kerneldoc
On Mon, 2023-03-06 at 15:33 -0600, Nathan Lynch via B4 Relay wrote: > From: Nathan Lynch > > Add documentation for rtas_call_unlocked(), including details on how > it differs from rtas_call(). > > Signed-off-by: Nathan Lynch Reviewed-by: Andrew Donnellan -- Andrew DonnellanOzLabs, ADL Canberra a...@linux.ibm.com IBM Australia Limited
Re: [PATCH 5/8] powerpc/rtas: rename va_rtas_call_unlocked() to va_rtas_call()
On Mon, 2023-03-06 at 15:33 -0600, Nathan Lynch via B4 Relay wrote: > From: Nathan Lynch > > The function name va_rtas_call_unlocked() is confusing: it may be > called with or without rtas_lock held. Rename it to va_rtas_call(). > > Signed-off-by: Nathan Lynch Not a huge fan of the name, the va_ suggests that the only difference between this function and rtas_call() is the varargs handling. Perhaps something like __rtas_call()? > --- > arch/powerpc/kernel/rtas.c | 9 - > 1 file changed, 4 insertions(+), 5 deletions(-) > > diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c > index c29c38b1a55a..96a10a0abe3a 100644 > --- a/arch/powerpc/kernel/rtas.c > +++ b/arch/powerpc/kernel/rtas.c > @@ -996,9 +996,8 @@ static void __init init_error_log_max(void) {} > #endif > > > -static void > -va_rtas_call_unlocked(struct rtas_args *args, int token, int nargs, > int nret, > - va_list list) > +static void va_rtas_call(struct rtas_args *args, int token, int > nargs, int nret, > + va_list list) > { > int i; > > @@ -1038,7 +1037,7 @@ void rtas_call_unlocked(struct rtas_args *args, > int token, int nargs, int nret, > va_list list; > > va_start(list, nret); > - va_rtas_call_unlocked(args, token, nargs, nret, list); > + va_rtas_call(args, token, nargs, nret, list); > va_end(list); > } > > @@ -1138,7 +1137,7 @@ int rtas_call(int token, int nargs, int nret, > int *outputs, ...) > args = &rtas_args; > > va_start(list, outputs); > - va_rtas_call_unlocked(args, token, nargs, nret, list); > + va_rtas_call(args, token, nargs, nret, list); > va_end(list); > > /* A -1 return code indicates that the last command couldn't > -- Andrew DonnellanOzLabs, ADL Canberra a...@linux.ibm.com IBM Australia Limited
Re: [PATCH 7/8] powerpc/rtas: warn on unsafe argument to rtas_call_unlocked()
On Mon, 2023-03-06 at 15:33 -0600, Nathan Lynch via B4 Relay wrote: > From: Nathan Lynch > > Any caller of rtas_call_unlocked() must provide an rtas_args > parameter > block distinct from the core rtas_args buffer used by the rtas_call() > path. It's an unlikely error to make, but the potential consequences > are grim, and it's trivial to check. > > Signed-off-by: Nathan Lynch call_rtas_display_status() seems to do exactly this, or am I missing something? > --- > arch/powerpc/kernel/rtas.c | 7 +++ > 1 file changed, 7 insertions(+) > > diff --git a/arch/powerpc/kernel/rtas.c b/arch/powerpc/kernel/rtas.c > index 633c925164e7..47a2aa43d7d4 100644 > --- a/arch/powerpc/kernel/rtas.c > +++ b/arch/powerpc/kernel/rtas.c > @@ -1042,6 +1042,13 @@ void rtas_call_unlocked(struct rtas_args > *args, int token, int nargs, int nret, > { > va_list list; > > + /* > + * Callers must not use rtas_args; otherwise they risk > + * corrupting the state of the rtas_call() path, which is > + * serialized by rtas_lock. > + */ > + WARN_ON(args == &rtas_args); > + > va_start(list, nret); > va_rtas_call(args, token, nargs, nret, list); > va_end(list); > -- Andrew DonnellanOzLabs, ADL Canberra a...@linux.ibm.com IBM Australia Limited
Re: [PATCH v3 2/2] arch/powerpc/kvm: kvmppc_hv_entry: remove r4 argument
Hi, On 2023-03-22 23:17:35, Michael Ellerman wrote: > Kautuk Consul writes: > > kvmppc_hv_entry is called from only 2 locations within > > book3s_hv_rmhandlers.S. Both of those locations set r4 > > as HSTATE_KVM_VCPU(r13) before calling kvmppc_hv_entry. > > So, shift the r4 load instruction to kvmppc_hv_entry and > > thus modify the calling convention of this function. > > > > Signed-off-by: Kautuk Consul > > --- > > arch/powerpc/kvm/book3s_hv_rmhandlers.S | 11 ++- > > 1 file changed, 6 insertions(+), 5 deletions(-) > > > > diff --git a/arch/powerpc/kvm/book3s_hv_rmhandlers.S > > b/arch/powerpc/kvm/book3s_hv_rmhandlers.S > > index b81ba4ee0521..b61f0b2c677b 100644 > > --- a/arch/powerpc/kvm/book3s_hv_rmhandlers.S > > +++ b/arch/powerpc/kvm/book3s_hv_rmhandlers.S > > @@ -85,7 +85,7 @@ _GLOBAL_TOC(kvmppc_hv_entry_trampoline) > > RFI_TO_KERNEL > > > > kvmppc_call_hv_entry: > > - ld r4, HSTATE_KVM_VCPU(r13) > > + /* Enter guest. */ > > bl kvmppc_hv_entry > > > > /* Back from guest - restore host state and return to caller */ > > @@ -352,9 +352,7 @@ kvm_secondary_got_guest: > > mtspr SPRN_LDBAR, r0 > > isync > > 63: > > - /* Order load of vcpu after load of vcore */ > > - lwsync > > - ld r4, HSTATE_KVM_VCPU(r13) > > + /* Enter guest. */ > > bl kvmppc_hv_entry > > > > /* Back from the guest, go back to nap */ > > @@ -506,7 +504,6 @@ SYM_INNER_LABEL(kvmppc_hv_entry, SYM_L_LOCAL) > > > > /* Required state: > > * > > -* R4 = vcpu pointer (or NULL) > > * MSR = ~IR|DR > > * R13 = PACA > > * R1 = host R1 > > @@ -524,6 +521,10 @@ SYM_INNER_LABEL(kvmppc_hv_entry, SYM_L_LOCAL) > > li r6, KVM_GUEST_MODE_HOST_HV > > stb r6, HSTATE_IN_GUEST(r13) > > > > + /* Order load of vcpu after load of vcore */ > > Just copying the comment here doesn't work. It doesn't make sense on its > own here, because the VCORE is loaded (again) a few lines below (536). > So as written this comment seems backward vs the code. > > The comment would need to expand to explain that the barrier is for the > case where we came from kvm_secondary_got_guest. > Agreed. > > + lwsync > > + ld r4, HSTATE_KVM_VCPU(r13) > > + > > #ifdef CONFIG_KVM_BOOK3S_HV_P8_TIMING > > /* Store initial timestamp */ > > cmpdi r4, 0 > > > But as Nick says I don't think it's worth investing effort in small > tweaks to this code. The risk of introducing bugs is too high for such a > small improvement to the code. > > Thanks for trying, but I think this asm code is best left more or less > alone unless we find actual bugs in it - or unless we can make > substantial improvements to it, which would be rewriting in C, or at > least converting to a fully call/return style rather than the current > forest of labels. > > I will take patch 1 though, as that's an obvious cleanup and poses no > risk (famous last words :D). Thanks for taking patch 1. I completely agree that it would not be wise to introduce even small alterations to stable legacy code. But sending patches like this and conversing with this mailing list's reviewers is giving me a good picture of what you are open to accepting at this stage. Thanks again! > > cheers
Re: [PATCH 6/8] powerpc/rtas: lockdep annotations
On Mon, 2023-03-06 at 15:33 -0600, Nathan Lynch via B4 Relay wrote: > From: Nathan Lynch > > Add lockdep annotations for the following properties that must hold: > > * Any error log retrieval must be atomically coupled with the prior > RTAS call, without a window for another RTAS call to occur before > the > error log can be retrieved. > > * All users of the core rtas_args parameter block must hold > rtas_lock. > > Move the definitions of rtas_lock and rtas_args up in the file so > that > __do_enter_rtas_trace() can refer to them. > > Signed-off-by: Nathan Lynch I'm no lockdep expert and I haven't checked if every possible case that can be annotated has been annotated, but these changes make sense as far as I can tell from my limited inspection. Reviewed-by: Andrew Donnellan -- Andrew DonnellanOzLabs, ADL Canberra a...@linux.ibm.com IBM Australia Limited
Re: [PATCH 8/8] powerpc/rtas: consume retry statuses in sys_rtas()
On Mon, 2023-03-06 at 15:33 -0600, Nathan Lynch via B4 Relay wrote: > From: Nathan Lynch > > The kernel can handle retrying RTAS function calls in response to > -2/990x in the sys_rtas() handler instead of relaying the > intermediate > status to user space. > > Justifications: > > * Currently it's nondeterministic and quite variable in practice > whether a retry status is returned for any given invocation of > sys_rtas(). Therefore user space code cannot be expecting a retry > result without already being broken. > > * This tends to significantly reduce the total number of system calls > issued by programs such as drmgr which make use of sys_rtas(), > improving the experience of tracing and debugging such > programs. This is the main motivation for me: I think this change > will make it easier for us to characterize current sys_rtas() use > cases as we move them to other interfaces over time. > > * It reduces the number of opportunities for user space to leave > complex operations, such as those associated with DLPAR, incomplete > and diffcult to recover. > > * We can expect performance improvements for existing sys_rtas() > users, not only because of overall reduction in the number of > system > calls issued, but also due to the better handling of -2/990x in the > kernel. For example, librtas still sleeps for 1ms on -2, which is > completely unnecessary. Would be good to see this fixed on the librtas side. > > Performance differences for PHB add and remove on a small P10 PowerVM > partition are included below. For add, elapsed time is slightly > reduced. For remove, there are more significant improvements: the > number of context switches is reduced by an order of magnitude, and > elapsed time is reduced by over half. > > (- before, + after): > > Performance counter stats for 'drmgr -c phb -a -s PHB 23' (5 runs): > > - 1,847.58 msec task-clock # 0.135 > CPUs utilized ( +- 14.15% ) > - 10,867 cs # 9.800 > K/sec ( +- 14.14% ) > + 1,901.15 msec task-clock # 0.148 > CPUs utilized ( +- 14.13% ) > + 10,451 cs # 9.158 > K/sec ( +- 14.14% ) > > - 13.656557 +- 0.000124 seconds time elapsed ( +- 0.00% ) > + 12.88080 +- 0.00404 seconds time elapsed ( +- 0.03% ) > > Performance counter stats for 'drmgr -c phb -r -s PHB 23' (5 runs): > > - 1,473.75 msec task-clock # 0.092 > CPUs utilized ( +- 14.15% ) > - 2,652 cs # 3.000 > K/sec ( +- 14.16% ) > + 1,444.55 msec task-clock # 0.221 > CPUs utilized ( +- 14.14% ) > + 104 cs # 119.957 > /sec ( +- 14.63% ) > > - 15.99718 +- 0.00801 seconds time elapsed ( +- 0.05% ) > + 6.54256 +- 0.00830 seconds time elapsed ( +- 0.13% ) > > Move the existing rtas_lock-guarded critical section in sys_rtas() > into a conventional rtas_busy_delay()-based loop, returning to user > space only when a final success or failure result is available. > > Signed-off-by: Nathan Lynch Should there be some kind of timeout? I'm a bit worried by sleeping in a syscall for an extended period. -- Andrew DonnellanOzLabs, ADL Canberra a...@linux.ibm.com IBM Australia Limited